Monday, July 23, 2012

Replication and snitches in Cassandra

In the previous post we’ve seen how data distribution in Cassandra works. But what if one of the nodes fails? If you want to ensure fault tolerance, you need to enable data replication. When you create a keyspace in Cassandra, you must decide the replica placement strategy: the number of replicas and how those replicas are distributed across nodes in the cluster. The replication strategy relies on the cluster-configured snitch to help it determine the physical location of nodes and their proximity to each other.
The total number of replicas across the cluster is often referred to as the replication factor.  A replication factor of 1 means there is only one copy of a row in a cluster. A replication factor of 2 means there are two copies of a row stored in a cluster. This replication factor is controlled at a key space level.

The different types of replication strategies are:

Simple strategy: This strategy places the original row on a node determined by a practitioner and then a copy of the row is placed on the next node walking clock-wise in the cluster.
Network topology strategy:  This is more sophisticated strategy than the simple strategy. Network topology strategy gives you more control over where the rows are placed in your database cluster. This allows you to place different rows on racks in a data center or to carry out geographical replication between multiple data centers. You can specify the replication strategy when creating a keyspace like:

CREATE KEYSPACE your_keySpace WITH strategy_class = 'NetworkTopologyStrategy' AND strategy_options:DC1 = 4; where DC1 = 4 specifies the replication factor.

Cassandra uses a snitch to define how nodes are grouped together within the overall network topology. This again can be defined in the configuration file cassandra.yaml

The four basic snitches you can define are:

Simple snitch: This is the default snitch used and uses a simple strategy of placing the copy of the row on the next available node walking clockwise through the nodes.
Rack inferring snitch:  The rack inferring snitch automatically tries to place copies of rows on different racks in your data center. The rack inferring snitch uses the node's IP address to infer the network topology. The second unit of the IP address is used to identify the data center where a node is location and the third unit identifies the rack.
Property file snitch: Is used to determine the location of the nodes by referring to the file.
EC2 snitch: If you want to use cassandra in the cloud, you have to use the EC2 snitch which uses the AWS API to request the region and the respective availability zone.

Wednesday, July 18, 2012

Cassandra Architecture - Data Partitioning

Before starting a Cassandra cluster, you must choose how the data will be divided across the nodes in the cluster. This involves choosing a partitioner for the cluster. Cassandra uses a ring architecture. The ring is divided up into ranges equal to the number of nodes, with each node being responsible for one or more ranges of the overall data. Before a node can join the ring, it must be assigned a token. The token determines the node’s position on the ring and the range of data it is responsible for.
Once the partitioner is chosen it is unlikely to change the configuration choice without reloading all the data. This makes it very important to choose and configure the correct partitioner before initializing the cluster.
The important distinction between the partitioners is order preservation (OP). Users can define their own partitioners by implementing IPartitioner, or they can use one of the native partitioners.

Random Partitioner
RandomPartitioner  is the default choice for cassandra as it uses an MD5 hash function to map keys into tokens. These keys will evenly distribute across the clusters. The row key determines where the node placement.  Consistent hashing algorithm used by Random partioning ensures that when nodes are added to the cluster, the minimum possible set of data is affected. The hashing algorithm creates an MD5 hash value of the row key ranging from 0 to 2*127. Then nodes in the cluster are assigned a token that represents the hash value in the above mentioned range. This value determines the row keys to be placed in the node. For e.g the below given row with row key ‘Prajeesh’ is assigned a hash key like 98002736AD65AB which determines the node that holds the range to store the row.
Scrum Master

Notice that the keys are not in order. With RandomPartitioner, the keys are evenly distributed across the ring using hashes, but you sacrifice order, which means any range query needs to query all nodes in the ring.

Ordered Preserving Partitioners
The Order Preserving Partitioners preserve the order of the row keys as they are mapped into the token space. This allows range scans over rows, meaning you can scan rows as though you were moving a cursor through a traditional index. For example, if your application has user names as the row key, you can scan rows for users whose names fall between Albert and Amy. This type of query would not be possible with randomly partitioned row keys, since the keys are stored in the order of their MD5 hash (not sequentially).
An advantage of using OPP is that the range queries are simplified since the query need not consult each node in the ring the fetch the data. It can directly visit the node based on the order of row keys.
A disadvantage of using OPP is that the ring becomes unstable over a time if your application tends to write or update a sequential block of rows at a time, then the writes will not be distributed across the cluster, putting it all to a node. This makes one node holding more data than the rest disturbing the even distribution of data across nodes.