In the previous post we’ve seen how data distribution in Cassandra
works. But what if one of the nodes fails? If you want to ensure fault
tolerance, you need to enable data replication. When you create a keyspace in Cassandra, you must decide
the replica placement strategy: the number of replicas and how those replicas
are distributed across nodes in the cluster. The replication strategy relies on
the cluster-configured snitch to help it determine the physical location of
nodes and their proximity to each other.
The total number of replicas across the cluster is often
referred to as the replication factor. A
replication factor of 1 means there is only one copy of a row in a cluster. A
replication factor of 2 means there are two copies of a row stored in a
cluster. This replication factor is controlled at a key space level.
The different types of replication
strategies are:
Simple strategy:
This strategy places the original row on a node determined by a practitioner
and then a copy of the row is placed on the next node walking clock-wise in the
cluster.
Network topology
strategy: This is more sophisticated
strategy than the simple strategy. Network topology strategy gives you more
control over where the rows are placed in your database cluster. This allows
you to place different rows on racks in a data center or to carry out
geographical replication between multiple data centers. You can specify the
replication strategy when creating a keyspace like:
CREATE KEYSPACE your_keySpace WITH
strategy_class = 'NetworkTopologyStrategy' AND strategy_options:DC1 = 4; where DC1 = 4 specifies the replication factor.
Cassandra uses a snitch to define how nodes are grouped
together within the overall network topology. This again can be defined in the
configuration file cassandra.yaml
The four basic snitches you can
define are:
Simple snitch:
This is the default snitch used and uses a simple strategy of placing the copy
of the row on the next available node walking clockwise through the nodes.
Rack inferring snitch:
The rack inferring snitch automatically
tries to place copies of rows on different racks in your data center. The rack inferring
snitch uses the node's IP address to infer the network topology. The second
unit of the IP address is used to identify the data center where a node is
location and the third unit identifies the rack.
Property file snitch:
Is used to determine the location of the nodes by referring to the
cassandra-topology.properties file.
EC2 snitch: If
you want to use cassandra in the cloud, you have to use the EC2 snitch which
uses the AWS API to request the region and the respective availability zone.