Sunday, June 10, 2012

Using Apache Cassandra with .NET - Part 1

With the uninterrupted growth of voluminous amount of unstructured and semi-structured computing data, storage of information, support and maintenance has been the biggest challenge. Data of this type would take too much time and cost too much money to load into a relational database for analysis. The demands of huge data and elastic scaling with desired performance has led to concept of big data. Although big data doesn't refer to any specific quantity, the term is often used when speaking about petabytes and exabytes of data.
Apache Cassandra is a standout among the NoSQL/post-relational database solutions on the market for many reasons. Some of the core features of Cassandra are:
  • Highly scalable peer-to-peer architecture based on the best of Amazon Dynamo and Google BigTable. Cassandra is the considered as the NoSQL leader when it comes to comfortably scaling to terabytes or petabytes of data.
  • Increased database for both read and write operations via nodes and cluster. Data is replicated to multiple nodes to protect from loss during node failure, and new machines can be added incrementally while online to increase the capacity and data protection of your Cassandra cluster.
  •  Ensure data safety due to its innovative append-only commit log. Users no longer have to trade off durability to keep up with immense write streams.
  • Transparent fault detection and recovery using gossip protocols to detect machine failure and recover when a machine is brought back into the cluster – all without your application noticing.
  • Cassandra offers support for multiple data centres by easy configuration options for setting copies of your data you want in each data centre.
  • Cassandra offers caching on each of its nodes enabling ease of development.
  • Incremental and dynamic expansion as Cassandra ring allows you to add nodes easily without manual migration of data needed from one to another.
  • Cassandra runs on commodity machines and requires no expensive or special hardware.

Cassandra’s data model
The Cassandra data model has 4 main concepts which are cluster, keyspace, column family and super column.
  • Cluster is the outermost structure in Cassandra (also called as ring). Cassandra database is specially designed to be spread across several machines functioning together that act as a single occurrence to the end user. Cassandra allocates data to nodes in the cluster by arranging them in a ring. Clusters contain many nodes (machines) and can contain multiple keyspaces.
  • A keyspace is the top level element of a schema.  A keysapce is a Container for column families to group multiple column families, typically one per application.
  • A column contains a name, value and timestamp. A column must have a name, and the name can be a static label (such as “name” or “email”) or it can be dynamically set when the column is created by your application.
  • A column family contains multiple columns referenced by a row keys. Column families can (and should) define metadata about the columns, but the actual columns that make up a row are determined by the client application. Each row can have a different set of columns.

Configuring Cassandra using DataStax OpsCenter
DataStax OpsCenter is a visual management and monitoring solution for big data platforms designed to manage and monitor Apache Cassandra™ database clusters. Through an intuitive point-and-click interface, a user can understand the state of one or more clusters through a centralized dashboard that provides an at-a-glance view of the status, activity, and issues across all monitored clusters. Users can easily create new database objects, manage maintenance activities, and perform other database tasks such as backups in a visual manner.  The opscenter interface looks like.

You can add a keyspace to the available cluster as given below.

After adding the keyspace you can add a column family as.

Later you can also use the CIL to change the schema structure or add new data to the column family as given below.

Accessing Cassandra in .NET
Now let’s write some code to access the posts we stored in the posts column family. We’ll be using Cassandraemon,  a LINQ client for Cassandra Apache.
public void OpenAConnectionTest()
    using (var context = new CassandraContext("localhost", 9160, "Twitter"))
        var products = from x in context.ColumnList
                        where x.ColumnFamily == "Tweets"
                        && x.Column == "text"
                        select x.ToObject<Tweet>();  
        Assert.IsTrue(products.Count() > 0);


Jonathan Ellis said...

Thanks for the post, Prajeesh!

Out of curiosity, what led you to choose Cassandraemon over Fluent Cassandra?

Prajeesh Prathap said...

Hi Jonathan,
I landed up on cassandraemon when i was looking for a .NET library for cassandra. Looks like fluent cassandra is trending and is simpler. Thanks for bringing this to my notice.

dokieboy said...

We chose to use Fluent Cassandra as a better fit for our .Net code and has worked well.

Unknown said...

Coming late to this article, it looks like Cassandraemon more closely matches "the modern .NET way to DAL" by using 'using' and IQueryable/LINQ whereas the "Fluent Cassandra" looks to be a more proprietary alienthing that looks like it was a direct translation from a foreign tongue (i.e. Java).

Unknown said...

The "official" client is simple DataStax provider at