Friday, October 11, 2013

Mapping CQL's sets and maps to column families

In this post we are going to explore how CQL implements sets and maps in Cassandra’s column store.

(in a bizarre twist of fate, John Berryman. created this post http://www.planetcassandra.org/blog/post/understanding-how-cql3-maps-to-cassandras-internal-data--structure yesterday on the same subject, I swear I hadn't seen it when I started working on this post, yesterday as well !  It's just how it goes sometimes,  Johns post is great it has to be said !. )

In CQL version 3 wide tables have been supported through the use of sets, maps and lists.  These features have been supported since Cassandra 1.2 (http://www.datastax.com/dev/blog/cql3_collections) and should now be the de facto way of creating “wide tables”  the canonical example of sets is the use of multiple email addresses for a user .  In the relational world you might create a email address table with a foreign key pointing to the user id for each address.   This is going to cause a join just for any request that needs details of the user and their valid addresses. 

Suppose we create a simple keyspace in the usual fashion:

create keyspace Keyspace3 WITH replication = {'class':'SimpleStrategy', 'replication_factor':1};

In a  Cassandra (from 1.2) you would create a table like this:

CREATE TABLE Users (   
    id uuid Primary Key,
    name text,    
    email_addresses set) ;

(This is similar to  Sylvain Lebresne’s example here http://www.datastax.com/dev/blog/cql3_collections)

We can insert data into the table (a user with 2 email addresses like this):

insert into users(id,name,email_addresses) values (88b8fd18-b1ed-4e96-bf79-4280797cba81,'tim',{'tim@example.org','timothy@example.org'});

This user has a UUID, a name and two email addresses.   You can of course get the email addresses with a select command:

select email_addresses from Users;

which will return the addresses as a set:

email_addresses
-------------------------------------------------------
            {'tim@example.org', 'timothy@example.org'}

However, how is this implemented in the column store ?  If you had used a thrift based interface (such as Hector) you may have created the column family and had the following structure:

Id: 88b8fd18-b1ed-4e96-bf79-4280797cba81 (Key)
    name: tim
  email_address: 'tim@example.org'
  email_address: 'timothy@example.org'

but how is it implemented in CQL3 ?  If you fire up Cassandra-cli you can use the list command to see what is stored in the column family:

LifeintheAirAge:bin Administrator$ ./cassandra-cli
Connected to: "Test Cluster" on 127.0.0.1/9160
Welcome to Cassandra CLI version 2.0.0

Please consider using the more convenient cqlsh instead of CLI
CQL3 is fully backwards compatible with Thrift data; see http://www.datastax.com/dev/blog/thrift-to-cql3

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown] use keyspace3;
Authenticated to keyspace: keyspace3
[default@keyspace3] list users;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: 88b8fd18-b1ed-4e96-bf79-4280797cba81
=> (name=, value=, timestamp=1381497810072000)
=> (name=email_addresses:74696d406578616d706c652e6f7267, value=, timestamp=1381497810072000)
=> (name=email_addresses:74696d6f746879406578616d706c652e6f7267, value=, timestamp=1381497810072000)
=> (name=name, value=74696d, timestamp=1381497810072000)

So we can see the rowkey as expected and the name of the user as a name value pair (the value is in ASCII in hex in this case 746976d is tim).

But for the email_addresses the values are not set.  The values of the email addresses is encoded into the name along with the “column” schema name.  name=email_addresses:74696d406578616d706c652e6f7267  is the column name email_addresses followed by tim@example.org in ASCII hex) .  Why do this ? Why not have the name as email_addresses and the value as the the hex email address ?  One reason perhaps is because this allows us to implement maps ina similar way with out needing special cases.   Suppose we alter table to include a map, we want to store details about our user, but we don’t yet know which details the user will provide (A contrived example I’ll grant you).  You can alter the table as follows:

alter table users add details map;

and insert some details as follows:

update users set details= {'tel' : '555 232341', 'twitter' : '@andycobley'} where id =88b8fd18-b1ed-4e96-bf79-4280797cba81;

What does our column now look like? Using the list command we get :

RowKey: 88b8fd18-b1ed-4e96-bf79-4280797cba81
=> (name=, value=, timestamp=1381498805511000)
=> (name=details:74656c, value=3031333832333435303738, timestamp=1381498805511000)
=> (name=details:74776974746572, value=40616e6479636f626c6579, timestamp=1381498805511000)

You can see the map key is stored with the column name  in the name part of the column family name value  pair. So  name=details:74656c contains ‘tel’ as a ASCII hex value.  The map value is simply stored in the value part of the column family name value pair.

So, we’ve seen how CQL3’s  maps and sets map on to the column family name value pairs by storing the CQL table’s column name in the name part of the column family name value pair.  It’s quite simple and elegant really.

(as ever I’m more than happy to receive corrections or further explanations !)

2 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. I truly like the method for your giving which is really amazing. I see deliberately by this post. Much appreciated and keep it up
    Best Website Design

    ReplyDelete