Saturday, July 31, 2010

Adding a Author index to our Bloggy App

There’s one more point about Arin’s design ( WTF is a SuperColumn? An Intro to the Cassandra Data Model
) which allows you to search by tag or for all posts by searching by a default tag (“__notag__”) . But what if (as is likely ) we want to get all posts by one author ? The Answer is to add a new ColumnFamily to our keyspace that looks the same as the TaggedPosts ColumnFamily but uses the Authors name as the tag. So our this will look like:
AuthorPosts : { // CF
     // blog entries created by “Andy"
      Andy: {  // Row key is the tag name
          // column names are TimeUUIDType, value is the row key into BlogEntries
           timeuuid_1 : i-got-a-new-guitar,
           timeuuid_2 : another-cool-guitar,
       },
_AllAuthors_: {  // Row key is the tag name
          // column names are TimeUUIDType, value is the row key into BlogEntries
          timeuuid_1 : i-got-a-new-guitar,
          timeuuid_2 : another-cool-guitar,
      }
}
We’ve used a made up tag _allAuthors_ for a row that’s going to store all posts from all authors. And in the conf file we add a column family definition like this:
<ColumnFamily CompareWith="TimeUUIDType" Name="AuthorPosts"/> 
We can add the post indexes to our ColumnFamily like this
ColumnPath authorsColumnPath = new ColumnPath("AuthorPosts");

authorsColumnPath.setColumn(asByteArray(timeUUID));
ks.insert(authorValue, authorsColumnPath, slugValue.getBytes());
//And do it for all others
ks.insert("_All-Authors_", authorsColumnPath, slugValue.getBytes());
Here authorValue is a string containg the Authors Name that we have used earlier in the code. timeUUID has been created earlier in the code when we added the TaggedPosts columns. See the previous post for details of creating this value.

The interesting thing about this is that we are using ColumnFamilys as indexes, in traditional SQL we would simply have done something like “Select * from Posts where Author like ‘Andy’ order by postdate” . Here in Cassandra we are creating indexes in Column Families so predetermining how we can search the data. Careful design is needed I think !

Creating the TaggedPost Column Family

Now it’s time to deal with the TaggedPosts Column family. I like to think of this as the indexing mechanism for our application, it’s this Columnfamily that allows us to get all posts or posts from a particular tag. Because the Column names are TimeUUIDType, Arin (who’s design we are working from remember WTF is a SuperColumn? An Intro to the Cassandra Data Model) points out that getting the latest 10 entries is going to be very efficient.

So our entries for this Column family are going to look like:

Tag:{
TimeofPost: TitleofPost,
TimeofPost:TitleofPost,
}

Also remember that Arin’s design has denormalised the tags in the Blog entry so they look like tag1,Tag2,Tag3. In our test code we’ll use an array of tags for a our test entry.

First up we are going to need a ColumnPath for this Column family:

ColumnPath tagsColumnPath = new ColumnPath("TaggedPosts");

So here’s the code:

String Tags[]={"Daily","Ramblings","_No-Tag_"};
columnName = "tags";
value = "";
for (int i=0;i<Tags.length; i++){
      value=value+Tags[i]+",";
      String tagKey=Tags[i];
     tagsColumnPath.setColumn(asByteArray(timeUUID));
     ks.insert(tagKey, tagsColumnPath, slugValue.getBytes());
}


The only point to note here is that the slugValue has been stored earlier in the code and is essentially the title of the post

Now , there is one major point to note, that’s the timeUUID. There are some problems creating this value which is essentially the time of the post, for details on the problems see:

http://wiki.apache.org/cassandra/FAQ#working_with_timeuuid_in_java

Essentially to create this UUID we are going to use Johann Burkard’s UUID library available from http://johannburkard.de/software/uuid/ and some of the code detailed in the Apache Cassandra FAC. So our timeUUID is generated as:

java.util.UUID timeUUID=getTimeUUID();

Where getTimeUUID() is taken form the Cassandra FAC:

public static java.util.UUID getTimeUUID()
{
     return java.util.UUID.fromString(new com.eaio.uuid.UUID().toString());
}

And that’s all we need to create he TaggedPosts CollumnFamily.

Friday, July 30, 2010

Trouble with Time UUIDs and Java

This is a place holder, I'm having problems generating time UUIDs and passing them to Cassandra. Currently I'm looking at:

http://wiki.apache.org/cassandra/FAQ#working_with_timeuuid_in_java

for an answer

Starting to write a Cassandra app in Java

I’m going to explore using Java to create an application that uses Caassandra as a datastore. To do this I’m going to implement the Bloggy App that is described in Arin Sarkissian’s introduction to Cassanadra:

WTF is a SuperColumn? An Intro to the Cassandra Data Model

Creating the keyspace

Now assuming you’ve got Cassandra up and running you’ll need to create the keyspace for the app which describes the column families and other config (such as sorting options on the columns). You’ll need to read Arin’s web page for more detail but here from that page is the config that needs to be added to storage-conf.xml to create the keyspaces. You’ll need to do this on each node in your cluster and you’ll need to restart Cassandra on each node for the keyspaces to be created. Add this to the Keyspaces section of the file:

<Keyspace Name="BloggyAppy">

<!-- other keyspace config stuff -->
<!-- This is a test app from : http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model -->

<!-- CF definitions -->
<ColumnFamily CompareWith="BytesType" Name="Authors"/>

<ColumnFamily CompareWith="BytesType" Name="BlogEntries"/>
<ColumnFamily CompareWith="TimeUUIDType" Name="TaggedPosts"/>
<ColumnFamily CompareWith="TimeUUIDType" Name="Comments"

CompareSubcolumnsWith="BytesType" ColumnType="Super"/>


<ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>

<!-- Number of replicas of the data -->

<ReplicationFactor>2</ReplicationFactor>

<!--
~ EndPointSnitch: Setting this to the class that implements
~ AbstractEndpointSnitch, which lets Cassandra know enough
~ about your network topology to route requests efficiently.
~ Out of the box, Cassandra provides org.apache.cassandra.locator.EndPointSnitch,
~ and PropertyFileEndPointSnitch is available in contrib/.
-->
<EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>

</Keyspace>

Writing data to the keyspace


I’m planning on using Java to create my application so I’ll need a way to connect to the database. Cassandra uses Thrift as an API but I’ll use a higher level client, in this case Hector. Download the latest version from: Hector Downloads and make sure the files are in your classpath. There are a couple of example files (and the code here will be very heavily based on these examples) at the git hub wiki. Also look in the test section of the src code on github for more examples

More info on Hector is here http://prettyprint.me/2010/02/23/hector-a-java-cassandra-client/

Connecting to the database

Connecting to the database is nice and easy, get a pool instance and borrow a client
CassandraClientPool pool = CassandraClientPoolFactory.INSTANCE.get();
CassandraClient client = pool.borrowClient("xxx.yy.36.151", 9160);

remember to release the connection once you’re done with it.

pool.releaseClient(client);

Writing an entry

Before we can write anything to Cassandra we need to set the keyspace we are going to use. In this case we are going to use our Blog application keyspace BloggyApp:


Keyspace ks = client.getKeyspace("BloggyAppy");

Suppose we want to add a “record” (to borrow from RDBMS terms), in this case lets add an author record to Authors column family. First get a column path to the Authors column:

ColumnPath columnPath = new ColumnPath("Authors");

So what we want to do is add a number of “fields” (which are name value pairs) to our “record” Suppose our “record” is going to look like this:


Andy
Tel == 01555 XXXXX
Email == andy@blogspot.org
Address == Blogspot

“Andy” is going to be our Key and each of Tel:data, Email:data, Address:data columns in that key. So to add the Andy key with a email address:

String key = "Andy";
String columnName = "Email";
String value = "andy@blogspot.org";

columnPath.setColumn(columnName.getBytes());
ks.insert(key, columnPath, value.getBytes());

So, here we set the columnpath (email) and then add to the key (andy) this columnpath with a value. Note that the value is stored as an array of bytes. We can go on like this to set the telephone number:

columnName = "Tel";
value = "01555 XXXXX";
columnPath.setColumn(columnName.getBytes());
ks.insert(key, columnPath, value.getBytes());

If we want to add a new “record” (say for Joe) just change the key (key=”Joe”) and start adding “fields”. Note we haven’t defined how many fields a key has or what the fields are. They are added as needed and not all may be present. This is a major difference to a traditional RDBMS. One last thing, our bloggy app (as defined in Arin’s article needs a pubdate in a Blog Entry key. This needs to be stored as unixtime. We can do that like this:

columnName = "pubDate";
long now = System.currentTimeMillis();
Long lnow=new Long(now);
value = lnow.toString();
columnPath.setColumn(columnName.getBytes());
ks.insert(key, columnPath, value.getBytes());

The important thing is we convert the long now value to a string before inserting it into the key.

Next time, starting to get some of this info out of Cassandra

Thursday, July 29, 2010

A very simple 2 node cassandra cluster on windows XP

Today I’ve been setting up a tiny Cassandra cluster in our teaching lab. I’m (for my sins) running this on a couple of Windows XP boxes. This means that for now Cassandra need to be run from the command prompt and the machine left running. Setting up on windows is fine, just make sure your JAVA_HOME is set correctly before running. To get more than one machine talking to each other do the following.

Open the storage-conf.xml file and look for ListenAddress. Change this to be the IP address of the the machine you’re working on:

<ListenAddress>xxx.yyy.36.151</ListenAddress>
<!-- internal communications port -->
<StoragePort>7000</StoragePort>

Do this on both machines. Now look for the Seeds config entry. Change this so it lists both the IP’s of each machine.

<Seeds>
<Seed>xxx.yyy.36.151</Seed>
<Seed>xxx.yyy.36.150</Seed>
</Seeds>
I also changed the Number of replicas of the data to the number of machines and to be frank I’m not quite sure if I needed to.

<ReplicationFactor>2</ReplicationFactor>

One other thing, before you are tempted to start either machine change the ClusterName. I’ve found trying to change the cluster name after starting Cassandra can cause problems.

Tuesday, July 27, 2010

Reset vs Cancel in HTML forms

I’ve been in an interesting discussion today on Twiter on the use of the reset button in forms. The questioner asks if buttons should be [cancel][submit] or [submit][cancel]? My objection is that the cancel button should actually be [reset]. The questioner countered that all OS dialog boxes have a cancel button not a reset, well that be true, but when creating web pages we are not dealing with OS dialog boxes. The difference is simple, with a OS dialog box, you close the box when you hit cancel, when you hit a HTML reset button it clears the form.

Jakob Nielsen has a post dating from 2000 Reset and Cancel Buttons in which he argues that the reset button is bad and shouldn’t be used. His main problem is that most designers put the rest button next to the submit and so can be hit by mistake. He also argues that the reset button isn’t really needed, who needs to clear an entire form and start again? There is also a chance that having the reset button there will slow users down. Reset does seem unneeded for most cases provided the user can return each element of a form to it’s default state.

But what about an explicit Cancel button that closes the form and returns the user to a default page? This would be the equivalent of a OS dialog box so would typically be followed by a “Are you sure yes/no” dialog box which would need to repopulate the form if “no” was selected. In my opinion a cancel button is useful for:

1: Pop out forms , the cancel button just closes the form.
2: Multi form pages with the user filling in a lot of information. A confirm cancellation button is really important here.

So reset buttons should be used sparingly, does a user really need to clear the entire form ? Cancel buttons should be used to make sure the transaction/form is really cancelled and return the user to the default / last none form page.

Tuesday, July 20, 2010

A few Cassandra links

Introduction to Apache Cassandra:

http://www.nosqldatabases.com/main/2010/7/13/introduction-to-apache-cassandra.html

Cassandra: Principles and Application (pdf paper)

Cassandra: Principles and Application

A Quick Introduction to the Cassandra Data Model:

http://maxgrinev.com/2010/07/09/a-quick-introduction-to-the-cassandra-data-model/

Do You Really Need SQL to Do It All in Cassandra?

http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/

Update Idempotency: Why It is Important in Cassandra Applications

http://maxgrinev.com/2010/07/12/update-idempotency-why-it-is-important-in-cassandra-applications-2/

And of course the famous WTF is a SuperColumn? An Intro to the Cassandra Data Model

http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model

A collection of excellent articles on using Java with Cassandra:

http://www.sodeso.nl/?p=80


Cassandra: Fact vs fiction

Distributed deletes in the Cassandra database

Please add more in the comments

How bad cache data in the DNS effected iPhones

This isn't really about web programming but it's the sort of annoying thing that might have an effect on any website your running in the craziest way.

Imagine the problem, all your clients can see your websites or so you think. Then you notice that if your iphone is connecting via 3g then your website is inaccessible. Everything works on wireless though. This happened to me recently and it wasn't just my iPhone. Other websites (including ones that are in the parent domain to mine) where all accessible. It was crazy !

So I downloaded an app iNetFactory that allowed ne to do nslookup, this showed that the site in question could not be resolved in the DNS. Other sites running off the same dns server where resolving just fine. This was puzzling.

To cut a long story short, after peering at the DNS configuration (its a windows dns server) I decided to look at the servers cache. To do this on a windows DNS box you'll need to open dnsmgmt and click on view/advanced. drilling down to my parent domain I noticed that the cache contained an entry for my domain (which it shouldn't). The entries in this where valid, but incorrect. I decided to nuke the cache (right click on Cached Lookups and choose clear cache). This cured the problem.

It's not pretty and I'm not sure how the bad entry got in there but I've reports from other iPhone users that the sites are now back and accessible. It's possible this was a case of cache poisoning or possibly a machine with an old entry (it did resemble an old entry) had managed to do it.

I'm keeping an eye on it !

Tuesday, July 6, 2010

Final step for handling put data, URL decoding

As we saw earlier we can get the put data from the body of the HTML request and decode it into name value pairs. However our values are URL encoded. That is spaces have been converted to "+" characters and others are encoded into %FF hex style. See:

URL encoding at Wikipedia

We need to decode this into plain text. Fortunately the standard java.net package has a URLDecode class that will do the job for us:

URLDecode man at Sun.com

This has two methods, the simpler one (taking only the string to decode as an argument) has been deprecated so we'll not use it. The second method takes the string to be decoded and a string representing the encoding method. This is usually (but not always) UTF-8. So our code to decode the PUT values is now:

URLDecoder dc = new URLDecoder();
System.out.println("String was "+dc.decode((String)hm.get("Software"),"UTF-8"));

Remember from last time the name value pair is stored in a hashmap (here hm). "Software" is the name of the field we are going to retrieve.

One last thing to do before sending this off for storing in a database. In order to avoid Cross Site Scripting attacks we should escape any html in the value field. This is to stop users putting text such as <script>alert("test")</script> into the input. We'll use the commons lang stringescapeutils package to deal with this:

String escape utils

Our code for dealing with the name value pairs now looks like:

String Software=org.apache.commons.lang.StringEscapeUtils.escapeHtml(
      (String)dc.decode((String)hm.get("Software"),"UTF-8"));

Decoding PUT data

As we saw yesterday, for a HTTP PUT command the data arrives in the body of the HTML content. A simple way to read that data we saw was:

InputStream is = request.getInputStream();
char ch;
for (int i=0; i < request.getContentLength();i++){

    ch=(char)is.read();

    System.out.print(ch);

}

which will just read the data and send it to stdout. However want we want to do is get the data and use it as if it had been sent over as standard parameters. If you look at the data output from the above code you'll see it is sent as name value pairs delimited by & . So if we our data is being sent from the jquery ajax call as follows:

data: { Module: $('#Module').val(), Software: $('#Software').val()}

This will be encoded as (for example)

Module=ac31004&Software=SQL+Server

Notice that spaces have been encoded as + characters.

We need to decode this, turning the input into name value pairs which can be sent to our database update code. There's probably a lot of ways to do this, some more efficient than others, but we'll look at one way using a hashmap.

In your servlet code create a global hashmap variable (remember to import the util class import java.util.HashMap;) :

private HashMap hm = new HashMap();

Now in our servlet init method add objects that are going to represent the name of input fields in the original html form:

hm.put("Module", "");
hm.put("Software", "");

In our put method read the contents of the request body into a Byte array:

InputStream is = request.getInputStream();
byte Buffer[]= new byte [request.getContentLength()];
is.read(Buffer);

We can now split this into name value pairs by string tokenising on the & character:

StringTokenizer st = new StringTokenizer (input,"&");

We can no read through all these pairs and String tokenise on the "=" character. We can then assume that the first of the pair is the name and the second the value. Using the hashmap we created earlier, we can look to see if the name is in the hashmap and if it is replace the value with the one we've just got from the name value pair. Doing this we restrict the input to only those fields we defined in the init method when we set up the hashmap. Here's the code:

StringTokenizer st = new StringTokenizer (input,"&");
while (st.hasMoreTokens ()) {
   String inputPair=st.nextToken ();
   StringTokenizer st2 = new StringTokenizer (inputPair,"=");

    // First token should be name of input field
   String name=st2.nextToken();
   String var =st2.nextToken();
    if (hm.containsKey(name)){
      hm.put(name, var);

   }


Finally to use these values we can just get them from the hashmap.

String Software=(String)hm.get("Software");

Monday, July 5, 2010

HTTP PUT, jquery and Java Servlets

If you are trying to create a RESTFULL interface then you need to implement the HTTP Put method to allow updates. Now , all browsers will not allow PUT (or DELETE) in a form method so the the easiest thing to do is use AJAX (xmlHttpRequest actually) to send over the data. Now you could handrole the xmlHttpRequest, but thats reinventing the wheel. Instead we can use JQUERY:

http://api.jquery.com/jQuery.ajax/

So a post can be down as follows (Module and Software are the ids of Input fields in our HTML), $("a") is attaching this to a "a href" statement:

$(document).ready(function() {
// do stuff when DOM is ready

$("a").click(function() {
$.ajax({
type: 'PUT',
url: "/Courses/Software",
processData : true,
data: { Module: $('#Module').val(), Software: $('#Software').val() } ,
error: function(data) {
$('.result').html(data);
alert('Error in put.'+data);
},
success: function(data) {
$('.result').html(data);
alert('Load was performed with '+data);
}
});
alert($('#Module').val()+" : "+ $('#Software').val());
});

});


The problem is how to handle this is the Java Servlet. You are probably aware that you can use a doPut(HttpServletRequest request, HttpServletResponse response) method in the servlet to handle the HTTP Post. The problem is how to get at the data. My first attempt was to just get the parameters:

System.out.println("Software:doPut"+request.getParameter("Module"));
System.out.println("Software:doPut"+request.getParameter("name"));

But that doesn't work. For PUT the data is in the body of the request. You can see this by looking at the content length:

System.out.println("Content length "+request.getContentLength());

So we need to read the body in our servlet (doPut) like this (simple example):

InputStream is = request.getInputStream();
char ch;
for (int i=0; i < request.getContentLength();i++){
ch=(char)is.read();
System.out.print(ch);
}

I'll leave decoding this into name, value pairs until next time