Wednesday, October 30, 2013

Hadoop 2 on Ubuntu on Azure.

This is to be read in conjunction with http://ac31004.blogspot.co.uk/2013/10/installing-hadoop-2-on-mac_29.html

Fire up a Azure Ubuntu server and ssh to it

Install a Java JDK:
apt-get install default-jdk

On you home machine, download a copy of Hadoop and secure copy it to the Azure machine (your username and machine will be different)
scp hadoop-2.2.0.tar.gz user@Hadoopmachine.cloudapp.net:

Unzip it and untar it
gunzip hadoop-2.2.0.tar.gz
tar xvf  hadoop-2.2.0.tar

You'll still need to set up the env variables
export JAVA_HOME=/usr/lib/jvm/default-java
export HADOOP_INSTALL=/home/user/hadoop-2.2.0
export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin


Also add JAVA_HOME, add Hadoop_INSTALL and change path in /etc/environment, see http://trentrichardson.com/2010/02/10/how-to-set-java_home-in-ubuntu/ for details

After setting up core-site.xml and hdfs-site.xml  you'll make the datanode and name nodename directories

mkdir -p /home/hadoop/yarn/namenode
mkdir /home/hadoop/yarn/datanode

Everything else should be the same.

Tuesday, October 29, 2013

Installing Hadoop 2 on a Mac

I've had a lot of trouble getting Hadoop 2 and yarn 2 running on my MAC.  There are some tutorials out there but they are often for
beta and alpha versions of the hadoop 2.0 family.  These are the steps I used to get Hadoop 2.2.0 working on my MAC running OSX 10.9

Note:  watch for version differences in this blog.  It was written for Hadoop 2.2.0, we are currently on 2.6.2 so that will need to be changed throughout.

Get hadoop from http://www.apache.org/dyn/closer.cgi/hadoop/common/

make sure JAVA_HOME is set (if you have Java 6 on your machine):
export JAVA_HOME=`/usr/libexec/java_home -v1.6`
(Note your Java version should be 1.7 or 1.8)

point HADOOP_INSTALL to the hadoop installation directory
export HADOOP_INSTALL=/Applications/hadoop-2.2.0

And set the path
export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin

You can test hadoop is found with
hadoop -version

make sure ssh is set up on your machine:
system preferences -> sharing -> remote login is ticked

try:
ssh @localhost

where is the name you used to logon.

in $HADOOP_INSTALL/etc these are the conf files I changed.

core-site.xml

 <configuration>  
 <property>  
   <name>fs.default.name</name>  
   <value>hdfs://localhost:9000</value>  
  </property>  
 </configuration>  


hdfs-site.xml

 <configuration>  
 <property>  
   <name>dfs.replication</name>  
   <value>1</value>  
  </property>  
  <property>  
   <name>dfs.namenode.name.dir</name>  
   <value>file:/Users/Administrator/hadoop/namenode</value>  
  </property>  
  <property>  
   <name>dfs.datanode.data.dir</name>  
   <value>file:/Users/Administrator/hadoop/datanode</value>  
  </property>  
 </configuration>  


Make the directories for the namenode and datanode data (note the file above and the mkdir below will need to reflect where you  want to store the files, I've stored mine in the home directory of the Administrator user on my Mac).

mkdir -p /Users/Administrator/hadoop/namenode
mkdir -p /Users/Administrator/hadoop/datanode

hadoop namenode -format

yarn-site.xml
 <configuration>  
 <!-- Site specific YARN configuration properties -->  
 <property>  
 <name>yarn.resourcemanager.address</name>  
 <value>localhost:8032</value>  
 </property>  
 <property>  
 <name>yarn.nodemanager-aux-services</name>  
 <value>madpreduce.shuffle</value>  
 </property>  
 </configuration>  


start-dfs.sh
start-yarn.sh
jps

should give
9430 ResourceManager
9325 SecondaryNameNode
9513 NodeManager
9225 DataNode
9916 Jps
9140 NameNode

if not check log files.  If data node is not started and  you get incompatible id's error, stop everything delete datanode directory and recreate
datanode directory

try  a ls
hadoop fs -ls

if you get

ls: `.': No such file or directory

then there is no home directory in the hadoop file system.  So

hadoop fs -mkdir /user
hadoop fs -mkdir /user/<username>
where is the name you are logged onto the machine with.

now change to $HADOOP_INSTALL directory and upload a file

hadoop fs -put LICENSE.txt


finally try a mapreduce job:

cd share/hadoop/mapreduce
hadoop jar ./hadoop-mapreduce-examples-2.2.0.jar wordcount LICENSE.txt out

Friday, October 11, 2013

Mapping CQL's sets and maps to column families

In this post we are going to explore how CQL implements sets and maps in Cassandra’s column store.

(in a bizarre twist of fate, John Berryman. created this post http://www.planetcassandra.org/blog/post/understanding-how-cql3-maps-to-cassandras-internal-data--structure yesterday on the same subject, I swear I hadn't seen it when I started working on this post, yesterday as well !  It's just how it goes sometimes,  Johns post is great it has to be said !. )

In CQL version 3 wide tables have been supported through the use of sets, maps and lists.  These features have been supported since Cassandra 1.2 (http://www.datastax.com/dev/blog/cql3_collections) and should now be the de facto way of creating “wide tables”  the canonical example of sets is the use of multiple email addresses for a user .  In the relational world you might create a email address table with a foreign key pointing to the user id for each address.   This is going to cause a join just for any request that needs details of the user and their valid addresses. 

Suppose we create a simple keyspace in the usual fashion:

create keyspace Keyspace3 WITH replication = {'class':'SimpleStrategy', 'replication_factor':1};

In a  Cassandra (from 1.2) you would create a table like this:

CREATE TABLE Users (   
    id uuid Primary Key,
    name text,    
    email_addresses set) ;

(This is similar to  Sylvain Lebresne’s example here http://www.datastax.com/dev/blog/cql3_collections)

We can insert data into the table (a user with 2 email addresses like this):

insert into users(id,name,email_addresses) values (88b8fd18-b1ed-4e96-bf79-4280797cba81,'tim',{'tim@example.org','timothy@example.org'});

This user has a UUID, a name and two email addresses.   You can of course get the email addresses with a select command:

select email_addresses from Users;

which will return the addresses as a set:

email_addresses
-------------------------------------------------------
            {'tim@example.org', 'timothy@example.org'}

However, how is this implemented in the column store ?  If you had used a thrift based interface (such as Hector) you may have created the column family and had the following structure:

Id: 88b8fd18-b1ed-4e96-bf79-4280797cba81 (Key)
    name: tim
  email_address: 'tim@example.org'
  email_address: 'timothy@example.org'

but how is it implemented in CQL3 ?  If you fire up Cassandra-cli you can use the list command to see what is stored in the column family:

LifeintheAirAge:bin Administrator$ ./cassandra-cli
Connected to: "Test Cluster" on 127.0.0.1/9160
Welcome to Cassandra CLI version 2.0.0

Please consider using the more convenient cqlsh instead of CLI
CQL3 is fully backwards compatible with Thrift data; see http://www.datastax.com/dev/blog/thrift-to-cql3

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown] use keyspace3;
Authenticated to keyspace: keyspace3
[default@keyspace3] list users;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: 88b8fd18-b1ed-4e96-bf79-4280797cba81
=> (name=, value=, timestamp=1381497810072000)
=> (name=email_addresses:74696d406578616d706c652e6f7267, value=, timestamp=1381497810072000)
=> (name=email_addresses:74696d6f746879406578616d706c652e6f7267, value=, timestamp=1381497810072000)
=> (name=name, value=74696d, timestamp=1381497810072000)

So we can see the rowkey as expected and the name of the user as a name value pair (the value is in ASCII in hex in this case 746976d is tim).

But for the email_addresses the values are not set.  The values of the email addresses is encoded into the name along with the “column” schema name.  name=email_addresses:74696d406578616d706c652e6f7267  is the column name email_addresses followed by tim@example.org in ASCII hex) .  Why do this ? Why not have the name as email_addresses and the value as the the hex email address ?  One reason perhaps is because this allows us to implement maps ina similar way with out needing special cases.   Suppose we alter table to include a map, we want to store details about our user, but we don’t yet know which details the user will provide (A contrived example I’ll grant you).  You can alter the table as follows:

alter table users add details map;

and insert some details as follows:

update users set details= {'tel' : '555 232341', 'twitter' : '@andycobley'} where id =88b8fd18-b1ed-4e96-bf79-4280797cba81;

What does our column now look like? Using the list command we get :

RowKey: 88b8fd18-b1ed-4e96-bf79-4280797cba81
=> (name=, value=, timestamp=1381498805511000)
=> (name=details:74656c, value=3031333832333435303738, timestamp=1381498805511000)
=> (name=details:74776974746572, value=40616e6479636f626c6579, timestamp=1381498805511000)

You can see the map key is stored with the column name  in the name part of the column family name value  pair. So  name=details:74656c contains ‘tel’ as a ASCII hex value.  The map value is simply stored in the value part of the column family name value pair.

So, we’ve seen how CQL3’s  maps and sets map on to the column family name value pairs by storing the CQL table’s column name in the name part of the column family name value pair.  It’s quite simple and elegant really.

(as ever I’m more than happy to receive corrections or further explanations !)