sexta-feira, 24 de setembro de 2010

Assembling a local Cassandra cluster in Linux



In order to build a Apache Cassandra [1] cluster exclusively for availability and replication testings, here is a simple solution, based on a single Linux instance, with no virtualization at all.

The idea is to initialize every node, run a testing client, and then manually kill some nodes processes in order to check the service continuous availability and data replication with several replication factors.

Cassandra's last stable version at this post date was 0.6.5, and this is the released considered in the study.

Thinking of a cluster using 3 nodes, we'll create the following file system structure on Linux:

/opt/apache-cassandra-0.6.5/nodes
|
|-- node1
|   |-- bin
|   |-- conf
|   |-- data
|   |-- log
|   `-- txs
|
|-- node2
|   |-- bin
|   |-- conf
|   |-- data
|   |-- log
|   `-- txs
|
`-- node3
    |-- bin
    |-- conf
    |-- data
    |-- log
    `-- txs


That is, each node has its own set of directories, namely:
- bin: binaries (mostly Shell Scripts) tuned for the instance
- conf: configuration files for the node
- data: database binary files
- log: system log in text format
- txs: transaction logs (i.e. commit logs)

In the following sections the steps will be described in detail.


Cassandra installation



First of all, get the required packages for Cassandra from its download site. Extract the file contents into /opt/apache-cassandra-0.6.5/ directory (you'll need administrator privileges in order to do so). Make sure your regular user owns this entire directory tree (use chown if needed).

Certify that you already has Java Virtual Machine (JVM) installed on the system, at least version 1.6. Cassandra is implemented in Java, so it needs JVM to be executed.


Supplying additional network interfaces



As we are building a pseudo-cluster, consider there are no more network interfaces than existing ones. Since Cassandra is a distributed system, their participant nodes must have an exclusive IP address each one. Fortunately we can simulate it on Linux using aliases to the existent loopback interface (i.e., "lo").

Therefore, using root user, we'll issue the instructions below in order to create the aliases lo:2 and lo:3, respectively pointing to 127.0.0.2 and 127.0.0.3:

# ifconfig lo:2 127.0.0.2 up
# ifconfig lo:3 127.0.0.3 up


You can check the overcome of these instructions by invoking ifconfig using no arguments:

$ ifconfig
lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:30848 errors:0 dropped:0 overruns:0 frame:0
          TX packets:30848 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:2946793 (2.9 MB)  TX bytes:2946793 (2.9 MB)

lo:2      Link encap:Local Loopback
          inet addr:127.0.0.2  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1

lo:3      Link encap:Local Loopback
          inet addr:127.0.0.3  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1


That way, node1's IP is 127.0.0.1, node2's 127.0.0.2, and so forth.


Registering node hostnames locally



Instead of using IP addresses in the sequential steps, we'll assign hostnames to them and use those names when referencing a cluster node.

As we have no DNS in this configuration, we need to edit /etc/hosts file and assign those IP mappings there. Change this file according to the example below:

/etc/hosts:
127.0.0.1    localhost    node1
127.0.0.2    node2
127.0.0.3    node3


From now on, we'll call the nodes simply as node1, node2, and node3. In order to finally test hostnames, try pinging the nodes as exemplified:

$ ping node2
PING node2 (127.0.0.2) 56(84) bytes of data.
64 bytes from node2 (127.0.0.2): icmp_seq=1 ttl=64 time=0.018 ms
64 bytes from node2 (127.0.0.2): icmp_seq=2 ttl=64 time=0.015 ms
^C
--- node2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.015/0.016/0.018/0.004 ms



Creating the first node structure



We'll first create the directory structure for node1 and after replicate it to remaining nodes. Switch to Cassandra's root directory, create subdirectories and copy files by issuing the following shell commands using your regular Linux user:

$ cd /opt/apache-cassandra-0.6.5/
$ mkdir -p nodes/node1
$ cp -R bin/ conf/ nodes/node1/
$ cd nodes/node1


The next step is to modify some default settings in the configuration files. Here I opted to use relative paths instead of absolute paths for simplification purposes.

Edit a single line in conf/log4j.properties file.

# Edit the next line to point to your logs directory
log4j.appender.R.File=./log/system.log


Now in conf/storage-conf.xml several tag values must be changed. Be sure to modify the correct tags, as indicated below:

  <Keyspaces>
    <Keyspace Name="Keyspace1">
      ...

      <ReplicationFactor>2</ReplicationFactor>
      ...

    </Keyspace>
  <Keyspaces>
  ...

  <CommitLogDirectory>./txs</CommitLogDirectory>
  <DataFileDirectories>
      <DataFileDirectory>./data</DataFileDirectory>
  </DataFileDirectories>
  ...

  <ListenAddress>node1</ListenAddress>
  <StoragePort>7000</StoragePort>
  ...

  <ThriftAddress></ThriftAddress>
  <ThriftPort>9160</ThriftPort>
  <ThriftFramedTransport>false</ThriftFramedTransport>
  ...


As we are building a special directory arrangement, we'll need to adequate some paths in the scripts. Thus, edit the file bin/cassandra.in.sh by providing the lines below appropriately:

for jar in $cassandra_home/../../lib/*.jar; do
    CLASSPATH=$CLASSPATH:$jar
done


The first node is finally concluded. So, let's turn our attentions to the other two nodes.


Creating the remaining nodes



We are gonna take advantage of the first node directory structure to create the two remaining nodes. Thus, issue the instructions below in order to perform the cloning:

$ cd /opt/apache-cassandra-0.6.5/nodes/
$ mkdir node2 node3
$ cp -R node1/* node2
$ cp -R node1/* node3


When you're done, call a tree command (if available) as illustrated below to check the whole structure:

$ tree -L 2
.
|-- node1
|   |-- bin
|   `-- conf
|-- node2
|   |-- bin
|   `-- conf
`-- node3
    |-- bin
    `-- conf


If you compare this structure to the other one in the beginning of this article, you should note that there are some subdirectories missing (i.e., log, data, txs). That's because they will be created automatically when the server starts up the first time.


Final settings on nodes



The first node is deemed a contact point, i.e., a seed. Other nodes should start the Gossip-based protocol by connecting to it.

Besides, we should make specific settings on each node. Firstly, each node must listen to it's own hostname (i.e., node1, node2, node3). Thus, edit the nodeX/conf/storage-conf.xml for each node by replacing the line below (this example applies to node2):

  <ListenAddress>node2</ListenAddress>


Monitoring on Cassandra is provided through JMX, listening to every host by default on TCP port 8080. As we are building the cluster on the same real instance, this is not possible anymore. Thus, JMX interfaces must be bound to the same host, but we must change the port on each node. Therefore, first node will be listening on 8081, node2 on 8082, and node3 on 8083.

This parameter is configured in nodeX/bin/cassandra.in.sh, so we must change only the port as exemplified below (applying to node2):

# Arguments to pass to the JVM
JVM_OPTS=" \
        -ea \
        -Xms1G \
        -Xmx1G \
        -XX:+UseParNewGC \
        -XX:+UseConcMarkSweepGC \
        -XX:+CMSParallelRemarkEnabled \
        -XX:SurvivorRatio=8 \
        -XX:MaxTenuringThreshold=1 \
        -XX:+HeapDumpOnOutOfMemoryError \
        -Dcom.sun.management.jmxremote.port=8082 \
        -Dcom.sun.management.jmxremote.ssl=false \
        -Dcom.sun.management.jmxremote.authenticate=false"



Starting up every server



For this step it is very interesting to open a new Linux terminal for each node. Thus, issue the instructions below:

$ node1/bin/cassandra -f

$ node2/bin/cassandra -f

$ node3/bin/cassandra -f


Pay attention to the console output on each node. A simple misconfiguration will be related there. Also note the relationship between the nodes. They must discover one another automatically in seconds.


Check services availability



Since every node in the cluster is up and running, their respective TCP ports must appear to the system when executing a netstat command.

In order to check listening TCP ports on Cassandra, one must search for 9160 (Thrift service), 7000 (internal storage), and 808X (JMX interface). Take a look at the example below of a successful cluster startup.

$ netstat -lptn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN      -             
tcp6       0      0 127.0.0.3:9160          :::*                    LISTEN      8520/java     
tcp6       0      0 127.0.0.2:9160          :::*                    LISTEN      8424/java     
tcp6       0      0 127.0.0.1:9160          :::*                    LISTEN      8336/java     
tcp6       0      0 :::46954                :::*                    LISTEN      8424/java     
tcp6       0      0 :::53418                :::*                    LISTEN      8336/java     
tcp6       0      0 :::49035                :::*                    LISTEN      8520/java     
tcp6       0      0 :::80                   :::*                    LISTEN      -             
tcp6       0      0 :::42737                :::*                    LISTEN      8520/java     
tcp6       0      0 :::8081                 :::*                    LISTEN      8336/java     
tcp6       0      0 :::8082                 :::*                    LISTEN      8424/java     
tcp6       0      0 :::8083                 :::*                    LISTEN      8520/java     
tcp6       0      0 :::60310                :::*                    LISTEN      8424/java     
tcp6       0      0 :::46167                :::*                    LISTEN      8336/java     
tcp6       0      0 ::1:631                 :::*                    LISTEN      -             
tcp6       0      0 127.0.0.3:7000          :::*                    LISTEN      8520/java     
tcp6       0      0 127.0.0.2:7000          :::*                    LISTEN      8424/java     
tcp6       0      0 127.0.0.1:7000          :::*                    LISTEN      8336/java     


The last test was from a network perspective. You might want to ask Cassandra whether these nodes are talking to each other and participating on a healthy partitioning scheme. In order to do so, another checking is to invoke nodetool's ring command.

$ cd /opt/apache-cassandra-0.6.5/

$ ./bin/nodetool -h localhost -p 8081 ring
Address       Status     Load          Range                                      Ring
                                       142865723918937898194528652808268231850  
127.0.0.1     Up         3,1 KB        39461784941927371686416024510057184051     |<--|
127.0.0.3     Up         3,1 KB        54264004217607518447601711663387808864     |   |
127.0.0.2     Up         2,68 KB       142865723918937898194528652808268231850    |-->|


Here it is, the local cluster is up and running with three nodes! :D




You can also check Cassandra's availability by using its simple client application:

$ ./bin/cassandra-cli --host node1 --port 9160
Connected to: "Test Cluster" on node1/9160
Welcome to cassandra CLI.

cassandra> set Keyspace1.Standard1['rowkey']['column'] = 'value'
Value inserted.

cassandra> get Keyspace1.Standard1['rowkey']['column']          
=> (column=636f6c756d6e, value=value, timestamp=1285273581745000)

cassandra> get Keyspace1.Standard1['rowkey']          
=> (column=636f6c756d6e, value=value, timestamp=1285273581745000)
Returned 1 results.

cassandra> del Keyspace1.Standard1['rowkey']                    
row removed.


As we set replication factor to 2 for this given keyspace Keyspace1, it is expected that each value saved in the Cassandra cluster is replicated to two nodes.

Do you really believe it is happening? If not, save another value (using set) and then kill one of the three node processes. Do not kill more than one! Then, retrieve that key again (using get). The system should be capable of automatically recovering from this intentional disaster.

You can easily make a node participate in the cluster again just by executing its respective bin/cassandra command. Take a look at all console outputs after every step, they are very interesting!


References



[1] Apache Cassandra