This is my take away from Datastax Cassandra Virtual Training course.This course is divided into seven sessions, which constitutes to 135 modules.
Session #1:
Big data is amount of data which is difficult to process using traditional tools, as they does not it on one computer. RDBMSs are designed to scale-up and not scale-out. No-SQL is departure from ACID complaint databases and are designed to relax those requirements to be massively distributed and scalable. They are based on CAP theorem, any distributed system can provide any two of them :
Cassandra gets its initial design from Amazon Dynamo (distribution and replication of data) and Google Big Table (column family to do write without read, always append/upsert new data with primary key,column name and timestamp). This is capable of handling compete data center failures, if multiple data centers are involved in system (as replication is done across data centers).
Cassandra has peer-to-peer replication and any node can handle read/write, hence no single point of failure.
It has a SQL like query language called CQL (de-normalized SQL). cqlsh is a command line tool comes with Cassandra to interact with for DDL and DML queries. CQL supports almost all SQL constructs excepts used for join and aggregation.
Session #1:
Big data is amount of data which is difficult to process using traditional tools, as they does not it on one computer. RDBMSs are designed to scale-up and not scale-out. No-SQL is departure from ACID complaint databases and are designed to relax those requirements to be massively distributed and scalable. They are based on CAP theorem, any distributed system can provide any two of them :
- Consistency : Data on all nodes are same (at any given time). System not providing this means nodes will be out of sync for few milliseconds (which is acceptable for some systems). Strong consistency across large distributed system will make system slow and less available (to ensure update of all node). So they provide eventual consistency. In Cassandra this can be tuned.
- Availability : Every request to the system will receive a response or error.
- Partition Tolerance : These systems partition their data to multiple system and replicate data across nodes to work even few nodes are down. This ensures availability. All nodes in the distributed system behaves as a whole system. As data is replicated updating all replica takes time, so this is done async or delayed (will take few millisecond to update all nodes).
Cassandra gets its initial design from Amazon Dynamo (distribution and replication of data) and Google Big Table (column family to do write without read, always append/upsert new data with primary key,column name and timestamp). This is capable of handling compete data center failures, if multiple data centers are involved in system (as replication is done across data centers).
Cassandra has peer-to-peer replication and any node can handle read/write, hence no single point of failure.
It has a SQL like query language called CQL (de-normalized SQL). cqlsh is a command line tool comes with Cassandra to interact with for DDL and DML queries. CQL supports almost all SQL constructs excepts used for join and aggregation.
It differs from SQL/RDBMS as it does not support join and aggregation, rather all data need to store in one table. So, while data modeling we need to model queries not data. Writes will put data in pre-joined form.
A table in Cassandra holds query result, duplication is encouraged if need query same data in different form. For example, think many-to-many relationship in relation model as two full (data duplicated, but inverted) table in Cassandra.
So, for different queries may need to create different table with same data stored differently. Coming posts for next sessions with more data modeling examples.
Datastax provides Cassandra installers for easy install and configuration, and OpsCenter, a management console. Enterprise edition of Datastax Cassandra offers security features, kerberos, tight Hadoop integration by replacing HDFS by Cassandra, and integration with Solr.
Interact with Cassandra:
Cassandra deamon process must be started before any interaction. Its registered as a service and automatically started if installed using Datastax installers.
- Using cqlsh to execute queries and commands.
- Using Datastax OpsCenter to monitor and add nodes to Cassandra cluster.
- Using CCM, command line tool to install multiple nodes on one machine for testing purpose (without any VM).
- Drivers developed for interaction through programming language like Java, C#, Python.
They are based on native binary protocol. Make sure these connection requirement meet https://github.com/datastax/java-driver/wiki/Connection-requirements , generally they are default.
Cassandra data type to Java data type mapping http://www.datastax.com/documentation/cql/3.0/webhelp/cql/cql_reference/cql_data_types_c.html
UUID: In Cassandra there are two types of UUIDs supported:
1. Type 1, time uuid, contains timestamp of 100ns precision, version and MAC address of computer generated the UUID.
2. Type 4 uuid have all random bits except version bit at 13th position in hexadecimal representation.
In this session Java, Cassandra and Eclipse IDE are installed and sample Java MVC projects setup.
Session #2
Session #2
No comments:
Post a Comment