HP Vertica, VoltDB and Talend together to manage Big Data

This blog is about using best of the breed database technology and integrate them for syncing data in near real time. VoltDB is specialized in OLTP and Vertica is an analytic platform and specialized in OLAP.

Here, I will describe required installation (single node installation with 0 K-safety) mandatory for getting databases running. To setup cluster and K-safety it needs addition host machines and configuration. As this does not include configuration for cluster, K-safety and other required for production use, this will not help DBAs, rather for developers, programmers or ETLs to install database on local/development machine to test application or ETL jobs.

Further posts will include writing UDF (user defined function) and other database developer perspective topics. Feel free to request me anything related to this.

OLTP and OLAP in traditional DBMS and NewSQL DBMS (VoltDB and Vertica)

This blog is about emerging NewSQL DBMSs and how this paradigm shift outperforms traditional RDBMS in performance. I will discuss about OLAP database Vertica Analytic Platform and most performant OLTP RDBMS, VoltDB. Both are from database research pioneer Michael Stonebraker (Vertica is acquired by HP). The idea behind these two DBMSs is, traditional RDBMS tend to provide all-in-one (general purpose, OLTP+OLAP) solution and not designed for large volume of data (Big Data). They are not scalable or very complex to scale and maintain. OLTP and OLAP have different needs, so there are two different DBMS redesigned and developed from ground-up, purpose built for two different needs. Both scale out horizontally, so simply add more machines to cluster to handle more load, K-safety etc.

As to get all functionalities (high volume transactions and analytic with complex queries) need to integrate both DBMS. For example, daily business activities need OLTP database and populate (insert/update) data and for analyzing this data (historical analysis) using a BI (Business Intelligence) solution need OLAP database. These two have different requirements, OLTP needs write optimized and OLAP needs read optimized DBMS technology. Also, OLTP and OLAP data modeling is different. In OLTP schema is normalized, for example 3NF, and in OLAP its snowflake or star schema where fact tables are surrounded by dimension table (dimensional modeling).

In tradition DBMS, its general purpose and for both same DBMS was used. Evolution of NewSQL is purpose built and optimized for specific use.

See http://www.cbsolution.net/techniques/ontarget/olap_vs_oltp_what_makes

To demonstrate this we will consider following use case:

An eCommerce website (popular and hence high volume) stores order placed and also, need to analyze user trends, like, by geographic location.

We will use following technologies:

VoltDB [Open Source Edition, 3.5.0.1] for OLTP database. Size ~19MB

Vertica [Community Edition, 6.1.2] for OLAP database. Size ~80MB

Talend DIfor ETL (synchronizing from VoltDB to Vertica). Size ~600MB

Will use Fedora 19 64-bit OS (x86_64 architecture, check linux architecture with 'uname' command) and decent hardware resources.

Step #1: Install and configure VoltDB, Vertica and Talend

Detailed installation step and fine tuned configuration can be found in corresponding installation guides or getting started documents.

Install VoltDB:

To download go to voltdb.com and register for download. A download link will be sent to the email. Download VoltDB (voltdb-3.5.0.1.tar.gz) and extract it, refer extracted directory as VOLTDB_HOME in this tutorial. Add VOLTDB_HOME/bin to $PATH (edit ~/.bashrc file). Installation Done!!!

This installation contains documents, web console (localhost:8080/studio) and JSON based REST API for accessing VoltDB, documentation for this http://voltdb.com/docs/UsingVoltDB/ProgLangJson.php

Install Vertica:

Go to my.vertica.com (register and login) and download Vertica for Fedora (vertica-6.1.2-0.x86_64.FC.rpm).

Prepare linux for installation:

Disable SELinux by editing selinux configuration file:

In terminal login with root user and execute vi /etc/sysconfig/selinux command and add “SELINUX=disabled” at end of this file.

Edit “vi /etc/pam.d/su“ and add “session required pam_limits.so” if not in file.

Now install Vertica with “rpm -ivh vertica-6.1.2-0.x86_64.FC.rpm” (change directory where rpm is downloaded before executing command). Few inputs will be asked for path and dbadmin user.

After installation a new linux user “dbadmin” will be created.

Switch user “dbadmin” to setup and create database with command “su – dbadmin” in terminal.

Install Talend DI:

Download Talend Open Studio for Data Integration v5.3.1 from http://www.talend.com/download/data-integration and extract the archive. Refer extracted folder as TALEND_HOME.

Step #2: Setup Databases for both DBMS

Data Modeling:

As this tutorial is to demonstrate technologies we will not focus on modeling concepts, rather have a single table in both database and sync data. Learn data modeling on http://www.learndatamodeling.com

Orders table OLTP [VoltDB]:

order_id <primary key>

item_id <item identifier of items stored in some warehouse table>

product_id <name of product in some product table>

product_category <category of product>

user_id <customer related information>

user_country

user_city

user_age

In real application product and user related information will be stored in different table and have references to these tables. For simplicity there is single table.

product_category will be partition column, because application and business both can be separated based on that and will produce evenly distributed data. When there is separate product and users tables, they can be partitioned on product_category and user_country respectively as they are expected to have large data. If expected data is small they can be kept as replication table. See more on partitioning and replication in “UsingVoltDB” document and http://voltdb.com/resources/volt-university/tutorials/section-1-4/

Orders table OLAP [Vertica]:

order_id <primary key>

product_id <name of product in some product table>

product_category <category of product>

user_id <customer related information>

user_country

user_city

user_age

date_created <populated from ETL>

In real application product and user will be different dimension table and orders will be fact table. Partitioning strategy will be same. The columns are different in table of both databases, as for analytic all data from OLTP is not required.

Create Database VoltDB and Configure for synchronization:

VoltDB compiles DDLs and Java stored procedures into a single jar, called catalog and deployed to VoltDB. We need to bundle transactions (data access logic) in form of stored procedures written in Java. We can execute adhoc queries, but this will not take advantage of architecture of VoltDB and will not be performant. Also, configure export of data incrementally to sync with Vertica using ETL.

There is a deployment.xml configuration fie to configure and enable features for VoltDB database.

We need following things to compile and generate database catalog:

DDLs for create table and export table, register stored procedure written in Java and partition table and stored procedure.

sample.sql

--create table for storing orders data

--add partition column as part of primary key to guarantee uniqueness across all

--partitions in database and no unique constraint violation while repartitioning

create table orders (

order_id integer not null,

item_id integer,

product_id integer,

product_category varchar(30) not null,

user_id integer,

user_country varchar(20),

user_city varchar(20),

user_age integer,

primary key(order_id, product_category)

);

--table for exporting selected columns from orders table

--only insert is allowed for export tables, as data is queued and

--fetched by an export client. This feature is for incremental sync

--of data to external system

create table orders_export (

order_id integer not null,

product_id integer,

product_category varchar(30) not null,

user_id integer,

user_country varchar(20),

user_city varchar(20),

user_age integer

);

--VoltDB does not support auto increment, to implement this we can have a table

--to store max +1 as next value of identifier field and query this table in

--stored procedure. This will be a replicated table.

create table auto_increment(

table_name varchar(50) not null,

next_value integer,

primary key(table_name)

);

--Mark orders_export table as export only

EXPORT TABLE orders_export;

--Partition orders table. No need to partition export table as no data is stored for them.

partition table orders on column product_category;

--This is a small table and suied as replication table, but we need to write to this table

--while get and increment next value for a table, so partition this on primary key

partition table auto_increment on column table_name;

--register stored procedure written in Java

CREATE PROCEDURE FROM CLASS SaveOrder;

CREATE PROCEDURE FROM CLASS AutoIncrement;

--Partition stored procedure on same column as for table and provide parameter index

--for partition column in argument passed to procedure. By defaut its expected asfirst

--argument, in our procedure it will be 4rd argument (index 3).

PARTITION PROCEDURE SaveOrder ON TABLE orders COLUMN product_category PARAMETER 3;

PARTITION PROCEDURE AutoIncrement ON TABLE auto_increment COLUMN table_name;

Deployment configuration file.

deployment.xml

<?xml version="1.0"?>

<!--

Single host local deployment

-->

<!--

Directories for storing snapshots, export overflow and other files generated by VoltDB

-->

<paths>

</paths>

<!--

Enable web console and REST API to interact with VoltDB. Apart from, JDBC driver and Java client

VoltDB can be accessed using JSON based REST API to execute queries

-->

</httpd>

<!--

VoltDB is in memory database and provides durability by writing data to file on regular interval.

Also, before shutdown database should be paused and saved to ensure all data written to disk and

on startup should be restored.

This configuration will save snapshots to path configured in <paths> on every 5 minutes and will

keep 3 recent snapshots.

Snapshots saves all data in tables excepts tables marked with export only.

-->

<!--

This configuration is for enabling export functionality and use export-to-file export client to

write exported data to file. There are other export client available like JDBC client to write data

directly to other database and Hadoop client to export data to Hadoop. One can write custom export

client as per need.

Export is for integrating VoltDb with other system. To export data we need to create tables marked

as export only. All insert to export only tables go to a queue and export client fetch from queue,

hence incremental export. On overflow of queue, data is written to disk to location specified in

<paths>

Enabing skipinternals option will remove transaction id, partition id, timestamp created like data

frm export and will export only data in table.

-->

<property name="nonce">sample</property>

<property name="outdir">/home/lalit/Softwares/VoltDB/sample/export</property>

</configuration>

</onserver>

</export>

</deployment>

More on deployment xml https://voltdb.com/docs/UsingVoltDB/ConfigStructure.php

Java classes for stored procedure.

SaveOrder.java

import org.voltdb.SQLStmt;

import org.voltdb.VoltProcedure;

public class SaveOrder extends VoltProcedure {

private final SQLStmt insert = new SQLStmt("insert into orders values (?, ?, ?, ?, ?, ?, ?, ?)");

private final SQLStmt export = new SQLStmt("insert into orders_export values (?, ?, ?, ?, ?, ?, ?)");

/**

* VoltDB procedures are subclass of {@link VoltProcedure} and run implicitly in transaction.

* @param itemId

* @param productId

* @param productCategory

* @param userId

* @param userCountry

* @param userCity

* @param userAge

* @return

* @throws VoltAbortException

public long run(int orderId, int itemId, int productId, String productCategory, int userId, String userCountry,

String userCity, int userAge) throws VoltAbortException{

//insert data into orders table and exprt table.

voltQueueSQL(insert, orderId, itemId, productId, productCategory, userId, userCountry, userCity, userAge);

voltQueueSQL(export, orderId, productId, productCategory, userId, userCountry, userCity, userAge);

voltExecuteSQL();

//procedures must return long, Long, VoltTable or VoltTable[], so return a value

return orderId;

}

AutoIncrement.java

import org.voltdb.SQLStmt;

import org.voltdb.VoltProcedure;

import org.voltdb.VoltTable;

import org.voltdb.VoltType;

public class AutoIncrement extends VoltProcedure {

private final SQLStmt autoIncrementSelect = new SQLStmt("select next_value from auto_increment where table_name=?");

private final SQLStmt autoIncrementUpdate = new SQLStmt("update auto_increment set next_value = ? where table_name=?");

private final SQLStmt autoIncrementInsert = new SQLStmt("insert into auto_increment values (?, ?)");

public long run(String table){

// Get next value for orders table, if null use 1

voltQueueSQL(autoIncrementSelect, "orders");

VoltTable[] result = voltExecuteSQL();

Integer nextValueOrders = 1;

if(result.length>0 && result[0].getRowCount()>0){

nextValueOrders = (Integer) result[0].fetchRow(0).get(0, VoltType.INTEGER);

}

//update auto increment table

if(nextValueOrders>1){

voltQueueSQL(autoIncrementUpdate, nextValueOrders+1, "orders");

}else{

voltQueueSQL(autoIncrementInsert, "orders", nextValueOrders+1);

}

voltExecuteSQL();

return nextValueOrders;

}

Create a Java project in Eclipse and add jars from VOLTDB_HOME/voltdb VOLTDB_HOME/lib folders. Also create a client to use this procedure to insert data to test the application.

Application.java

import java.io.IOException;

import org.voltdb.VoltTable;

import org.voltdb.VoltTableRow;

import org.voltdb.VoltType;

import org.voltdb.client.Client;

import org.voltdb.client.ClientFactory;

import org.voltdb.client.ClientResponse;

import org.voltdb.client.NoConnectionsException;

import org.voltdb.client.ProcCallException;

public class Application {

public static void main(String[] args) throws Exception{

org.voltdb.client.Client client = ClientFactory.createClient();

client.createConnection("localhost");

//TODO modify AutoIncrement procedure to accept int arg to set next value, to avoid

//calling this get-and-increment every time in bulk load.

int orderId = getNextValueForTable(client);

client.callProcedure("SaveOrder", orderId, 1, 101, "CE", 1, "IN", "Mumbai", 25);