This
blog is about using best of the breed database technology and
integrate them for syncing data in near real time. VoltDB is
specialized in OLTP and Vertica is an analytic platform and
specialized in OLAP.
Here,
I will describe required installation (single node installation with
0 K-safety) mandatory for getting databases running. To setup cluster
and K-safety it needs addition host machines and configuration. As
this does not include configuration for cluster, K-safety and other
required for production use, this will not help DBAs, rather for
developers, programmers or ETLs to install database on
local/development machine to test application or ETL jobs.
Further
posts will include writing UDF (user defined function) and other
database developer perspective topics. Feel free to request me
anything related to this.
OLTP and OLAP in
traditional DBMS and NewSQL DBMS (VoltDB and Vertica)
This blog is about emerging NewSQL
DBMSs and how this paradigm shift outperforms traditional RDBMS in
performance. I will discuss about OLAP database Vertica Analytic
Platform and most performant OLTP RDBMS, VoltDB. Both are from
database research pioneer Michael
Stonebraker (Vertica is acquired by HP). The idea behind these
two DBMSs is, traditional RDBMS tend to provide all-in-one (general
purpose, OLTP+OLAP) solution and not designed for large volume of
data (Big Data). They are not scalable or very complex to scale and
maintain. OLTP and OLAP have different needs, so there are two
different DBMS redesigned and developed from ground-up, purpose built
for two different needs. Both scale out horizontally, so simply add
more machines to cluster to handle more load, K-safety etc.
As to get all functionalities (high
volume transactions and analytic with complex queries) need to
integrate both DBMS. For example, daily business activities need OLTP
database and populate (insert/update) data and for analyzing this
data (historical analysis) using a BI (Business Intelligence)
solution need OLAP database. These two have different requirements,
OLTP needs write optimized and OLAP needs read optimized DBMS
technology. Also, OLTP and OLAP data modeling is different. In OLTP
schema is normalized, for example 3NF, and in OLAP its snowflake or
star schema where fact tables are surrounded by dimension table
(dimensional modeling).
In tradition DBMS, its general purpose
and for both same DBMS was used. Evolution of NewSQL is purpose built
and optimized for specific use.
To demonstrate this we will consider
following use case:
An eCommerce website (popular and hence
high volume) stores order placed and also, need to analyze user
trends, like, by geographic location.
We will use following technologies:
VoltDB [Open Source Edition, 3.5.0.1]
for OLTP database. Size ~19MB
Vertica [Community Edition, 6.1.2] for
OLAP database. Size ~80MB
Talend DIfor ETL (synchronizing from
VoltDB to Vertica). Size ~600MB
Will use Fedora 19 64-bit OS (x86_64
architecture, check linux architecture with 'uname' command) and
decent hardware resources.
Step #1: Install and
configure VoltDB, Vertica and Talend
Detailed installation step and fine
tuned configuration can be found in corresponding installation guides
or getting started documents.
Install
VoltDB:
To download go to voltdb.com
and register for download. A
download link will be sent to the email. Download VoltDB
(voltdb-3.5.0.1.tar.gz) and extract it, refer extracted directory as
VOLTDB_HOME in this tutorial. Add VOLTDB_HOME/bin to $PATH (edit
~/.bashrc file). Installation Done!!!
This installation contains documents,
web console (localhost:8080/studio) and JSON based REST API for
accessing VoltDB, documentation for this
http://voltdb.com/docs/UsingVoltDB/ProgLangJson.php
Install
Vertica:
Go to my.vertica.com
(register and login) and download Vertica for Fedora
(vertica-6.1.2-0.x86_64.FC.rpm).
Prepare linux for installation:
Disable
SELinux by editing selinux configuration file:
In
terminal login with root user and execute vi
/etc/sysconfig/selinux command and add “SELINUX=disabled” at
end of this file.
Edit
“vi /etc/pam.d/su“
and add “session required pam_limits.so” if not in file.
Now install Vertica
with “rpm -ivh vertica-6.1.2-0.x86_64.FC.rpm” (change directory
where rpm is downloaded before executing command). Few inputs will be
asked for path and dbadmin user.
After installation a
new linux user “dbadmin” will be created.
Switch user “dbadmin”
to setup and create database with command “su – dbadmin”
in terminal.
Install
Talend DI:
Download Talend Open Studio for Data
Integration v5.3.1 from
http://www.talend.com/download/data-integration
and extract the archive. Refer extracted folder as TALEND_HOME.
Step #2: Setup
Databases for both DBMS
Data
Modeling:
As this tutorial is to demonstrate
technologies we will not focus on modeling concepts, rather have a
single table in both database and sync data. Learn data modeling on
http://www.learndatamodeling.com
Orders table OLTP [VoltDB]:
order_id <primary key>
item_id <item identifier of items
stored in some warehouse table>
product_id <name of product in some
product table>
product_category <category of
product>
user_id <customer related
information>
user_country
user_city
user_age
In real application product and user
related information will be stored in different table and have
references to these tables. For simplicity there is single table.
product_category will be partition
column, because application and business both can be separated based
on that and will produce evenly distributed data. When there is
separate product and users tables, they can be partitioned on
product_category and user_country respectively as they are expected
to have large data. If expected data is small they can be kept as
replication table. See more on partitioning and replication in
“UsingVoltDB” document and
http://voltdb.com/resources/volt-university/tutorials/section-1-4/
Orders table OLAP [Vertica]:
order_id <primary key>
product_id <name of product in some
product table>
product_category <category of
product>
user_id <customer related
information>
user_country
user_city
user_age
date_created <populated from ETL>
In real application product and user
will be different dimension table and orders will be fact table.
Partitioning strategy will be same. The columns are different in
table of both databases, as for analytic all data from OLTP is not
required.
Create
Database VoltDB and Configure for synchronization:
VoltDB
compiles DDLs and Java stored procedures into a single jar, called
catalog and deployed to VoltDB. We need to bundle transactions (data
access logic) in form of stored procedures written in Java. We can
execute adhoc queries, but this will not take advantage of
architecture of VoltDB and will not be performant. Also, configure
export of data incrementally to sync with Vertica using ETL.
There
is a deployment.xml configuration fie to configure and enable
features for VoltDB database.
We need following
things to compile and generate database catalog:
DDLs
for create table and export table, register stored procedure written
in Java and partition table and stored procedure.
sample.sql
--create
table for storing orders data
--add
partition column as part of primary key to guarantee uniqueness
across all
--partitions
in database and no unique constraint violation while repartitioning
create
table
orders
(
order_id
integer
not
null,
item_id
integer,
product_id
integer,
product_category
varchar(30)
not
null,
user_id
integer,
user_country
varchar(20),
user_city
varchar(20),
user_age
integer,
primary
key(order_id,
product_category)
);
--table
for exporting selected columns from orders table
--only
insert is allowed for export tables, as data is queued and
--fetched
by an export client. This feature is for incremental sync
--of
data to external system
create
table
orders_export
(
order_id
integer
not
null,
product_id
integer,
product_category
varchar(30)
not
null,
user_id
integer,
user_country
varchar(20),
user_city
varchar(20),
user_age
integer
);
--VoltDB
does not support auto increment, to implement this we can have a
table
--to
store max +1 as next value of identifier field and query this table
in
--stored
procedure. This will be a replicated table.
create
table
auto_increment(
table_name
varchar(50)
not
null,
next_value
integer,
primary
key(table_name)
);
--Mark
orders_export table as export only
EXPORT
TABLE
orders_export;
--Partition
orders table. No need to partition export table as no data is stored
for them.
partition
table
orders
on
column
product_category;
--This
is a small table and suied as replication table, but we need to write
to this table
--while
get and increment next value for a table, so partition this on
primary key
partition
table
auto_increment
on
column
table_name;
--register
stored procedure written in Java
CREATE
PROCEDURE
FROM
CLASS
SaveOrder;
CREATE
PROCEDURE
FROM
CLASS
AutoIncrement;
--Partition
stored procedure on same column as for table and provide parameter
index
--for
partition column in argument passed to procedure. By defaut its
expected asfirst
--argument,
in our procedure it will be 4rd argument (index 3).
PARTITION
PROCEDURE
SaveOrder
ON
TABLE
orders
COLUMN
product_category
PARAMETER
3;
PARTITION
PROCEDURE
AutoIncrement
ON
TABLE
auto_increment
COLUMN
table_name;
Deployment
configuration file.
deployment.xml
<?xml
version="1.0"?>
<deployment>
<!--
Single
host local deployment
-->
<cluster
hostcount="1"
sitesperhost="2"/>
<!--
Directories
for storing snapshots, export overflow and other files generated by
VoltDB
-->
<paths>
<exportoverflow
path="/home/lalit/Softwares/VoltDB/sample/export"/>
<snapshots
path="/home/lalit/Softwares/VoltDB/sample/snapshots"/>
<voltdbroot
path="/home/lalit/Softwares/VoltDB/sample/root"/>
</paths>
<!--
Enable
web console and REST API to interact with VoltDB. Apart from, JDBC
driver and Java client
VoltDB
can be accessed using JSON based REST API to execute queries
-->
<httpd
enabled="true">
<jsonapi
enabled="true"
/>
</httpd>
<!--
VoltDB
is in memory database and provides durability by writing data to file
on regular interval.
Also, before shutdown database
should be paused and saved to ensure all data written to disk and
on
startup should be restored.
This configuration will save
snapshots to path configured in <paths> on every 5 minutes and
will
keep
3 recent snapshots.
Snapshots saves all data in
tables excepts tables marked with export only.
-->
<snapshot
prefix="sample"
frequency="5m"
retain="3"/>
<!--
This
configuration is for enabling export functionality and use
export-to-file export client to
write
exported data to file. There are other export client available like
JDBC client to write data
directly
to other database and Hadoop
client to export data to Hadoop.
One can write custom export
client
as per need.
Export
is for integrating VoltDb with other system. To export data we need
to create tables marked
as
export only. All insert to export only tables go to a queue and
export client fetch from queue,
hence
incremental export. On overflow of queue, data is written to disk to
location specified in
<paths>
Enabing
skipinternals
option will remove transaction id, partition id, timestamp
created like data
frm
export and will export only data in table.
-->
<export
enabled="true">
<onserver
exportto="file">
<configuration>
<property
name="type">csv</property>
<property
name="nonce">sample</property>
<property
name="period">15</property>
<property
name="outdir">/home/lalit/Softwares/VoltDB/sample/export</property>
<property
name="skipinternals">true</property>
<property
name="with-schema">true</property>
</configuration>
</onserver>
</export>
</deployment>
More
on deployment xml
https://voltdb.com/docs/UsingVoltDB/ConfigStructure.php
Java classes for stored
procedure.
SaveOrder.java
import
org.voltdb.SQLStmt;
import
org.voltdb.VoltProcedure;
public
class
SaveOrder extends
VoltProcedure {
private
final
SQLStmt insert
= new
SQLStmt("insert
into orders values (?, ?, ?, ?, ?, ?, ?, ?)");
private
final
SQLStmt export
= new
SQLStmt("insert
into orders_export values (?, ?, ?, ?, ?, ?, ?)");
/**
* VoltDB procedures are subclass of {@link
VoltProcedure}
and run implicitly in transaction.
* @param
itemId
* @param
productId
* @param
productCategory
* @param
userId
* @param
userCountry
* @param
userCity
* @param
userAge
* @return
* @throws
VoltAbortException
*/
public
long
run(int
orderId, int
itemId, int
productId, String productCategory, int
userId, String userCountry,
String
userCity, int
userAge) throws
VoltAbortException{
//insert
data into orders table and exprt
table.
voltQueueSQL(insert,
orderId, itemId, productId, productCategory, userId, userCountry,
userCity, userAge);
voltQueueSQL(export,
orderId, productId, productCategory, userId, userCountry, userCity,
userAge);
voltExecuteSQL();
//procedures
must return long, Long, VoltTable or VoltTable[], so return a value
return
orderId;
}
}
AutoIncrement.java
import
org.voltdb.SQLStmt;
import
org.voltdb.VoltProcedure;
import
org.voltdb.VoltTable;
import
org.voltdb.VoltType;
public
class
AutoIncrement extends
VoltProcedure {
private
final
SQLStmt autoIncrementSelect
= new
SQLStmt("select
next_value from auto_increment where table_name=?");
private
final
SQLStmt autoIncrementUpdate
= new
SQLStmt("update
auto_increment set next_value = ? where table_name=?");
private
final
SQLStmt autoIncrementInsert
= new
SQLStmt("insert
into auto_increment values (?, ?)");
public
long
run(String table){
//
Get next value for orders table, if null use 1
voltQueueSQL(autoIncrementSelect,
"orders");
VoltTable[]
result = voltExecuteSQL();
Integer
nextValueOrders = 1;
if(result.length>0
&& result[0].getRowCount()>0){
nextValueOrders
= (Integer) result[0].fetchRow(0).get(0, VoltType.INTEGER);
}
//update
auto increment table
if(nextValueOrders>1){
voltQueueSQL(autoIncrementUpdate,
nextValueOrders+1, "orders");
}else{
voltQueueSQL(autoIncrementInsert,
"orders",
nextValueOrders+1);
}
voltExecuteSQL();
return
nextValueOrders;
}
}
Create
a Java project in Eclipse and add jars from VOLTDB_HOME/voltdb
VOLTDB_HOME/lib folders. Also create a client to use this procedure
to insert data to test the application.
Application.java
import
java.io.IOException;
import
org.voltdb.VoltTable;
import
org.voltdb.VoltTableRow;
import
org.voltdb.VoltType;
import
org.voltdb.client.Client;
import
org.voltdb.client.ClientFactory;
import
org.voltdb.client.ClientResponse;
import
org.voltdb.client.NoConnectionsException;
import
org.voltdb.client.ProcCallException;
public
class
Application {
public
static
void
main(String[] args) throws
Exception{
org.voltdb.client.Client
client = ClientFactory.createClient();
client.createConnection("localhost");
//TODO
modify AutoIncrement procedure to accept int
arg
to set next value, to avoid
//calling
this get-and-increment every time in bulk load.
int
orderId = getNextValueForTable(client);
client.callProcedure("SaveOrder",
orderId, 1, 101, "CE",
1, "IN",
"Mumbai",
25);
orderId
= getNextValueForTable(client);
client.callProcedure("SaveOrder",
orderId, 2, 101, "CE",
2, "IN",
"Mumbai",
25);
orderId
= getNextValueForTable(client);
client.callProcedure("SaveOrder",
orderId, 3, 101, "CE",
3, "US",
"New
York",
34);
orderId
= getNextValueForTable(client);
client.callProcedure("SaveOrder",
orderId, 4, 107, "APP",
4, "IN",
"Mumbai",
25);
orderId
= getNextValueForTable(client);
client.callProcedure("SaveOrder",
orderId, 5, 101, "CE",
5, "GB",
"London",
23);
orderId
= getNextValueForTable(client);
client.callProcedure("SaveOrder",
orderId, 6, 101, "CE",
6, "IN",
"Mumbai",
25);
orderId
= getNextValueForTable(client);
client.callProcedure("SaveOrder",
orderId, 7, 101, "CE",
7, "IN",
"Mumbai",
25);
orderId
= getNextValueForTable(client);
client.callProcedure("SaveOrder",
orderId, 8, 103, "APP",
8, "IN",
"Mumbai",
25);
orderId
= getNextValueForTable(client);
client.callProcedure("SaveOrder",
orderId, 9, 101, "CE",
9, "IN",
"Mumbai",
25);
orderId
= getNextValueForTable(client);
client.callProcedure("SaveOrder",
orderId, 10, 102, "CE",
10, "IN",
"Mumbai",
25);
client.drain();
client.close();
}
private
static
int
getNextValueForTable(Client client) throws
NoConnectionsException, IOException, ProcCallException{
ClientResponse
response = client.callProcedure("AutoIncrement",
"orders");
if(response.getStatus()!=ClientResponse.SUCCESS){
System.out.println("Failed
t retrive words");
System.exit(-1);
}
VoltTable[]
results = response.getResults();
if(results.length>0){
VoltTable
result = results[0];
if(result.getRowCount()>0){
VoltTableRow
row = result.fetchRow(0);
return
((Integer)row.get(0, VoltType.INTEGER)).intValue();
}
}
return
1;
}
}
Inserts
can be done using sqlcmd, a
command line tool or REST API but this way is chosen to show how to
develop VoltDB client application. Call procedure may return values.
Compile and create
catalog:
Create
a directory named “sample” to keep all database related files and
create sample.sql and deployment.xml files from .previous section.
Also copy compiled SaveOrder.class file from eclipse projects bin
directory.
Open
terminal and move to sample directory.
Execute
command voltdb compile --classpath="./" -o sample.jar
sample.sql to generate
deployables for database. Check output or any errors, if succeed move
to next step.
Create
two folders root and export configured in deployment.xml <paths>
element. Now deploy database
with command voltdb create host localhost catalog
sample.jar deployment deployment.xml
In
another terminal use sqlcmd command
to test database or open localhost:8080/studio
in browser
Now execute Application.java main method (in eclipse) to
insert data and verify data inserted. Verify data inserted and after
5 minutes check data exported in export folder.
Execute this query from studio in browser or sqlcmd and
check incremental export.
insert
into orders_export values(11, 11, 101, 'CE', 1, 'IN', 'Mumbai', 25);
Create
Database Vertica
For
data modeling we will use same schema as export table in VoltDB with
few additional column populated from ETL.
DDL
for create database:
example.sql
--Only
one database runs at a time in Vertica cluster
--Rather
than databases we need to create schema, but it differes
--from
Orale schema as this is not associated with user
create
schema
example;
create
table
orders
(
order_id
integer
not
null
primary
key,
product_id
integer,
product_category
varchar(30)
not
null,
user_id
integer,
user_country
varchar(20),
user_city
varchar(20),
user_age
integer,
create_date
date
);
This script creates a schema and
one table in this schema. Vertica is a column oriented database and
achieves this performance by means of partitioning and projections.
Partitioning, K-safety and cluster scalability is similar to VoltDB.
Projections are collection of columns created with specific ordering
or grouping for query specific or pre join projections are pre
computed inner join of tables. For each table there is a super
projection, which is collection of all columns. Users can create
custom query specific or pre join projections. This is similar to
materialized view in other DBMS.
This partitioning and
projections let database designers to plan design early. Vertica
provides database designer tool to create and deploy projection. This
requires DDL schema, sample data in table and sample queries to guess
projections. So let load data in table and create sample queries file
to run database designer.
Sample data must be similar to
actual data, for example should not duplicate all record or generate
them all random. For example if some column have a set of values ~20
should generate random number between 0-20 and insert 100K records.
Java class to insert records
using JDBC. Create a new class in Java project and add jars from
/opt/vertica/java/lib.
VerticaInsert.java
import
java.sql.Connection;
import
java.sql.Date;
import
java.sql.DriverManager;
import
java.sql.PreparedStatement;
import
java.util.Random;
public
class
VerticaInsert {
public
static
void
main(String[] args) throws
Exception{
Class.forName("com.vertica.jdbc.Driver");
Connection
connection =
DriverManager.getConnection("jdbc:vertica://localhost:5433/vmartdb",
"dbadmin",
"lalit");
connection.createStatement().executeUpdate("set
search_path to example");
PreparedStatement
statement = connection.prepareStatement("insert
into orders values(?, ?, ?, ?, ?, ?, ?, ?)");
Random
random = new
Random();
String[]
prodCats = {"CE",
"APP",
"ACC",
"ELEC",
"PNS",
"FRN"};
String[]
country = {"IN",
"US",
"GB",
"SW",
"AUS",
"KOR",
"JAP"};
String[]
city = {"Mumbai",
"Delhi",
"London",
"tokyo",
"Melborne",
"New
York",
"Malysia"};
Date
date = new
Date(System.currentTimeMillis());
for(int
i=0;i<100000;i++){
if(i%10000==0)date.setTime(date.getTime()-84600);
statement.setInt(1,
i);
statement.setInt(2,
random.nextInt(100));
statement.setString(3,
prodCats[random.nextInt(6)]);
statement.setInt(4,
random.nextInt(1000));
statement.setString(5,
country[random.nextInt(7)]);
statement.setString(6,
city[random.nextInt(7)]);
statement.setInt(7,
random.nextInt(50));
statement.setDate(8,
date);
statement.execute();
}
statement.close();
connection.close();
}
}
example_queries.sql
select
count(*)
from
orders
group
by
product_category;
select
count(*)
from
orders
group
by
user_country;
Add all complex frequently used analytic queries here.
Create
Database:
Switch
user and login with dbadmin linux user to create and start Vertica
database. Open a new terminal and execute command “su –
dbadmin”
Create
a new folder named “example” and move to this folder. This folder
should be writable for dbadmin, so create it to user's home using
terminal.
Execute
/opt/vertica/bin/admintools
command to launch admin tools. Select “Configuration Menu” and
click OK.
Select
“Create Database” and Click OK and provide database name “orders”
and click OK. On
next screen provide password
(optional) or skip and click
OK.
Then
select host (Only one host will be visible). Keep
paths as it is or provide the new directory created. Vertica will
store data files there.
Select Yes to create database and OK on success message. On next screen, press Cancel and exit admin tools and type “vsql” command on
terminal to interact with Vertica using command line. Provide
password if provided in create database.
Execute
command \i example.sql on
vsql pompt to create table.
Exit vsql by typing \q.
Now,
run
VerticaInsert.java
from eclipse to insert sample data for database designer. This
will take several minutes. Batch update can be used to improve speed.
Now
database designer can analyze schema and nature of data and queries to be
executed and generate optimal partition and projections.
Using
Database Designer:
This step is not mandatory for tutorial but important
for designing optimized Vertica database for real application.
Again
open admin tools and Select Configuration Menu->Run
Database Designer and select “orders”
database and provide password.
Provide folder path to store output of database designer
and provide design name “orders_design”.
Select Design Type as Comprehensive.
Select Schema as “orders” and select all design
options in next window.
Provide
path of sample queries example_queries.sql
and K-Safety as 2. K-Safety upto 3 is officially supported by
Vertica.
Select designer's design priority as Balanced and click
OK. Proceed with the messages and exit admin tools.
Database
design and setup is done. Now we can check database using vsql.
Delete sample data from table to use it with real data sync from ETL.
To start database open admin
tools ->Start Database and follow wizard and
to shutdown admin tools
->Stop
Database.
Step #3: Configure ETL
using Talend DI
Talend
provides an Eclipse based IDE and it has projects like eclipse
workspace. Generally one project (workspace) per ETL solution
(project) is created for manageability.
Start Talend by double clicking
TOS_BD-linux-gtk-x86_64 executable in TALEND_HOME. First time Talend
will ask to create project. Create a project and it will appear in
project list as selected. Click on Open button on right side to start
IDE. Close the welcome window to start creating ETL.
Right click on Job Design in
Repository section on
left side and fill wizard. Provide name as “VoltDBToVerticaSync”
and click finish. Empty
canvas will appear and a pallate on right side.
Now we are ready
to create ETL job. ETL flow, 3 steps:
- Pick files exported by VoltDB and process it.
- For each record in file map it to table in Vertica (direct mapping no transformation) and add create_date column in mapping as current date.
- Write this to Vertica in bulk, as even if we export in 15-30 minutes interval there will be lot of data.
Drag and drop
following components from pallate to canvas of created job:
- tFileList to iterate over the directory where exported files from VoltDB are stored. This component supports configurations like sort files by name, modified date etc and file name pattern filters.
- tFileInputDelimited to process file row by row. We need to define schema for file, for column type position etc. This supports advanced CSV options.
- tMap to map inputs to outputs. This component have one row input and multiple lookup input and many outputs. We can perform joins between lookup and row input to lookup tables, in our case we have identifiers in exported CSV like product_id, user_id etc. We can lookup product or user tables for other columns by identifiers. Row input is processed one by one but lookup is loaded at once before processing row input.
- tVerticaOutput to write output of mapping to Vertica database.
After dropping all
components on canvas align them in a row and connect them. To connect
two component right click on one and select “Row->[ Iterate |
Main | New Output] as shown in figure below.
After
connecting there will be an error in tVerticaOutput component “Module
vertica-jdk5-6.0.0-0.jar
is required”. T fix this
go to module tab and scroll down to locate tVerticaOutput and click
on it. Then create a copy of Vertica java connector jar
/opt/vertica/java/lib/
vertica-jdk5-6.1.2-0.jar
and
rename it to vertica-jdk5-6.0.0-0.jar
.
Click on jar icon and select this jar.
Now configure each component:
To
configure click on component to select and go to “Component”
View.
tFileList
Enter path of
directory where VoltDB exports file and file name pattern. Also,sort
files in ascending order according to modified date to process files
in sequence in which they are created
tFileInputDelimited
Specify file name by typing “curr”
and press Ctrl+Space to get suggession and select the shown
suggestion in image. Other file configuration as shown.
Configure schema
for input file:
Click on button
next to “Edit Schema” to open this dialog and add schema
row by clicking '+' button in dialog. Create each row in specified
order (as in file) and provide name, tye etc.
Click OK when done
and 'Yes' when prompted for propagate schema.
Note: Create
“product_category” instead of “prod_category” in third
column.
tMap
Double click on
tMap to open mapper.
Click on yellow
header on left to select all rows and drag the to right side 'orders'
box and drop to map them.
Create entry
highlighted in blue at bottom right by clicking '+' button.
Click OK to save
and close.
tVerticaOutput
As
export tables are insert only, all updates are also inserted into
export tables to export. In this example procedure is not updating,
if some procedure is updating, then must insert updated data in
export tables. Target system
(in our case Vertica) must implement sync in insert or update manner
to merge/update data rather than having duplicates.
There
are trade-off in exporting complete row or only modified columns in
exporting update scenario, as if export is enabled after some time
will not have data to update. Exporting
complete row have overhead to fetch row as update procedures will
have only data
(columns) to
be updated. One choice could be before enabling sync sync
all history data.
Sync
history data to external system:
Other way to sync data to external
system is to query from actual tables and to do this incrementally
need to store data by timestamp and query data after previous polled
timestamp. But as this timestamp may not be suitable for partitioning
this query will be multi partition and wills be less performing. So
we will use this for querying all data for one time history sync.
To query all data we can do a select
query on all tables using generic JDBC component in Talend, as there
is no component for VoltDB or can user HTTP request or REST
components to use REST API provided by VoltDB. We will use generic
JDBC component.
Create a new job
“VoltDBToVerticaHistorySync”
and drag drop tJdbcInput
component and configure:
To add Driver
jars, click on '+' button and on right end of added row click button
to browse.
Select
'External modules' and browse to
“VOLTDB_HOME/voltdb/voltdbclient-3.5.0.1.jar” click OK. This is
the driver class for VoltDB. Also, add “guava-12.0.1.jar” in the
same way except it is an internal module so
rather than browsing we need to select from list.
Add tMap and
tVerticaOutput as in previous job. Now run job in Talend to sync one
time history data. We will run this in Talend as this is one time
process, for other job we will schedule it to run on regular
interval.
Run job in Talend:
Also, do this to
test previous job.
Schedule
VoltDBToVerticaSync
job using crontab:
Export
job by right clicking on job in Repository section as shown below:
Define settings (defaults are sufficient, only specify or note path to export) for export and click Finish.
The
archive will be created in TALEND_HOME with job's name. Extract this
to any suitable location. There will be VoltDBToVerticaSync_run.sh
in VoltDBToVerticaSync
folder. This script is for
executing the job, make entry of
this job in crontab to
schedule it.
Open
terminal and execute command crontab -e and
make entry in vi editor:
0/15 * * * * <path to VoltDBToVerticaSync_run.sh>
for 0, 15, 30, 15 minuthe of every hour or
5,20,35,50 * * * * <path to
VoltDBToVerticaSync_run.sh>
for specific minutes entry.
After that save it using “:wq” command. Done, job is
scheduled!!!
Now,
every time an order is saved using the SaveOrder procedure in VoltDB
it will also queued to export
and export client will fetch it and write to file on every 15
minutes. Talend job will process the file and write to Vertica.
Summary
We
created a demonstration of complete database ecosystem for an
application, in which business data is stored in OLTP system and a
OLAP system to store data for historical analysis. Also, ETL to sync
live data from OLTP to OLAP system.
For
this we have used best in breed, purpose built, NewSQL DBMS
technlogies.
Thats
it! Feel free to post any doubts or issues faced in exercising
this tutorial r regarding technologies used.
No comments:
Post a Comment