My Learnings and Sharing...RaNd0m!!

Saturday, September 2, 2017

User Defined Function (UDF) : HP Vertica

UDF combines repeated (or complex) SQL or other task into one unit that can be reused. Lets explore this feature in Vertica.
There are five types of UDx (User Defined):

UD SQL function: Can use built-in functions, arguments and operators in only one SQL expression and return result o the expression.
UD Scalar function: Takes one row as input and produces one row as output. Can be written in C++ and R.
UD Aggregate function: Takes one column data as input and produces one column output. Can be written in C++.
UD Analytic function: Calculate one output row from multiple input row. Unlike transform function this take input rows one by one, so can read multiple rows and return one output row. Can be written in C++.
UD Transform function: Takes multiple rows as input and output multiple rows. Can be written in C++ and R.

Fenced mode: Vertica has a separate process to execute UDx, so that they don't use significant resources of main Vertica process.

All functions written in C++ can be run in fenced mode or not, but functions in R always run in fenced mode.

To install and setup Vertica see my previous post. It is recommended to have single node installation of Vertica on developer machine or testing UDx.

Create Web Service SOAP request from WSDL and call service

In general, we generate stubs or use some XML builder to create SOAP request. This works for already known web services. Here, I am describing creating SOAP request (sample template) from any WSDL, like SOAP-UI does. After generating this request XML we can prompt user to fill data in request or, if values are known we can pass them to call the service. This can be useful in creating SOAPUI like tool embedded in a web application or some integration solution.

Lets start with first scenario :

Creating a request template.
Generate HTML form, from request XML to collect values to be passed.
Generate request from values collected call service.

In second scenario, generate request from known values of all parameters. In this no user integration required.

To do these I will use Membrane SOA (a Apache2 licence product) and develop web application using vertx 2 and JQuery. Vertx is node.js like framework for JVM.

My First Predictive Modeling With R : Using Caret Package

The caret package was developed to:

create a unified interface for modeling and prediction (interfaces to 147 models)
streamline model tuning using resampling
provide a variety of “helper” functions and classes for day–to–day
model building tasks
increase computational efficiency using parallel processing

Caret package: http://topepo.github.io/caret/index.html

Wednesday, August 30, 2017

Java NIO2: Non-Blocking, Asynchronous and Zero-copy IO

Java NIO2 provides Channel API for File system based and Network/Socket based IO.

Useful links:

AsynchronousChannel interface: https://docs.oracle.com/javase/7/docs/api/java/nio/channels/AsynchronousChannel.html

AsynchronousChannel Implementation: https://docs.oracle.com/javase/8/docs/api/java/nio/channels/AsynchronousSocketChannel.html

NIO2 primer: https://www.ibm.com/developerworks/library/j-nio2-1/index.html

Zero-copy efficient data transfer:
https://www.ibm.com/developerworks/library/j-zerocopy/
http://xcorpion.tech/2016/09/10/It-s-all-about-buffers-zero-copy-mmap-and-Java-NIO/

AKKA IO: https://dzone.com/articles/non-blocking-io-discovering-0

NIO Examples :
https://docs.oracle.com/javase/8/docs/technotes/guides/io/example/index.html

JBOSS XNIO Zero-Copy:

https://docs.jboss.org/author/display/XNIO/Zero-Copy+Transfers?_sscc=t

Tuesday, January 28, 2014

Mustache Template in Talend ETL

This blog is about using Mustache template in ETL/ELT to generate rich textual output like HTML, JSON, XML etc from data in ETL flow.

First, download and install Talend in you don't have.

Get tMustache component from here. Install component as described in Step #1 in this tutorial. Need to add mustache module in TOS, similar to Elasticsearch.

Get sample jobs for TOS v5.0.3 and v5.3.1 from here and import them in talend to understand usage of component.

Configuration of tMustache component:

In schema add all columns in flow to both side as highlighted in red rectangle. This will make row data available to next components in flow. In case, row must have only out column, like in Elasticsearch Component use tFilterColumns component next to tMustache.

Add one extra column, like newColumn (highlighted in green rectangle) to hold output of template execution. This column will be empty in incoming connection and will be written by this component.
Select this column in "Output Column" parameter.

In "Mustache Template" put any valid Mustache template. All variable names must be same as column labels in schema.

Try with sample jobs and me know if there is any issue.

Wednesday, January 15, 2014

Talend Elasticsearch Indexing Tutorial

We are going to do indexing data in existing Elasticsearch server in ETL/ELT flow.

Prerequisites:

Make sure Elasticsearch server is running, if you are testing in local machine. In, deployments it is generally running and indexing continuously. For this you can unzip and run elasticsearch sh/bat. See.

NOTE:
Elasticsearch version notes

[Skip this note if using Elasticsearch version <2.0.0]:
Before adding component to Job add dependent libraries in Modules view of Talend Studio to enable component to detect ES version.
ES >2.x need more libraries (as shown in below image). Load them using tLibraryLoad component or Java Routine (right click, on any routine used in the Job) "Edit Routine Library" option, as shown in image.

Additional libraries for ES >2.x

[Skip this note if using Elasticsearch version <1.2.0]:
Above 1.2.0+ Elasticsearch requires JDK 7, so make sure you have JDK 7 installed and Talend Studio is using this.

Talend Studio JDK

See 1.2.0 release notes.

Also need to use Talend Studio Version 5.2.1+.

Step #0: Setup Elasticsearch: Do any configuration you need. Create indexes, types and mappings you need. Generally in development you don't have to do anything, as ETL job starts requesting Elasticsearch will add these metadata with defaults, so you can import jobs in talend and run them.
In, production however defaults are not sensible like need to specify mappings. For this create all of them explicitly. For example you want to specify analyze or storing perticular field :

{
    "tweet" : {
        "properties" : {
            "message" : {"type" : "string", "store" : "yes"},

            "sender" : {"type" : "string", "store" : "yes",

                        "index":"not analyzed"}
        }
    }
}

You can do this as a one time process before running ETL or even in ETL by sending a HTTP request using Http/Rest components in talend. Suggest me if a separate component will be useful for doing this.

After, Elasticsearch is setup and running we can import sample job in talend to explore how this component can be used.

Step #1: Download and import job in Talend from here. This component can be placed in ETL flow to index data.
1.1: Get Elasticsearch extension from exchange

How to install Talend extension/component from Talend Exchange

Click "Exchange" link.
Search for "elasticsearch" in Available Extension.
Check tElasticsearcIndex component appears in pallate under section fulltextsearch. If not present try Ctrl+Shift+F3 to reload pallate.

NOTE: at present talendforge has some issues in adding new revision, so I am not able to add this revision to talendforge. If your cluster name is not "elasticsearch" (the default) then you need to specify cluster name in component. This input is added in laterst version of component and hosted on Github. Latest version 1.0.3 uploaded to Talend Exchange.

1.2: Import job in Talend studio. After importing you may get compilation error in Talend version 5.3 or higher. To resolve this, go to "Module" view and add two jars, elasticsearch and lucene-core. Copy these two jars form your Elasticsearch installation lib folder in some folder and rename them to remove version (make them elasticsearch.jar and lucene-core.jar).

Look for elasticsearch.jar and lucene-core.jar and add them as shown in snaps:

Browse to directory copied in previous section and add two jars for two modules:

Do the same for lucene-core.jar.

1.3: Job flow overview:

This job will:

Generate 100 rows.
Transform them.
Add nested document.
Index documents.

Generate Rows:

Generate 100 random rows with one key column as int (randomly generated from 0-100) and four string column randomly generated.

Transform them:

Key1 is of type Object and a String array will be passed to this (nested documents). Next section is more in nested documents.

Reform for complex nested documents:
I you need more than two level of nesting, pass all data from tMap in flat single document like:
key:value
child.key=value1
child.child.key=value2
.....
And, in tJavaRow restructure them in nested Map:

//Code generated according to input schema and output schema

output_row.id = input_row.id;

output_row.key1 = input_row.key1;

output_row.key2 = input_row.key2;

output_row.key3 = input_row.key3;

output_row.key4 = input_row.key4;

output_row.key5 = input_row.key5;

Map<String, Object> nested = new HashMap<String, Object>();

nested.put("child1", "val1");

nested.put("child2", "val2");

Map<String, String> nestedChild = new HashMap<String, String>();

nestedChild.put("nestedChild1", "val1");

nestedChild.put("nestedChild2", "val2");

nested.put("child3", nestedChild);

output_row.key6 = nested;

Add, java.util.Map and java.util.HashMap imports in advanced settings of tJavaRow.
Note: As, all the flows in Talend are flat rows driven by schema we need to handle them separately.

Alternatively,

tWriteJSONField can be used for nesting sub-documents.
Mustache template component to generate JSON. See tutorial.

Index data:
Configure Elasticsearch extension, like:

The column names in right side will be keys in JSON document indexed. For nested documents you can use tMustache component instead of tJavaRow. Add configuration parameters:
Host: Elasticsearch server host.
Port: Elasticsearch server port. 9300 is default port for Elasticsearch server and 9200 is for REST API
Index Name: This is equivalent of database in SQL world.
Collection Name: This is equivalent of tables in database.
Cluster Name: Name of cluster this host belongs. Default is "elasticsearch". To check your cluster name:
Hit http://<HOST>:<REST PORT>/_cluster/health?pretty=true in browser and get "cluster_name" value, if REST API (HTTP server).is enabled.
Or, check "cluster.name" in Elasticsearch installation config/elasticsearch.yml file.

Note: There is additional level of hierarchy "mapping", under table. As, documents does not have strict schema they are grouped in mapping. If a new document structure is encountered in indexing new mapping is created with default settings.
Better, if you know all your document possible schema and create mapping with your need in advance. Also, if document of same structure have different data type, it will have weird behavior.

Run the job and see indexed document in "Sense" Google Chrome plugin.

See Indexed Documents:

It shows there are 61 documents, as we generated 100 random numbers and designated that as key, duplicate keys are updated (upsert). See "key6" is nested.

Note: As this plugin is not available now, hit http://<HOST>:<REST PORT>/<INDEX_NAME>/<COLLECTION_NAME>/_search

Thats it! Let me know if there is any issue.

Count (*) : Cassandra Data Modeling

How to model Cassandra for count queries?

Consider same "playlist" database from previous post.

We need to do following queries in tracks table:

Count number of tracks. SQL : SELECT COUNT(*) FROM tracks;
Count number of tracks for a particular genre. SQL : SELECT COUNT(*) FROM tracks WHERE genre=?;

Step #1:
Create table to store counts:
CREATE TABLE track_count {
   count counter,
   table_name text,
   genre text,
   PRIMARY KEY(table_name,genre)
}

This table can be used to store all table counts, hence table_name column added. In case, if there we need to have count for only one table this column can be removed. And, if grouping is not required or distinct groups are large (so to implement count all need to read all record in memory), remove this column. So, for one table count all query there will be only one cell in this table.

Generalized table to store counts for all tables in database:
CREATE TABLE track_count {
   count counter,
   table_name text,
   <grouping columns>,
   PRIMARY KEY(table_name,<grouping columns>...)
}

Step #2:
Update count on adding track:
BATCH BEGIN
INSERT INTO tracks (...) VALUES (...);
UPDATE tracks SET count=count+1 WHERE table_name='tracks' and genre=?;
APPLY BATCH;

On every insert of track the count will be updated and grouped by genre. As, genre will of small number its okay to get all records in memory and sum counts to get number of records.

Step #3:
CQL queries modeling:

[SQL] : SELECT COUNT(*) FROM tracks;
[CQL] : SELECT count FROM track_count WHERE table_name='tracks';
Sum count in application code.

[SQL] : SELECT COUNT(*) FROM tracks WHERE genre=?;
[CQL] : SELECT count FROM track_count WHERE table_name='tracks' genre=?;

[SQL] : SELECT COUNT(*) FROM tracks GROUP BY genre;
[CQL] : SELECT count FROM track_count WHERE table_name='tracks';

In SQL we query same table for count and in CQL we create new tables(s) and query other table(s).