Streaming ETL is the newest technical challenge in the perspective of  ‘Big data acquisition from disparate applications from multiple tenants at the same time’ and ‘organizing the streams and computing the aggregations on Hadoop/ Analytics Datastore’.

Since Hadoop is meant to be a batch-processing share-nothing ETL tool, we need to keep it current through real-time streaming technologies like SQLStream.

Functional principle

One of the most important features of a Data Stream Management System – DSMS is the possibility to handle potentially infinite and rapidly changing data streams by offering a flexible processing at the same time, although there are only limited resources like a limited main memory. The following table provides various principles of DSMS and compares them to traditional DBMS.

Database management system (DBMS) Data stream management system (DSMS)
Persistent data (relations) volatile data streams
Random access Sequential access
One-time queries Continuous queries
(theoretically) unlimited secondary storage limited main memory
Only the current state is relevant Consideration of the order of the input
relatively low update rate potentially extremely high update rate
Little or no time requirements Real-time requirements
Assumes exact data Assumes outdated/inaccurate data
Plannable query processing Variable data arrival and data characteristics

Ref : http://en.wikipedia.org/wiki/Data-stream_management_system

SQLStream : Ref : Making the Elephant Fly

The most important concept for Streaming SQL is the stream. A stream is a continually updating data object. A stream is like a table with no end, but which does have a beginning (when the stream was established). The number of records in a stream can be infinite.

A stream is a schema object that is a relation but which does not store data like as a finite relation (such as a table in a database). Instead, a stream implements a “publish-subscribe” protocol. It can be written to by multiple writers and read from by multiple readers.

A conventional SQL application prepares and executes a statement with a SELECT… query and iterates through the returned result set until end of fetch is detected, when there are no more rows to return. The application then returns to doing something else.

Ref : http://www.sqlstream.com/docs/index.html?qs_stream_and_view.html

SQLStream (SQL repurposed as Streaming API) offers very impressive Real-time Continuous Data Transformation and Load Flow.

Another alternative flow ‘Python Nginx ETL processes running in parallel’. 

Business Use case :

Extraction of business data through Boomi (a great ‘Business Data Integration’ enabler) / any other Data Source Emits Streams =>  Transform streams fed to  (reduces the result in smaller chunks) => Load into HBase / Columnar Datastore => Query HBase / Hadapt / Columnar Datastore.

The following slide from SQLStream explains the above-mentioned flow :     courtesy : SQLStream.com

Note how the real-time alerts are channeled to Mobiles/Tablets/Long polling Browser Clients from the SQLStream.  SQLStream helps to keep Hadoop continuous and up-to-date which otherwise suffers from data aggregation lags.

There are various approaches for collecting streams, organizing the streams into ordered sets/graphs and persisting them. S4, Storm, Kafka, Mongo-streaming API, python-etl module –  are major streaming solutions.

But end of the day,  we need a Data Flow Language in order to depict an ETL Dataflow’. That’s where SQL still outperforms others as the most credible dataflow language!

There is a huge benefite using SQL stream over Hadoop batching :

  • Replace tuple by tuple instead of file by file
  • Declarative automatic optimization
  • Finegrained parallelism (pipelined and superscaler)
  • No Hadoop data-lag due to data change and aggregation delays.
  • Both historical and continuous data analysis

courtesy : SQLStream.com

None of the HBase Query API (HBQL, JasperForge, Hive) – has been designed to performed fast BI Analytics for Enterprise Business Applications.

If we can build a SQL query module on top of Hadoop (opensource alternative to Hadapt – sql for Parallel DB ), then the complete flow ETL-MR-QUERY  will be easily managed by any Business User through simple SQL-Manager Tools.

Ref:  http://www.slideshare.net/sqlstream/realtime-streaming-big-data-with-relational-streaming

Advertisements