Streaming ETL is the newest technical challenge in the perspective of ‘Big data acquisition from disparate applications from multiple tenants at the same time’ and ‘organizing the streams and computing the aggregations on Hadoop/ Analytics Datastore’.
Since Hadoop is meant to be a batch-processing share-nothing ETL tool, we need to keep it current through real-time streaming technologies like SQLStream.
One of the most important features of a Data Stream Management System – DSMS is the possibility to handle potentially infinite and rapidly changing data streams by offering a flexible processing at the same time, although there are only limited resources like a limited main memory. The following table provides various principles of DSMS and compares them to traditional DBMS.
|Database management system (DBMS)||Data stream management system (DSMS)|
|Persistent data (relations)||volatile data streams|
|Random access||Sequential access|
|One-time queries||Continuous queries|
|(theoretically) unlimited secondary storage||limited main memory|
|Only the current state is relevant||Consideration of the order of the input|
|relatively low update rate||potentially extremely high update rate|
|Little or no time requirements||Real-time requirements|
|Assumes exact data||Assumes outdated/inaccurate data|
|Plannable query processing||Variable data arrival and data characteristics|
SQLStream : Ref : Making the Elephant Fly
The most important concept for Streaming SQL is the stream. A stream is a continually updating data object. A stream is like a table with no end, but which does have a beginning (when the stream was established). The number of records in a stream can be infinite.
A stream is a schema object that is a relation but which does not store data like as a finite relation (such as a table in a database). Instead, a stream implements a “publish-subscribe” protocol. It can be written to by multiple writers and read from by multiple readers.
A conventional SQL application prepares and executes a statement with a SELECT… query and iterates through the returned result set until end of fetch is detected, when there are no more rows to return. The application then returns to doing something else.
SQLStream (SQL repurposed as Streaming API) offers very impressive Real-time Continuous Data Transformation and Load Flow.
Another alternative flow ‘Python Nginx ETL processes running in parallel’.
Business Use case :
Extraction of business data through Boomi (a great ‘Business Data Integration’ enabler) / any other Data Source Emits Streams => Transform streams fed to (reduces the result in smaller chunks) => Load into HBase / Columnar Datastore => Query HBase / Hadapt / Columnar Datastore.
The following slide from SQLStream explains the above-mentioned flow : courtesy : SQLStream.com
Note how the real-time alerts are channeled to Mobiles/Tablets/Long polling Browser Clients from the SQLStream. SQLStream helps to keep Hadoop continuous and up-to-date which otherwise suffers from data aggregation lags.
There are various approaches for collecting streams, organizing the streams into ordered sets/graphs and persisting them. S4, Storm, Kafka, Mongo-streaming API, python-etl module – are major streaming solutions.
But end of the day, we need a Data Flow Language in order to depict an ETL Dataflow’. That’s where SQL still outperforms others as the most credible dataflow language!
There is a huge benefite using SQL stream over Hadoop batching :
- Replace tuple by tuple instead of file by file
- Declarative automatic optimization
- Finegrained parallelism (pipelined and superscaler)
- No Hadoop data-lag due to data change and aggregation delays.
- Both historical and continuous data analysis
courtesy : SQLStream.com
None of the HBase Query API (HBQL, JasperForge, Hive) – has been designed to performed fast BI Analytics for Enterprise Business Applications.
- Adobe came up with ‘Saasbase Analytics’. http://www.slideshare.net/Hadoop_Summit/low-latancy-olap-with-hadoop-13386744
- Hadapt built Analytics Query Engine (Parallel Hadoop DB) on top of HDFS – which is way faster than HBase (http://engineering.linkedin.com/hadoop/recap-improving-hadoop-performance-1000x )
If we can build a SQL query module on top of Hadoop (opensource alternative to Hadapt – sql for Parallel DB ), then the complete flow ETL-MR-QUERY will be easily managed by any Business User through simple SQL-Manager Tools.