Big Data Business has solved the problem of ‘Big data acquisition and persistence’ using daily ETL and Batch Analysis through the Hadoop Eco-system.

Its time to think beyond batch processing and act in real time !

Now the stable Big Data Market entering into new domains of ‘Ad-hoc analytic on real-time streaming data’ !

‘Dynamic data analysis’ is no longer driven by  ‘traditional business intelligence’ but involves ‘fast exploration of data patterns’, ‘performing complex deterministic or in-deterministic approximations or accurate queries’ !

Tremendous Research work underway to challenge conventional Hadoop wisdom and create the augmented reality of big data analytics by addressing the limitations of current  MapReduce framework !

While different projects solve different types of limitations of Hadoop MR, some by augmenting and enhancing Hadoop and some by replacing Hadoop MR with new ‘data analysis’ technology.

Though there is no one-stop shop for addressing all the challenges but Shark and Spark is the closest contender with most promising technical enhancements on Hadoop MR limitations !

Need to preserve Data Locality 

Traditional Hadoop MR does not preserve ‘data locality’ during Map Reduce transition or between iterations. In order to send data to the next job in a MR workflow, a MR job needs to store its data in HDFS. So it incurs communication overhead and extra processing time. The BSP concept was implemented (Pregel, Giraph, Hama) to solve this very important MR problem. Hama initiates a peer2peer communication only when necessary and peers focus on keeping locally processed data in local nodes.

Need for Real-time processing and streaming ETL

Hadoop was purpose-built for ‘distributed batch processing’ using static input, output and processor configuration. But fast ad-hoc machine learning query requires real-time distributed processing and real-time updates based on dynamically changing configuration without requiring any code change. Saffron Memory Base is one of the coolest technology that offers real-time analytic on hybrid data.

SQLtream Connector for Hadoop provides bi-directional, continuous integration with Hadoop HBase.

With SciDB, one can run a query the moment it occurs to the user. By contrast, arguably Hadoop enforces a huge burden of infrastructure setup, data preparation, map-reduce configuration and architectural coding. Both SciDB and SMB positions themselves as a complete replacement of Hadoop-MR when it comes to complex data analysis.

Not built for complex analytic functions

Its not suitable for increasingly complex mathematical and graphical functions like Page-Ranking, BFS, Matrix-Multiplication which require repetitive MR Jobs.

So many interesting research works spawned in recent time; BSP, Twister, Haloop, RHadoop

Need for rich Data Model and rich Query syntax

Existing MR Query API has limited syntax for relational joins and group-bys. They do not directly support iteration or recursion in declarative form and are not able to handle complex semi-structured nested scientific data. Here comes MRQL (Map-Reduce Query Language) to the rescue of Hadoop MR by supporting nested collections, trees, arbitrary query nesting, and user-defined types and functions.

Impala : read directly from HDFS and HBase data. it will add a columnar storage engine, cost-based optimizer and other distinctly database-like features.

Need to optimize data flow and query execution

All the ‘Big data Analytics datastores’ both proprietery and open-sourced trying their best to redefine the ‘traditional Hadoop MR’

Well MapReduce does not make sense as an engine for querying !

So here comes Shark is a Distributed In-Memory MR framework with great speed of Execution

Enhance speed and versatility of MR Query

Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Here comes Google Dremel and Apache Drill. Columnar query execution engine offers low latency interactive reporting using DrQL which encompasses a broad range of low latency frameworks like Mongo Query, Cascading, Plume. It adheres to Hadoop philosophy of connecting to multiple storage systems, but broadens the scope and introduces enormous flexibility through supporting multiple query languages, data formats and data sources.  Spark extends Hive syntax but employs a very efficient column-storage  with interactive multi-stage computing flow.

Ability to cache and reuse intermediate map outputs

HaLoop introduces recursive joins for effective iterative data analysis by caching and reusing loop-independent data across iterations. A significant improvement over conventional general-purpose MapReduce.

Twister offers a modest mechanism for managing configurable and tasks and implements effective pub/sub based communication and off course special support  for ‘iterative MR computations’.

Leverage CPU cache and distributed memory access patterns

There are quite a few frameworks to store data in distributed memory instead of HDFS like GridGain, HazelCast, RDD, Piccolo

Hadoop was not designed to facilitate interactive analytics. So it required few game changers to exploit ‘CPU cache’ and ‘Distributed Memory’

Need for Dynamic Memory Computation and Resource Allocation

Output of batched job need to be dumped in secondary storage. Rather it would be good idea to constantly compute the data size and create peer processors and make sure collective memory does not exceed entire data size. So its important to understand Hadoop-MR is not the best-fit for processing all types data structures !  BSP model should be adopted for massive Graph processing where bulk of static data can remain in filesystem while dynamic data processed by peers and result kept in memory

Need for Push-based Map Reduce Resource Management

Though Hadoop’s main strength is distributed processing over clusters, but at peak load the utilization drops due to scheduling overhead.  Its well-known how mapreduce cluster is divided into fixed number of processor slots based on static configuration. SO Facebook introduced Corona – a ‘push-based scheduling where a cluster-manager tracks nodes, dynamically allocates a slot making it easier to utilize all the slots based on cluster workload for both map-reduce and non-mr applications.

Reduction in Cluster size

‘Data Analysis’ processors do not need the same number of machines like Hadoop nodes. For example ParAccel can handle data analysis in one node as opposed to (avg) 8 nodes required by Hadoop to perform same type of analysis due to advanced storage schema optimization.  Enhanced File systems like QFS offer much lower replication factor (1.5) and higher throughput

Avoid Data duplication 

Both ParAccel and Hadapt share a similar vision of analyzing the data as close to the data node as possible without moving data to different BI layer.

As opposed to Hadoop Connector strategies of MPP analytic datastores, Hadapt processes data in an RDBMS layer sitting close to HDFS and load-balances queries in a virtualized environment of adaptive query engines.

Ensure Fault-tolerance & Reliability 

Apache Hama ensures reliability through Process Communication and Barrier Synchronization. Each peer uses checkpoint recovery to occasionally flush the volatile part of its state to the DFS and allows rollback to the last checkpoint in the event of failure.

Per Shark documentation, “RDDs track the series of transformations used to build them (their lineage) to recompute lost data “

 Shark/Spark with its compact Resilient Data Set -based distributed in-memory computing offers a great hope to startups and open source enthusiasts for  building a lightning fast data warehouse system.

Reference :