In Part-1 (published in 2012) we discussed limited options for streaming  real-time ETL and Analytics on Hadoop

Lets explore how Kafka works with Storm / Spark / Samza. We briefly touch upon SQLStream and DataTorrents

Streaming technology has matured a lot and its now possible to develop a fault-tolerant , scalable messaging and streaming solution.

Courtesy : O’reilly Data Newsletter


Data Source –> Pub/Sub –> Stream Processing  –> Store/Analyze  –> Consumer / Applications

Stream Ingestion :

Kafka :  high-throughput distributed messaging system acting as message broker for buffering events between Logstash agents or directly consuming messages from sources (apps / search sites/ agents)

{auto-discover clusters using zk_connect , maintain groups topics blacklists whitelists}
{manage offsets , consumer threads , queue size , max_bytes ( parallelism , buffers) }
{configure rebalancing options, decoder classes}
>> logstash <-> kafka
input { kafka { topic_id => "user_event_1" } }
output { kafka { topic_id => "user_event_1" } }

Reference :  , , custom decoder :

Why Kafka ?

Lets recall how messy can an Enterprise App become ?

Apps and Services (~20) -> ActiveMQ , Splunk , OLTP RDBS , KV Store, Monitoring , Cache, NFS, Hadoop (complexity ~20X20)–> Apps and Services (~20) ….  everyone talking to everyone else … O(N*N)

Such messy architecture leads to –>  Low throughput, lossy, lack of scalability, no central manager, no batch integration , no multi-subscriber scaling, lack of strong persistence, no stream procesing , no ordering guarantee , partitioning …

Without a reliable and complete data flow, a Hadoop cluster is a very expensive beast to maintain to accomplish Data Acquisition to Integration and Analytics !

The main bottleneck here is the fact that consumer is attached to source of data !

Lets put less pressure on Infrastructure as Data grows Exponentially ! We need efficient Broker !

In essence , the Broker should be very efficient ETL enabler ! We must pre-process (Extract / Transform / Load) and implement the common data patterns as much as possible before data reaches its Destination !

Business, Users, Finance, Operations => Unified Logs <= Hadoop, Search, Monitoring, DW, Social Graph, Reco Engine, Security, Email  …   Great … now Complexity ~ O(N*2)

So Kafka implemented a very fundamental concept — maintain Commit Logs and strong multi-publisher, multi-subscriber, multi-topic ! Built like a Distributed System !

Just like unix program – streams data between programs ..  (cat input | grep “foo” | wc -l)

kafkaScalability of a filesystem

Hundreds of MB/sec/server throughput ,

Many TB per server

Guarantees of a database

Messages strictly ordered , All data persistent

Distributed by default

Replication , Partitioning model


Producers, Consumers, and Brokers all fault tolerant and horizontally scalable

Stream Processing

interesting read :  in-stream-big-data-processing

excerpts from kafka + spark-streaming integration

// Define the Kafka parameters, broker list must be specified val kafkaParams = Map(“” -> “localhost:9092,anotherhost:9092”)

// Define which topics to read from val topics = Set(“sometopic”, “anothertopic”)

// Create the direct stream with the Kafka parameters and topics val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](streamingContext, kafkaParams, topics)

Spark modelling and Spark query best practices
Other approaches for Streaming ETL  and Real-time DataWarehousing (multiple dimensions) and Ad-hoc BI

Example of using Cassandra as the Datawarehouse with Spark streaming


Druid –  real time distributed OLAP

Doradus (Cassandra) –

Impala – Datawarehousing on old-fashioned Hadoop data

References : strata slides,

Samza – is another great real-time stream analysis software.

process real-time data and re-process historical data when logic changes in the same framework 

Now , how does the overall stack look like !

Just to highlight again why Kafka is so popular as Streaming Log Bus

  • sequence of concurrent updates to nodes  (through data consistency)
  • highly available data replication
  • ‘strong commit’ guarantee to Leader (writer)
  • external data subscription feeds
  • handle / restore replica failure
  • rebalance data between nodes

Low-latency Queries :

Results of processed stream should be stored into serving nodes (backed by btrees / sstable / inverted index) i.e. Cassandra / Mongo / Elastic Search so that clients (producing streams) can quickly lookup the processed results.

Overall Best Practice :  publish every data change (DB log, feeds) to a stream so that latest version of a stream (Samza can partition, replicate , reconstruct and avoid data loss) can be queried faster than corresponding DB and for any changes in DB , current data can be joined with historical data – beating CAP)

Jay Kreps (author of Kafka) provided a list of recommended articles:

DataTorrent  (culmination of streaming and fast batch)

End-2-End Pipeline :  high-volume auto-scaling fault-tolerant event-streaming system

meeting SLA of campaign policy by meeting the no of impressions promised to Ad providers and improve Ad placement performances.

Large pool of Ad servers generating huge number of events, DT provides fault-tolerant operators over flumes, 2 days worth of de-duping on 10B events happening in 2 ms which is fed to DT real-time in-memory queue connected to a computing layer which calculates an Ad placement strategy (from 45 mins to 1 mins) -> real-time dashboards .

DataTorrent adopts a very intelligent approach to auto-scale the processing VMs based on the partitions (size of incoming events)

Ref :

SQLStream : Ref : Making the Elephant Fly

The most important concept for Streaming SQL is the stream. A stream is a continually updating data object. A stream is like a table with no end, but which does have a beginning (when the stream was established). The number of records in a stream can be infinite.

A stream is a schema object that is a relation but which does not store data like as a finite relation (such as a table in a database). Instead, a stream implements a “publish-subscribe” protocol. It can be written to by multiple writers and read from by multiple readers.

A conventional SQL application prepares and executes a statement with a SELECT… query and iterates through the returned result set until end of fetch is detected, when there are no more rows to return. The application then returns to doing something else.

Ref :

Academic Papers, Systems, Talks, and Blogs

  • These are good overviews of state machine and primary-backup replication.
  • PacificA is a generic framework for implementing log-based distributed storage systems at Microsoft.
  • Spanner—Not everyone loves logical time for their logs. Google’s new database tries to use physical time and models the uncertainty of clock drift directly by treating the timestamp as a range.
  • Datanomic: “Deconstructing the database” is a great presentation by Rich Hickey, the creator of Clojure, on his startup’s database product.
  • “A Survey of Rollback-Recovery Protocols in Message-Passing Systems“—I found this to be a very helpful introduction to fault tolerance and the practical application of logs to recovery outside databases.
  • “The Reactive Manifesto”—I’m actually not quite sure what is meant by reactive programming, but I think it means the same thing as “event driven.” This link doesn’t have much information, but this class by Martin Odersky (of Scala fame) looks fascinating.
  • Deconstructing the Database
  • Paxos!
    • Leslie Lamport has an interesting history of how the algorithm was created in the 1980s but was not published until 1998 because the reviewers didn’t like the Greek parable in the paper and he didn’t want to change it.
    • Once the original paper was published, it wasn’t well understood. Lamport tried again and this time even included a few of the “uninteresting details,” such as how to put his algorithm to use using actual computers. It is still not widely understood.
    • Fred Schneider and Butler Lampson each give a more detailed overview of applying Paxos in real systems.
    • A few Google engineers summarize their experience with implementing Paxos in Chubby.
    • I actually found all of the Paxos papers pretty painful to understand but dutifully struggled through them. But you don’t need to because this video by John Ousterhout (of log-structured filesystem fame) will make it all very simple. Somehow these consensus algorithms are much better presented by drawing them as the communication rounds unfold, rather than in a static presentation in a paper. Ironically, this video, which I consider the easiest overview of Paxos to understand, was created in an attempt to show that Paxos was hard to understand.
    • “Using Paxos to Build a Scalable Consistent Data Store”—This is a cool paper on using a log to build a data store. Jun, one of the coauthors, is also one of the earliest engineers on Kafka.
  • Paxos has competitors! Actually each of these map a lot more closely to the implementation of a log and are probably more suitable for practical implementation:
    • “Viewstamped Replication” by Barbara Liskov is an early algorithm to directly model log replication.
    • Zab is the algorithm used internally by Zookeeper.
    • RAFT is an attempt at a more understandable consensus algorithm. The video presentation, also by John Ousterhout, is great, too.
  • You can see the role of the log in action in different real distributed databases:
    • PNUTS is a system that attempts to apply the log-centric design of traditional distributed databases on a large scale.
    • HBase and Bigtable both give another example of logs in modern databases.
    • LinkedIn’s own distributed database, Espresso, like PNUTS, uses a log for replication, but takes a slightly different approach by using the underlying table itself as the source of the log.
  • If you find yourself comparison shopping for a replication algorithm, this paper might help you out.
  • Replication: Theory and Practice is a great book that collects a number of summary papers on replication in distributed systems. Many of the chapters are online (for example, 1, 4, 5, 6, 7, and8).
  • Stream processing. This is a bit too broad to summarize, but here are a few things I liked: