1. N1QL enables Real-time Aggregation on Multi-Dimensional data (simulating relational joins over json documents)

N1QL extends SQL to handle Nested , Heterogeneous , Distributed data and allows ad-hoc query on secondary index

Courtesy @Couchbase Workshop

N1QL allows us to choose which elements are returned at the top level.

This is incredibly useful as it reduces the need to write complex parsing logic:

  • Within the application.
  • In Client side front end frameworks (Angular/Backbone etc.)

SELECT a.name, s.flight, s.utc, r.sourceairport, r.destinationairport, r.equipment FROM `travel-sample` r UNNEST r.schedule s JOIN `travel-sample` a ON KEYS r.airlineid WHERE r.sourceairport=’LHR’ AND r.destinationairport=’LAX’ AND s.day=2 ORDER BY a.name

sample code :  https://github.com/kaniska/couchbase_analytics/tree/master


2. Spark SQL allows seamless aggregation of data by joining heterogeneous datasources 

connect existing BI tools through JDBC

share data with multiple users through cached tables in jdbc driver

use columnar formats (skip unnecessary columns , use required columns)  , store data as a block of columns for a specific column

use partitioning (on filter keys like /year=2015/month=08/ etc.) , force shuffle through repartition ( if we want parallelism like fork-join and then coalesce)

pushing predicates to storages (postgress , mysql)

uses data frames (a distributed collection of rows organized into named columns) , (map-partitions shuffled RDDs , for a well-expressed dataframe , the optimizer to does the magic)

df = sqlContext.read.format(“json”).option(“samplingRatio”,”0.1″).load(“/data/marketing1.json”)


// ETL  –>  use a library to read finance staging data


sqlCtx.table(“revenue”).groupBy(“dept”).agg(“dept”, avg(“amount”)).collect()

the cool stuff is custom Python, R , Java , Scala code can be blended with Dataframe operations :  here an attribute ‘City’ from udf is merged with ‘User Id’ from user table

zipToCity = udf(lambda zipCode:  <custom python GeoLoc library>)

def add_demography(events):

u = sqlCtx.table(“users”)

events.join(u, events_user_id == u.user_id).withColumn(“city”, zipToCity(df.zip))

// courtesy : databricks demo

and the best part is since all operations are lazy a great deal of optimization can be performed .

for example, now when we call the add_demographics to find events info and then find training data , users can be scanned while executing Query Plan and data can be columnized and compressed even before executing actual Query .

events = add_demographics(sqlCtx.load(“/data/events”, “parquet”))

training_data = events.where(events.city == “San Francisco”).select(events.timestamp).collect()

Precomputing Joined data and Preprocessing filters

// courtesy : databricks demo

Reference : https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html , https://forums.databricks.com/

Demo code : https://github.com/databricks/spark-pr-dashboard

Online Classes : https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x  ,  https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x

3. Streaming Solr documents meets Spark Dataframes


Special Notes  : Comparison between SPark and ElasticSearch use cases :

Spark :  Data Discovery -> to find patterns , cherry pick signals from noise , find trends

ElasticSearch : Deep data exploration by Query -> Instant Analytics Dashboard , complex boolean queries , timelines , aggregations , future trends