Its very important to be able to perform fast ad-hoc query on current data, historical snapshot and future trends by querying various static and predictive versions of same data.
The goal is fast aggregations on billions of N-dimensional data with flexible drill-downs and perform sub-second queries to show results immediately on real-time data.
Its very easy to get lost in sea of jargons and analytics offerings !!!
While building a fast Analytics Platform we need to keep in mind that just having a fast data processing engine is not good enough!
- We still need a ‘streaming data integration’ layer where SaaS / On-premise business data/ Machine data can be collected uninterruptedly.
- Raw data need to be stored in file-system (optimized for analytics, index-able, searchable, version-able, compressible).
- Then comes the data engine that need to be able to constantly discover, correlate data-patterns and reuse the pre-computed result, support user-defined functions, built-in analytical functions.
- Finally a much-needed sleek widget-based UI (or at-most user-friendly query-dsl editor)
- Well, the entire solution need to be served from Cloud as a multi-tenant SaaS application complying with all kinds of Data Integrity, Data Security, Backup, Recovery features.
There are very few commercial SaaS Analytics offerings (there are many on-premise and appliance-based solutions) that satisfy all the above demands from the customer alongside fulfilling the promise of blazing fast ad-hoc query on real-time arbitrary dataset !
Tableau, SiSense, Lattice-Engine, EdgeSpring, Ayasdi, Hadapt clearly emerge as one of the most innovative commercial Big-data Business Analytics products supporting most of the above features !
We shall first take a look at the open-source projects that can quickly enable create numerous combinations of KPIs without waiting for weeks or months to analyze daily changing records.
Word of caution : Opensource Enthusiasts , you still need to build your eye-catching prisms/widgets to visually represent the data analyzed by the following frameworks !
1. Druid : one of the most promising open-source project.
In RDBMS, ad-hoc queries trying to reach multiple hierarchical dimensions (some cases multiple facts) at arbitrary depth ; are painfully slow ! Even range-cache, parallel M/R can not help gain much speed.
Even NoSQL (key/value stores) mostly do not support secondary indices, range-scan only on primary keys and processing increses exponentially with varying dimensional combinations.
• Example: ~500k records • 11 dimensions, 5-depth • 4.5 hours on a 15-node Hadoop cluster • 14 dimensions, 5-depth • 9 hours on a 25-node Hadoop cluster
Druid is trying to address all these problems and provide a distributed column-store well-suited for fast data scans, fast filtering and with simplified processing logic.
2. Spark on Hive : a very intuitive approach for massive parallel data analysis and Shark Fast Data Analysis Using Coarse-grained Distributed Memory.
As per AMP Lab : a combination of conceptually simple “engineering” changes to the engine (e.g., in-memory storage) and more involved architectural changes (e.g., partial DAG execution) can alleviate them. We also find that a somewhat surprising variable not considered in detail in MapReduce systems, the task scheduling overhead, actually has a dramatic effect on performance, and greatly improves load balancing if minimized.
3. Splout-SQL : Another promising open-source project which aims to remove the bottleneck between web-layer ad-hoc query tools and big data set distributed across many nodes.
Will soon try out both the open source offerings and evaluate how they suite analyzing business data to create dynamic KPIs.
4. Impala : Its a very well-researched topic in 2013 that columnar datastores initially designed for querying big data (HBase, Cassandra, MongoDB) are not well-suited for fast ad-hoc query due to the intrinsic compromises they need to make to satisfy schemaless sorted column-family / key-value structures !
So Impala project started from ground-up to directly reach out to HDFS and allow raw queries against the filesystem ! Its in nascent stage and needs to overcome many limitations like non-indexed hdfs etc.
Yet HBase will be the first choice for querying big data involving few column families where just scanning keys can take us to relevant cells !
HBase is enjoying strong community support and surging ahead http://blog.sematext.com/2012/04/22/hbase-real-time-analytics-rollbacks-via-append-based-updates/
Avro on HBase offers compact Json data for fast visualization. http://people.apache.org/~scottcarey/ApacheAvroOverview.pdf
> Impala-QL will be available a high-level DSL for users.
5. Jquery+Node.js+Redis+MongoDB : There is a polygot approach to this ad-hoc query problem !
I got a chance to play with and explain the steps in detail in a separate blog.
For medium size streaming data we can fetch events through Redis queue in real-time, store both raw data and pre-computed aggregations in Mongo and keep the web UI updated through long-polling ! Then further drill-downs and ad-hoc queries on Mongo (mostly leveraging in-memory Map/Reduce functions) are ridiculously fast ! There is a very little price you pay for data conversions and payloads as everything is Json (UI > Middleware > DB )
6. SciDB : Ad-hoc Mathematical query on Big Scientific data
Through the array data model and its attendant query languages, SciDB provides ACID semantics and allows SQL-like (i.e., declarative) data manipulation and analysis. This is the cornerstone of ad-hoc, flexible, complex analytics — something that is not even theoretically possible with key-value stores. A SQL-like Array Query Language (AQL), and an Array Functional Language (AFL)
Proprietary and Commercial solutions :
1. Lattice-Engine : SalesPrism is simply leading the Ad-hoc Business Analytics with its powerful built-in Sales predictive algo features.
2. SiSense : now that Microsoft bought this disruptive on-demand Analytics solution offering dazzling prisms ! Its really shaking the BI world !
ElastiCube's query processing engine, dubbed Elastic IQ, keeps performance usage patterns history for each column. Elastic IQ uses this this information, along with hardware type and free resources currently available, to generate a query execution plan, optimized for fastest response and minimal RAM/CPU consumption. A query execution plan is made up of thousands of hardware-optimized instructions that can be easily processed in parallel across different CPUs/cores. Many of these instructions are shared across different queries and different users, so Elastic IQ uses 'instruction recycling' to execute each instruction once, and then re-use the result. This does not only apply to two identical queries, but even to different queries that share partial data.
Copyright : SiSense, Reference : http://www.sisense.com/product/technology
2. Hadapt : It has taken few advanced steps to make Hadoop work with parallel db , optimized queries and improving the in-memory performance. http://engineering.linkedin.com/hadoop/recap-improving-hadoop-performance-1000x
> Hadapt offers a convenient SQL-interface for end-users to perform complex queries on BigData.
3. Birst : most promising comprehensive real-time analytics commercial solution (Raw data collection, arbitrary KPI definition, actionable insights)
4. Vectorwise and ParAccel : They are leading the innovations to offer radical ad-hoc analytics with their individual highly optimized vectorisation and compression.
> VectorWise offers quick integration with YellowFin UI for instant visualization of ad-hoc data c analysis.
5. Other ColumnStores and MPP Databases : Directly leverage parallel db technology (Terradata, Vertica etc. ) or parallel queries on columnar analytic data stores (InfiniDB / InfoBright).
We had already experienced the flexibility of columnar stores when we created a protype using LucidDB. Very convenient for bulk-loading (streaming ETL in one end of data-pipe and analytics on selective columns on another end)
While these columnar and (in some cases) Map/R capable databases are extremely efficient in performing conventional analytic, but will suffer from the bottlenecks of database access patterns (memory to file system latency and performing queries involving multi-dimensional references )
6. Tableau Data Engine
Architecture-aware design The core Data Engine structure is a column-based representation using compression that supports execution of queries without decompression. Leveraging novel approaches from computer graphics, algorithms were carefully designed to allow full utilization of modern processors with near optimal usage of the L1 and L2 caches, minimal intermediate results, and break-through techniques for managing streaming of data from disk to avoid loss of throughput that enable us to avoid the common limitation of requiring data sets to be completely loaded into memory before analysis can be done resident.
Copyright: Tableau Software
7. EdgeSpring : One of the greatest innovation in the space of arbitrary data exploration. Heavily indexed and compressed file system based key,value stores that also leverage the fast L1, L2, L3 CPU caches to offer blazing fast analytic !
At its core is a revolutionary data store, the EdgeMart™ that combines three key innovations – massive compression for scalability and portability, a search–based approach for free–form discovery of the most granular data, and a schema–free representation for unprecedented flexibility and adaptability to change. On top of the EdgeMart foundation, we deliver a highly dynamic, interactive visual Lens Framework that enables end users to immediately experience the speed, power, and flexibility of the EdgeMart using standard web technology.
Through this tightly integrated approach, EdgeSpring leapfrogs the arcane world of ETL, Data Warehousing, Data Marts, and BI, to facilitate Business Insight within days rather than months and years.
8. Ayasdi : The patented Topologial Data Analysis discovers insights within Data without requiring the users to perform any query.
This is also a Multi-Tenant SaaS analytics platform that offers instant ad-hoc analysis on complex big data (running against HBase cluster)
9. TokuDB introduced innovative fractal tree indexing and improved MySQL query performance (fast ad-hoc query with write-optimized indexing structure ). Maximum utilization of SSDs http://www.tokutek.com/2012/12/fractal-tree-indexing-overview/
There are some open-source projects worth taking a look.
(a) Katta : Searching distributed lucene indices or hdfs data.
(c) Google BigQuery (Bigdata Analytics as a Service) is a cheaper way to offload the analytics processing to Google data centers. http://www.toolsjournal.com/cloud-articles/item/568-google-bigquery-offers-big-data-analytics-as-a-service
There are some very costly enterprise-focused MPP Analytical Datastores packaged inside appliances . They may not offer cost-effective solutions for blazing fast ad-hoc query on real-time data; but still mentioning for reference :
1. SAS Analytics with UI running in Teradata Big data crunching Appliance. http://www.teradata.com/partners/SAS/SAS-High-Performance-Analytics-for-Teradata-Advantage-Program/
2. SAP in-memory Big Data Analytics computing. http://www.sap.com/services-and-support/in-memory/hana.epx
3. IBM Big Data Appliance and Visualization – Hadoop Management + Neteeza + Vivismo. http://thinking.netezza.com/press_release/ibm-advances-big-data-analytics-acquisition-vivisimo
Reference : http://www.quora.com/Big-Data/best_questions