BI world is experiencing a huge paradigm shift to leverage ‘real-time ad-hoc query on arbitrarily large data-set from disparate data source‘ !
The world knows it very well – how to collect, store, analyze structured and unstructured big data using combination of NoSQL, NewSQL, Hadoop and MPP datastores ;
So the next major focus is how to perform real-time dynamic queries with speed of thought !!
1. New age Databases leveraging Multi-core processor and high-speed CPU cache : It’s a well-known fact how Single-bus Multi-core CPU offers simultaneous multi-threading that significantly reduces latency for certain type of algorithm and data structures ! So DBMS being re-architected from ground up to leverage the ‘cpu-bound partitioning-phase hash-join’ as opposed to ‘memory-bound hash-join’ !
Its the ever-increasing speed of CPU caches and TLBs which allow blazing fast computation and retrieval of hashed result. Also its noteworthy how modern multi-core CPU and GPGPU offer cheap compression schemes at virtually no CPU cost. As we know access to memory becoming pathetically slower compared to the ever galloping processor clock-speed!
ElasticCube from SiSense leverages ‘query plans optimized for fast response time and parallel execution based on multi-cores’ and continuous ‘instruction recyling’ for reusing pre-computed results.
2. Parallel Vectorization of compressed data through SIMD (Single-Instruction, Multiple Data) : VectorWise efficiently utilized the techniques of vectorization, cpu compression and ‘using cpu as execution memory’. This is also a core technology behind many leading analytics columnstores.
3. Positional-Delta-Tree (PDT) : PDT stores both position and the delta are stored in memory and effectively merged with data during optimized query execution. Ad-hoc query in most cases is about identifying the ‘difference’. VectorWise makes effective use of PDT. More can be found in its white paper.
4. Directly query compressed data residing in heavily indexed columnar files : SSDs and Flash storages will get cheaper (means more cheaper Cloud service) with innovative compression and de-duplication on file system. All the analytics datastores are gearing up to make most of this feature.
5. Usage of Fractal Tree Indexes : Local processing is the key ! Keep enough buffers and pivots in the Tree node itself in order to avoid frequent costly round trips along the tree for individual items! That means keep filling up your local buffer and then do bulk flush! New age drives love bulk updates (more changes per write) to avoid fragmentation !
TokuDB replace MySQL binary tree implementation with Fractral tree and achieved massive performance gain.
6. Distributed Shared Memory Abstraction provides a radical performance improvement over disk-based MR ! Partial DAG execution (Directed Acyclic Graph) model to describe parallel processing for for in-memory computation (aggregate result set that fits in memory e.g. like intermediate Map outputs).
Spark uses ‘Resilient Distributed Dataset’ architecture for converting query into ‘operator tree’ ! Shark keeps on reoptimizing a running query after running first few stages of the task DAG, thereby selecting better Join strategy and right degree of parallelism. Shark also offers ‘co-partitioning multiple tables based on common key’ for faster join query!
It leverages SSD, CPU cores and Main Memory to the fullest extent !
7. Topological Data Analysis is a giant leap forward ! It treat data model as a topology of nodes and discovers patterns and results by measuring similarity ! Ayasdi has pioneered this idea to build the first ‘Query-free exploratory analytics tool’ ! Its a true example of ‘analytics based on unsupervised learning without requiring a priori algebraic model’.
Ayasdi Iris is a mind-boggling insight discovery tool !
8. Build a KnowledgeBase by pre-creating and continuously updating metadata about data, data access patterns, query, aggregate result. This type of innovative ‘dynamic introspective’ approach helps columnar storage to avoid requirement for indexing and costly subselects and allows to decompress only the required data !
InfoBright is offering fast analytics based on such ‘knowledge-based columnar data discovery’
9. Parallel Array Computation is a very simplistic yet powerful mathematical approach to embed Big Math functions directly inside database engine. SciDB has mastered this concept by embedding statistical computation using distributed, multidimensional arrays.
10. Associative Memory Base : This is one of the most exciting technology fueling innovations at Safforn. Saffron Memory Base (SMB) embedded ‘machine learning capabilities’, ‘semantic correlation’, ‘unstructured text analysis’ directly into ‘data processing engine’ just by ‘thinking like a brain’ ! Just like Brain does not know who / what / when / how things happened caused an events at a certain point of time; rather depends on connections and counts i.e. correlates similar entities and spaces and events to derive a result ! As opposed to pure statistical calculations SMB like brain mixes connections with counts i.e. semantics with statistics ! It discovers the natural contexts and ranks thousands of attributes in big data; without requiring to build massive ontology or base model !
11. Bulk Synchronous Parallel manages the synchronization and communication in the middle layer as opposed to file-system based parallel random access pattern ! It uses the K-Means Clustering algorithm to . Apache Hama provides a stable reference implementation for analyzing streaming events or big data with graph/network structure by implementing deadlock-free ‘message passing interface’ and ‘barrier synchronisation’ (reduces significant n/w overheads)
12. Semi-computation and instant approximation: ‘fast response to ad-hoc query’ through Continuous learning from experience and instant approximation as opposed to waiting for the end of processing and computation of final result . A bunch of innovative products coming to market with built-in ‘data science capabilities’ – for example H20 from 0xdata
Reference : http://pages.cs.wisc.edu/~jignesh/publ/hashjoin.pdf http://www.sisense.com/documentation/prism-elasticube-manager/introduction-to-elasticube-manager , http://kowshik.github.com/JPregel/pregel_paper.pdf http://en.wikipedia.org/wiki/Topological_sorting http://fastreporting.files.wordpress.com/2011/03/vectorwise-whitepaper.pdf http://www.slideshare.net/paulhofmann/big-data-and-saffron http://www.paradigm4.com/2013/01/terabyte-scale-parallel-processing-with-r-and-scidb/ http://en.wikipedia.org/wiki/Associative_Memory_Base http://www.staff.science.uu.nl/~bisse101/Book/PSC/psc1_2.pdf http://calab.kaist.ac.kr/~swseo/papers/IEEE_CLOUDCOM2010_HAMA.pdf http://en.wikipedia.org/wiki/Bulk_synchronous_parallel http://gigaom.com/2012/07/05/want-to-ditch-your-data-scientists-heres-are-7-startups-that-can-help/ http://pages.sisense.com/elasticube-whitepaper.html?src=bottom http://fastreporting.files.wordpress.com/2011/03/vectorwise-whitepaper.pdf http://support.infobright.com/Support/Resource-Library/Whitepapers/