Finally Datawarehouse is set free from the bondage of slow colloquial relational databases which do not scale quite well with growing number of multiple tenants and multiple dimensions and with rising demand of trending historical data.
RDBMS not suitable for Analytics :
- Huge Overhead of – Record Level Locking
- Lacks notion of implicit ordering
- Overhead of buffer pool is too big
- Overhead of latching threads
- Write-ahead Logging
- No support for ROLAP
- No notion of Shared Nothing Grid Based Architecture
select count(source), stage, dayOfWeek, owner group by source, stage, dayOfWeek, owner. The SQL statement only access 4 columns.
In a Row Store, the database has to read the ENTIRE table from disk (say 5 GB).
In a Column Store, the database only reads the columns it needs from disk (100 MB).
There is no dynamic rules for sharding and partitioning. Its all done once through schema creation process. http://www.nicholasgoodman.com/bt/blog/2011/01/31/column-store-101/
CREATE TABLE Sale (
tenantId int NOT NULL,
orderId int NOT NULL,
… omitting for …
) PARTITION BY LIST(tenantId) (
PARTITION t1 VALUES IN (1) ENGINE=InnoDB,
PARTITION t2 VALUES IN (2) ENGINE=InnoDB,
PARTITION t3 VALUES IN (3) ENGINE=InnoDB,
We can completely get rid of Fact-2-Fact Joins , Snow Flake Schema, Monolithic Views (which are major roadblocks of performance in RDBMS based Datawarehouse). ColumnStores is all about partitioning data sets into logical time-series in pre-aggregated tables. For real-time query and historical trending ColumnStore prove to be fast and scalable through add-on layers.
Actian is leading the edge through its award-winning ‘Real-time analytics modules/add-ons top of its fasted Columnar DB’
Lets see the innovations Columnar DB thrive on :
Very fast query on highly-selective multi-dimensional data through >>
- Powerful Vectorisation
- Hash join/Aggregation
- Star Join Optimization
- Fast Intersect/ Union/Inner/ Merge – joins.
- Dimension history tracking,
- More cpu-bound & less io-bound,
- Point-in-time Query.
- Fast bulk loading in parallel,
- Data-streaming, page multi-versioning,
- Hot/incremental backup, upsert/merge,
- User-defined transformations
Of course, only few Analytics Datastores (Vectorwise, ParAccel, InfoBright) can boast of innovative feature-set. There ares still few open areas like correlated and paralleled queries, implicit support for timezones and encryption.
This blogpost summarizes the analytics offerings based on market segments : http://blogs.actian.com/ketankaria/2012/06/23/vectorwise-trends/
Note : 08/24/2015
Spark offers great integration with columnar structures (cardinality distribution tables, parquet formats )