Data Analysis strategy for large amount of data

Initially we define the schema for Time Dimensions, Junk Dimensions, Calculated Field Dimensions and attach the dimension references to the Fact table.

When the size of data grows and number of dimension increases , the performance degrades !

Imagine the following SQL ~ to calculate the SUM of Deals :

Assuming all the historical versions for a deal entity are preserved !


SELECT  Sum(fact1.amount)

FROM   fact_deal fact1,

(SELECT fact2.entity_biz_key AS min_biz_key,

Min(fact2.fact_deal_id) AS min_id

FROM   fact_ deal fact2

INNER JOIN dim_local_date dim1

ON ( dim1.cust_id = fact2.cust_id

AND dim1.dim_local_date_id = fact2.close_local_date_id )

INNER JOIN ….   fact2.created_local_date_id

INNER JOIN ……  fact2.effective_from_local_date_id

INNER JOIN …..  fact2.effective_to_local_date_id

WHERE fact1.cust_id = <Customer ID>

AND dim2.int_date BETWEEN <StartDate> AND <EndDate>    AND …….

GROUP BY fact2.entity_biz_key) view

WHERE fact1.entity_biz_key = view.min_biz_key  AND fact1.fact_deal_id = view.min_id;

So in order to optimize such Data Warehouse Query, usually for time-series aggregation we first calculate the base data set using a simple SQL query and then apply dynamic/pre-defined filters.

Next we can execute the SQL-MapReduce query in parallel for multiple time points and apply histogram stats / aggregation functions.