Data Sources: Data begins its life in external sources like Social Networks , Retail/ Finance/ CRM channels, Website logs, Network logs , Sensors and thousands of other sources.
[ Salesforce.com , Siebel CRM , NetSuite ERP , Hoover’s, SAP ERP , Facebook profile , Marketo marketing , Eloqua marketing , Hyperion , Intuite finance ]
Enterprise Data Integration: Integration Agents capture the nascent data from sources through hundreds of Connectors and apply rules to cleanse, polish and enrich data.
Data Collection System: Mature contextual data then delivered to Collection Agents which store the data in humongous Immutable and denormalized Staging stores !
Data Staging: Data should be staged in MySQL using Copy-on-write strategy such that every original entity record is stored along with time-stamp index and there should be a staging metadata table that should record the batch-id, time-window and list of entity-ids along with entity-types for ensuring data-consistency.
ETL : ETL Jobs kicks in to further split out the denormalized data into more structured schema : Star schema / Daisy schema per Kimball DW [facts/dimensions] OR Column-families / Mirror tables per Bigtable principle [ cassandra / Hbase ] OR columnar structures [ MonetDB / Parquet / Spark RDD] Or MongoDB documents Or Teradata Aster.
Columnar storage reduces input/output (I/O) as the values of only one column are retrieved in a single I/O. Another blog post explaining the detail analysis. It adopts parallelism in SELECT clause essentially converting the query into a Map-Reduce function. Dictionary compression and Run-length Encoding are great strengths of ColumnStore ! InfiniDB , InfoBright , MonetDB each offers unique way of achieving the high-performance data analysis.
Query Engine : The Data Analysis Engine is smart enough to leverage powerful built-in aggregation queries Or apply custom MR functions to parallelize fast data processing. One can either define pre-canned queries or construct adhoc queries for real-time analysis and reporting.
BI Platform : The Analytics Platform usually allows user to create a custom Dashboard and define KPI Metric and save all resources in a workspace. The KPI Charts allow multi-level Drill-downs and custom Filtering .
Data Collection Process
- Synchronous Batch-oriented heterogeneous sources
- Asynchronous requests
- Import csv files
- Continuous streams
- Flat files such as CSV and TSV
- Excel files
- JDBC connection
- Flat files from HDFS
- Our open API for data imports via streaming over HTTP REST
- Internal database systems (columnar or row based, on-disk or in-memory, open source/commercial or homegrown)
- Flat files into HDFS
- Leverage map-reduce framework for big data processing
- Leverage map-reduce framework for different types of analytics like predictive analytics
- Leverage inexpensive scalable and reliable architecture of the Hadoop infrastructure
- Possibly replace our internal database systems if we do HDFS to HDFS and leverage JDBC connector over HDFS like from Cloudera Impala, Hive, or similar
- There are quite a few Enterprise tools like Pentaho Kettle, Talend, CloverETL, Informatica.
- Now if one needs to have complete control over ETL flow, Spring Batch is one of the best options :
- Batch processing capability that can be scheduled or triggered (on-demand)
- Transaction concept for rollback
- Partial processing to allow best effort instead of fail fast
- Parallel processing via multi-threading as well as across multiple nodes
- Partitioning of ETL jobs and data for parallel processing
- Business logic that can be coded with common programming language for use in both ETL logic as well as UDF logic
- Pluggable logic component to allow the concept of trigger point execution
- Workflow control and management of ETL jobs
- Messaging capability via JMS or similar for remote management
- Auditing and logging capabilities
[ETL TOOL] (Inputs) EXTRACT ==> TRANSFORM ==> LOAD (Outputs)
Load Phase :
- Support Logical Data Models
- Star Schema
- create dimension (regular/junk)
- create fact (measure/aggregate)
- slowly changing dimension – SCD Type1 (update) / Type2 (effective from/to change) for business column
- identify natural key for SCD
- manage time dimension for Drill Down
- snapshot capability via SCD
- SnowFlake Schema
- soft foreign key references (used in join queries but not in constraint checking)
- Adhoc Schema
- custom SQL to form denormalized structure ( row-based / columnar / column-families )
- auto-lookup against references
- Star Schema
Validation Phase :
- Business Rules validate the input data streams as part of constraint checking or custom rules.
Transformation Phase :
- User-defined business logic converts requirements into dataset definitions through configurable UI
Data Analysis Platform
Commercial tools like Datameer, Platfora no doubt simplifies the process of analyzing data through innovative BI Visualizations !
But homegrown tool offers the advantages of creating a protocol between the underlying data model and visualization platform.
The first step towards creating such platform is to build a Model of BI artifacts : DataSource, DataSet, Project, Dashboard, Report and Metric.
Computation Engine :
- Create and configure the BI artifacts in a ReportExecution Context.
- Next execute the following steps
- Parser ( Logical Plan)
- creates Object Model
- Optimizer ( read from cache first )
- Compiler (Logical plan -> Physical plan)
- analyze to see what types of measures / aggregations to be calculated
- depending on types of data source , generate a physical plan using a script like ‘pig latin’ for hadoop.
- Query Execution
- Function library
- UDF library
- Parser ( Logical Plan)
- Data Processing layers ( real-time SQL / pig like batch processing MR query / Spark RDD query )
- if needed apply MR functions (user-defined / built-in)
- scrollable cacheable result.
Reference Products: DataMeer, YellowFin, Pentaho, Indicee, GoodData