Machine Learning Pipelines and Models need to be managed through a robust system.
DURABLE Data pipeline need to be -- Discoverable -- Understandable -- Repeatable -- Accurate -- Bottoms Up -- Lineage Aware
DURABLE Models are just not a combination of linear coefficients.
— Models need to be stored, searchable, well-documented and continuously reviewed.
— Usually a DSL on Models help access list of features, score calibrations, link functions, feature transformations
Feature Selection Automation:
— Data Mntc (versioning and dockerized containers) : DOMINO
— Versioned data and results (encapsulated in a Docker Container)
Automation encourages Experimentation:
— easy baseline
— try different variations
— regular retraining
– Periodically check if model-performance is dropping due to changes in underlying domain data
A/B Testing , match offline metrics with online metrics.
(A) ALATION speeds up Data Preparation by offering a Data Catalog management system.
(B) TENSORFLOW help distribute parameters across servers and maintain data parallelism as models and , data, parameters reside in separate servers.
It offers a Serving mechanism for continuously training pipelines.
(D) Netflix has developed a system for managing and running Machine Learning Pipeline using Mesos ~ http://techblog.netflix.com/2016/05/meson_31.html
A heterogeneous workload of Spark, Python, R & Scala tasks run thousands of computations concurrently on an elastic Mesos cluster of hundreds of nodes.