Key Lessons learnt from Data Science Conference


Emerging Platforms and Libraries for machine learning on big data

IBIS is a great initiative to provide data-scientists effortlessly connect to Impala ( SQL Query Engine for distributed data processing jobs running on Hadoop )

orders = con.table(‘global_orders’)

top_orders = (orders .group_by(‘order_cust_key’) .size() .sort_by((‘count’, False)) .limit(5))


Ref : ,

Creating Intelligent Applications using GraphLab 

  • Scalable DataFrame to create largest NumPy array using GraphLab Create ( )
  • SFrame helps get around memory constraints imposed by scikit or pandas
  • SGraph backed by SFrame to store all tables, images, texts
  • Continuos Offline Evaluation (Historically Labelled data) and Online Evaluation ( stream a portion of incoming data (B) to evaluate new deployments and rest as control group (A) Its extremely critical not just make things work ( by mainstream data analysis) but also measure the deviations .
  • Classification accuracy, precision-recall, log-loss
  • Ranking parameters DCG / NDCG
  • Regression RMSE, max error
  • Online loss , model fitting score
  • Continuously match Business Metrics and ‘Discovered Features’
  • Fast deep learning
  • IPython Examples  ,  ,

Streaming Data Science
practical examples of spark streaming

Examples from Microsoft slide on tweet analysis

  • Named Entity Recognition
  • Link parsing
  • Topic categorization
  • Sentiment classification
  • Location inference
  • SPam detection
  • Adult content detection

Anomaly Detection

Location Intelligence 

Demand Prediction

Spammer Detection

Network Analysis

Social network analysis with NetworkX

Web Personalization

Product Extraction :

  • build a network of products (by considering substitutes /purchase instead/ , complements /purchase extra/)
    • Female Red Belt ==> substitute other belts and complement with female bags .. to add more product reco
  • apply topic models and discover micro categories
  • generate explanations why certain products preferred (given pair of products , predict if related – substitute / complement)  [ Link Prediction ==> review topics ]
  • learn multiple relationships simultaneously (why users view X , but buys Y ) => p ( x flows to y | x & y related )
    • so associate each node in the category tree with a small no of topics

Sales Forecasting :

  • most influencing deals and opportunity scoring

Music Recommendation :

  • Interesting usecase with ensemble learning for close content match between different songs part of same genre
  • Decisions driven by both classification and rules

New Approach to train terra bytes

Tools to accelerate Data Analytics

  • add these 8 python weapons to Data Analysts Armory
  • Apache Zeppelin – Web Frontend for Spark
  • bigdata Tools :

Large scale streaming analytics

Apache Flinks (Akka based distributed computing engine, Exactly Once delivery)

Recommendations by Data Scientists

  • Cross Validation techniques estimate the quality of the fitting process not the quality of the final model
  • Testing process directly measures the performance of the actual models
  • Sometimes more data can be bad for a fixed procedure ( random forest over shallow trees can lose tree diversity for duplicate or near-duplicate variables)
  • Instead of considering the presence of more variables as a consequence of mutual info , filter out as much noise , collinear , near constant variables as possible. Not all the variables will help determine the performance on future instances.
  • Consider useful inductive bias to regularize terms and schemes to reduce variance.
  • Avoid PCA
  • Pre-processing on dependent variables (word2vec , partial least sqrs)
  • If we do not employ cross-validation early in ML phase, our Model may not offer accuracy for new instances
  • some interesting notes on model accuracy : ,

Visual Analysis of Data

Democratization of Data

New Generation Data Science-driven Smart Business

Deep Learning

common sense reasoning and human belief models (consider deep NN layers)

H2O – Distributed Machine Learning

Make Tech Human

Human powered data enrichment platform ( people-powered data enrichment platform )

Guide Visually impaired people  : , ,

Read Greek menu —

Topic Modeling 

Best ML Tech Talks , Top 10 ML API , ML Demystified ,

TOP 10 ML Videos , ML CheatSheets , Free ML Books ,

ML CaseStudy , Data Science Summit Videos , Data Blog – Distributed ML

More Usecases and References :  ->