Key Lessons learnt from Data Science Conference

header

Emerging Platforms and Libraries for machine learning on big data

IBIS is a great initiative to provide data-scientists effortlessly connect to Impala ( SQL Query Engine for distributed data processing jobs running on Hadoop )

orders = con.table(‘global_orders’)

top_orders = (orders .group_by(‘order_cust_key’) .size() .sort_by((‘count’, False)) .limit(5))

top_orders

Ref : http://docs.ibis-project.org/generated-notebooks/5.html , http://www.slideshare.net/wesm/ibis-scaling-the-python-data-experience

Creating Intelligent Applications using GraphLab 

  • Scalable DataFrame to create largest NumPy array using GraphLab Create ( https://twitter.com/datoinc )
  • SFrame helps get around memory constraints imposed by scikit or pandas
  • SGraph backed by SFrame to store all tables, images, texts
  • Continuos Offline Evaluation (Historically Labelled data) and Online Evaluation ( stream a portion of incoming data (B) to evaluate new deployments and rest as control group (A) Its extremely critical not just make things work ( by mainstream data analysis) but also measure the deviations .
  • Classification accuracy, precision-recall, log-loss
  • Ranking parameters DCG / NDCG
  • Regression RMSE, max error
  • Online loss , model fitting score
  • Continuously match Business Metrics and ‘Discovered Features’
  • Fast deep learning
  • IPython Examples  , https://dato.com/learn/userguide/  , https://s3.amazonaws.com/static.dato.com/dato_stratasj2015_training.zip

Streaming Data Science
practical examples of spark streaming

Examples from Microsoft slide on tweet analysis

  • Named Entity Recognition
  • Link parsing
  • Topic categorization
  • Sentiment classification
  • Location inference
  • SPam detection
  • Adult content detection

Anomaly Detection 

http://cloudacademy.com/blog/bigml-machine-learning/

http://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation-forest

Location Intelligence 

Demand Prediction

Spammer Detection

http://www.cs.umd.edu/~shobeir/papers/fakhraei_kdd_2015.pdf

Network Analysis

Social network analysis with NetworkX

Web Personalization

http://ijcai-15.org/downloads/tutorials/T11-WebPersonalization.pdf

Product Extraction :  http://snap.stanford.edu/proj/sceptre/

http://cs.stanford.edu/people/jure/pubs/prodgraph-kdd15.pdf

  • build a network of products (by considering substitutes /purchase instead/ , complements /purchase extra/)
    • Female Red Belt ==> substitute other belts and complement with female bags .. to add more product reco
  • apply topic models and discover micro categories
  • generate explanations why certain products preferred (given pair of products , predict if related – substitute / complement)  [ Link Prediction ==> review topics ]
  • learn multiple relationships simultaneously (why users view X , but buys Y ) => p ( x flows to y | x & y related )
    • so associate each node in the category tree with a small no of topics

Sales Forecasting : http://www.slideshare.net/AndyTwigg1/data-science-at-insidesalescom

  • most influencing deals and opportunity scoring

Music Recommendation : http://courses.cs.washington.edu/courses/csep521/07wi/prj/michael.pdf

  • Interesting usecase with ensemble learning for close content match between different songs part of same genre
  • Decisions driven by both classification and rules

New Approach to train terra bytes

Tools to accelerate Data Analytics

  • add these 8 python weapons to Data Analysts Armory
  • Apache Zeppelin – Web Frontend for Spark
  • bigdata Tools : http://www.datanami.com/2015/06/12/8-new-big-data-projects-to-watch/

Large scale streaming analytics

Apache Flinks (Akka based distributed computing engine, Exactly Once delivery)

http://dataartisans.github.io/flink-training/index.html

Recommendations by Data Scientists

  • Cross Validation techniques estimate the quality of the fitting process not the quality of the final model
  • Testing process directly measures the performance of the actual models
  • Sometimes more data can be bad for a fixed procedure ( random forest over shallow trees can lose tree diversity for duplicate or near-duplicate variables)
  • Instead of considering the presence of more variables as a consequence of mutual info , filter out as much noise , collinear , near constant variables as possible. Not all the variables will help determine the performance on future instances.
  • Consider useful inductive bias to regularize terms and schemes to reduce variance.
  • Avoid PCA
  • Pre-processing on dependent variables (word2vec , partial least sqrs)
  • If we do not employ cross-validation early in ML phase, our Model may not offer accuracy for new instances
  • some interesting notes on model accuracy :  http://winvector.github.io/ ,  https://github.com/WinVector/zmPDSwR

Visual Analysis of Data 

http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

https://twitter.com/vibrantdata

Democratization of Data

New Generation Data Science-driven Smart Business

Deep Learning

common sense reasoning and human belief models (consider deep NN layers)

https://dato.com/events/training/2015_pydata_seattle.html

H2O – Distributed Machine Learning

http://library.fora.tv/conference/h2o_world_2015/buy_programs

Make Tech Human

http://www.wired.com/category/maketechhuman

Human powered data enrichment platform ( people-powered data enrichment platform )

Guide Visually impaired people  : http://visualqa.org/ , visqualqa.org/visualize ,

Read Greek menu — http://googleresearch.blogspot.com/2015/07/how-google-translate-squeezes-deep.html

Topic Modeling 

Best ML Tech Talks , Top 10 ML API , ML Demystified ,

TOP 10 ML Videos , ML CheatSheets , Free ML Books ,

ML CaseStudy , Data Science Summit Videos , Data Blog – Distributed ML

More Usecases and References :

http://conf.dato.com/speakers/prof-dhruv-batra/  -> http://visualqa.org/visualize/

http://www.slideshare.net/dato-inc/strata-london-deep-learning-052015

http://strataconf.com/big-data-conference-ca-2015/public/schedule/detail/39129

http://strataconf.com/big-data-conference-ca-2015/public/schedule/detail/38709

http://strataconf.com/big-data-conference-ca-2015/public/schedule/detail/38518

http://strataconf.com/big-data-conference-ca-2015/public/schedule/detail/38774

http://strataconf.com/big-data-conference-ca-2015/public/schedule/detail/38511

 

Advertisements