Sandeep Rajput : Predictive Analytics : Tutorials
Predictive Analytics Tools

Humans have made tools for millenias. One of the earliest examples is the Clovis knife from 13 millenia ago. It probably was the state-of-the-art at that time, but would not have passed muster in the age of Homer. It is very human to want to create new tools and utilities, and it is an admirable and endearing trait of our species. In a team-oriented culture, or anything requiring collaboration, this drive to invent can become an obstacle to progress. In the Lord of the Rings, presumably the fellowship divided their tasks according to their strengths, an ideal outcome. That ideal outcomes however did require

  1. A common goal bigger than any individual
  2. Diversity in skills and sizes
  3. High levels of competence

All three ingredients need to be present for team work to be perfect. In the world of data science, these loosely translate to

  1. Business problem you're trying to solve
  2. Separate toolsets for exploration and deployment
  3. Solid knowledge of the fundamentals to make sense of outputs
Also implicit is the fact that 3-4 persons is the ideal team size for a data science project. Sure, some projects will have hundreds of individuals working on them, but the core will have no more than 3 or 4 data scientists. Since things are often not cut and dried in data science, the team must listen well to each other and be able to move forward while staying aware of important assumptions based on intuition or weak precedent. Still, every member should be spending two-thirds of their time making tangible progress by themselves. It is a very real danger to evangelize in terms of generalities because the uncertainty is great and one cannot know the answer in advance with any accuracy.
Computer Programming

Q: Which programming language is best?

A: The one you know best

Appears facile? In a time of pressure, an individual will turn to his or her tool of choice, and that is almost always the tool one knows best. That needs to be borne in mind; for the subtext is better learn one tool in detail, than 10 cursorily and that is still valid in today's age where making a new computer language is nowhere as hard as it used to be when C was developed.

We can conceptualize to some extent, however. For high-level review of data, we often need counts and frequencies. If the data sits in a database, a SQL query is most appropriate. If the data sits in HDFS, a higher level of abstraction than Map/Reduce is probably in order.

The tools and software very handy for exploring the data and trying out combinations, loved by Statisticians and data scientists alike, are often not the most suitable tool for deploying the end result where other considerations of software engineering are very important. Data Manipulation

Phase Description Programming languages
Access Frequencies, summaries, unique values Perl (ASCII data), SQL (traditional databases)
Exploration Frequency slices, time trends, variable discovery R, Matlab (Octave/Sclilab), SAS; Python stack
Discovery Data mining algorithms R, Matlab, SAS

(c) Copyright Sandeep Rajput 2014. All rights reserved.