While it was cold and dreary outside it was warm, cozy and full of good food and good people in Uber’s Headquarters for a presentation by Josh Wills, Director of Data Science at Cloudera. Josh’s presentation focused on the problems facing data scientists in a rapidly changing environment and the steps to find the best algorithm to any unique data set and statistical problem.
Starting the talk with his definition of a “Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician,” Josh went into the basics of data science, machine learning and the differences between the academic and industrial pursuits of data scientists. He stated that the goal for most academic data scientists is having their paper accepted in the prestigious NIPS conference. While the goal of those working in the industry is to make money. He used his experience working with Google’s advertising division to show the balance between pleasing the client and making money while not upsetting the users; emphasizing the motto that done is better than perfect and the notion that the most data the fastest wins.
Josh later described 4 ways data scientists test data.
- Collaborative Filtering:Find Hidden Treasure.
- Random Forests:Using big data set, sample many times, makeup a bunch of prediction models and take the sum of all of them; ensemble models, works best.
- K-means Clustering: Don’t use for the clustering, use it to find things that do not cluster well. May lead to fraud detection, among other things.
- Non-random sampling: Often push aside, but if done correctly can work very well
Josh stressed that in the end each model has its strengths and weaknesses; to find the right solution to complex problems you must use many techniques. He concluded his presentation by talking about his most recent work.