There’s a lot of hype about big data, cloud computing, machine learning, and data science. From my experience, things get lost in translation when it comes to what is said by the non-technical stakeholder and what’s actually understood. In this piece, I explore the historical roots of these popular concepts and highlight that, while most of the techniques and foundations have been established for a while, the primary differentiating factor is perhaps just terminology.
A few weeks ago, in between flights, I decided to check out the Santa Monica Public Library. While perusing the mathematics and statistics section, I found a book called “Introduction to Statistical Analysis” published in 1951. Intrigued by the title (how was statistical analysis done in that time?), I flipped through the text and deduced that most of the ideas from the book are still the foundation of many techniques we use to analyze data today.
Of course, a lot has changed in the last six decades since the theoretical foundations of the techniques in the book were developed. The biggest changes since it published are the widespread use of computers and that advanced computations are no longer done by hand. Today, computers are more efficient in every single way. With cheap computation as a result of Moore’s Law, more computationally intensive approaches (like advanced optimizations, neural networks, and deep learning) are now possible.
Since then, we’ve weathered at least nine artificial intelligence winters. We’ve gone through several hype cycles prophesizing magical applications of computers — from automatic speech translation to expert systems designed to run our entire defense systems. These winters are usually brought upon by a chain reaction that begins with pessimism in the AI research community, followed by pessimism in the press, followed by heavy funding cuts, followed by the end of serious research.
In the last decade, data storage has gotten cheaper and big data has evolved into a buzzword (all the cool kids are talking about it). Big consulting firms have written meticulously detailed reports about how data will transform every industry known to man. It is not only the new crude oil, but also the next “frontier for innovation, competition and productivity” and “gold rush”.
On top of all that, we’ve seen associated fads come and go. Fuzzy logic got all the attention in the mid-2000's, and, just like “nanotechnology”, it became passé as 2010 rolled in. Terms like “machine learning”, “deep learning”, and “neural networks” are on fleek today. Good old terms like “statistical analysis”, “data mining”, or even “KDD” (Knowledge Discovery in Databases) are not quite as “in”. Recently, an IEEE study identified 26 words that are synonymous with data mining alone!
Is everyone doing “big data” and “machine learning” without me?
A few years ago, while my startup was trying to raise money, I spent some time consulting where I got to experience the height of the “big data” hype firsthand. One of my marketing-focused clients brought me in to help them get in on “big data” to remain competitive. They wanted to “machine learn” things about the “big data” from their customers’ social media presence to stand out in the marketplace.
I’ve attended several conferences where people introduced themselves to me as a “Big Data CEO” and have even been approached to build a system that would predict global-scale political events from just analyzing streams ofTwitter data.
Everywhere, companies and individuals are trying to get in on the hype without having a clear understanding of what “big data” and “machine learning” really mean.
One thing is clear: If you lack an understanding of how these terms work together, it’s hard to know what you can stand to gain from it.
But, machine learning does not need big data!
It’s easy to assume that because everyone started talking about machine learning right after they started talking about big data, that machine learning has to be about training computers to do something intelligent from the large amount of data that’s coming at a high volume (with high velocity and variety).
Machine learning does not actually have anything to do with computers. Themachine in machine learning is actually a hypothetical statistical/mathematical machine you train to do something for you. In fact, until very recently, statistical inference was the common vernacular to refer to what is called machine learning now. Most of the techniques used in machine learning have their roots in classical mathematics and statistics, which have been around for decades (centuries in some cases). Another fact: These have been used successfully since before computers were invented.
A popular statistician even “joked”:
“To paraphrase provocatively, ‘machine learning is statistics minus any checking of models and assumptions’.”
In The Imitation Game, Alan Turing’s primary motivation for developing a calculating machine was so that he could run optimization simulations while trying to break cryptography used by the Axes powers during World War II, which (arguably) is a machine learning problem.
You likely don’t have or need big data, but you can science up on what data you have.
In reality, you can train a machine learning model from “small data”, and, as long as your dataset is a statistically significant sample of the population, your model is likely to perform reasonably well.
Okay, so who is going to do science-y things to my data for me?
As the hype around big data grows, a new term has been coined for the nerds who work with data and are taking a computational approach: data scientist. Right now, it’s being boldly referred to as “The sexiest job of the 21st century” by a popular business publication (demonstrated in the photo below). Almost every company riding the “big data train” has hired a few of these people over the last few years.
It’s hard to describe what a data scientist is exactly today. Back in the 60's, the term was synonymous with computer scientist. In fact, when I first learned the definition of a computer in kindergarten, it was defined as “a device that converted data into information” (which, by the way, still doesn’t sit well with me–but, that’s a different story). Arguably, good old computer programmers have been doing data science for a while–they’re usually always operating on data. From storing data efficiently, moving it fast, transforming it accurately, and displaying it contextually, we’ve always used computers to operate on data.
So what’s so different about what a data scientist does?
Just like the varied interpretations of what big data really means, what a data scientist is and does depends a lot on the company plus the problem they’re trying to solve. At the lowest common denominator, the work they do is usually thought to be the intersection of statistics, computer programming, and domain knowledge. However, there’s a lot of debate around how they’re different from statisticians, how they’re trained, and how they’re different from developers. Is this its own field, or is it an interdisciplinary mix of computer science, mathematics + statistics, and domain knowledge? Or is it really just what statisticians have been doing all along? Also, because data science (at scale) involves large-scale computation, what is the relationship between data science and computational science?
Before the big data boom, one of the seminal works in data analysis described data science as:
‘… detective work — numerical detective work — or counting detective work — or graphical detective work. … [It is] about looking at data to see what it seems to say. It concentrates on simple arithmetic and easy-to-draw pictures. It regards whatever appearances we have recognized as partial descriptions, and tries to look beneath them for new insights.’ — Bruce Ratner, Statistical and Machine-Learning Data Mining
At an applied math conference a few months ago, I attended a session where a lively panel of really smart people discussed the current and future status of data science.
What interested me most were these anecdotes from the panel:
· Data science is not statistics
· Data science should be taught by computer scientists
· In five years, every domain of science and engineering will center on data science
· In ten years, all of data science will be applying machine learning
· Data science is not new — it’s just the other side of the computational science and engineering “coin”
The typical data scientist’s job description goes like this: Make discoveries while swimming in data. Possess an intense curiosity. Bring structure to formless data and make analysis possible while maintaing a feel for business issues combined with an empathy for customers. Advise stake holders across function on how to use this information to make better products.
The particular skills most useful when solving data science problems include computation, mathematical + statistical modeling, and some understanding of how to build mathematical models of the real world. That’s why you see data scientists coming from fields as broad as computer science, astronomy, physics, applied mathematics — they come from many different trainings and backgrounds.
My final words (and meme)
Much like computer science in the early 1980’s, communities don’t quite have consensus around what exactly a data scientist is (or does). What we do know, however, is that it’s going to evolve into something that’s functionally very symbiotic to the computational sciences and statistical analyses.
My final thoughts on its future: Data science will be centered on the ability to take data and understand it. A good data scientist will process that information, extract value from it, visualize and communicate with it, and continue to do so at a larger scale as the world’s computational capability increases.
My final meme: