SNAPSHOT: Treating data as a business asset and taking advantage of it to improve performance or provide better services can be daunting. Below, you’ll find 8 steps to make things easier, less intimidating, and more comprehensive.
Understand the Problem
Over the past couple of decades, businesses and public entities have made expensive investments in computing infrastructure, which has vastly improved our ability to gather and keep data throughout the enterprise. Every aspect of a business or government agency, no matter how big or small, generates data at a faster pace than ever before.
When faced with a business problem or civic issue, it is important to be able to assess how data can improve performance or help you better provide services.
Data science and data mining are critical to a competitive business strategy, so it is important for your entire organization to have a solid understanding.
Having a firm grasp of the fundamental concepts, and frameworks in place to organize data-analytic thinking, will allow you to interact with other businesses and agencies competitively. Most importantly, it will enable you to make better decisions informed by data and recognize your data-oriented competition.
Grasp the Fundamentals
So, what is data science? At a high level, data science is a set of fundamental principles that guide the extraction of patterns and knowledge from a large amount data, using methods at the intersection of statistics, machine learning, computer programming, artificial intelligence, database systems, business intelligence and software engineering.
It is often said that data science is the intersection of three fields of specialization: Computer Science, Math and Statistics, and Domain Knowledge.
How can it help? Simply put, when you combine these three elements, you’ll be able to build the right datasets and analyses on top of them.
Build Your Framework
One. Understand the problem that data can solve
This may seem obvious, but projects for data analysis seldom come pre-packaged as a clear, unambiguous data science/data mining problem. This is usually an iterative process of discovery, and creativity plays a major role in breaking down a business problem into one or more data mining problems.
The decisions that benefit from being informed by data analysis usually fall into two major categories:
- Decisions that require new discoveries within the data.
- Reoccurring decisions, made frequently. These decisions can benefit from even small increases in decision-making accuracy based on data analysis.
This is where domain knowledge of the industry and the business is immensely valuable. Knowing which questions to ask guides the entire process. Questions like which data-sets to gather, what kind of analyses to perform on the data, and how to communicate the results, can all be answered iteratively as the problem you’re trying to solve with data becomes clearer and more unambiguous.
Two. Understand the data
Once the problem has been defined, the data is the raw material used to build the solution. There rarely is an exact match between the data and the problem. A lot of existing data will often have been collected for other purposes unrelated to the current problem you’re trying to solve. Data may also have widely different acquisition costs — some datasets are free while others will be expensive in terms of effort or monetary cost.
As you begin to understand what data you have and can acquire, a cost-benefit analysis of each dataset can help you understand what kind of investment to make. Access to datasets will also have a significant influence in solution paths and engineering approaches.
For example, if your data is not too large and is mostly tabular with relationships between items, then a relational database might suffice. However, if you have a stream of geo-referenced textual data, the engineering challenges of storing it, manipulating it, and analyzing it are quite different.
Three. Acquire, clean and normalize the data
Data acquisition usually happens from a variety of sources — from relational databases, the web, non-relational databases, scraping web pages, digitizing non-electronic data, and a multitude of other methods. Data can also be of various forms — tabular, spatial, temporal, text, images, sound.
Properly tagging these datasets with accurate metadata and storing them in some kind of a data warehousing solution is only the start of the process.
Most of these datasets will need to be cleaned up and normalized before they can be of value to your organization — this will involve manipulating the data in a variety of ways before analyzing it. Sorting data in a different manner or picking a more statistically significant sub-sample, acquiring summary statistics out of your data as an initial sanity check, re-shaping the data, and merging this data with other datasets, are tasks a data analyst will do often as preparation for actual data analysis.
For reference: http://research.microsoft.com/en-us/projects/datacleaning/)
Four. Perform some exploratory data analysis
Once your data has been acquired, cleaned up and normalized, exploring the data to understand its intricacies and subtleties can help you get a much more detailed understanding of what your data can tell you.
This process often involves plotting your data in graphs and visualizations and slicing and grouping it in creative ways to see if the pictures tell a story or bring out a pattern.
Five. Evaluate and stick to a data mining processes
Once you’ve acquired data and ran some exploratory analyses on them, you should evaluate and drive the analysis process through a well-defined data mining process that works for your team and the problem. A couple of the best known processes are CRISP-DM and KDD.
A good process should not only guide the analysis process, but also offer a well-researched set of methods for data modeling, model evaluation and their deployment.
Six. Think about provenance, privacy, ethics and governance
Be cognizant of laws that might govern the data. For example, certain financial data might require you to become PCI compliant, health data might require a certain anonymization algorithm that falls under HIPAA standards, and certain student and educational data might be governed by FERPA regulations. Staying compliant with these is not just an engineering challenge: you should think about what kind of organization-wide policies regarding access to these datasets you might need to implement from a compliance, privacy or ethical standpoint.
Seven. Think about data infrastructure: How are you protecting yourself from data loss?
Are you backing up often enough? What is your process of making sure your backups are valid? Are you auditing your data recovery process as often as you should be? What is your recovery plan if your data center floods or catches on fire?
If you’re going to be sharing the data or analyses based on your data, it’s a good idea to evaluate whether you need two or more separate data stores. For example, if you’re publishing some of your data on a web page on your open data portal, it’s a good idea to consider keeping a copy of your data in “cold storage” in a data warehouse that’s serving the content for the production database.
Eight. Share your data and your analyses
If you’re a public agency wanting to share your data for external consumption, there are plenty of open source projects that make it easy to stand up your own open data portal and share your data. ckan and Open Data Catalog make this really easy — you just have to have a domain to serve it up on.
If you’re wanting to share it internally and it’s mostly visualizations, consider using Plotly or Google Charts, which make interactive, graphically appealing data visualizations and allow you to share them with fine control.
Make Better Decisions
The framework outlined above paints a very high-level picture of how you can create interesting datasets and and extract useful insight from data to improve decision-making. Success in today’s business environment requires an understanding of how to apply these techniques, processes and methods to develop solutions for particular business problems, and to be able to think about data as a business asset.
The ability to direct the application of data-driven analytics throughout an organization and eventually the community leads to success — regardless of your role, focus, or expertise.