Fun Fact: Over 90% of all the data we have in the world today was generated in the last two years.
In fact, almost every activity that we do nowadays generates data. Data comes from more sources in greater volume, velocity and variety than ever before. We’re generating data from cell phones, satellites, cameras, credit card transactions, user generated social media content, law enforcement data, etc. Most of that data is generated and transmitted in real time. Collectively we generate more than 2.5 quintillion bytes of data every single day!
In 2012, a quote emerged in the big data world, claiming,
“It’s cheaper to keep data than to delete it.”
While initially viewed with some skepticism, it’s now generally agreed that this is true. With the decreasing cost of being able to have an “elastic infrastructure” hosted entirely in the cloud, increasing the storage capacity of the equipment you’ve rented in the cloud is usually much lower than the cost of analyzing your data with the intention of deciding what to delete.
This presents an interesting set of dilemmas for an organization:
- If it’s indeed cheaper to keep data than to delete it, does that necessarily mean that you should keep all of it?
- How do you control who has access to that data?
- If you do decide to analyze your data and delete some of it, how soon should you do it?
- Are their laws in the jurisdictions that you’ve acquired parts of your data from that require you to handle it, its retention, and access differently?
These are important questions for the management of every organization that uses data; ideally, before it’s a victim of an incident that could have been easily prevented from well-thought out policies on proper handling and retention of data. Data doesn’t necessarily show up on your balance sheet like other business assets, but improper handling of it can sink your entire business efforts. The 2013 compromise of sensitive government data and the recent Ashley Madison hack illustrate how important it is to always pay attention to data security and integrity throughout the organization.
These potential pitfalls are precisely why every organization that deals with data needs a Data Governance Policy, which outlines the processes and procedures put in place to ensure that data is captured, cleaned and stored in a way that maintains its validity and is auditable. This is in addition to any technical data security or backup schemes that your organization may have in place.
As you’re building a data governance policy for your organization, here are a few things to think about:
One: Are any of the datasets you have (or want to have) regulated by specific laws or industry rules?
As I explained in an earlier post, there are a few rules and regulations that you have to comply with when handling certain types of datasets. For example, if you’re storing or operating on financial data, you’ll want to look at the Sarbannes-Oxley Act to make sure that you’re compliant. Or, if you’re dealing with credit card and other payment information data, you’ll want to make sure your procedures are compliant with the Payment Card Industry Data Security Standard (PCI DSS). Educational data is regulated by theFamily Educational Rights and Privacy Act (FERPA) and health data is compliant with the The Health Insurance Portability and Accountability Act of 1996 (HIPAA).
Most of these laws are very prescriptive about how data should be handled, including how long you can keep it, who in your company can access it, what kinds of pre-processing needs to be done before performing analyses on the data, and how to enforce compliance procedures.
If you’re thinking of starting a new product or line of business that might use potentially sensitive datasets, it’s a wise idea to get legal guidance on how you can ensure that your organization is fully compliant with the specific laws that apply to your data. Typically, it’s a fine balance between using data to deliver business value and taking appropriate precautions for the privacy and security of personal information.
Two: Are your cross-functional and external needs aligned?
Different business units and teams within your organization will likely have different needs from the data. Implementing an organization-wide data governance policy requires knowledge, understanding and buy-in. Keep in mind, you’ll not only need these things from your data science or data engineering professionals, but also from your customer support, sales and marketing professionals.
Keeping cross-functional needs in alignment with your compliance, security and privacy needs is paramount to successfully implementing a process that works efficiently for your organization.
Depending on the sensitivity of the datasets, it’s also useful to think about a process for keeping a log of when and for how long someone had access to a particular dataset; for internal auditing purposes.
Also, make sure you set proper expectations across the board about processes that determine who can access what data, how long they have access to it, and what they do with it after they’re done using it.
Three: How long is your data valid and valuable? What happens after?
Right now, there is an abundance of data and storing it is cheap. Data is collected and kept just as a precaution in case it is needed. Deleting data too soon can cause trouble. However, at the same time, your data may become outdated after a certain amount of time has passed. An example of this dilemma held true for UPS who deleted customer data for individuals who hadn’t logged in for 7 months, leading to some pretty upset customers.
Four: How long can you keep your data “hot” without it becoming too dangerous?
Your data becomes dangerous when it isn’t protected properly and falls into the wrong hands. There are always bad guys out there that are trying to get into your infrastructure and steal your data.
Which is why, it’s important to consider that just because you have data that you can keep forever, doesn’t mean that you should keep it within the application’s database. The Ashley Madison case proves if your users have requested (or in this case, paid) for their data to be deleted, it probably doesn’t need to be just tagged as invisible while still kept in your application database.
Consider moving your data into “cold storage”, like a data warehouse. If you’re on the cloud, look into what longer-term data storage options your cloud provider has for backup and disaster recovery. Google and Amazonboth offer really good, long term storage that you can leverage for your application.
This way, even if your application or database servers are compromised and your data is stolen, the bad guys don’t have all of the data that has ever flowed through your application. You’re potentially saving yourself someembarrassment and grief, while implementing an additional barrier to the hackers trying to steal your data.
Need help with your organization’s data?
Let us know.