Addressing the Data Governance Dilemma in the Age of Big Data

Fun Fact: Over 90% of all the data we have in the world today was generated in the last two years.

In fact, almost every activity that we do nowadays generates data. Data comes from more sources in greater volume, velocity and variety than ever before. We’re generating data from cell phones, satellites, cameras, credit card transactions, user generated social media content, law enforcement data, etc. Most of that data is generated and transmitted in real time. Collectively we generate more than 2.5 quintillion bytes of data every single day!

In 2012, a quote emerged in the big data world, claiming,

“It’s cheaper to keep data than to delete it.”

While initially viewed with some skepticism, it’s now generally agreed that this is true. With the decreasing cost of being able to have an “elastic infrastructure” hosted entirely in the cloud, increasing the storage capacity of the equipment you’ve rented in the cloud is usually much lower than the cost of analyzing your data with the intention of deciding what to delete.

This presents an interesting set of dilemmas for an organization:

  1. If it’s indeed cheaper to keep data than to delete it, does that necessarily mean that you should keep all of it?
  2. How do you control who has access to that data?
  3. If you do decide to analyze your data and delete some of it, how soon should you do it?
  4. Are their laws in the jurisdictions that you’ve acquired parts of your data from that require you to handle it, its retention, and access differently?

These are important questions for the management of every organization that uses data; ideally, before it’s a victim of an incident that could have been easily prevented from well-thought out policies on proper handling and retention of data. Data doesn’t necessarily show up on your balance sheet like other business assets, but improper handling of it can sink your entire business efforts. The 2013 compromise of sensitive government data and the recent Ashley Madison hack illustrate how important it is to always pay attention to data security and integrity throughout the organization.

These potential pitfalls are precisely why every organization that deals with data needs a Data Governance Policy, which outlines the processes and procedures put in place to ensure that data is captured, cleaned and stored in a way that maintains its validity and is auditable. This is in addition to any technical data security or backup schemes that your organization may have in place.

As you’re building a data governance policy for your organization, here are a few things to think about:

One: Are any of the datasets you have (or want to have) regulated by specific laws or industry rules?

As I explained in an earlier post, there are a few rules and regulations that you have to comply with when handling certain types of datasets. For example, if you’re storing or operating on financial data, you’ll want to look at the Sarbannes-Oxley Act to make sure that you’re compliant. Or, if you’re dealing with credit card and other payment information data, you’ll want to make sure your procedures are compliant with the Payment Card Industry Data Security Standard (PCI DSS). Educational data is regulated by theFamily Educational Rights and Privacy Act (FERPA) and health data is compliant with the The Health Insurance Portability and Accountability Act of 1996 (HIPAA).

Most of these laws are very prescriptive about how data should be handled, including how long you can keep it, who in your company can access it, what kinds of pre-processing needs to be done before performing analyses on the data, and how to enforce compliance procedures.

If you’re thinking of starting a new product or line of business that might use potentially sensitive datasets, it’s a wise idea to get legal guidance on how you can ensure that your organization is fully compliant with the specific laws that apply to your data. Typically, it’s a fine balance between using data to deliver business value and taking appropriate precautions for the privacy and security of personal information.

Two: Are your cross-functional and external needs aligned?

Different business units and teams within your organization will likely have different needs from the data. Implementing an organization-wide data governance policy requires knowledge, understanding and buy-in. Keep in mind, you’ll not only need these things from your data science or data engineering professionals, but also from your customer support, sales and marketing professionals.

Keeping cross-functional needs in alignment with your compliance, security and privacy needs is paramount to successfully implementing a process that works efficiently for your organization.

Depending on the sensitivity of the datasets, it’s also useful to think about a process for keeping a log of when and for how long someone had access to a particular dataset; for internal auditing purposes.

Also, make sure you set proper expectations across the board about processes that determine who can access what data, how long they have access to it, and what they do with it after they’re done using it.

Three: How long is your data valid and valuable? What happens after?

Right now, there is an abundance of data and storing it is cheap. Data is collected and kept just as a precaution in case it is needed. Deleting data too soon can cause trouble. However, at the same time, your data may become outdated after a certain amount of time has passed. An example of this dilemma held true for UPS who deleted customer data for individuals who hadn’t logged in for 7 months, leading to some pretty upset customers.

Four: How long can you keep your data “hot” without it becoming too dangerous?

Your data becomes dangerous when it isn’t protected properly and falls into the wrong hands. There are always bad guys out there that are trying to get into your infrastructure and steal your data.

Which is why, it’s important to consider that just because you have data that you can keep forever, doesn’t mean that you should keep it within the application’s database. The Ashley Madison case proves if your users have requested (or in this case, paid) for their data to be deleted, it probably doesn’t need to be just tagged as invisible while still kept in your application database.

Consider moving your data into “cold storage”, like a data warehouse. If you’re on the cloud, look into what longer-term data storage options your cloud provider has for backup and disaster recovery. Google and Amazonboth offer really good, long term storage that you can leverage for your application.

This way, even if your application or database servers are compromised and your data is stolen, the bad guys don’t have all of the data that has ever flowed through your application. You’re potentially saving yourself someembarrassment and grief, while implementing an additional barrier to the hackers trying to steal your data.

Need help with your organization’s data?
Let us know.

The next chapter

Yesterday was my last day full-time at EyeVerify. As I reflect on the ride I've had with the company and the team since the technology went commercial in 2012, I'm humbled and impressed by the experience.

We started in 2012 with a clunky prototype that used the back facing camera of a smartphone, and required the user to hold the phone 6-8 inches away, nearly blinding most of our users with the bright flash as they were gazing from the left and then to the right. It was beautiful, and we were far away from being "soccer mom friendly".

Over the next two and a half years, we've brought the technology to the front facing cameras, and you no longer have to gaze! Verifications happen in well under a second, we generate cryptographic keys, enabled chaff-based enterprise server support, run as a service on Android phones, integrated with several financial institutions, enabled liveness support; and several new innovative, yet unannounced, features and deals under way.

What a ride! We have a mature product!

I'm excited to announce that I'll be starting as the Chief Data Officer at MindMixer. It's an exciting opportunity to build a new product, and engage with a very different set of stakeholders to solve a very exciting, yet challenging set of problems.

Given my history and strong connection with EyeVerify, I will still stay involved at some level, including working on my dissertation work - which is strongly tied to the core technology at EyeVerify. However, starting tomorrow, I'm tackling a whole new set of problems and exploring a radically different set of approaches to solve them, and I'm pumped.

How could biometrics be used for the better delivery of health?

According to a 2006 report from the Institute of Medicine, 1.5 million people were harmed by medication errors; similar studies indicate that 400,000 injuries occur yearly in hospitals as a direct result of medication errors. A large majority of these errors are pharmacy misfills.

The most common prescription errors were due to the pharmacist putting someone else’s medicine to the customer, mostly due to mis-identification of the patient. Other common errors involved the pharmacist putting the wrong prescription medicine in the pill bottle, or putting the right medicine in the incorrect dosage in the pill bottle, or mislabeling the medicine with the wrong instructions.

In addition to damaging a patient’s health, or even causing death to the patient, due to receiving the incorrect medication, the societal losses due to such errors include the imposition of increased health care costs to everyone due to the medicine being given makes the recipient sicker, not healthier, and in return, causes medical costs to rise for everyone.

While a 2002 legislation by the FDA requiring bar code scanning for all patients at hospitals has reduced medication errors by 86% in the nine years since the law - these kinds of incidents resulting from medical errors are still too common to ignore, especially in situations when the patient does not have a bar code. This is often the case when getting re-fills for medications; recent reports indicate that pharmacists estimate a 10% error rate in medication re-fills.

One of the major issues hindering the healthcare industry from delivering more accurate healthcare is a seamless method of identification of patients that works consistently and reliably. In an ideal situation, such a method for biometric identification would be relatively inexpensive to deploy at a larger scale, so that the cost of such a method would not prevent patients from being able to always identify themselves; even when they are not admitted as a patient in a particular hospital.

Healthcare institutions have evaluated a wide variety of technologies to help solve this problem, and biometric technologies for identification have been one of the main areas of active research in hospital and pharmacy management. Such experiments have shown that traditional biometric technologies do not work effectively in such situations.

We believe that modern biometric technologies could be applied in this domain to significantly reduce errors that arise from patient mis-identification.

The latex gloves worn by many healthcare professionals have made the widespread application of fingerprint enrollment and authentication quite challenging. It’s also been shown that frequent hand washing causes dry skin which causes many commercial fingerprint sensors to not be able to authenticate correctly, in a reliable and consistent fashion.

Other biometric technologies like retina scanning and iris scanning are still really expensive and inconvenient, which reduces their applications to mostly blocking insecure transactions, instead of facilitating secure, authentic transactions.

Over the last five years, smartphones are also becoming more commonplace in healthcare. Medical professionals will often use text messages and similar instant messaging communication media to communicate with each other about patient status. Even tablets are starting to become more prevalent - medical schools now provide students tablets to use as textbooks and to round on patients. With this increase in the use of mobile technology, comes the increased risk for HIPAA compliance issues.

As smartphones are outselling feature phones in the United States, it’s not just healthcare workers that are using them - more and more patients are coming to hospitals and pharmacies with a smartphone or tablet in their pocket, and pharmacies and hospitals are getting on the smartphone bandwagon by coming up with customized apps as well.

This increased prevalence of smartphones among patients and healthcare professionals along with the dire need of the healthcare industry as a whole of a simple, yet secure and reliable way of identifying a patient receiving a particular course of treatment or medication presents unique opportunities. The next phase of healthcare information technology would benefit greatly by focusing on tying these two trends together to provide a cohesive solution.

The big question is, is there already a way to tie these two recent trends together to provide a security and identification solution that is scalable, inexpensive and significantly better than all current technologies? Learn more in my next post.

10 steps to getting the most out of external development

Written for Silicon Prairie News (http://www.siliconprairienews.com/2013/12/10-steps-to-getting-the-most-out-of-external-development)

For more than two years, I’ve consulted with more than 35 different organizations around the country on various software projects. These organizations have had very different development needs. Some of them were just an idea a solo entrepreneur was trying hard to get off the ground without any funding at all, while others were for Fortune 500 companies with a lot of experience working with outside developers and agencies.

Through my participation in these projects, I’ve learned a lot about how to effectively manage a development project when you’re contracting to a developer or a development team that’s not internal, and more importantly, how not to manage such a relationship.

Here’s what I’ve learned:

1. Have a clear idea of what you want

The first step before engaging with any developer or development agency is to develop a clear idea in your mind about the product or service you’re trying to develop. It’s important to be able to present a very quick summary of what you’re working on, yet at the same time being prepared to delve into the nitty-gritty of any individual section.

2. Think through special/edge cases

Solutions to whatever problem you’re solving are not usually a straight line. Have you walked through all possible scenarios and figured out how the system will react? Have you thought about invalid situations, when the user provides incorrect input or misses a step?

3. Develop a comprehensive requirements document

This is an important step a lot of people seem to ignore. Being able to articulate your concept and present it in written form is not only essential to the developer(s), but also helpful to understand your own product as well. This can be in one or more of many forms: user stories, “system shalls,” wireframes, etc.

In any event, it’s very important to ensure this document covers the purpose and scope of your product or service, identifies the user demographics, elaborately presents functional and non-functional requirements, and identifies all assumptions and constraints. An ideal requirements document also includes a proposed timeline with appropriate milestones, a testing plan and anything else you might think is helpful for the team to be aware of.

This is the single-most valuable resource for your development team, internal or external.

It’s also essential to remember this is continuously a work in progress, and will be revised as additional information becomes available. It’s not a big deal to change it as long as proper procedures are followed, but the earlier the changes happen, the better and cheaper it is.

4. Research several companies

Identify their strengths and weaknesses. Ask about their core areas of expertise and request to see work samples. Learn about how they approach software development. Are they fans of a particular software-development process methodology? Have they used their recommended process on previous projects? What were some of the biggest challenges? Figure out what kind of software consulting firm they are.

5. Know your expectations and learn about theirs

Before signing a contract with a company, state your expectations about the project precisely. What is your criteria for a successful project? How much wiggle room is there with the timeline? How often do you expect status reports? Who will be your point of contact? How soon can you expect a response to an e-mail sent to this point of contact? Do you expect them to comment the code in a certain manner or maintain a development log documenting technical decisions?

It’s also important that you ask what their expectations are from you. Do they expect you to be at bi-weekly scrum meetings? What kind of deliverables are required out of you to keep a smooth flow going? Are they developing a throw-away prototype for an MVP in an effort to get it done quick, or is this going into the final product?

6. Define the scope of each phase of the project well

It’s important to have the scope of each phase of the overall project defined precisely, and this is often dictated by other factors. What would the stakeholders (potential/existing customers) most want to see earliest? Sometimes the marketing plan dictates this to some extent as well. Also, sometimes two or more different entities work on a project. Maybe there’s a creative/digital agency developing the user experience and a back-end development firm working on the business logic. In such instances, it’s important for the client to identify in precise terms the scope of each of those sub-modules of the project.

7. For existing projects, make sure the team knows all the assumptions made by the previous developer(s)

Although starting a project from scratch is every developers’ dream, that’s often not the case. In such instances, it’s important for the new development team to be aware of the technical assumptions made by the previous team(s). This reduces ramp-up time for the new team and greatly diminishes the risks associated with the project.

8. Find a technical mentor not associated with the development company

If you’re not very technical yourself, it’s very advantageous for you to have a technical mentor you can bounce ideas off of. Sometimes, when the development team presents you with a few technical options that will achieve your desired goal, it’s a good idea to sit down with your mentor and discuss these so you have a different and perhaps more educated perspective guiding your decision.

9. R&D is more expensive than engineering

If your product or service requires original research and development, it’s going to take longer than an engineering solution that relies on existing research. R&D involves multiple cycles of hypothesis and exploration; design, development and testing; implementation and improvisation; and dealing with a lot more uncertainty.

10. Notify of changes to the requirements, scope or timeline immediately

From a developers perspective, it’s always better to know immediately of any changes to the project sooner rather than later. The sooner a change is identified, the cost to deal with the change can be remarkably minimized.

Programming In Academia Vs Industry

I’ve programmed both in the academic setting and in industry. In college, I’ve been in research projects, directed readings with professors, developed apps and an autograder, and had a great experience doing so. I’ve also done programming in industry, actually worked on products that have shipped to users. Most of my work in industry has been either embedded systems programming or enterprise application programming.

A lot of people seem to think that programming in academia isn’t real world programming. While I agree that four years programming in academia solving theoretical problems in C or Matlab isn’t going to make you a rockstar in implementing RESTful web services in Java or .NET, I do think the core programming skills and principles carry over nonetheless.

In this post, I would like to highlight some of the differences I’ve noticed in my experiences with both so far.

One of the major differences in my experience between the two domains has been guidance. If I’m stuck on something in college, even if this is as trivial as figuring out the best practices, I can bounce ideas off my professors, and they understand that I’m learning by doing and are usually accommodating. An elegant solution is preferred over a hacked together solution, and you are encouraged to learn best practices doing it.

Now, I’m not talking about a programming assignment for class here. I’m talking about projects done under the guidance of a faculty advisor.

My experience in industry tells me that the some product is better than no product rule takes precedence over anything else. If you cannot figure out the most elegant way to do something, hack something together to have a working solution. If time and budget permit, you may have a chance to later go back and fix it, although sometimes you may not. You may have a mentor assigned to you, but he’s busy doing his stuff and will usually try to point you to something, and if it doesn’t help, and you are on your own to figure it out. If you ask TOO MUCH, you’re probably a bad hire.

Another difference I’ve discovered is documentation (comments in code). While programming in academia, if you do not document your code, someone’s going to get upset because they do not know the language you’re using and can’t figure out the algorithm, or just because you’re not following what has been taught in your courses. In industry, regrettably, no one has the time for extensive source code documentation. Maybe it’s just that I’m working under experienced professionals who know what the code does by just looking at ten lines out of a 500 lines class implementation, but such has been my experience. After all, were you hired to produce products and services or to write pre- and post-conditions for your methods?

I’m sure I’m not the only person who’s wished that this would change. For a developer, comments in source code are more helpful than external documentation any day.

Academic programming also tends to have evolving designs and architectures. Trial and error is better than something set in stone. A professor would rather work on an interesting problem than have meetings to discuss an architecture document. After all, wasn’t I supposed to have taken a software engineering class?

Programming in industry will always have written architecture, design and other documentation. If not, it’s probably not a real project.

I will admit that I’ve programmed more in an academic setting than in industry, but so far, I think there’s pros and cons with both, and both would do themselves a favor if they learned from each other.

 
--- 
UPDATE: I also posted this question on the stackexchange programmers community, and I received some very interesting answers. Here are some of them:
----
In a traditional undergraduate computer science program you learn just programming. But industry doesn’t want people who are just programmers, industry wants real software engineers. I know many job descriptions don’t seem to know the difference which only confuses the matter, but in the real world you need to be able to:

Gather and analyze requirements, when they aren’t directly given to you
Design and analyze architecture, with near endless possibilities
Create test plans and act on them, to evaluate and improve the quality of a system
Work collaboratively on a team, of people with different backgrounds and experience levels
Estimate and plan work, even if you don’t know exactly what to build
Communicate effectively with stakeholders, who have different needs that don’t necessarily align
Negotiate schedule, budget, quality, and features, without disappointing stakeholders

Oh yeah, and you also have to be able to write code too, but that’s, on average, only 40 – 60% of a software engineer’s time.

So, it’s not that freshly minted computer science undergrads don’t know how to program (many are in fact, very good programmers) – it’s that many of them don’t know how to do anything else!
---
University

(I call this scenario university, because programming as an actual computer scientist is also different than what you do while studying)
Your teacher gives you:

A well defined, isolated problem, the solution of which can be provided within a short and well defined time span and will be discarded afterward
A well defined set of tools that you were introduced to prior to assignment
A well defined measure for the quality of your solution, with which you can easily determine whether your solution is good enough or not

“Real World”

In this scenario:

The problem is blurry, complex and embedded in context. It’s a set of contradictory requirements that change over time and your solution must be flexible and robust enough for you to react to those changes in an acceptable time.
The tools must be picked by you. Maybe there’s already something usable in your team’s 10 year old codebase, maybe there’s some open source project, or maybe a commercial library or maybe you will have to write it on your own.
To determine whether the current iteration of your software is an improvement (because you’re almost never actually done in a software project), you need to do regression testing and usability testing, the latter of which usually means that the blurry, complex, contradictory, context-embedded requirements shift once again.

Conclusion

These things are inherently different to a point where there’s actually very little overlap. These two things work at to completely different levels of granularity. CS will prepare you for “real world” software development like athletics training would prepare an army for battle.
---
Academia is mainly focused on the “science of programming” thus studying the way to make efficient particular algorithm or developing languages tailored to make certain paradigms more expressive. Industry is mainly focused in producing things that have to be sold. It has to rely on “tools” that are not only the languages and the algorithms, but also the libraries, the frameworks etc.

This difference in “focus” is what makes a academic master in C practically unable to write a windows application (since we windows API are not in the C99 standard!), thus feeling as it is “unable to program”. But, in fact, he has all the capabilities to learn itself what he’s missing. Something that -without proper academic studies (not necessarily made in Academia)- is quite hard to find.