Data’s Credibility Problem
Artwork: Chad Hagen, Nonsensical Infographic No. 7, 2009, digital
As a rising product-management executive prepares for an important presentation to her firm’s senior team, she notices that something looks off in the market share numbers. She immediately asks her assistant to verify the figures. He digs in and finds an error in the data supplied by the market research department, and the executive makes the necessary corrections. Disaster averted! The presentation goes very well, and the executive is so delighted that she makes an on-the-spot award to her assistant. She concludes, “You know, we should make it a policy to double-check these numbers every time.” No one thinks to inform the people in Market Research of the error, much less work with the group to make sure that the proper data is supplied the next time.
I’ve seen such vignettes play out in dozens of companies in my career as a data doctor. In telecommunications, the maintenance department might have to correct bad addresses inputted by Customer Service; in financial services, Risk Management might have to accommodate incorrect loan-origination details; in health care, physicians must work to improve patient outcomes in the face of incomplete clinical data. Indeed, data quality problems plague every department, in every industry, at every level, and for every type of information.
Much like our rising executive, employees routinely work around or correct the vast majority of these errors as they go about their daily work. But the costs are enormous. Studies show that knowledge workers waste up to 50% of time hunting for data, identifying and correcting errors, and seeking confirmatory sources for data they do not trust.
And consider the impact of the many errors that do leak through: An incorrect laboratory measurement in a hospital can kill a patient. An unclear product spec can add millions of dollars in manufacturing costs. An inaccurate financial report can turn even the best investment sour. The reputational consequences of such errors can be severe—witness the firestorm that erupted over problems with Apple Maps in the fall of 2012.
When crude oil is thick, one of the major costs of working an oil field is steam-heating the crude in the ground to make the oil easier to pump. To figure out how much steam is needed, field technicians point an infrared gun at the flow line, take a reading, and send the data to the reservoir engineer. On the basis of those data, the engineer determines the right amount of steam and instructs field technicians to make any adjustments.
But the flow line can get dirty, which insulates the line and causes readings to be as much as 20°C lower than the true level. A dirty flow line means dirty data. This was a big problem at one oil company, whose field technicians had no idea how inaccurate their readings were—or that bad readings routinely caused reservoir engineers to use more steam than necessary, jacking up operational expenses by tens of millions of dollars.
This story is all too typical of data quality problems that plague every industry. Yet the solution is usually quite simple: Make sure that the employees involved in creating the data understand the problem. Once managers at the oil company specified that employees had to clean the flow lines, the errors stopped.
When data are unreliable, managers quickly lose faith in them and fall back on their intuition to make decisions, steer their companies, and implement strategy. They are, for example, much more apt to reject important, counterintuitive implications that emerge from big data analyses.
Fifty years after the expression “garbage in, garbage out” was coined, we still struggle with data quality. But I believe that fixing the problem is not as hard as many might think. The solution is not better technology: It’s better communication between the creators of data and the data users; a focus on looking forward; and, above all, a shift in responsibility for data quality away from IT folks, who don’t own the business processes that create the data, and into the hands of managers, who are highly invested in getting the data right.
Connect Data Creators with Data Customers
From a quality perspective, only two moments matter in a piece of data’s lifetime: the moment it is created and the moment it is used. The quality of data is fixed at the moment of creation. But we don’t actually judge that quality until the moment of use. If the quality is deemed to be poor, people typically react by working around the data or correcting errors themselves.
But improving data quality isn’t about heroically fixing someone else’s bad data. It is about getting the creators of data to partner with the users—their “customers”—so that they can identify the root causes of errors and come up with ways to improve quality going forward. Recall our rising executive. By not informing Market Research of the error and correcting it herself, she left others to be victimized by the same bad data coming from the department. She also took it upon herself to adjust the numbers even though she was far less qualified to do so than the creators of the data.