In the previous blog post, I provided an overview of data quality and its dimensions. To recap, its dimensions are:
This second part will focus on data integrity and data accuracy.
Danette MacGilvray, in Executing Data Quality Projects, describes data integrity as:
“The Data Integrity Fundamentals dimension of quality is a measure of the existence, validity, structure, content, and other basic characteristics of data. It includes essential measures such as completeness/fill rate, validity, lists of values and frequency distributions, patterns, ranges, maximum and minimum values, and referential integrity.”
To better understand data integrity, it is important to profile the data. Data profiling is a type of reporting that can shine a light on all the topics mentioned above. Essentially, data profiling provide a highly detailed view of what the data looks like:
- How complete is the data in this column? Does every row have a value or are some blank?
- What is the smallest value in a column? The largest?
- What columns have just a limited number of values? What are they? Are there any outliers (values that appear very infrequently relative to the others?)
- If there are “parent” or “master” lists of values stored somewhere, then are all the “children” consistent? That is, if a “parent” table has 10 values, do “child” records respect that? Or, are there 15 distinct values in the children, some of which aren’t in the parent?
A data analyst can use SQL and reporting tools to answer these kinds of questions. However, when the problem increases to hundreds or thousands of tables and columns, more robust data-profiling tools are needed. Blazent includes robust functionality to profile data and then report exactly these kinds of integrity issues.
Data accuracy might seem similar to integrity, but it is a different question. Data can have integrity and still be inaccurate, and inaccurate data might appear to have integrity (at least, a data-profiling tool won’t catch it). How can this be?
Let’s look at an example.
Suppose you say that in the IT Asset system a given device’s vendor must appear in the master list of vendors (perhaps drawn from your Vendor Contracts or Accounts Payable system). If you don’t enforce this programmatically, then your system will likely start to have vendor names appearing that are not seen in the master list. This is an integrity issue.
Suppose, however, some device has “Dell Inc.” as the vendor name. “Dell Inc.” is, in fact, in the master list, but it’s wrong. The device is made by HP. This is a data accuracy issue.
Data profiling can catch some of these issues, for example, a profiling tool can identify a pattern that Dell serial numbers all start with “D,” but HP serial numbers all start with “H.” In many cases, however, inaccurate data surfaces through an audit. The individual record has to be checked by a human being against some other source of truth (such as the asset invoice), and the inaccuracy is then caught. Given the volumes most companies manage, it is rare that all data can be audited; usually, only small samples are subject to analysis. If a sample identifies a discrepancy, however, two actions should be taken:
1. The discrepancy should be fixed.
2. The reason (root cause) for the discrepancy should also be identified and fixed, if possible.
In this way, a continuous improvement process will increase data accuracy during time.
It’s important to have some quantified understanding of what the data accuracy is worth. Bad data drives rework and some level of risk, both of which can be quantified. Often, accuracy of more than 95% is not worth pursuing; some level of fine-tuning and cross-checking of data is expected during the course of operational processes. If large dollar amounts are at stake, however, then more aggressive countermeasures may be needed. Formal enterprise risk management provides the tools and language to identify when such actions might be needed (e.g. instituting increased auditing or integrity checking as a form of risk control).
Part 3 of this series will look at data completeness and duplication, which are two sides of the same coin