The Dimensions of Data Quality
Data quality is a broad topic, and the data management community has worked hard to dissect and define it so that it’s accessible and usable. In the next few blogs, I’ll provide a deeper look at what is meant by data quality, and examine its component parts.
What is meant by data or information quality? Danette McGilvray, the author of Executing Data Quality Projects, defines it as “the degree to which information and data can be a trusted source for any and/or all required uses.” This is a good definition, as it focuses on the outcome of data quality as a practice. It would be counter-intuitive to trust data that was known to be inaccurate, outdated or redundant. Conversely, why would you NOT trust data that has been shown to be reliable and accurate?
Data quality is based on a number of dimensions, which represent different ways to manage and understand the quality of data. These dimensions include:
- Integrity
- Accuracy
- Completeness
- Duplication
- Currency
- Consistency
Data integrity is the most fundamental dimension and the one on which all other dimensions are based. Data integrity is the determinant of whether, on the whole, the data “makes sense” given what is known of the business and its requirements. Data integrity practices include profiling to identify unusual or outlying values, understanding expected distributions, and establishing and enforcing domains of value (i.e., what are the valid vendors of hardware or software).
Data accuracy is a different question than data integrity. A given record may satisfy integrity expectations, and simply still be wrong, e.g. the server is not at location L340; it’s at M345. Data profiling and exception reporting won’t uncover this error; you have to audit the data against reality.
Data completeness is self-explanatory. Is all the expected data there? Are servers or software sneaking into operational status with no repository record? Integrity checks can help, but you may still need some form of audit by testing the data against its reality through a formal inventory process. Reconciliation of multiple data sets becomes an essential tool for supporting this type of initiative.
Data duplication is the flip side of completeness. If a customer has two accounts, then he or she may be able to exceed his or her credit line. If a server appears twice, then the capital budget will be inaccurate. Simple duplication, where two records are identical, can be found relatively easily through basic reporting. The real challenge is what happens when the duplicated records are just different enough to avoid simple reporting? Powerful algorithms such as those used by Blazent can identify potential duplicates, based on fuzzy or incomplete matches.
Data currency is the use of the necessary processes to keep the data current. The data may have once been correct, but not anymore. The server moved, or was decommissioned and sent to salvage. Data profiling over time can identify the “data spoilage” rate, that is, the expected half-life of a data set, and how often the processes will need to run to maintain data currency. Data quality analytics is an important leading indicator for currency issues. The IT Asset system may be behind the times, but the event monitoring system is no longer seeing a feed from Server X – so what happened to it?
Data consistency is the need to replicate data for good, or at least unavoidable, reasons, although it is ideal to have all data in one place. If your IT Asset repository is distinct from your CMDB, then how do you keep them in sync? What about your Fixed Asset and monitoring systems? You might see server data in any of these, with subtle (or dramatic) differences. Reconciling this diversity into a common view is essential, and yet challenging. Detailed business rules may be required to deliver such a consistent, integrated resource.