The Dimensions of Data Quality

Data quality is a broad topic, and the data management community has worked hard to dissect and define it so that it’s accessible and usable. In the next few blogs, I’ll provide a deeper look at what is meant by data quality, and examine its component parts.

What is meant by data or information quality? Danette McGilvray, the author of ­Executing Data Quality Projects, defines it as “the degree to which information and data can be a trusted source for any and/or all required uses.” This is a good definition, as it focuses on the outcome of data quality as a practice. It would be counter-intuitive to trust data that was known to be inaccurate, outdated or redundant. Conversely, why would you NOT trust data that has been shown to be reliable and accurate?

Data quality is based on a number of dimensions, which represent different ways to manage and understand the quality of data. These dimensions include:

  • Integrity
  • Accuracy
  • Completeness
  • Duplication
  • Currency
  • Consistency

Data integrity is the most fundamental dimension and the one on which all other dimensions are based. Data integrity is the determinant of whether, on the whole, the data “makes sense” given what is known of the business and its requirements. Data integrity practices include profiling to identify unusual or outlying values, understanding expected distributions, and establishing and enforcing domains of value (i.e., what are the valid vendors of hardware or software).

Data accuracy is a different question than data integrity. A given record may satisfy integrity expectations, and simply still be wrong, e.g. the server is not at location L340; it’s at M345. Data profiling and exception reporting won’t uncover this error; you have to audit the data against reality.

Data completeness is self-explanatory. Is all the expected data there? Are servers or software sneaking into operational status with no repository record? Integrity checks can help, but you may still need some form of audit by testing the data against its reality through a formal inventory process. Reconciliation of multiple data sets becomes an essential tool for supporting this type of initiative.

Data duplication is the flip side of completeness. If a customer has two accounts, then he or she may be able to exceed his or her credit line. If a server appears twice, then the capital budget will be inaccurate. Simple duplication, where two records are identical, can be found relatively easily through basic reporting. The real challenge is what happens when the duplicated records are just different enough to avoid simple reporting? Powerful algorithms such as those used by Blazent can identify potential duplicates, based on fuzzy or incomplete matches.

Data currency is the use of the necessary processes to keep the data current. The data may have once been correct, but not anymore. The server moved, or was decommissioned and sent to salvage. Data profiling over time can identify the “data spoilage” rate, that is, the expected half-life of a data set, and how often the processes will need to run to maintain data currency. Data quality analytics is an important leading indicator for currency issues. The IT Asset system may be behind the times, but the event monitoring system is no longer seeing a feed from Server X – so what happened to it?

Data consistency is the need to replicate data for good, or at least unavoidable, reasons, although it is ideal to have all data in one place. If your IT Asset repository is distinct from your CMDB, then how do you keep them in sync? What about your Fixed Asset and monitoring systems? You might see server data in any of these, with subtle (or dramatic) differences. Reconciling this diversity into a common view is essential, and yet challenging. Detailed business rules may be required to deliver such a consistent, integrated resource.

 

Managing the Data Lifecycle

In business, we rely on applications, services, and data to improve the efficiency and effectiveness of the organization. One of the side-effects of having a sophisticated capability is thinking we now have the capacity to analyze all the data that enters our IT system; the problem that surfaces consistently  is losing track of key aspects of the data and what it reveals about our organization. The actual insight provided by this information comes from having a complete view of the data from its creation to its retirement. This data lifecycle should be our focus, since it will help us better understand its actual value and impact on operations, as well as how it can be used to ensure the organization can grow and prosper.

The Data Lifecycle

There are three main areas to understand with respect to the lifecycle of data once it is created:

  • Maintenance
  • Entitlement
  • Retirement

Each area is important to the support of operational activities, which ultimately contribute to positive business outcomes. Understanding the entire data lifecycle can provide valuable insight into how to run the organization better.

Maintenance

Maintenance of data is a vast area that demands many resources. It involves understanding what should be remediated versus what could be eliminated.

The key to any data maintenance plan is a data quality improvement strategy. This is required across the entire organization in order to leverage all available data source types. The more frequently these maintenance activities are performed, the more likely it is that organizations will be able to leverage data for beneficial purposes.

Entitlement

Entitlement is a different type of effort. It focuses on making sure the right people have access to the right data.

Only the appropriate or authorized people will fully understand the meaning of data they view or consume, since these individuals will have innate knowledge about the context of the data and be able to interpret what it means. For example, someone with a strong security background might detect data modification patterns that suggest a possible breach, or a systemic issue about how the data is being maintained. Data at the discrete or elemental level might not convey this message unless individuals with knowledge of the entire lifecycle are able to understand it and translate it, so others can take appropriate actions.

Retirement

Lastly is the retirement of data, which in some cases is the most vital step.

When we recognize that data or a data source is no longer adding value to the organization, it must be disposed of or eliminated from further consideration to avoid negatively skewing or influencing decisions. In addition, the timely retirement of data sources provides valuable information about the data itself, as well as the usefulness and longevity of the associated data sources. It could also be the reason why decisions based on this type of data may not have matched expectations.

As everyone in IT knows, there is no shortage of data and the tools to interpret it, but it is critical to have a systematic process to manage the data lifecycle for it to be fully leveraged. Without such tools and processes, data will just become more noise in our environment and prevent us from reaching successful outcomes. Outcomes will only begin to change when we look at data in a more comprehensive and holistic manner that factors in the entire lifecycle; only then will we have all the information needed to help the organization make the best decisions possible.