The dimensions of data quality: completeness and duplication
A preceding blog provided an overview of the operational dimensions that are normally associated with data quality. To recap, these are:
- Integrity
- Accuracy
- Completeness
- Duplication
- Currency
- Consistency
This blog post will focus on data completeness and duplication.
Data completeness (or coverage)
Data completeness is intuitive. If a server is on the data center floor, but not in the Asset database, then the required referential data is incomplete. Danette McGilvray, in Executing Data Quality Projects, defines data coverage as “A measure of the availability and comprehensiveness of data compared to the total data universe or population of interest.” Data may also be incomplete if, for example, the Asset ID is included, but not the serial number or the vendor name. As the Data Management Body of Knowledge states, ”One expectation of completeness indicates that certain attributes always have assigned values in a data set.”
Data completeness in IT asset management is challenged in various ways. If multiple processes are used to procure assets, then they all need to update the asset data, but this may be easier said than done, especially in organizations with some degree of federation (e.g., multiple P&Ls). Even with one consistent process, that process must be followed correctly – process non-compliance will often result in data gaps (data gaps, conversely, are often evidence that processes are not being followed).
Data completeness can be evaluated through various means:
- Exception reporting
- Reconciliation with other sources
- Inventory activities
Exception reporting can easily identify missing attributes with a written report stating “show me all the data where this field is empty.” It is more difficult for exception reporting on a single data store to identify missing rows; this is where reconciliation is added. By comparing (for example) an IT asset database with a monitoring system, we can ask, “What systems are in one or the other, but not both?” Those systems can then drive an investigation process.
Finally, physical inventories remain an important tool, especially for data centers that support multiple organizations, processes and/or technologies. While time-consuming, they nevertheless provide an effective “ground truth,” at least at a point in time, for data completeness.
Data duplication
Data duplication is also a challenging topic for IT asset managers. There are four primary categories of IT asset data sources:
- The IT asset database itself
- The Configuration Management database or system (if separate from the ITAM system)
- The corporate Fixed Asset system
- Monitoring and element management systems of various kinds
As data is consolidated from various sources, it is easy for duplication to occur. On certain levels computers remain remarkably limited, and if a serial number or another key attribute is slightly different, two records may be created for the same device. This is not unusual; for example, a serial number created from an invoice may include leading zeroes where the same serial number reported by a discovery tool does not have them – or vice versa.
Defining logical configuration items (IT assets and services) that require some alignment are particularly challenging. If one team calls the application “Quadrex,” but another team refers to it as “QDX Billing,” both may appear in inventory consolidation attempts. This is a good example of why having a firm handle on application and service naming is so essential. Exception reporting can show such duplication in some cases; for example, if Quadrex and QDX Billing both are shown to run on the same infrastructure, perhaps they should be investigated.
Data completeness and data duplication are two of the most obvious dimensions of data quality. Problems in either dimension are typically quite evident to users and can erode trust in the IT asset repository. It’s therefore important to develop a systematic approach to managing them, including appropriate automation.
While simple “exact match” checks can result in reconciliation failures and duplication, Blazent uses more advanced approaches proven during billions of reconciliations to identify unwanted redundancy and critical data gaps. When you combine this level of automation with solid processes that not only fix inaccurate data, but also fix the causes of its inaccuracy, you have a solid and valuable IT asset management capability.