Everything you need to know about Oracle and Java licensing

What You Don’t Know about Oracle and Java Licenses Could Cost You Millions

A breaking story at the end of 2016, as Gavin Clarke reported in The Register, is Oracle’s increasingly aggressive enforcement of Java software licenses. It has increased staff in its License Management Services, and longtime Java users are receiving bills requiring hefty per-user and per-processor payments.

It’s an open secret in the IT industry that there is big money in licensing true-ups (that is, making sure you pay for what you actually use): tens and sometimes hundreds of millions of dollars are often at stake. The worst situations emerge from misunderstandings:

  • Thinking software is free when it isn’t
  • Changing your usage style (e.g., moving the license from a physical to a virtual machine) and assuming there is no change in your licensing obligation

Java is a classic example of the first misunderstanding. Because it can be easily downloaded and there seems to be no mechanism in place for billing, people assume that it’s free. If you are a small organization, then you may well be able to sail beneath the radar for the foreseeable future. Larger scale use, however, is more difficult to conceal, especially when people are hired to discover whether you’ve paid for your Java licenses.

There are multiple products typically downloaded with Java, some of them free, some not. As Clarke stated, “There’s no way to separate the paid Java SE sub-products from the free Java SE umbrella at download as Oracle doesn’t offer separate installation software.”

More broadly, different parts of Java may or may not be free, but at any point that is Oracle’s decision –it owns Java and can do what it likes, including changing the terms and conditions for future versions. Oracle has now apparently put in place effective audit methods and created a hit list of big users; if your use of Java is significant, you could receive an audit notice from Oracle any day.

This is not a new issue. During the past five years, reports have continually surfaced of companies paying massive audit penalties to IBM, Oracle, Adobe and others (just attend a SAM summit if you want to hear the stories). Companies moving from physical to virtual computing have incurred some of the biggest charges. For example, a large enterprise may have a long relationship with a major software vendor, who provides a critical software product used widely for many purposes. The price for this product is based on the power of the computer running it. A license costs less for a computer with 4 cores and 1 gigabyte of RAM than it would for a computer with 16 cores and 8 gigabytes of RAM. The largest computers naturally require the most expensive licenses.

During a three-year period, the enterprise virtualizes thousands of formerly physical computers, each of which had been running the vendor’s software; however, the physical computers were, in general, smaller machines. The new virtual farms were clusters of 16 of the most powerful computers available on the market. After a review, the vendor insisted that EACH of the thousands of instances of its software running on the virtual machines was liable for the FULL licensing fee applicable to the most powerful machine!

Even though each of the virtual machines could not possibly have been using the full capacity of the virtual farm, the vendor insisted that the contract did not account for that, and it was impossible to know whether any given VM had been using the full capacity of the farm at some point.

“True-up” charges in the millions of dollars have resulted. Major software vendors have earned massive amounts in such charges and continue to audit aggressively for these kinds of scenarios.

This is why managing your IT asset data is so important. You must genuinely understand what products you are using, and their associated licensing models. This knowledge starts with the data in your provisioning, configuration management and discovery tools, and it is no small feat to get all this information right. Motivated software auditors will find those redundancies, omissions and inaccuracies, and you must know about them before they do.

The difference between data quality and data integrity

Data Quality and Data Integrity Are Not the Same

The difference between data quality and integrity is important to understand if you want to improve the overall efficiency and effectiveness of your organization, since both are highly dependent on the data leveraged for day-to-day business decisions.

Data Quality

When the term “quality” is used in reference to data, it conveys a clear statement to the individual consuming it. This largely reflects the context in which it will be used, and therefore its intention and meaning must be clear. People tend to use terms like complete, relevant and consistent when describing data quality. The result of poor data quality, e.g. wrong and inconsistent data, is poor investments and excessively expensive operations.

Data Integrity

Integrity defines the accuracy and consistency of data, but it has additional definitions to distinguish it from quality. Data integrity relates to the validity of data for the period of time during which it is relevant from that source. When data is described as having integrity, it’s viewed as being genuine and resilient during a period of time and hence reliable for future use. To have integrity requires assurances that there are mechanisms in place to prevent accidental and/or intentional unauthorized modification of the data.

You can see how these two terms can be confused and intermingled since, at a single point in time, they are essentially identical in meaning. It’s during a period of time that they might diverge and integrity could be lost while quality is still maintained. If your data quality is bad, then you can never have data integrity for that source. This is because data integrity builds on the foundation that data quality provides; the resulting data integrity is what enables organizations to grow and deliver positive business outcomes.

Your organization cannot expect to improve business outcomes through data cleansing efforts if you don’t understand the difference between quality and integrity. Those types of efforts are:

Data Quality:

  • Data Dictionary – Create and maintain for most vital data
  • Data Cleaning – Ensure types and formats are as per the Data Dictionary
  • Data Completion – Identify missing data and update with correct values
  • Originating Sources – Use wherever possible and try to avoid secondary or tertiary sources
  • Reviews and Audits – Periodic and ad-hoc to identify discrepancies

Data Integrity:

  • Scientific Analysis – Execute statistical/mathematical analysis of data
  • Systems Analysis – Analyze how systems process data
  • Code Analysis – Changing implementation methods could introduce new patterns and trends
  • Architectural Design – Reliability of data transference
  • Organizational Structure – Credentials and authorizations across groups

As you can see from the sample of efforts listed, quality and integrity intersect in a variety of ways. It is your understanding of them that will enable you to implement improvement efforts that lead to a more efficient and effective organization. Your organization must make informed, accurate decisions, and it can only do so with data that is of a known quality and integrity.

Implications of poor data quality in healthcare

Organizations are constantly challenged to maintain the right level of data quality. This is especially true in a risk-averse industry such as healthcare, where decisions could literally mean the difference between life and death. In addition, ensuring the privacy of patient data and compliance with various regulations from HIPAA, HITECH, PSQIA and others is not only mandatory and complex, but also could be costly in fines and fees. Noncompliance is not a viable alternative when someone’s life could be at risk.

Whether it is a physician accessing patient records via a tablet at a bedside through a cloud service or administrators accessing data during normal hospital facility operations in a Data Center, the regulators require that the data be accurate and maintained with the proper level of governance as dictated by the law.

The metrics for complying with relevant regulations are specified in great detail. The collection and maintenance procedures for the data must ensure that healthcare professionals and organizations can deliver these metrics without additional scrutiny. This need is putting pressure on organizations to leverage supplemental tools to process data and improve their quality before decisions are made or regulations compromised. To achieve these regulatory goals and ensure high data quality for the healthcare industry, organizations must employ strong data cleansing routines.

Data cleansing is the identification and correction of corrupted, duplicate, missing or inaccurate data. This data, especially when related to healthcare, cannot be wrong, inaccurate, incomplete or unrecognizable to the operations and processes that consume them. The ramifications of inaccurate data could impact patient safety, accurate reimbursement for services, and many other aspects of healthcare delivery. Data cleansing also identifies duplicate data, which directly affect the organization’s efficiency and effectiveness.  The capability of the organization to operate efficiently and to make accurate decisions that lead to positive outcomes requires these activities and processes be engrained in daily operations.

Regulators are well aware of the challenges that organizations face with poor data quality, but their focus is on preventing what could happen when poor data quality exists and is used without knowledge of its accuracy. As demanding as it might be to abide by the HIPAA, HITECH and PSQIA requirements, these regulations are needed and organizational leaders must ensure they are fully compliant. One of the primary ways to do this is by defining and implementing a robust data cleansing strategy that not only addresses regulations, but also ensures data accuracy and privacy. Only then will healthcare organizations minimize the implications of poor data quality in their industry.

The interacting lifecycles: Service, Asset, and Technology Products

We have previously discussed the Asset lifecycle and its architecture. However, Asset is only one of four IT lifecycles:

  • Application services
  • Infrastructure services
  • Assets
  • Technology products

An Application Service is a business or market-facing product, consumed by people whose primary activities are not defined by an interest in Information Technology (IT): a user example could include a bank customer looking up her account balance, or an Accounts Receivable systems operator checking payment status, while a Service example by contrast would include IT-centric functions such as an Online Banking system or a Payroll system.

An Infrastructure Service is, by contrast, a digital or IT service primarily of interest to other digital or IT services/products. Its lifecycle is like that of the application service, except that the user is some other IT service. An example would be a private cloud, or a Storage Area Network system managed as a service, or the integrated networking system required for connectivity in a data center. Platform and Infrastructure-as-a-Service is also tracked here.

The Service Lifecycle is the end-to-end existence of either application or infrastructure service systems, from idea to retirement. Services imply ongoing support; they are live and operational.

An Asset is a valuable, tangible investment of organizational resources that is tracked against loss or misuse and optimized for value over time, which can sit unused and still have some value. Examples would include a physical server or other device, or a commercial software license. Whether assets can be virtual is a subject of debate and specific to the organization’s management objectives, but given the licensing implications of virtual servers, treating them as assets is not uncommon.

Finally, a Technology Product is a class of Assets, the “type” to the Asset “instance.” For example, the enterprise might select the Oracle relational database as a standard Technology Product. It might then purchase 10 licenses, which are Assets. Or, a particular class of server (e.g. HPE ProLiant DL380) is another kind of technology product. Technology products are not services; an organization can acquire technology products by buying assets and not running them, in which case they are not services. Or a service may use multiple technology products (e.g. a private cloud service based on VMWare, Cisco, and DELL hardware and software assets).

Technology products can be complicated to explain, and many people who are starting out on their CMDB (Configuration Management Database) journey simply don’t understand why you need all these concepts. But trying to reduce technology products to services or assets always leads to confusion. Services are composed of assets, which are instances of technology products, is the correct way to think of it. Services are supported – someone is “on call” – while technology products are passive and only made available through asset instances that support operational services.

The lifecycles drive much activity because they often don’t line up. Services use multiple kinds of technology products, represented by assets.


One of the reasons that IT management can be so painful is the time and effort spent “lining up the lifecycles.” As an example, assume a database is no longer supported for version 9, and so a decision to start deployment of version 10 is taken, but version 10 only runs on 64-bit architectures, so it’s not just a matter of upgrading the database license. The entire server strategy needs to be considered, and there will also be ripple effects up into the infrastructure services (e.g. internal cloud) and ultimately the application layer. Before you know it you’re in a discussion about building out a whole new water-cooled, DC-powered wing of the data center, all because the database version went off support.

Another example might be an application vendor  requiring a patch, but again, that patch has only been released for versions of the application that are certified on more modern infrastructure and so the entire technical stack gets driven into change from the top-down.

One thing is clear, having solid IT asset data is essential to navigating these commonplace complexities. If you have a complete inventory of your assets, you can understand the interacting lifecycles and make appropriate plans by combining your asset data with lifecycle data; you can comfortably build long-horizon forecasts and strategic technical plans. But this planning all starts with having data that is complete, accurate and clean, which is why CMDBs are so critical to a smoothly operating IT ecosystem.

The Dimensions of Data Quality: Currency and Consistency

4th of a 4-part series

Previous posts have provided an overview of data quality drivers and their associated dimensions. To recap, the dimensions covered include:

  • Integrity
  • Accuracy
  • Completeness
  • Duplication
  • Currency
  • Consistency

This post presents more detail on what is meant by data currency and consistency.

Data currency

Data currency is not a financial reference, it’s a temporal reference: “the degree to which the data is current with the world it models,” as the Data Management Body of Knowledge (DMBOK) suggests. You may have once had the right information about a server or other IT asset in your database, but then it was renamed, moved or re-purposed. The data is not current and must be updated.

Updates can be manual or automatic. They can take place on an as-needed basis or they can be scheduled periodically, it all depends on your business requirements. The business rules that define your approach to this are called “data currency” rules.

Data currency is a familiar issue for IT asset repositories and configuration management databases (CMDBs). Too often, such projects are funded as a one-time capital project, which means it’s a challenge to obtain support for ongoing operations. Without steady-state operational processes in place to maintain the data, they will inevitably decay, and the entire capital investment is then at risk as the repository loses credibility.

Guarding against this requires strong executive support for, and a continuous improvement approach to, data quality issues. The value of keeping the repository up to date should be assessed, and conversely the costs and risks of data inaccuracy must also be factored for long-term success. 100% data accuracy and currency is neither affordable nor required – but what is an acceptable level? 95%? 98%? Only you can answer this question for your organization, in terms of such factors as:

  • Prolonged system outages
  • Redundant researching of the same IT data
  • Increased security and financial risk due to lack of awareness
  • Other risks, many of which may not be obvious

Data consistency

Data consistency is related to both data integrity and data currency. It applies whenever data is maintained in two places; DMBOK summarizes it as “ensuring that data values in one data set are consistent with values in another data set.” If you designate one system, such as the CMDB/ITAM (IT Asset Management) repository as the system of record for servers, and another system (for example, a monitoring system) as a downstream replica, then you should run periodic consistency checks. Are the servers in the monitoring system still recorded in the CMDB? Are there disposed servers in the CMDB that the monitoring system still records? Are there servers in the monitoring system that have never been entered into the CMDB?

Data can be assessed for consistency in terms of its relation to data in other systems, or even within the same overall system. For example, a high-performance system that may cache frequently accessed data also runs the risk of potential consistency issues.

All such issues, whether between or within systems, should be flagged through exception reporting and investigated, then fixed. The process gaps that caused them to occur must also be investigated and fixed; perhaps discovery needs to be run more frequently, or different approaches to identifying exceptions are needed. Finding and fixing such problems may require a few rounds of experimentation, but these are baseline requirements for any efficiently run IT infrastructure.


Data quality is not a mysterious art, but rather a defined set of practices that are well established in the data management professional community. This set of articles has looked at the six dimensions of data quality:

  • Integrity
  • Accuracy
  • Completeness
  • Duplication
  • Currency
  • Consistency

By understanding their definitions, and developing clear methods for measuring and improving them, you can add significant value to your CMDB and IT Asset repositories, the IT service management processes they support, and ultimately your business, with its increasing dependence on digital technology. Companies that learn this important lesson and stay one step ahead of their IT requirements are those that thrive. At its core, this is dependent on the quality of the underlying data.

The dimensions of data quality: completeness and duplication

A preceding blog provided an overview of the operational dimensions that are normally associated with data quality. To recap, these are:

  • Integrity
  • Accuracy
  • Completeness
  • Duplication
  • Currency
  • Consistency

This blog post will focus on data completeness and duplication.

Data completeness (or coverage)

Data completeness is intuitive. If a server is on the data center floor, but not in the Asset database, then the required referential data is incomplete. Danette McGilvray, in Executing Data Quality Projects, defines data coverage as “A measure of the availability and comprehensiveness of data compared to the total data universe or population of interest.” Data may also be incomplete if, for example, the Asset ID is included, but not the serial number or the vendor name. As the Data Management Body of Knowledge states, ”One expectation of completeness indicates that certain attributes always have assigned values in a data set.”

Data completeness in IT asset management is challenged in various ways. If multiple processes are used to procure assets, then they all need to update the asset data, but this may be easier said than done, especially in organizations with some degree of federation (e.g., multiple P&Ls). Even with one consistent process, that process must be followed correctly – process non-compliance will often result in data gaps (data gaps, conversely, are often evidence that processes are not being followed).

Data completeness can be evaluated through various means:

  • Exception reporting
  • Reconciliation with other sources
  • Inventory activities

Exception reporting can easily identify missing attributes with a written report stating “show me all the data where this field is empty.” It is more difficult for exception reporting on a single data store to identify missing rows; this is where reconciliation is added. By comparing (for example) an IT asset database with a monitoring system, we can ask, “What systems are in one or the other, but not both?” Those systems can then drive an investigation process.

Finally, physical inventories remain an important tool, especially for data centers that support multiple organizations, processes and/or technologies. While time-consuming, they nevertheless provide an effective “ground truth,” at least at a point in time, for data completeness.

Data duplication

Data duplication is also a challenging topic for IT asset managers. There are four primary categories of IT asset data sources:

  • The IT asset database itself
  • The Configuration Management database or system (if separate from the ITAM system)
  • The corporate Fixed Asset system
  • Monitoring and element management systems of various kinds

As data is consolidated from various sources, it is easy for duplication to occur. On certain levels computers remain remarkably limited, and if a serial number or another key attribute is slightly different, two records may be created for the same device. This is not unusual; for example, a serial number created from an invoice may include leading zeroes where the same serial number reported by a discovery tool does not have them – or vice versa.

Defining logical configuration items (IT assets and services) that require some alignment are particularly challenging. If one team calls the application “Quadrex,” but another team refers to it as “QDX Billing,” both may appear in inventory consolidation attempts. This is a good example of why having a firm handle on application and service naming is so essential. Exception reporting can show such duplication in some cases; for example, if Quadrex and QDX Billing both are shown to run on the same infrastructure, perhaps they should be investigated.

Data completeness and data duplication are two of the most obvious dimensions of data quality. Problems in either dimension are typically quite evident to users and can erode trust in the IT asset repository. It’s therefore important to develop a systematic approach to managing them, including appropriate automation.

While simple “exact match” checks can result in reconciliation failures and duplication, Blazent uses more advanced approaches proven during billions of reconciliations to identify unwanted redundancy and critical data gaps. When you combine this level of automation with solid processes that not only fix inaccurate data, but also fix the causes of its inaccuracy, you have a solid and valuable IT asset management capability.

Things to ask IT during a merger or acquisition

Key topics to ask IT during an M&A

As a follow-up to a previous post regarding IT’s role in a Merger & Acquisition (M&A) process, it’s valuable to look more deeper into what IT departments should specifically address to help M&A’s succeed. Many of the decisions that leaders make in this process require IT to seek out and provide high-quality information during the due diligence phase. More accurate and complete information will lead to more informed and presumably more profitable decisions, which should result in a higher likelihood that the merged organization’s ultimate corporate objectives will be reached. To ensure success, IT will need to provide leadership with information on these key topics:

Inventory & Capital Cost of Hardware Devices:  Every reasonably sized organization has millions, if not tens or hundreds of millions of dollars invested in hardware infrastructure. These are vital financial assets in addition to being crucial to running operations. Corporate leaders need a precise understanding of what they are acquiring; they must know how many devices and of what types (from a financial perspective), even before IT Operations assesses them in terms of running the merged entity. This ultimately culminates in massive capital value that must be accurate.

Inventory & Licensing of Technologies & Applications in Use: Similar in nature to hardware inventory but somewhat more difficult to gather, is the listing of what applications and software technologies are actively in use. These software packages and licensing arrangements often overlap or conflict with existing license agreements as well as with their correlation to hardware that supports them. This requires an ability to aggregate and normalize data from various management systems such that a consolidated view of the software environment is available for leadership.

Lower IT Infrastructure Costs: IT needs to provide information related to the possible consolidation of IT resources, technologies and facilities while still delivering the requisite ongoing services. This can only be done with a complete accounting of the devices and software technologies discussed above. Once gathered, IT must leverage all of that data to analyze the current and projected services before providing management with recommendations for potential cost savings.

Resource & Operational Efficiencies: There are areas of IT Operations that can more easily be consolidated which ultimately enable the potential for some headcount reduction in the new joint entity. This requires that comprehensive operational data be gathered that demonstrates measurable operational efficiency and effectiveness gains. The IT personnel involved in due diligence must carefully generate reports on operational metrics, such as call volumes, Service Level Agreements, incident response details, as well as change volumes and failure rates for analysis.

Increase Procurement Power: If the two organizations eventually merge, there is the potential for benefits from increased leverage over vendor and service provider contracts. For example, in cases where volume thresholds are crossed or enterprise licenses might be available, it might be possible to lower future discrete costs. This requires IT personnel to recognize that data about technologies, vendors and licenses may be similar or even identical in the two organizations. Many times these are the same, but defined and described differently within the two organizations.  Missed opportunities to normalize the two independent companies’ data could cost millions during the long term.

The ultimate goal for those in leadership positions executing a merger or acquisition is to ensure that the deal makes financial sense and achieves corporate objectives. That can only happen when company leaders have all the necessary data to describe the environment, which then allows them to make informed decisions. This is only possible when data quality is high; hence the vital role of IT efforts during an M&A. IT is the key provider of quality information, which directly affects the likelihood of success.

The role of IT in mergers and acquisitions

When organizations acquire or merge with another company, it is critical that they make informed decisions that will determine the level of success or failure of the venture. Information technology departments have always played an important role in M&A activity. The IT department of the acquiring company must assist in the assessment of the other company’s IT Assets, evaluating its IT operations and the accuracy of financial reporting relating to IT.

The company being acquired will have devices running in operations that must continue to deliver their services throughout the acquisition process. All devices must be fully integrated into the post-acquisition operational environment and maintained by the existing staff. It’s important to account fully for these devices details and ensure they are understood from both operational and inventory perspectives. Acquired technologies that are in conflict with the strategic direction of the current infrastructure might need to be replaced or factored into a revised strategic direction and should be considered as part of the due diligence process.

In parallel to evaluating and inventorying the Assets, IT should also evaluate the details of the operations at the company being acquired. Support structures, operating metrics, service level agreements, monitoring thresholds and security vulnerabilities, as well as all other decisions, tasks and actions that occur daily must be carefully documented and reviewed. The operational procedures then must be compared to the existing ones to determine if they should be adopted, replaced or coalesced.

Generally, companies acquire or merge for financial growth, but that’s not always the end result. The realization of growth is dependent upon the ability to report fully all of the financial aspects of the company being acquired. That requires IT to produce accurate reports on the cost of IT operations as well as other areas of the target company.  Infrastructure or IT Asset data must be accurate and complete before being assessed. The assessments must be comprehensive and up-to-date to be useful to mitigate risks and avoid surprises as a result of bad information. The collected data must instill a high level of confidence for leadership to make decisions. Otherwise, the M&A will fail or force leadership to reconsider its decision.

IT is a vital player in the successful execution of the acquisition of another company. It must produce all of the necessary data and provide insight into how that organization truly operates. Understanding this is important during due diligence because it helps to avoid risk and contributes to the overall valuation.

There are notable risks with any M&A if the information needed for the due diligence process is not accurate or not accessible. An accurate valuation and successful execution depend on these findings and will be a major factor in how or if the company proceeds with the acquisition.

The role of IT is to collect, consolidate and present the gathered facts to leadership, so it has a comprehensive overview of the target company to minimize risk and to make an informed decision on the proposed acquisition.

The dimensions of data quality: integrity and accuracy

In the previous blog post, I provided an overview of data quality and its dimensions. To recap, its dimensions are:

  • Integrity
  • Accuracy
  • Completeness
  • Duplication
  • Currency
  • Consistency

This second part will focus on data integrity and data accuracy.

Data integrity

Danette MacGilvray, in ­­Executing Data Quality Projects, describes data integrity as:

“The Data Integrity Fundamentals dimension of quality is a measure of the existence, validity, structure, content, and other basic characteristics of data. It includes essential measures such as completeness/fill rate, validity, lists of values and frequency distributions, patterns, ranges, maximum and minimum values, and referential integrity.”

To better understand data integrity, it is important to profile the data. Data profiling is a type of reporting that can shine a light on all the topics mentioned above. Essentially, data profiling provide a highly detailed view of what the data looks like:

  • How complete is the data in this column? Does every row have a value or are some blank?
  • What is the smallest value in a column? The largest?
  • What columns have just a limited number of values? What are they? Are there any outliers (values that appear very infrequently relative to the others?)
  • If there are “parent” or “master” lists of values stored somewhere, then are all the “children” consistent? That is, if a “parent” table has 10 values, do “child” records respect that? Or, are there 15 distinct values in the children, some of which aren’t in the parent?

A data analyst can use SQL and reporting tools to answer these kinds of questions. However, when the problem increases to hundreds or thousands of tables and columns, more robust data-profiling tools are needed.  Blazent includes robust functionality to profile data and then report exactly these kinds of integrity issues.

Data accuracy

Data accuracy might seem similar to integrity, but it is a different question. Data can have integrity and still be inaccurate, and inaccurate data might appear to have integrity (at least, a data-profiling tool won’t catch it). How can this be?

Let’s look at an example.

Suppose you say that in the IT Asset system a given device’s vendor must appear in the master list of vendors (perhaps drawn from your Vendor Contracts or Accounts Payable system). If you don’t enforce this programmatically, then your system will likely start to have vendor names appearing that are not seen in the master list. This is an integrity issue.

Suppose, however, some device has “Dell Inc.” as the vendor name. “Dell Inc.” is, in fact, in the master list, but it’s wrong. The device is made by HP. This is a data accuracy issue.

Data profiling can catch some of these issues, for example, a profiling tool can identify a pattern that Dell serial numbers all start with “D,” but HP serial numbers all start with “H.” In many cases, however, inaccurate data surfaces through an audit. The individual record has to be checked by a human being against some other source of truth (such as the asset invoice), and the inaccuracy is then caught. Given the volumes most companies manage, it is rare that all data can be audited; usually, only small samples are subject to analysis. If a sample identifies a discrepancy, however, two actions should be taken:

1. The discrepancy should be fixed.

2. The reason (root cause) for the discrepancy should also be identified and fixed, if possible.

In this way, a continuous improvement process will increase data accuracy during time.

It’s important to have some quantified understanding of what the data accuracy is worth. Bad data drives rework and some level of risk, both of which can be quantified. Often, accuracy of more than 95% is not worth pursuing; some level of fine-tuning and cross-checking of data is expected during the course of operational processes. If large dollar amounts are at stake, however, then more aggressive countermeasures may be needed. Formal enterprise risk management provides the tools and language to identify when such actions might be needed (e.g. instituting increased auditing or integrity checking as a form of risk control).

Part 3 of this series will look at data completeness and duplication, which are two sides of the same coin

The Dimensions of Data Quality

Data quality is a broad topic, and the data management community has worked hard to dissect and define it so that it’s accessible and usable. In the next few blogs, I’ll provide a deeper look at what is meant by data quality, and examine its component parts.

What is meant by data or information quality? Danette McGilvray, the author of ­Executing Data Quality Projects, defines it as “the degree to which information and data can be a trusted source for any and/or all required uses.” This is a good definition, as it focuses on the outcome of data quality as a practice. It would be counter-intuitive to trust data that was known to be inaccurate, outdated or redundant. Conversely, why would you NOT trust data that has been shown to be reliable and accurate?

Data quality is based on a number of dimensions, which represent different ways to manage and understand the quality of data. These dimensions include:

  • Integrity
  • Accuracy
  • Completeness
  • Duplication
  • Currency
  • Consistency

Data integrity is the most fundamental dimension and the one on which all other dimensions are based. Data integrity is the determinant of whether, on the whole, the data “makes sense” given what is known of the business and its requirements. Data integrity practices include profiling to identify unusual or outlying values, understanding expected distributions, and establishing and enforcing domains of value (i.e., what are the valid vendors of hardware or software).

Data accuracy is a different question than data integrity. A given record may satisfy integrity expectations, and simply still be wrong, e.g. the server is not at location L340; it’s at M345. Data profiling and exception reporting won’t uncover this error; you have to audit the data against reality.

Data completeness is self-explanatory. Is all the expected data there? Are servers or software sneaking into operational status with no repository record? Integrity checks can help, but you may still need some form of audit by testing the data against its reality through a formal inventory process. Reconciliation of multiple data sets becomes an essential tool for supporting this type of initiative.

Data duplication is the flip side of completeness. If a customer has two accounts, then he or she may be able to exceed his or her credit line. If a server appears twice, then the capital budget will be inaccurate. Simple duplication, where two records are identical, can be found relatively easily through basic reporting. The real challenge is what happens when the duplicated records are just different enough to avoid simple reporting? Powerful algorithms such as those used by Blazent can identify potential duplicates, based on fuzzy or incomplete matches.

Data currency is the use of the necessary processes to keep the data current. The data may have once been correct, but not anymore. The server moved, or was decommissioned and sent to salvage. Data profiling over time can identify the “data spoilage” rate, that is, the expected half-life of a data set, and how often the processes will need to run to maintain data currency. Data quality analytics is an important leading indicator for currency issues. The IT Asset system may be behind the times, but the event monitoring system is no longer seeing a feed from Server X – so what happened to it?

Data consistency is the need to replicate data for good, or at least unavoidable, reasons, although it is ideal to have all data in one place. If your IT Asset repository is distinct from your CMDB, then how do you keep them in sync? What about your Fixed Asset and monitoring systems? You might see server data in any of these, with subtle (or dramatic) differences. Reconciling this diversity into a common view is essential, and yet challenging. Detailed business rules may be required to deliver such a consistent, integrated resource.