By Ronald G. Ross
Founder, BRS
Business has a fundamental problem with data quality. In some places it’s merely painful; in others it’s near catastrophic. Why is the problem so pervasive? Why does it never seem to get fixed? Perhaps we’ve been thinking about the problem wrong. Time to take a fresh look.
Extracted from Business Knowledge Blueprints: Enabling Your Data to Speak the Language of the Business, 2nd ed., by Ronald G. Ross, 2020, 288 pp, https://www.brsolutions.com/business-knowledge-blueprints.html
The central flaw in the long-running discussion over data quality is literally its focus on ‘data’. Stored data is merely the system or database residue of things that have already happened in the business, a memory of past events.
To truly fix ‘data quality’ problems requires a business perspective, a shift in the focus from data design or data cleansing, to the activity and knowledge of the business itself. Our sights should be set squarely on the context that produces the data.
Creating Data: More To It Than You Think
Consider what you’re doing when you create a piece of data. In some ways what you’re doing is quite profound. Think about it this way: The act of creating data is the act of creating a message to people in the future.
Recipients of the message might be only just milliseconds away – but they also might be weeks, months or even years away. Data isn’t just data; it’s an effort to communicate.
Normally we think of communication as either direct conversation or (in the spirit of the times) a flurry of text messages exchanged more or less in real time with people we know. In either case there’s usually a shared context within which the meaning of the messages can be interpreted, as well as more or less real-time exchange of clarifications.
What’s distinct about creating data is that you’re almost certainly not going to be face-to-face with the recipients of the message or connected live with them via an interactive network. That fact rules out body language (e.g., raised eyebrows or emoticons ) and dialog (including grunts and groans – or more emoticons) to clarify what you mean. In that sense the communication is blind.
As a consequence, the data a worker creates literally needs to speak for itself. The emphasis needs to be on the effectiveness of communication – that is, on semantic quality.
Semantic quality focuses on whether the meaning of a message is clear. Just formatting data correctly and applying a few data constraints doesn’t get you there. If the meaning isn’t clear a business communication won’t be properly understood. In other words, you need clarity for the concepts communicated – not just data quality.
The Role of Data/System Architectures and What Data Quality is Really About
Data Quality Nonsense: It’s entirely possible to assess your data quality as high even though the business communications that produced the data were confusing, contradictory, unintelligible, or otherwise ineffective. Rating data quality high when communication is poor is nonsense!
The quality of data in a data store can never be any better than the quality of the business communications that produced it. A systematic means to manage data at rest simply does not guarantee the vitality – the semantic health – of the business communications it supports.
Let’s take a step back. Data is literally blind communications to people in the future. That’s what it is; that’s what it’s for. Full stop. Because of that delay a secure, well-organized holding area is needed. So, IT professionals, hopefully guided by knowledgeable data architects, create data stores.
Unfortunately, typical data quality measures focus on the health of the data store content, rather than on the semantic quality of the intended business communications. Perhaps that serves a purpose for data management in isolation, but it misses the mark almost entirely in clarifying what practices produce good business communications in the first place. Typical data quality dimensions (e.g., completeness, uniqueness, timeliness, etc.) are:
- Retroactive rather than proactive
- Quantitative rather than qualitative
- Systemic rather than semantic
Worst of all, typical data quality dimensions implicitly remove responsibility off the shoulders of those who create the data.
Forming High-Quality Business Communications
Want High-Quality Data? Rather than retroactively focusing on data already formed, you need proactive measures to form high-quality messages in the first place – no matter whether structured data or written business communication (‘unstructured data’).
What should the recipients of blind messages expect? They have the right to expect:
- High-quality evidence about what the content means.
- No need for any significant assumptions, whether unconscious or deliberate, to supplement that evidence.
- The content representing exactly the reality the evidence suggests.
What form does evidence available to recipients take?
- names, codes and words
- definitions
- business vocabulary
- business rules
The four dimensions of BRS Semantic Quality in the graphic arise directly from these four kinds of evidence, respectively. They provide the context for blind communications. They apply equally to structured data and to written business communication (‘unstructured data’).
The Readability Dimension of BRS Semantic Quality. Remember that data is a message, a communication, to people in the future. A readable message is one that is not encoded or cryptic, one whose meaning is not obscured by choice of signifiers (names, codes, or words).
Cryptic names and codes are rampant in IT systems; they are encouraged by programming languages, software platforms, and legacy computer tradecraft. Examples:
- PID-RAD2-TYPE. Who but programmers might know what that name represents?
- A coding scheme for the values of a field where ‘0’ stands for ‘no’ and ‘1’ stands for ‘yes’. Why?!
- The abbreviation ‘PT’. Without adequate evidence, this abbreviation could stand for many things, including the following (acks @StevenSarsfield):
-
- PT Emp à part-time employee
- PTCRSR à Personal Transportation Cruiser
- Blk pt chassis à black platinum chassis
- 24pt bk à manual published in 24-point type
- 2 pt asbl à two-part assembly
- 1 pt à one pint
- LIS PT à Lisbon, Portugal
-
How you name things should always be based on natural-language ways of communicating about the things. Inadequate or misleading names, or ones that could easily be misconstrued, should be carefully avoided.
In subject matter of any complexity – which is to say virtually all business subject matter – word choice can make a huge difference in the ultimate effectiveness of a communication. There is simply no name like exactly the right name.
The Understandability Dimension of BRS Semantic Quality. Remember that data is a message, a communication, to other people. An understandable message uses only terms with solid business definitions. Example: Suppose in immunology someone calls something a site. A definition is missing. Does site refer to a location where a vaccination took place (e.g., a doctor’s office), or to an anatomical location where a vaccination was injected?
Miscommunication can easily result where definitions for terms are absent, unclear, imprecise, incomplete, and/or un-business-like. Defining things accurately is a central skill for professionals of all stripes.
The Precision Dimension of BRS Semantic Quality. Data is a message, a communication, to people in the future. A precise message is one that uses shared terms from a business vocabulary correctly.
Sometimes the choice of word for some concept in a message is simply wrong. Such usage can be highly misleading. Example:
Using extension to mean an offering of a product given to a prospect when the prospect clicks on an ad, rather than the official meaning, an additional period of time given to a prospect to accept an offer. (Yes, that’s a real example from one of our clients arising from social-media marketing vs. traditional marketing.)
Perhaps even worse is being inconsistent in usage – e.g., sometimes a word means one thing, and sometimes another. Such cases are called homonyms (one word or word phrase, but multiple meanings).
Other times a word can span a broad gray band of meaning. Example:
Using customer to mean anything from active customer to any party that has ever expressed even the slightest interest in the company’s products or services.
Terms (including synonyms) should always refer to only a single concept in a given context. For that you need a solid business vocabulary, which in turn requires a robust concept model (business ontology).
The Reliability Dimension of BRS Semantic Quality. Data is a message, a communication, to people in the future. A reliable message is one that complies with all relevant business rules.
Much confusion arises over business rules. Professionals who work with data/system architectures often have a technical view of them. That’s off-target. Business rules are not data rules or system rules. A true business rule is a criterion for running the business. Business rules are about business knowledge and business activity, not data – at least not directly.
I recently read the following statement about data quality: “Business rules capture accurate data content values.” No! Business rules are about running the business correctly.
If the business is run correctly, its business communications should be formed correctly. If its business communications are formed correctly, then the content of its data stores should be correct. So yes, business rules result in correct data, but more importantly correct data arises because business activity is conducted correctly in the first place.
In other words, data quality isn’t really about the quality of your data, it’s more about the quality of your business rules.
There is a fundamental difference between communicating in business terms vs. communicating in data-speak. Which do you think your business partners will prefer?! Two simple examples using business rules illustrate. Each example is first expressed by a clear textual business rule statement (using RuleSpeak), then as a corresponding data constraint.
- Business rule: A customer must have an assigned agent if the customer has placed an order. Expressed as a corresponding data constraint: A valid agent id is required in the assigned-agent field of a customer record if any order records are listed for that customer record.
- Business rule: The payee of a claim payment for a claim must be a party who made the claim.Expressed as a corresponding data constraint: The payee number, if any, listed in the payee field of a claim-payment record must be for one of the parties listed as having made the claim.
Sad to say, most discussions of data quality have been complicit in a vast oversimplification of ‘business rules’. Don’t be fooled by trivial examples. Samples:
- Data in a field is invalid because it violates some data type constraint – for example, social security numbers are found in a field for a person’s surname.
- Data in a field is invalid because it violates some minimum or maximum threshold – for example, a number greater than 99 is found in a percentile field.
Obviously, you do need constraints like these, but they barely scratch the surface of true business rules. They just happen to be easy to talk about because they involve values of only a single field. They’ll never get you even close to meaningful data quality.
Bottom Line
The four dimensions of BRS Semantic Quality get to root causes of ‘data quality’ problems, as well as of miscommunication in written or other business communications. Communicating about difficult subject matter is hard to begin with. Blind communication to people you can’t converse or interact with directly is the hardest of all. It requires order-of-magnitude sophistication in the techniques used to form the messages.
Find out what BRS can help you do about semantic quality: https://www.brsolutions.com