Data Culture Issues and How to Fix Them

Read more about a discussion of the three major causes and potential mitigation techniques of data culture issues, as published on The Data Administration Newsletter.

If your organization uses data—and even if it doesn’t—you have a data culture. Think about the ways that you and your colleagues interact with and discuss data. Are people afraid of it? Do they trust it? Is it spoken about as a driver of business and competitive edge or just the exhaust of your existing operations?

Data culture is the organizational processes and social norms surrounding the production, use, and consumption of data. A poor data culture can lead to confusing communication, inconsistent decision-making, and non-actionable insights, while a good one promotes robust, actionable, and data-driven insights.

What should you look out for when assessing your data culture?

I have found three key indicators of poor data culture: fear of data, inconsistent use of vocabulary and metrics, and mistrust of data. In this article, I discuss the consequences of these issues and some solutions I have found effective in correcting them.

These data culture issues are largely consequences of having democratized analytical data. Democratization means having data available to the people who need it and have the skill sets to derive meaningful insights from it. The debate between centralization and democratization of data is an entire discussion in its own right, but I have found strong benefits of allowing people to access and understand data for the issues they encounter in their daily work. In this article, I will also explore some of the danger zones of democratized data and their possible solutions.

Data Culture Issues

Fear of Data

All organizations are moving towards data-driven insights. Data is a powerful tool that allows us to look at past performance and see aggregate trends to indicate future performance and direct decision-making. If people in organizations are afraid of data, however, they will not be inclined to use it. 

Many people feel uncomfortable around data. They may not know how to use it, feel overwhelmed by it, or think that it is an unintuitive black box. They may be afraid of potentially breaking something. Even people who do understand data concepts may be afraid to venture unfamiliar datasets. This hesitation means that people will not use the data to its fullest extent; people who want to make decisions with data will avoid doing so because they are intimidated.

To tackle this issue, we must give people the appropriate tools to feel comfortable with the data. Up-front training and discussions are essential. Instead of handing licenses to new analysts and assuming their skills have prepared them to deal with your company’s data, provide short education sessions on not only the tool, but also the company’s data and best practices.

I tell everyone who takes this kind of training that the goal is not for them to remember everything I say, but instead to know when and how to ask questions. Data can be tricky and there are often specific ways it can and (often more importantly) cannot be used. Instead of telling them to memorize each scenario, I give them the tools to identify these situations and to sniff out if something seems amiss, then provide them with a variety of resources to understand and rectify the issue. 

Up-front data education is beneficial to provide context and resources before new analysts have a chance to develop bad habits. They then feel empowered with their data and they know who they can approach with various questions. These training sessions make them feel as if they are a part of the analyst community and help them feel comfortable discussing data issues with other analysts across the organization.

Inconsistent use of Vocabulary and Metrics

Inconsistent use of vocabulary and metrics can easily lead to confusion in meetings. Here is the kind of scenario I have experienced many times:

People gather in a meeting to discuss last month’s sales and ensure that quarterly numbers are on track. Last month’s sales goal was $1.5M.Director of Operations: “Last month’s sales were $1.7M! We’re definitely on track to make our quarterly numbers. Let’s stay the course for the rest of the quarter since it seems to be working!”Director of Marketing: “Last month’s sales were $1.4M. We’re not very far off, but if we keep this pace we will miss our quarterly numbers. I recommend an increase in the marketing budget.”Confusion ensues as the Director of Operations and the Director of Marketing try to justify their numbers. Finally, the meeting is put on hold while they each return to their analysts to understand the math. It comes to light that the Operations sales total did not take refunds into account, but the Marketing sales total did. No one is sure how refunds are factored into the monthly sales goal.

If this kind of interaction sounds familiar to you, it is indicative of a data culture that is inconsistent in its use of vocabulary and metrics. Time spent on figuring out why numbers don’t add up or metrics are misaligned is time not spent making and executing decisions. Additionally, even if you are able to make a decision, you might not be making the right one. Without knowledge of how the monthly sales goal is calculated, it is impossible to know whether the organization should stay the course or invest more in marketing.

Standardizing vocabulary and metrics requires everyone to be looking at the same thing. This first requires us to move data as far upstream as possible, then consolidate the datasets, and finally create universal dashboards. Let me comment on each of these in turn.

Moving Data Upstream

We find there are four major layers in which data can be produced or manipulated: at the source; during Extract, Transform, Load (ETL); in the Business Intelligence (BI) Tool; and during analysis.

Data produced at the source

Data that comes directly from the application or software. It relies solely on the integrity of the input data and the software producing it. This data will be consistent across the entire organization—both production and analytical data.

Data manipulated in the ETL layer

Data that is taken from production and manipulated, often to make it easier to analyze. This data will be consistent across the analytical data but will not appear in the production data.

Data manipulated in the BI Tool

Data that is post-ETL, accessed in the BI tool. Most BI tools have the ability to curate data to surface only relevant files and create calculations that can then be accessed across the BI tool. Data curated in the BI tool allows you to control what BI dashboard creators are accessing while still giving them freedom to create their own visualizations. This data will be uniform across the BI tool, but not across all the analytical data or across different BI tools.

Data manipulated at time of analysis

Data that is manipulated by each analyst and dashboard creator when they are working with the data. These are often calculations. Allowing individual analysts to calculate and manipulate data is beneficial because a central body cannot curate for every analytical need; however, be aware of whether people are using the same assumptions in analysis. If they do not, there will be different numbers floating around to answer the same question. If it seems that many people are using similar calculations, it would be useful to move these calculations into the ETL or BI Tool layer.

The closer data is to the source, the more likely the organization will be consistent in its data analysis. Moving data toward the source will expedite communication and ensure that everyone trusts the numbers presented.

Consolidating Datasets

Once the data is in the optimal places, it is important to have an easy way to access it. This goes beyond just providing the right permissions to the data; it is about structuring it in such a way it can be understood.

The first step is to create a few wider datasets to incorporate tables often joined together. Different organizations choose to split their data differently: by theme (e.g., financial data), most commonly used, by level of detail, and many more.

There are some drawbacks to creating wider datasets: less flexibility and performance issues. Wider datasets mean that the joins are already pre-dictated for analysis. Depending on the complexity of the data model, pre-dictated joins may not be an issue, or it may make some analyses more difficult. An example of this is duplicating rows—in pre-dictated joins, some tables may end up multiplying to meet the level of detail of the dataset, meaning that care must be taken to ensure that numbers add up properly. Additionally, wider datasets are less performant than skinnier ones, which can make them difficult to use for analyses.

However, wider datasets can be beneficial in reducing confusion, time wasted, and the possibility for error. If joined well, wider datasets ensure that people do not need to be familiar with the data model to create analyses. They do not have to understand how foreign keys map to each other or how to use lookup tables to de-code IDs. Looking for the correct tables to join is also time consuming. The more joins that can be eliminated, the easier and faster analyses can be done. The more democratized you want your data to be, the more important it is to pre-dictate joins, allowing for more accessible data.

Mistrust of Data

Who among us has not had an experience with bad data? Analysts sometimes do days of analyses only to realize their base data was incorrect and end up feeling betrayed and burned. Each time this happens, they become increasingly distrusting of data. Cautious analysts will see one small error in the data—possibly a delayed refresh or strange outlier value—and extrapolate to the entire dataset, dismissing it as useless. Without trust, analysts become hesitant to use the data, giving wide margins of error on reports and analyses to senior management.

The key to increasing data trust is communication. If analysts are not aware of known issues and fixes in the works, every new problem they encounter becomes an unknown-unknown. They assume small errors are indicative of larger ones and spend their time checking the data instead of doing analyses.

Instead of leaving analysts to imagine all the possible error scenarios, providing them with full information about what is going wrong in the data at all times is essential. It may seem counterproductive to air all the issues even if no one has complained about them, but it shows good faith to keep them fully informed. Often data errors are not all-encompassing; data that did not refresh overnight is still usable from the previous refresh. Letting them know specifics of the errors will inform them about which parts of the data are danger-zones and which pieces are safe to use. This limits the amount of catastrophizing analysts can do when they find errors in the data.

It is also important to keep people up to date on changes and improvements being made in the data. Any updates or improvements in the data should be communicated through release notes at a regular cadence.

Implementing a ticketing system or another consistent way for data users to communicate their issues and requests with the data is also helpful. It not only reduces the number of ad-hoc requests that the data team receives through email, IM, in person, etc., but also provides a more consistent view into the updates that are being made. Ticket submitters are able to go to one location and see their requests (and—depending on the ticketing system—where they are in the process). This opens the curtain helpfully into what often feels like the black box of data production.

Create a Culture of Discussing Data

These steps provide guideposts and guard rails around data use. Although important, the core element of having healthier and more responsible data use is the data culture. Changes need to come from the ground up—they cannot be dictated by senior management. Don’t expect any change to happen merely by having the CIO come by and say, “We need to think about data differently”.

Culture change comes from the people who are using the data and who are talking about it every day. It is essential to facilitate casual conversations between staff about data and data use. If people are speaking openly about data and how they are using it, they will be better able to help each other. Additionally, their lexicons will begin to merge. In communicating with each other, they begin to use the same words to refer to the same concepts. If these conversations can be facilitated across departments, it will go a long way towards encouraging changing the culture and making people feel less afraid and more trusting of the data.

Highly effective techniques in this regard are discussion hours and office hours.

Discussion hours are regular (weekly) blocks of time that can be spent for anyone who is handling data in the organization to come together to discuss a topic or listen to a presentation. Typically, the topic is set, but can be influenced by what is going on in the workplace or frequently asked questions. It can be about how to use data tools, a review of a specific dataset, field definitions, or any of the various data related issues that are relevant to you.

Office hours is when an expert of a certain type of data or tool makes themselves completely available to their co-workers for questions for a certain time period. While most of us work on ad-hoc emails and people ‘swinging by’ the desk, office hours are beneficial because people can feel comfortable coming by with a question. They do not need to feel that they are bothering you or disrupting your work because this time is specifically designated for answering questions.

Office Hours provides the additional benefit of being able to lock down other times on your calendar and not having to answer questions then. Setting a regular and frequent cadence for office hours is essential so people know when they can find you and that they will not have to wait too long to ask their question. If office hours are set too infrequently, people may end up trying to answer questions for themselves instead of coming to the expert, which may create mismatches in other parts of the data.

Office hours and discussion hours can help address all three data culture issues—people who discuss data more often become more comfortable with it, evolve toward a common language around it, and be more likely to discuss the issues they find with it.

Conclusion

Data culture issues are easy to discuss, but it takes a village to implement them. One cannot merely decree that data trust concerns are no longer an issue or that we should all use consistent vocabulary. Culture change comes from those who are using and discussing the data. Data is something that everyone relies on, but few people have full visibility into. Keeping it in a black box only increases its mystery. Implementing some of these steps I’ve discussed will peel back the mystery and allow more people to be involved in the conversation. The more people talk about data, the more it will embed itself in the zeitgeist of the company culture, creating organic cultural change.