Imagine losing prospects or money because of the data set you used. What?! Yes, if your data set is not managed well, it can be detrimental to your organisation. It can even cost the US economy trillions of dollars a year. Without further ado, here’s a brief guide on data quality, so that you know what the big deal is about data quality.
What is data quality?
- Accuracy: Your data needs to be correct and precise. If not, it’s misleading.
- Reliability: If your data set is unclear or inconsistent, you can’t rely on it.
- Completeness: If some information is missing from your data set, it’s harder to make decisions based on your incomplete data set.
- Uniqueness: Duplicates will just make the data set messier and harder to look at.
- Granularity: The right amount of detail in your data set can aid the right decision-making.
- Availability: The data needs to be available and accessible to the people in your organisation who use the data to work. A data catalog, which will be explained in the next article, can help with finding and identifying the data you need.
- Relevance: The data set you’re storing and using should be necessary and purposeful for your organisation. Otherwise, it’s a waste of resources.
- Timeliness: Your data set should be up-to-date and relevant for the time of use.
Ensuring data quality as per the dimensions mentioned is so important because it avoids or minimises the problems of bad data.
“Poor data quality costs the US economy around $3.1 trillion a year.”IBM
“The cost of bad data is an astonishing 15% to 25% of revenue for most companies.”MIT Sloan Management Review
Bad data is so expensive because of the time, effort and money wasted on dealing with the errors.
On top of that, tight deadlines discourage many people from solving the root causes of the bad data with the data creators, and instead, push them to make the corrections themselves with their own guesses to meet the deadlines.
Surely, many business owners can relate to the losses resulting from leaving bad data unchecked including:
- Added expenses when products are shipped to the wrong customers.
- Lost sales opportunities due to incomplete or incorrect customer records.
- Fines for bad compliance.
Hence, good data can avoid or minimise these problems since it frees up more time, effort and money for more productive pursuits.
More importantly, good data also increases the accuracy and usefulness of data analytics projects, leading to better business strategies.
Data quality management (DQM)
DQM is the practice of improving and maintaining the quality of data and is said to be a core component of the overall data governance process.
Data quality can be improved and maintained by following the best practices of DQM:
- Start investigations with a root cause analysis.
- Establish master data, which acts as a single source of truth.
- Create a set of data quality rules and conduct data quality assessments based on business requirements for data.
- Use data quality tools including profiling, standardisation, cleansing, and matching.
“Data quality tools are the processes and technologies for identifying, understanding and correcting flaws in data that support effective information governance across operational business processes and decision making.”Gartner
Data quality tools:
- Data cleansing (also known as data cleaning): “Process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.” Sounds like something that would appeal to the neat freaks.
However, care must be taken to ensure that the data cleansing process doesn’t result in the loss of data due to human error. For example, a developer misunderstood a field and dropped it or a developer used the wrong format (example: Timeformat UTC vs GMT).
That’s why it’s useful to keep the raw data in another file just in case these mistakes are made, so that the data cleansing can be redone.
- Data profiling: “Process of reviewing source data, understanding structure, content and interrelationships, and identifying potential for data projects.”
- Data matching (also known as record or data linkage, entity resolution, object identification, or field matching): “Task of identifying, matching and merging records that correspond to the same entities from several databases or even within one database.”
- Data standardisation: “Workflow that converts the structure of disparate datasets into a Common Data Format.”
A real-life application of data quality
Not only can data quality be beneficial for businesses, it can also be beneficial for charity organisations like Save The Children UK (SCUK).
- Fulfill the General Data Protection Regulation (GDPR) requirement for charities to be transparent about their data and fund management.
- Overcome “donor fatigue”, where donors are contacted by numerous charities.
- Speed up the cleansing and loading of the increasing volume of data, streamed from several data sources, into its customer relationship management (CRM) system.
- Avoid duplicate records to ensure that new records of the same data are not incorrectly created when they’re already in the system. Duplicate records, even with slight variations in the name and addresses of the same donor, can cost SCUK money and even annoy the donor for contacting that donor with the same message more than once.
When SCUK started using Talend’s DQM software, the data cleansing and loading process sped up and cut down the time of DQM by 60%.
The software also improved the identification of duplicate records before loading, providing an accurate view of the donors.
This allows SCUK to improve their targeted marketing campaigns with relevant messages to their donors, increasing donation levels.
And with better quality data, SCUK can also be more transparent about their donation processes as stipulated by GDPR.
Basically, having high quality data allows you to deliver the insights you need while low quality data poses a barrier to it.
Especially if the DQM fulfills the data quality dimensions of accuracy, reliability, completeness, uniqueness, granularity, availability, relevance and timeliness.
But the thing about ensuring data quality is that it’s not just something you can do once and leave alone.
It’s something that needs to be worked on continuously if you expect the results to keep improving.
Most importantly, let’s keep in mind that it is much easier to prevent bad data than to fix it, as the saying goes:
“Prevention is better than cure.”