Tips for Eliminating Poor Data
source link: https://dzone.com/articles/tips-for-eliminating-poor-data
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
The Best Approach To Handling Poor Data
There are many ways to evaluate poor data, but the following approach has proved to be the most effective and universal in practice.
To weed out poor data, you need to:
- Clearly define criteria for poor data
- Perform data analysis against these criteria
- Find out the sources of this poor data
- Fix poor data
- Fix poor data sources
Criteria for poor data can be matching the data to a certain type or format, to a range, its completeness, the absence of duplicates, and others.
Next, you need to check all the data or some of them for compliance with these criteria.
At the same time, if the amount of data being checked is large, it makes sense to check only part of the data at the initial stages since most sources of errors can be identified and corrected even on a small sample.
And after correcting these errors, the entire dataset can already be checked.
The source of poor data can be a person who made a mistake while performing data input, such as a POS employee.
It can also be an external information system or some process performing internal calculations in your own information system.
After identifying poor data and its sources, there are two directions to work on:
- Fix already existing poor data
- Prevent the appearance of such data in the future
In the first direction, depending on the source, you either need to manually correct the data, or reload them from an external system, or perform correct calculations in your information system.
In the second direction, it is usually necessary to correct the processes that cause the appearance of poor data.
In case of staff errors, you can train them or add input validation.
If poor data comes from an external system, then you need to discuss the exchange format with counterparties.
If poor data is the result of internal calculations, then you need to correct the corresponding algorithms in your information system.
The Efficiency of This Technique
The given technique is extremely effective due to the clear definition of criteria based on the needs of the business.
Also, clear criteria allow you to automate the process of identifying poor data, which allows you to quickly inform about their appearance and respond to it in a timely manner.
For example, you can organize an email notification with the results of data quality checks.
Periodicity of the Bad Data Evaluating Process
It is very important that this data evaluation process be run regularly.
This will allow you to correct errors in data sources in a timely manner and, as a result, avoid time-consuming manual corrections, as well as minimize business risks associated with the use of poor data.
The frequency of starting the data evaluation process depends on the information system type and the data format itself.
In an analytical system, part of the data may remain unchanged, and it is enough to check such data once and eliminate errors, then repeat the process only for new data.
In the online system, large amounts of data can change, and you need to check the entire dataset; in this case, you can constantly check part of the set, and only in case of errors, check the entire dataset for specific errors only.
Speaking of specific values, this process can be daily for systems that are sensitive to any errors, up to once per month, if the data quality allows a certain amount of errors without a significant impact on business processes.
Regardless of the chosen launch frequency, the process of improving data quality itself must exist throughout the entire lifecycle of an information system.
What Can Lead to the Accumulation of Bad Data?
The absence of a process for handling poor data leads to an accumulation of errors.
Moreover, errors in the original data can lead to secondary errors resulting from working with poor data.
Eventually, all these factors will lead to the appearance of a constant component that negatively affects the company's profit.
Summary
Building data quality management processes at the early stages of system deployment is very important.
This will give a set of evaluation criteria and algorithms for analyzing poor data, informing users, and working with sources of poor data on small volumes with the least amount of effort.
After all, it is better to prevent errors than to eliminate them.
Recommend
-
47
Many Java developers and users are still not clear on how the JDK will be delivered and updated starting with JDK 11 later this year. Here we provide a concise summary of the changes with links to sources.
-
26
Have you ever wondered how to make your React applications faster? Yes ? How about having a checklist for eliminating common react performance issues? Well, you are in the right place. ...
-
36
Go 2 aims to improve the overhead of error handling , but do you know what is better than an improved syntax for h...
-
12
Eliminating Data Downtime with Data Observability - Why We Partnered with Monte Carlo Data Posted on 2021, Feb 09 2 mins read
-
4
Data, objects, and how we're railroaded into poor design Jan 23, 2018 I don’t think we have any actually good programming languages, and I don’t think I’m alone in believing this. Programm...
-
5
Report: 44% of CRM users say poor data quality means lost revenue Image Credit: Getty Images Join today's leading executives onli...
-
5
The Perils of Poor Data Visualization in CRO & A/B Testing As any UX & CRO expert should now, the way we present information matters a lot both in terms of how well it is understood an...
-
3
@kuwalaKuwalaKuwala is an Open Source No Code Data Platform that reduces the friction between BI Analysts and Engineers.
-
7
Poor data integration could be stymying your sustainability By Dr. Stefan Sigg
-
4
Abstract/Contents Abstract Generative Machine Learning models have been well documented as being able to produce explicit adult content, including child sexual abuse material (CSAM) as we...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK