3

Data Testing: The Missing Component of Data Quality

 1 year ago
source link: https://dzone.com/articles/data-testing-the-missing-component-of-data-quality
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Data Testing: The Missing Component of Data Quality

Data Quality in production is incomplete without proper data testing. In this article, you will learn the basics of data testing and how to get started.

Jul. 01, 23 · Analysis
Like (1)
1.68K Views

Data quality

Data Quality is crucial for systems such as data warehouses, MDM, CRM, and other data-centric projects. However, DQ is often ignored in development until the system is fully operational in production. This results in a large gap in data quality as there was minimal, or no testing done during the development phase of the project.

Data is like a product and the data system is like a factory that produces it. In a factory, quality is split into two components quality assurance and quality control. Let’s take a deeper dive and understand these concepts and how they apply to data quality.

Quality Assurance (QA) vs Quality Control (QC)

QA: The quality of the raw materials and the processing methods are verified by QA processes during product manufacturing to ensure minimum defects in the final product. 

What's in store for Kubernetes in the Enterprise in 2023

Our 2022 Kubernetes report features insights into how teams leverage Kubernetes, K8s in AI, advancements in cluster observability, and more.

thumbnail?fid=16388824&w=350

QC: Even after QA, there will still be some gaps in manufacturing that are not caught by the QA processes, and that is why there is a final process of QC that checks the final product.

Quality assuarance
Quality control
QUALITY ASSURANCE
QUALITY CONTROL
Focus on processing to prevent defects in the product. Identifying defects in the product before delivery.
  • Proactive 
  • Reactive 
  • Quality of the raw material input
  • Focus on the final product
  • Focus on processes  
  • Process Orchestration 

How do the QA and QC concepts from manufacturing apply to the data domain? What are the shortcomings of the traditional data quality methods where data testing is often ignored?  

Limitation of Legacy Data Quality Approach

The above comparison shows that DQ in the postproduction phase is QC, which is just 50% of the DQ process. The other 50% is Quality Assurance or Data Testing in development is missing. Here are a few shortcomings of traditional data quality:

  1. ETL Testing: Data quality tools are designed to test the data and not for ETL testing.
  2. Too Little, Too Late: Data Quality processes are only applied when the final data is delivered in production. By this time the defective data system is already in production.
  3. Garbage In, Garbage Out: The raw data that was used during development was never tested. So, the developer was not aware of all the permutations and combinations of data that were possible. 
  4. Bad Process Results in Bad Data: The data processes that transform the data itself might be incorrect and that will result in poor-quality data.
  5. Incorrect Orchestration of Data Processes: Usually, data processing is the execution of various data processes in a particular order and time. This is required to integrate data from multiple sources into a single unified view. 
  6. Incorrect Data Schema: If the data model has issues such as incorrect data types, wrong data type lengths, precisions, or missing constraints then many data issues will later pop up in production data.

Remember, these issues can only be dealt with during the development phase of the project by implementing proper QA or data testing.

Understanding Data Testing

Data Testing is a method to test and certify the data and the data processes before deploying the code in production. It consists of raw data testing, ETL testing, and process orchestration.

 

Understanding Data Testing

Data testing consists of the following activities.

17036470-file-validation-icedq.png
Testing raw data to ensure the data is as expected.
17036472-file-to-database-reconciliation-icedq.png
Testing the ETL process to ensure it is coded correctly and transforming the data as per the requirements.
17036474-table-validation-icedq.png
Business validation of the output of the data generated by the ETL process.

Getting Started With Data Testing

A major part of data testing involves testing ETL processes. In iceDQ, this is done by implementing the following steps:

1. Identify the Process for Testing

  • Identify the data process that is tested. In this case, we are taking the “LOAD_EMP_DATA” process.
  • Find the source database. Here, it is “ADVENTURE WORKS DB” and the schema is “HR” 
  • Find the destination database “EDW_DB”.

2. Find the Source and Target Tables Used by the Data Process.

  • The source table is “EMPLOYEE”.
  • The destination table is “EMPLOYEE_DIM”.
2. Find the Source and Target Tables Used by the Data Process.

3. Connecting to Databases.

  • Create a connection to the source database from where the data is read by the process.
  • Create another connection for the target database where the processed data is loaded.
Connecting to Databases

4. Creating & Execute Data Testing Rules.

  • In this case, most of the data transformation is happening for the employee’s name.
  • The source has a “first name” and “last name” as columns.
  • However, the destination has a complete name. 
  • The data transformation combines the first name and last name to provide the complete name. 
  • Hence, a check is created to reconcile the data transformation.
Creating & Execute Data Testing Rules

5. Reviewing the Output of Data

  • Once the test is executed, iceDQ will generate data exceptions that show the defects in data transformations as highlighted below.
  • Based on the defects, the data process is certified or rejected.
Reviewing the Output of Data

Conclusion

Data quality

Data quality is not just a production concept, but equal or more efforts are put into data testing during development. A combined approach of Data Testing during development and Data Monitoring in production will provide the best outcome for data quality and that too in a very effective way.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK