Validating data quality with AWS Glue DataBrew

In previous post, I have showed How AWS Glue DataBrew can help you to Handling PII data (The post is Thai language) For an English reader, the screenshot speaks for itself, you can easily follow.

There was a recent announcement AWS Glue DataBrew users can now develop data quality rules, which are customized validation tests that set business needs for specific data, according to the company. As a result, Any data person who does not want to invest in a high-cost DQ licensing product or does not want to use an open-source framework that requires coding knowledge can define their own quality rules and populate them in a data quality dashboard and validation report, allowing customers to quickly view rule outcomes and determine whether their data is fit for use. As a Builder, with New two announcement AWS GlueDataBrew with DQ, and PII Handing will help us a lot in term of design DQ process at scale, as it's no compute server to maintenance. In this post I will walk you though How to do validate your data quality with AWS Glue DataBrew.

What is AWS Glue Databrew?
What do you mean by Data Quality rule?

Architecture Diagram

Pre-requisites

AWS Account❗
Download Data here
Unzip, and upload only patient.csv upload to Amazon S3.

Preparing dataset

Go to AWS Glue DataBrew console here

On the left you click Datasets to create Dataset for Glue DataBrew

click Create new dataset

Choose patient.csv that you just uploaded, you also have an option to get the data from Glue data catalog, Amazon Redshift, Appflow, or Snowflake as well. Click Create dataset

Run data profile

At Datasets, choose Patient dataset, and Click "Run data profile"

You will wait few minutes for data profile to populate Data profile overview, Column statistics, Data quality rules suggestion, and Data lineage, once it's done click "View data profile"

You expect to see all the data statistics that useful to understand your dataset as follow

Dataset preview

Data profile overview

Potential PII detection

Attribute Correlations

Individual Column level statistic

All Columns level statistic summary such as min, max, distribution, column type, unique value, and etc

You can also drill down to columns that identified as PII as well in All Column level statistic

You can also see Data lineage as well!

Enough for Data exploration. The purpose of this post related to Data quality, let's take a look on the Data quality tab, you will see nothing, as you don't have a rule yet! but after you run Data profile job, it suggest you based on standard DQ such as uniqueness, completeness, and etc on the right hand side

Create DQ ruleset

You can pick what that reasonable to your business criteria, Again there's zero code require! I choose to check uniqueness in my Id, SSN, and length for SSN, and etc

Again this is recommendation from AWS DataBrew. you can create your own DQ rule later, after I have done, I just click "Create ruleset"

I can add, remove, adjust, or review DQ rule I have just added

I just click "Create ruleset", I can see DQ rulesets here

click DQ rulesets name, in my case is "DQ Rulesets for Patient data" I should be able to review all my DQ rules.

Associate DQ rulesets, with Data profile job, Click here to see all your Data profile job

For my case it's "Patients - Data Profile job, choose, and click Actions, and Edit

Search for "Data quality rules", and click "Apply data quality ruleset"

Choose your DQ Rulesets, and click Apply selected rulesets, and click Save

Re-run data profile with DQ ruleset

You choose your Profile jobs, and Click "Run job"

View DQ dashboard

Wait a few minutes, you should be able to see Result from Data profile job, with DQ result, based on your DQ rules!

My dataset has 4 pass, with 6 failed, based on My DQ rules, obviously I have to fix my data, before downstream process will consume!

Conclusion

AWS Glue DataBrew is a visual data preparation tool that makes it easy to clean and normalize data using over 250 pre-built transformations, all without the need to write any code, with New feature for DQ, and PII handing it will help you to add your own data quality rules, and PII tokenization in your automated data pipelines (AWS managed airflow, or AWS step function) to make sure your data are clean, and protected.

You can think about How add to PII obfuscate, and DQ zone like this

Table Of Contents

Architecture Diagram

Pre-requisites

Preparing dataset

Run data profile

Create DQ ruleset

Re-run data profile with DQ ruleset

View DQ dashboard

Conclusion

Recommend

The Power of Industry 4.0 to Enhance Asset Management Performance

AWS Serverless Data Analytics Pipeline | AWS White Paper Summary

When to : SNS or SQS

犀牛财经早讯：哔哩哔哩全资收购有妖气原创漫画平台

POD Strategies & Accountability

Top 8 Docker Best Practices for using Docker in Production ✅

唇齿相依：中国PaaS要保持高速增长可能得帮帮SaaS友商了

美国知名歌手惠特尼·休斯顿未发行歌曲 Demo 将以 NFT 形式拍卖

在十字路口迷路——记于22岁生日之前

图解字节跳动、腾讯、Roblox、Meta等6家中美公司在元宇宙赛道的投资、收购和开发

About Joyk