5

SAP BTP Data & Analytics Showcase – How to improve and manage data quality u...

 2 years ago
source link: https://blogs.sap.com/2021/07/20/sap-btp-data-analytics-showcase-how-to-improve-and-manage-data-quality-using-sap-data-intelligence/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Introduction

This is the 2nd of 5 blog posts which are part of the “SAP BTP Data & Analytics Showcase” series of blog posts. We recommend you to look into our overall blog post to gain a better understanding of the end-to-end scenario and context which involve multiple SAP HANA Database & Analytics components in the cloud.

In this blog post, we are going to demonstrate how to improve and manage data quality with the help of data quality rules in SAP Data Intelligence. Through the user interfaces in SAP Data Intelligence, business users without technical backgrounds can manage and monitor quality of datasets. We highlight the data quality related activities in the data preparation phase, as such high-quality datasets help build confidence in data modelling and generate accurate business insights.

Following the upcoming sessions in this blog post, you will experience how SAP Data Intelligence could help improve and manage your data quality in an end-to-end process. You can easily follow this process in five steps:

  • Create basic/advanced quality rules under different categories
  • Manage quality rules and bind rules to datasets
  • Evaluate data quality for selected datasets
  • Utilise self-service data preparation for correction
  • Monitor and track overall quality status

Screenshot-2021-07-14-at-11.45.28.png

End-to-end process to improve and manage data quality using SAP Data Intelligence

End-to-end demo video related to this blog

1. Create basic/advanced quality rules

In this session, we are going to show is how to define basic and advanced quality rules targeting multiple quality issues. Let’s go to the below user interface “Define and categorize rules” under Metadata Explorer together

Picture-1-2.png

Scenario 1: Define basic quality rules

We will demonstrate how to define a basic quality rule, which helps solve the data quality issue: inaccurate post code in Germany – entries with empty post code or not in 5 digits. Let’s look into the below rule, which was created under category “Accuracy” with the corresponding conditions.

Picture-1-3.png

Picture-1-4.png

Basic rule for inaccurate post codes under category Accuracy

In our demo, we have created another two basic rules under category “Completeness”, which detect missing gasoline brand and house number entries in station master data. We will not show the similar definition here.

Picture-1-5.png

Basic rules for incomplete brand and house number under category Completeness

Scenario 2: Define advanced quality rules

In the second scenario, we are going to demonstrate how advanced quality rules could identify duplicated records – stations with the same gasoline brand and geographical information (latitude and longitude). The condition script provided by advanced quality rules has more operators, which enable you to develop more complex logic for rules. You could learn how to use Rule Script Language writing condition scripts from these templates.

Let’s dive into this advanced rule, which was defined under category “Uniqueness”. The newly-added operator is_unique is applied in this case.

Picture-1-6.png

Screenshot-2021-07-08-at-17.08.19.png

Scenario 3: Test quality rules

After creating quality rules, you could use the feature “Test Rule” to check if the new-created rules meet your expectation. Here we checked our advanced quality rules for duplicated station entries.

Screenshot-2021-07-08-at-17.51.32.png

2. Manage quality rules in Rulebook

In this session, we will illustrate how to import quality rules and bind quality rules with datasets you would like to evaluate via Rulebook. Let’s go to Metadata Explorer and select “Create Rulebook” feature there. In our case, we have created a Rulebook called “StationMasterDataRulebook”, as shown below.

Screenshot-2021-07-09-at-11.35.06.png

Scenario 1: Import quality rules into Rulebook

Firstly, we will reference basic rules and advanced rules that we created in the first part to the Rulebook. You could easily do it with the following three steps.

Screenshot-2021-07-09-at-11.38.48.png

Scenario 2: Bind quality rules with datasets

Secondly, we can choose the datasets whose quality we’d like to evaluate. This procedure can be completed in three steps: click “arrows” button – select dataset with F4 help – map attributes.

Screenshot-2021-07-09-at-11.41.41.png

After successful setups, you will see the following binding information in your Rulebook. In our case, we have chosen the imported CSV file in DI internal data lake that stores the station master data to evaluate.

Screenshot-2021-07-09-at-11.53.22.png

3.Evaluate data quality for selected datasets

Now we would like to show how to utilise the imported rules to evaluate our selected database table and gain a first feeling about the quality status.

Scenario 1: Configure threshold for evaluation results

Firstly, let’s configure the thresholds for maximal failed records and minimal passing records. This step could be easily done in Rulebook as follows.

Screenshot-2021-07-09-at-11.55.12.png

Scenario 2: Run all rules and view results in Rulebook

Now, we have completed all the setups in Rulebook. Let’s run the evaluation with imported rules and check the results.

Screenshot-2021-07-09-at-11.58.59.png

Recently, there is a new feature released by SAP Data Intelligence which enables users save all the failed records in a database table in SAP HANA Cloud (only connection type “HANA_DB” works here). Using this feature, you could easily adapt failed records and manage data quality.

Screenshot-2021-06-18-at-11.13.50-copy.png

Scenario 3: Deep dive into evaluation results in Rulebook

After clicking the “View Results” button, you’ll enter into the following analytical page where you are able to gain better statistical insights of quality status and identified quality issues of your datasets. Let’s look into the results together.

  1. Overall quality status: Warning (65.39% of 15444 station master entries passes all the four rules)
  2. Evaluated dataset: Station master data from CSV file in DI internal data lake
  3. Detail quality information related to an individual rule “Incomplete gasoline brand entries”: 544 (3.52%) stations have empty gasoline brand
  4. Sample data violated the rule: 5 records are sampled to show the quality issue

Screenshot-2021-07-13-at-09.54.33.png

4. Utilise self-service data preparation for correction

SAP Data Intelligence offers powerful self-service data preparation to help business users and data scientists to correct detected quality issues (from the evaluation results shown in Part 3). What needs to be specially mentioned is that the data preparation related actions could be saved in a recipe. You could access the data preparation functionality easily via Metadata ExplorerThe document published in SAP Help Portal could provide you more product information and user guidelines about this great feature. In our demo video, two examples for data preparation part is presented.

Screenshot-2021-07-13-at-09.59.26.png

Scenario 1: Implement data preparation actions and store corrected data

In our case, we developed a data preparation template including all the relevant actions to correct and standardise the CSV file “stations_data_quality” and created a new CSV file called “stations_master_improved” to store the corrected data. Later, we can import this new-prepared CSV file into HANA Cloud database using data pipelines.

Screenshot-2021-07-13-at-10.04.15.png

Scenario 2: Create rule bindings to new-created datasets

We have fixed the detected data quality issues in the station master data. How is the overall data quality status now? Let’s check the evaluation results again. For this purpose, we need to bind our four quality rules with the new-created CSV file (stations_master_improved) via Rulebook, as shown below.

Screenshot-2021-07-13-at-10.07.15.png

5. Monitor and track data quality

Now, let’s click the “Run All” button and see the reevaluation results together! You could see that overall data quality status is improved from 65.4% to 85.42%, changing from “Warning (Orange)status to “Healthy (Green)status. Furthermore, you could also build your own Rule Dashboard and keep tracking the data quality of your datasets. For further information regarding Rule Dashboard, please reference this document.

Picture-1-26.png

Conclusion

Congratulations! You have finished this session managing data quality in the phase of data preparation. We hope you are able to improve and manage your data quality through such simple five steps. Moreover, we’d like to convey our message that high-quality data could help you generate accurate business insights and accelerate your end-to-end process. As next step, you could create data pipelines via Modeler importing this high-quality dataset into your HANA Cloud database and start your data modelling activities!

We highly appreciate for your feedback and comments! Enjoy!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK