Data Yoga-It is all about finding the right balance

A big live style topic these days is finding the right balance, either your work-life balance or with yoga and mindfulness your inner balance. In Data Science this is not always so easy. There is quite often a natural imbalance, so we very often deal with such data sets.

Let’s start at the beginning: What does imbalanced mean for data sets? We often have data sets where we want to predict a certain category, for example, if our customers are churning or not and both categories don’t appear equally in our data set. On the contrary, one class has significantly more events than another. From a business point of view, this is a good thing we don’t want 50% of our customers to churn, we rather prefer it to be only 10%. Or when it comes to predictive quality, we don’t want our factories to produce 50% bad quality, we rather prefer it to be less than 2%. So here we are happy that we have unbalanced results, and we don’t pursue balance. The only problem that appears from a statistical point of view and especially when we take the predictive quality example is when we have rare cases, it is sometimes harder to find patterns in these imbalanced data sets. This makes it also more challenging to predict such events with machine learning. But we still want to use machine learning to reduce our churn rate or the bad quality even further, so what can we do?

The first impulse is: To add more data. Good idea, but what can we do if we already have all our data available and the machine learning model still doesn’t give us sufficient results? Here over- and undersampling can help.

The idea is that we randomly select from the rare case event (in our example the churned customer or the bad quality) cases and add them to our data set. This is called oversampling. For undersampling we randomly select data from the majority group (our loyal customers or the good quality) and remove them from the data set. In both cases, we will now get a less imbalanced data set. If we use these methods long enough, we can even generate completely balanced data sets, but from my experience, this is not necessary most of the time.

So, the bottom line is over- and undersampling is pretty cool and sometimes necessary, as you can see in the visualization (I mean it is hard enough to hold that pose without balancing the unbalanced bowls). Therefore, I am very glad that the native machine learning library in SAP HANA called hana_ml offers not only cool algorithms to train your predictive models but also functions for over- and undersampling.

For oversampling the Synthetic minority over-sampling technique (SMOTE) is used. In standard oversampling you would stupidly copy the points from the less frequent category, add them to your data set and this way create a lot of duplicates. The idea behind SMOTE is not to generate duplicates, but rather to create synthetic data points that are slightly different from the original data points. The technical documentation is pretty good, so if you want to try it now on your own:

hana_ml.algorithms.pal package — hana-ml 2.13.220715 documentation (sap.com)

For undersampling the Tomek’s Links method is used. The idea behind this is to detect points that are closest neighbors and belong to different classes so-called Tomek Links. This point will then be removed, you have the choice to either remove both points or only one of them (traditionally the one belonging to the majority class is removed). My suggestion is to try what works best on your data set. The technical documentation for this can be found here:

hana_ml.algorithms.pal package — hana-ml 2.13.220722 documentation (sap.com)

Of course, you can combine both methods. Luckily there is already such a combined procedure prepared for you in the hana_ml library:

hana_ml.algorithms.pal package — hana-ml 2.13.220722 documentation (sap.com)

I hope after this quick overview of over- and undersampling with hana_ml you now are super motivated to directly try it. My colleague Yannick Schaper wrote an amazing blog post on how to get started with training your first machine learning model in HANA. The use case he used is detecting predictive quality and as usual for such a case it is a highly unbalanced data set. Hence, my suggestion is to build on the Blog and use case from Yannick and challenge yourself to create better results using over- or undersampling. Have fun.

Data Yoga-It is all about finding the right balance

Data Yoga-It is all about finding the right balance

Recommend

扎克伯格：将对 Horizon Worlds 和化身图形进行重大更新

腾讯申请专利：可用区块链发布寻人启事

Conan 1.51: Improved download tool to support getting files from the local file...

不需要 JS！仅用 CSS 也能达到监听页面滚动的效果！

软银CFO回应抛售阿里：为安抚投资者，展示财务状况稳健

2022年中国印制电路板(PCB)市场现状与发展趋势分析市场规模已超430亿美元【组图】

【行业深度】洞察2022：中国光纤激光器行业竞争格局及市场份额(附市场集中度、企业竞...

腾讯云首次对外公布新一代大数据产品矩阵全景图

苹果广告“二分法”：Facebook不行，自己可以

林肯揭开 L100 概念车的面纱，具备超未来主义的自动驾驶车型

About Joyk