Predict The Type in Pokemon GO World!

Recently, Niantic Labs launches an activity that everyone has the opportunity to catch the legend pokemen — Lugia and Articuno. Nowadays, the company just launch the 151 different pokemon which had been introduced in first generation, and there’re many different types which will make me confused. A curiosity has existed in my mind: Is there a method to predict the type of the pokemon by the given attributes?

The profile of Articuno that I found on the internet randomly

Few feature problem

After I servery the data of pokemon GO on the internet, the appropriate dataset was found on the kaggle. As the result, I use this dataset to do the further work.

In the first generation, there’re 17 different type of classes. Before I check the available feature, I think weight, height, stardust consumption or candy might be the great attributes. However, there’re only two meaningful attributes that are recorded: Max CP and Max HP. It makes me feel nervous: can few features enough to represent the complex properties between the different classes?

Structure selection

In the tremendous machine learning world, two structures are the most popular: random forest and xgboost. To simplify the work, I just use random forest to complete the task. On the other hand, I also use DNN structure to predict the class too.

Precision judgement

The image of Bulbasaur and Pikachu

I use sklearn random forest object to complete the work. However, there’s a critical problem. The tag of each row can just accept one class. For example, Pikachu has only one type — electronic. However, Bulbasaur has two types: grass and poison. How to deal with this situation?

I split as the two rows. In this strategy, each pokemon has only 1 class, but there’re more data that will be generated. The format of the data can be shown below. As the result, one type of pokemon might occupy two rows if it has two different type of classes.

To reformat the data table, melt function of pandas is useful! The id_vars parameter accept the list which will not be reshape. On the contrary, the whole other attribute will be reformat. The usage is shown below.

On the other hand, the problem doesn’t happen in the DNN structure. Since the output can accept for multiple types. For this tricky method, the common soft-max estimation isn’t adopted in this approach. I use sigmoid function to approximate the multiple one-hot vector output.

Lack of feature & training data

In my original DNN structure, I use 4 layers to approximate the multiple classes. However, the performance is very low. I guess the feature cannot represent the different properties indeed.

Two tricks I use to improve the result. The first one is raising dimension. By the previous experience, I found that the improper dimension will influence the final performance. An auto-encoder is added to my structure. First, the auto-encoder will be trained to fit the original data distribution. Next, the whole classify DNN will be trained. The length of the encoded layer is 128. The image illustrates the final structure.

In fact, data shortage is a common problem in the machine learning world. There’re two methods to solve the problem. The first one is leave-one-out cross validation. The other one is bootstrapping. The random forest structure has used bootstrapping concept. However, I didn’t use the cross validation finally. On the other hand, I choose to generate the new data.

Here is the piece code to generate the new data. The CP and HP will be reduced slightly with the same corresponding type of class. At last, the data will be append to the original pandas data frame object.

Final evaluation

Even though generating new data is not a formal method to enhance the result. However, the performance improved a lot by these two tricky method. The match rate is adopted to evaluate the final model. The definition of the match rate is to determine how many rows of testing get the correct class prediction.

In fact, there’s a little different between evaluation of two models. The random forest gets the score only if the class are correct. However, the DNN structure gets the score if the prediction matches any of the two label. By this different evaluation sight, the DNN structure will get higher score in theory.

After training, the random forest can get about 0.7 match rate sometimes. However, DNN structure seldom gets 0.69 match rate toward the training data. To my surprised, it get 0.88 match rate toward the testing data! The above table shows 6 different DNN result.

Test in your turn!

I put the whole demonstration here. At the same time, the single script has been uploaded on this kaggle website. You can train the model first, and test for your own experience. During the testing process, you should follow the input format that are shown in the readme of the website.

At last, I find two random example and do the testing. The first one is dragonite, the other one is pikachu. As you can see, the prediction of dragonite is correct. However, the pikachu is wrong.

The prediction of dragonite that I found on the internet randomlyThe prediction of pikachu that I found on the internet randomly

We can see the result from another angle. The following is the scatter plot of the whole data. The x axis is CP while the y axis is HP. As you can see, there’s much complicated in the red rectangle regions that there’re several types of different classes. If we want to divide the class clearly, over-fitting problem might occur. To simplify this problem, it is not a proper idea to determine the class with such few features.

The distribution toward the 17 classes

We can use this graph to explain the above two example. The CP of dragonite is high relatively. In this graph, it locates at the middle region while there is not tightly in the surrounding. However, the value of pikachu is located in the lower red rectangle region. As the result, it’s reasonable that it cannot be determined correctly.

For the conclusion, the result might not be the best even though I try my best to improve the structure. Maybe it really needs more feature to build the more accurate prediction model.

Predict The Type in Pokemon GO World!

Predict The Type in Pokemon GO World!

Few feature problem

Structure selection

Precision judgement

Lack of feature & training data

Final evaluation

Test in your turn!

Recommend

第013期如何提高电脑播放音乐的音质? 电脑接音响声音好听的技巧 USB DAC有必要买吗?

巴菲特被骂了：比特币的头号敌人反社会的老头子

并发理论基础：指令重排序问题

CPU性能翻了10.5倍？麒麟芯的进化之路

Microsoft proudly explains how the internal rollout of Windows 11 was its smooth...

X Macro - Wikipedia

复旦副教授魏忠钰：AI和人类对垒「奇葩说」，如何打造智能辩手？

午报 | 淘宝已支持修改账号名；俞敏洪自称或已成新东方发展障碍

The Typearture Variable Color Font Initials - Typearture

How Big Tech Hijacked Its Sharpest, Funniest Critics

About Joyk