3

Predict The Type in Pokemon GO World!

 2 years ago
source link: https://medium.com/@sunnerli/predict-the-type-in-pokemen-go-world-450be8c05529
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Predict The Type in Pokemon GO World!

Recently, Niantic Labs launches an activity that everyone has the opportunity to catch the legend pokemen — Lugia and Articuno. Nowadays, the company just launch the 151 different pokemon which had been introduced in first generation, and there’re many different types which will make me confused. A curiosity has existed in my mind: Is there a method to predict the type of the pokemon by the given attributes?

The profile of Articuno that I found on the internet randomly

Few feature problem

After I servery the data of pokemon GO on the internet, the appropriate dataset was found on the kaggle. As the result, I use this dataset to do the further work.

In the first generation, there’re 17 different type of classes. Before I check the available feature, I think weight, height, stardust consumption or candy might be the great attributes. However, there’re only two meaningful attributes that are recorded: Max CP and Max HP. It makes me feel nervous: can few features enough to represent the complex properties between the different classes?

Structure selection

In the tremendous machine learning world, two structures are the most popular: random forest and xgboost. To simplify the work, I just use random forest to complete the task. On the other hand, I also use DNN structure to predict the class too.

Precision judgement

The image of Bulbasaur and Pikachu

I use sklearn random forest object to complete the work. However, there’s a critical problem. The tag of each row can just accept one class. For example, Pikachu has only one type — electronic. However, Bulbasaur has two types: grass and poison. How to deal with this situation?

I split as the two rows. In this strategy, each pokemon has only 1 class, but there’re more data that will be generated. The format of the data can be shown below. As the result, one type of pokemon might occupy two rows if it has two different type of classes.

To reformat the data table, melt function of pandas is useful! The id_vars parameter accept the list which will not be reshape. On the contrary, the whole other attribute will be reformat. The usage is shown below.

On the other hand, the problem doesn’t happen in the DNN structure. Since the output can accept for multiple types. For this tricky method, the common soft-max estimation isn’t adopted in this approach. I use sigmoid function to approximate the multiple one-hot vector output.

Lack of feature & training data

In my original DNN structure, I use 4 layers to approximate the multiple classes. However, the performance is very low. I guess the feature cannot represent the different properties indeed.

Two tricks I use to improve the result. The first one is raising dimension. By the previous experience, I found that the improper dimension will influence the final performance. An auto-encoder is added to my structure. First, the auto-encoder will be trained to fit the original data distribution. Next, the whole classify DNN will be trained. The length of the encoded layer is 128. The image illustrates the final structure.

In fact, data shortage is a common problem in the machine learning world. There’re two methods to solve the problem. The first one is leave-one-out cross validation. The other one is bootstrapping. The random forest structure has used bootstrapping concept. However, I didn’t use the cross validation finally. On the other hand, I choose to generate the new data.

Here is the piece code to generate the new data. The CP and HP will be reduced slightly with the same corresponding type of class. At last, the data will be append to the original pandas data frame object.

Final evaluation

Even though generating new data is not a formal method to enhance the result. However, the performance improved a lot by these two tricky method. The match rate is adopted to evaluate the final model. The definition of the match rate is to determine how many rows of testing get the correct class prediction.

In fact, there’s a little different between evaluation of two models. The random forest gets the score only if the class are correct. However, the DNN structure gets the score if the prediction matches any of the two label. By this different evaluation sight, the DNN structure will get higher score in theory.

After training, the random forest can get about 0.7 match rate sometimes. However, DNN structure seldom gets 0.69 match rate toward the training data. To my surprised, it get 0.88 match rate toward the testing data! The above table shows 6 different DNN result.

Test in your turn!

I put the whole demonstration here. At the same time, the single script has been uploaded on this kaggle website. You can train the model first, and test for your own experience. During the testing process, you should follow the input format that are shown in the readme of the website.

At last, I find two random example and do the testing. The first one is dragonite, the other one is pikachu. As you can see, the prediction of dragonite is correct. However, the pikachu is wrong.

The prediction of dragonite that I found on the internet randomlyThe prediction of pikachu that I found on the internet randomly

We can see the result from another angle. The following is the scatter plot of the whole data. The x axis is CP while the y axis is HP. As you can see, there’s much complicated in the red rectangle regions that there’re several types of different classes. If we want to divide the class clearly, over-fitting problem might occur. To simplify this problem, it is not a proper idea to determine the class with such few features.

The distribution toward the 17 classes

We can use this graph to explain the above two example. The CP of dragonite is high relatively. In this graph, it locates at the middle region while there is not tightly in the surrounding. However, the value of pikachu is located in the lower red rectangle region. As the result, it’s reasonable that it cannot be determined correctly.

For the conclusion, the result might not be the best even though I try my best to improve the structure. Maybe it really needs more feature to build the more accurate prediction model.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK