This project was done in November 2017. The methods used may be outdated.

I don't know why it always automatically gets pushed to the top of my website:)

# Abstract

In this project, I use the Pokemon Dataset in Kaggle. I focus on the attribute **Legendary**. I try to build a model to predict whether a specific Pokemon is Legendary or not with five methods of classification methods in machine learning.

# Experiment Design

I use Principal components analysis(PCA) to reduce the dimension. With the transformed dataset, I implement *Decision Tree*, *K-nearest neighbors*, *Random Forest*, *AdaBoost* and *Support Vector Machine* to build the predictive model. I also tune the model parameters by double cross validation and gridsearch.Then I make comparasion among all the models and find the ones with the best performance.

# Data

```
import pandas as pd
import numpy as np
pkm = pd.read_csv('/Users/.../Desktop/Pokemon.csv')
pkm.head()
```

This data set includes 721 Pokemon, including their number, name, first and second type, and basic stats: HP, Attack, Defense, Special Attack, Special Defense, Speed, Generation and Legendary. I use all other attributes to classify the attribute Legendary.

# Variable Selection

To implement PCA, I delete the categorical attributes(type1 and type2) and other useless attributes(id, generation) first.

```
pkm1 = pkm.drop(['#','Name','Generation'],1)
pkm_attributes = pkm1.drop(['Legendary','Type 1','Type 2'],1)
```

And then I use PDA here.

```
from sklearn.decomposition import PCA
pca = PCA(n_components=pkm_attributes.shape[1])
fit = pca.fit(pkm_attributes).transform(pkm_attributes)
```

Then I visualize the results and find the numbers of components used later.

```
var1=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)
var1 = np.insert(var1,0,0)
plt.plot(var1)
axes = plt.gca()
axes.set_ylim([40,110])
plt.show()
```

Based on the changing point, I choose to set **n=3**.

```
pca = PCA(n_components=3)
fit = pca.fit(pkm_attributes).transform(pkm_attributes)
fit1 = pd.DataFrame(fit,columns=['c1','c2','c3'])
df = pd.concat([fit1,pkm1['Legendary']],axis=1)
df.head()
```

df is the transformed attributes used in classification.

# Classification Model

## Split into train and test

In the following part, I will make **False = 1, True = 0** in Legendary.

```
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
leg = abs((df['Legendary'].values - 1))
X_train, X_test, y_train, y_test = train_test_split(fit, leg, test_size=0.33, random_state=42)
```

## Decision Tree

I use cross validation in train set to determine the best k for *max depth of tree*.

```
from sklearn import tree
from sklearn.model_selection import cross_val_score
# creating odd list
myList = list(range(10,30))
# empty list that will hold cross validation scores
cv_scores = []
# perform 10-fold cross validation we are already familiar with
for k in myList:
clf = tree.DecisionTreeClassifier(max_depth=k)
scores = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
cv_scores.append(scores.mean())
# changing to misclassification error
MSE = [1 - x for x in cv_scores]
```

I find the best k with least MSE and visualize it.

```
# determining best k
optimal_k = myList[MSE.index(min(MSE))]
print("The optimal number of max depth is %d" % optimal_k)
# plot misclassification error vs k
plt.plot(myList, MSE)
plt.xlabel('Number of Max Depth K')
plt.ylabel('Misclassification Error')
plt.show()
```

The optimal number of max depth is 14

Then I build the best model based on Decision Tree and print the confusion Matrix.

```
from sklearn.metrics import confusion_matrix
clf = tree.DecisionTreeClassifier(max_depth=14)
clf.fit(X_train, y_train)
y_predict_dt = clf.predict(X_test)
c_df = confusion_matrix(y_test, y_predict_dt)
c_df
# definition is here
# http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
```

Confusion Matrix

12 | 4 |
---|---|

6 | 242 |

## K-nearest Neighbor

In this model, I try to find the best k for the number of neighborhoods.

```
from sklearn.neighbors import KNeighborsClassifier
# creating odd list of K for KNN
myList = list(range(1,50))
# subsetting just the odd ones
neighbors = list(filter(lambda x: x % 2 != 0, myList))
# empty list that will hold cross validation scores
cv_scores = []
# perform 10-fold cross validation we are already familiar with
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy')
cv_scores.append(scores.mean())
# changing to misclassification error
MSE = [1 - x for x in cv_scores]
```

Then print the best k and visualize the MSE as before.

```
# determining best k
optimal_k = neighbors[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d" % optimal_k)
# plot misclassification error vs k
plt.plot(neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()
```

The optimal number of neighbors is 13

Then build the best model of KNN and print the confusion matrix.

```
knn = KNeighborsClassifier(n_neighbors=13)
knn.fit(X_train, y_train)
y_predict_knn = knn.predict(X_test)
confusion_matrix(y_test, y_predict_knn)
# http://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html
```

Confusion Matrix

14 | 2 |
---|---|

10 | 238 |

## Random Forest

In this model I tune the parameter which determine the estimators used in each sub model.

```
from sklearn.ensemble import RandomForestClassifier
# creating odd list of K for KNN
myList = list(range(1, 50, 2))
# empty list that will hold cross validation scores
cv_scores = []
# perform 10-fold cross validation we are already familiar with
#n estimators：
for k in myList:
clf = RandomForestClassifier(max_depth=2, random_state=0,n_estimators = k)
scores = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
cv_scores.append(scores.mean())
# changing to misclassification error
MSE = [1 - x for x in cv_scores]
```

Then print the best k and visualize the results.

```
# determining best k
optimal_k = myList[MSE.index(min(MSE))]
print("The optimal number of n estimator is %d" % optimal_k)
# plot misclassification error vs k
plt.plot(myList, MSE)
plt.xlabel('Number of n estimator K')
plt.ylabel('Misclassification Error')
plt.show()
```

The optimal number of n estimator is 5

Then build best model of Random Forest and set the confusion matrix.

```
from sklearn.ensemble import RandomForestClassifier
#http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
clf = RandomForestClassifier(max_depth=2, random_state=0,n_estimators = 5)
clf.fit(X_train, y_train)
y_predict_rf = clf.predict(X_test)
confusion_matrix(y_test, y_predict_rf)
```

8 | 8 |
---|---|

8 | 240 |

## AdaBoost

In this model I search the maximum number of estimators at which boosting is terminated.

```
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
# creating odd list of K for KNN
myList = list(range(100,1000, 100))
# empty list that will hold cross validation scores
cv_scores = []
# perform 10-fold cross validation we are already familiar with
#n estimators：
for k in myList:
clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
algorithm="SAMME",
n_estimators=k)
scores = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
cv_scores.append(scores.mean())
# changing to misclassification error
MSE = [1 - x for x in cv_scores]
```

Then determine the best parameter and visualize it.

```
# determining best k
optimal_k = myList[MSE.index(min(MSE))]
print("The optimal number of n estimator is %d" % optimal_k)
# # plot misclassification error vs k
plt.plot(myList, MSE)
plt.xlabel('Number of n estimator K')
plt.ylabel('Misclassification Error')
plt.show()
```

The optimal number of n estimator is 200

At last, build optimal model of Adaboost and return the confusion matrix.

```
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
# http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
algorithm="SAMME",
n_estimators=200)
bdt.fit(X_train,y_train)
y_predict_ab = bdt.predict(X_test)
c_ada = confusion_matrix(y_test, y_predict_ab)
c_ada
```

13 | 3 |
---|---|

7 | 24 |

## SVM

Here I use Grid Search to tune SVM.So I won't visualize it as previous.

```
from sklearn import svm
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(svm.SVC(), param_grid={"C":[0.001,0.1, 1, 10], "gamma": [1, 0.1, 0.01,0.001]}, cv=4)
grid.fit(X_train, y_train)
print("The best parameters are %s with a score of %0.2f"
% (grid.best_params_, grid.best_score_))
```

The best parameters are {'C': 1, 'gamma': 0.001} with a score of 0.95 In[189]:

Then build the best model of SVM and print the Confusion Matrix.

```
clf = svm.SVC(C=1,gamma=0.001)
clf.fit(X_train,y_train)
y_predict_svm = clf.predict(X_test)
c_svm = confusion_matrix(y_test, y_predict_svm)
c_svm
```

7 | 9 |
---|---|

6 | 242 |

# Discussion

In this project, I implement different algorithms to make classification with threee generated variables based on PCA. Since the legendray pokemons only take a very small portion of all, the performances don't differ greatly. But among all algorithms, decision tree and adaboost have the highest accuracy.The shortage of adaboost is that it runs much more slowly than all the other algorithms.

Among all models, Random Forest has the worst performance. The reason might be the structure of our dataset.

Tuning the model helps a lot. All the models perform better with tuned parameters than that with default values.

The jupyter notebook file can be gound on my Github.

# Conclusion

Based on the confusion matrixes, among all these models, I find that decision tree and adaboost have the best accuracy.

**Accuracy**: The accuracy is both **96.21%**.

**Specificity**: The accuracy for decision tree is **75%**, for adaboost is **81.25%**.

**Precision**: The precision for decision tree is **98.37%**, for adaboost is **99.18%**.

**Sensitivity**: The sensitivity for decision tree is **97.58%**, for adaboost is **97.18**.

While they have the same accuracy, among the other three indexes, adaboost is higher in two of them than decision tree. So I think **AdaBoost is the best model in this project**.