FIFA21 EPL analysis
Unsupervised learning - prediction of FIFA21 players' position
- Import python module
- Webscrap FIFA21 EPL players' stats
- Data cleaning
- Top players of Fifa21
- Feature engineering
- Model selection
- Evaluate models with test set
Objective of this notebook:
- Apply webscrap by myself
- Visualise FIFA21 player data
- Unsupervised learning to predict players' position
Note book summary:
- Create python script (BeautifulSoup) to scrap FIFA21 EPL players data from https://sofifa.com/
- Clean data
- Gained information from the data
- Simplify position
- Train test split via stratified sampling
- Predict position => 90% accuracy and F1 score for the test set via SVM
- Midfielders are the position that are hardest to predict
Inspired by this analysis from Kaggle: https://www.kaggle.com/younessennadj/fifa19-analytics-predict-position#4.Divide-the-data-to-train-and-test-datasets
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup as bs
import time
import re
from random import randrange
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
First, check if you can webscrap. sofifa.com indicate you can webscrap through their robots.txt
Webscrap procedure:
- Search prem players in Fifa21
- Results are stored in table, 60 per page. up to 600
- open each pages, grab the link to pages for individual players
Note: scrap responsibly. Add a random 2-6 seconds delay in between access to webpages
links = []
for x in range(1):
# make a search and scrap the webpage --------------------------------------------------
base_url = 'https://sofifa.com/players?type=all&lg%5B%5D=13'
if x > 0:
add_url ='&offset='+str(x*60)
else:
add_url=''
r = requests.get(base_url+add_url)
# call the soup function to make the html readable --------------------------------------
webpage = bs(r.content)
table = webpage.find('table')
rows = table.find_all('tr')
for row in rows[1::]:
link = row.find_all("a",{"class":"tooltip"})[0]
links.append(link['href'])
# print progress on scrapping link
print("Page",x,"done")
#Be sure to pause
time.sleep(randrange(2,6))
Number of players in EPL:
len(links)
The top 10 players, sorted by overall rating:
links[:10]
Ok, we have 644 players registered in EPL in the FIFA21 game. That is 32 players per club which is about right
Note: I have scrapped them before using the same function below. For for this demo, I will only scrap the first 10 players
links3 = links[:10]
Players to scrap (show first 5):
links3[:5]
Use function to scrap
list_stats_top10 = []
def player_name(weblink):
return weblink.find('h1').text
def nat(weblink):
return weblink.find('div',{"class":"meta bp3-text-overflow-ellipsis"}).a['title']
def dob_wh(weblink):
stuff = weblink.find('div',{"class":"meta bp3-text-overflow-ellipsis"}).text
temp = stuff.split("(")[1]
dob = temp.split(")")[0]
temp2 = temp.split(")")[1].split(" ")
height = temp2[1]
weight = temp2[2]
return dob, height, weight
def club_info(weblink):
club = weblink.find(text = re.compile('Player Specialities')).parent.findNext('h5').text
jersey = weblink.find(text = re.compile('Jersey Number')).next
c_valid = weblink.find(text = re.compile('Contract Valid Until')).next
c_value = weblink.find('section',{"class":"card spacing"}).find(text = re.compile('Value')).previous.previous
wage = weblink.find('section',{"class":"card spacing"}).find(text = re.compile('Wage')).previous.previous
return club, jersey, c_valid, c_value, wage
def player_stats(weblink):
best_pos = weblink.find(text = re.compile('Best Position')).next.text
best_rating = weblink.find(text = re.compile('Best Overall Rating')).next.text
return best_pos, best_rating
def player_stats_detail(weblink):
# Attacking stats
temp = weblink.find(text = re.compile('Attacking')).parent.parent.find_all("li")
keys = ['Crossing','Finishing','Heading Accuracy','Short Passing','Volleys']
for index,attrs in enumerate(temp):
temp2 = attrs.find_all('span')
stats[keys[index]] = temp2[0].text
# skill stats
temp = weblink.find(text = re.compile('Attacking')).parent.parent.parent.find_next_sibling("div").find_all("li")
keys = ['Dribbling','Curve','FK Accuracy','Long Passing','Ball Control']
for index,attrs in enumerate(temp):
temp2 = attrs.find_all('span')
stats[keys[index]] = temp2[0].text
# movement stats
temp = weblink.find(text = re.compile('Movement')).parent.parent.find_all("li")
keys = ['Acceleration','Spring Speed','Agility','Reactions','Balance']
for index,attrs in enumerate(temp):
temp2 = attrs.find_all('span')
stats[keys[index]] = temp2[0].text
# power stats
temp = weblink.find(text = re.compile('Power')).parent.parent.find_all("li")
keys = ['Shot Power','Jumping','Stamina','Strength','Long Shots']
for index,attrs in enumerate(temp):
temp2 = attrs.find_all('span')
stats[keys[index]] = temp2[0].text
# mentality stats
temp = weblink.find(text = re.compile('Mentality')).parent.parent.find_all("li")
keys = ['Aggression','Interceptions','Positioning','Vision','Penalties','Composure']
for index,attrs in enumerate(temp):
temp2 = attrs.find_all('span')
stats[keys[index]] = temp2[0].text
# defending stats
temp = weblink.find(text = re.compile('Defending')).parent.parent.find_all("li")
keys = ['Defensive Awareness','Standing Tackle','Sliding Tackle']
for index,attrs in enumerate(temp):
temp2 = attrs.find_all('span')
stats[keys[index]] = temp2[0].text
# goalkeeping stats
temp = weblink.find(text = re.compile('Goalkeeping')).parent.parent.find_all("li")
keys = ['GK Diving','GK Handling','GK Kicking','GK Positioning','GK Reflexes']
for index,attrs in enumerate(temp):
temp2 = attrs.find_all('span')
stats[keys[index]] = temp2[0].text
# traits stats
try:
temp = weblink.find(text = re.compile('Traits')).parent.parent.find_all("li")
for attrs in temp:
if 'Traits' in stats:
stats['Traits'].append(attrs.text)
else:
stats['Traits'] = [attrs.text]
except:
stats['Traits'] = None
for index,link in enumerate(links3):
# make a search and scrap the webpage --------------------------------------------------
base_url = 'https://sofifa.com/'
add_url = link
r = requests.get(base_url+add_url)
weblink = bs(r.content)
stats={}
# get player name
name = player_name(weblink)
stats['Player_name'] = name
# get nationality
nationality = nat(weblink)
stats['Nationality'] = nationality
# get dob, weight, height
dob, height, weight = dob_wh(weblink)
stats['dob'] = dob
stats['height'] = height
stats['weight'] = weight
# get club info
club, jersey, c_valid, c_value, wage = club_info(weblink)
stats['club'] = club
stats['jersey'] = jersey
stats['c_valid'] = c_valid
stats['c_value'] = c_value
stats['wage'] = wage
# add general player stats
pos, rating = player_stats(weblink)
stats['pos'] = pos
stats['rating'] = rating
# add player stats detail
player_stats_detail(weblink)
# print progress --------------------------
list_stats_top10.append(stats)
if index % 10 ==0:
print(index)
#Be sure to pause between accessing pages
time.sleep(randrange(2,6))
import json
def save_data(title,data):
with open(title, 'w', encoding ='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent = 2)
def load_data(title):
with open(title, encoding ='utf-8') as f:
return json.load(f)
#save_data('FIFA21_EPL_top10.json', list_stats_top10)
import os
os.chdir("C:/Users/Riyan Aditya/Desktop/ML_learning/Project6_EPL_20192020")
FIFA21_data = load_data('FIFA21_EPL.json')
Lets see the first data, Kevin De Bruyne
FIFA21_data[0]
df = pd.DataFrame(FIFA21_data)
Few things to do:
- Everything is string. Convert to numeric when needed to be, especially the individual stats
- Convert DOB to datetime
- Convert height and weight to the SI unit
- Convert value and wages to the right unit (eg: no M and no K)
- What to do with traits in players? Perhaps ignore for now
First, lets look at the traits
- How many unique traits are there?
- Proportion of players that have traits?
- Is it worth keeping?
Players with no traits:
df['Traits'].isnull().values.ravel().sum()
199 from 644 players (~30%) do not have any traits
Top 5 players that have no traits:
df[df['Traits'].isnull()][:5].Player_name
Wow. Kante doesnt have any traits? This could be a mistake from Fifa21 database
Unique traits:
df.Traits.explode().unique()
print("number of unique traits :",len(df.Traits.explode().unique()))
26 unique traits. Probably too long if I expand the column similar to how a OHE works
Is it worth keeping?
Probably not. Remove them for now
df2 = df.copy()
df2 = df2.drop(labels='Traits', axis=1)
df2.columns
# fix mistake with column name
df2.rename(columns={'Spring Speed':'Sprint Speed'}, inplace=True)
DOB to datetime
df2['dob'] = pd.to_datetime(df2['dob'])
df2.dob.dtype
Height to numeric & SI
def height_conversion(ht):
# current format is x'x"
ht2 = ht.split("'")
ft = float(ht2[0])
inc = float(ht2[1].replace("\"","")) # " is a special character
return round(((12*ft)+inc)*2.54,0)
df2['height'] = df2['height'].apply(lambda x:height_conversion(x))
df2.height.dtype
Weight to numeric & SI
def weight_conversion(wt):
# current format is xxxlbs
wt2 = wt.split('lbs')
return round(float(wt2[0])*0.453592,0)
df2['weight'] = df2['weight'].apply(lambda x:weight_conversion(x))
df2.weight.dtype
Contract value and wage to numeric
value_dict = {"K":1000, "M":1000000 }
def money_conversion(money):
# current format is €xxxxK
money = money.replace('€','')
if money[-1] in value_dict:
num, values = money[:-1], money[-1]
return (float(num)* value_dict[values])
df2['c_value'] = df2['c_value'].apply(lambda x:money_conversion(x))
df2['wage'] = df2['wage'].apply(lambda x:money_conversion(x))
df2.c_value.dtype, df2.wage.dtype
Convert jersey, rating and individual stats attributes to numeric
df2['jersey'] = pd.to_numeric(df2['jersey'], errors='coerce')
df2['rating'] = pd.to_numeric(df2['rating'], errors='coerce')
for col in df2.columns[12::]:
df2[col] = pd.to_numeric(df2[col], errors='coerce')
df3 = df2.copy()
Top rated players from each club
Note that there can be multiple top rated players per club
idx = df3.groupby(['club'])['rating'].transform(max) == df3['rating']
top_rated = df3[idx][['club','Player_name','pos','rating']]
top_rated.sort_values('club')
Top rated players per pos
These positions are based on what FIFA21 recommend as "Best Position"
idx = df3.groupby(['pos'])['rating'].transform(max) == df3['rating']
top_rated = df3[idx][['pos','club','Player_name','rating']]
# create custom sort so this makes positional sense
custom_dict = {'GK':0, 'CB':1, 'LB':2, 'RB':3, 'LWB':4, 'RWB':5, 'CDM':6, 'CM':7, 'CAM':8, 'RM':9, 'LM':10,
'RW':11, 'LW':12, 'CF':13, 'ST':14}
top_rated['rank'] = top_rated['pos'].map(custom_dict)
top_pos = top_rated.sort_values('rank')
top_pos.drop(labels=['rank'], axis=1)
Some weird result here. Right forward maybe, but I would not put Rasford as a Right milfielder
Top rated players per original country
Unique nationalities of Prem players
df3.Nationality.unique().shape
List of top players per nationality of origin:
idx = df3.groupby(['Nationality'])['rating'].transform(max) == df3['rating']
top_rated = df3[idx][['Nationality','club','Player_name','pos','rating']]
top_rated.sort_values('Nationality')
Top player based on some stats
Best crosser
df3[['Player_name','club','pos','Crossing']].loc[df3['Crossing'].idxmax()]
Best short passer
df3[['Player_name','club','pos','Short Passing']].loc[df3['Short Passing'].idxmax()]
Best long passer
df3[['Player_name','club','pos','Long Passing']].loc[df3['Long Passing'].idxmax()]
Best header
df3[['Player_name','club','pos','Heading Accuracy']].loc[df3['Heading Accuracy'].idxmax()]
Best finisher
df3[['Player_name','club','pos','Finishing']].loc[df3['Finishing'].idxmax()]
Best FK taker
df3[['Player_name','club','pos','FK Accuracy']].loc[df3['FK Accuracy'].idxmax()]
Best PK taker
df3[['Player_name','club','pos','Penalties']].loc[df3['Penalties'].idxmax()]
Best volleyer
df3[['Player_name','club','pos','Volleys']].loc[df3['Volleys'].idxmax()]
Highest shot power
df3[['Player_name','club','pos','Shot Power']].loc[df3['Shot Power'].idxmax()]
Speed merchant and acceleration
df3[['Player_name','club','pos','Sprint Speed']].loc[df3['Sprint Speed'].idxmax()]
df3[['Player_name','club','pos','Acceleration']].loc[df3['Acceleration'].idxmax()]
Dribbler
df3[['Player_name','club','pos','Dribbling']].loc[df3['Dribbling'].idxmax()]
df3[['Player_name','club','pos','Ball Control']].loc[df3['Ball Control'].idxmax()]
Best stamina
df3[['Player_name','club','pos','Stamina']].loc[df3['Stamina'].idxmax()]
Best strength
df3[['Player_name','club','pos','Strength']].loc[df3['Strength'].idxmax()]
Best positioning
df3[['Player_name','club','pos','Positioning']].loc[df3['Positioning'].idxmax()]
Best vision
df3[['Player_name','club','pos','Vision']].loc[df3['Vision'].idxmax()]
Best composure
df3[['Player_name','club','pos','Composure']].loc[df3['Composure'].idxmax()]
Best defensive awareness
df3[['Player_name','club','pos','Defensive Awareness']].loc[df3['Defensive Awareness'].idxmax()]
Interceptions
df3[['Player_name','club','pos','Interceptions']].loc[df3['Interceptions'].idxmax()]
Best sliding tackle
df3[['Player_name','club','pos','Sliding Tackle']].loc[df3['Sliding Tackle'].idxmax()]
Best GK reflexes
df3[['Player_name','club','pos','GK Reflexes']].loc[df3['GK Reflexes'].idxmax()]
Best GK kicking
df3[['Player_name','club','pos','GK Kicking']].loc[df3['GK Kicking'].idxmax()]
Tallest player (in cm)
df3[['Player_name','club','pos','height']].loc[df3['height'].idxmax()]
Heaviest player (kg)
df3[['Player_name','club','pos','weight']].loc[df3['weight'].idxmax()]
Top player based on wages and salaries
Highest wages
df3[['Player_name','club','pos','wage']].loc[df3['wage'].idxmax()]
Highest contract values
df3[['Player_name','club','pos','c_value']].loc[df3['c_value'].idxmax()]
Top earner per club
idx = df3.groupby(['club'])['wage'].transform(max) == df3['wage']
top_rated = df3[idx][['club','Player_name','pos','rating','wage']]
top_rated.sort_values('club')
Total wages per club
grouped = df3.groupby('club')['wage'].sum().reset_index()
grouped.sort_values('wage', ascending=False)
I am not sure the wages in FIFA21 reflect the real wages of these players
We have 16 different positions. We need to simplify this.
Let's simplify those to the 4 traditional positions: Goalkeeper, Defender, Midfielder, Forward. For simplicity:
- GK: GK
- DF: CB, LB, RB, LWB, RWB
- MF: CDM, CM, CAM, RM, LM,
- FW: RW, LW, CF, ST
def repos(pos):
if pos == 'GK':
return 'GK'
elif pos[-1] == 'B':
return 'DF'
elif pos[-1] == 'M':
return 'MF'
else:
return 'FW'
df3['pos2'] = df3.apply(lambda x: repos(x['pos']), axis=1)
pos_count = df3['pos2'].value_counts()
pos_count
import plotly.express as px
from IPython.display import HTML
cat_order = ['GK','DF','MF','FW']
fig = px.bar(pos_count.reindex(cat_order))
fig.update_layout(yaxis_title="Count")
fig.update_layout(xaxis_title="Position")
fig.update_layout(showlegend=False)
#fig.show()
HTML(fig.to_html(include_plotlyjs='cdn'))
wage, rating per position
fig2 = px.scatter(df3, x="rating", y="wage", color="pos2", hover_data=['Player_name'])
HTML(fig2.to_html(include_plotlyjs='cdn'))
Interesting trend here. Seems like FW tend to be the expensive one, while MF are cheap
Rating vs position
df3['pos2'] = pd.Categorical(df3['pos2'], ['GK','DF','MF','FW'])
fig3 = px.box(df3.sort_values("pos2"), x="pos2", y="rating", color = 'pos2', points="all", hover_data=['Player_name'])
fig3.update_layout(xaxis_title="Position")
fig3.update_layout(showlegend=False)
HTML(fig3.to_html(include_plotlyjs='cdn'))
Every position have players from 55 to 90s rating
Wage vs position
fig4 = px.box(df3.sort_values("pos2"), x="pos2", y="wage", color = 'pos2', points="all", hover_data=['Player_name'])
fig4.update_layout(xaxis_title="Position")
fig4.update_layout(showlegend=False)
HTML(fig4.to_html(include_plotlyjs='cdn'))
KDB is an outlier in term of his FIFA21 salary
pd.pivot_table(df3, index = 'pos2', values = 'wage')
Forward gets higher wages. Make sense. Goalscoring attribute is a premium
df4 = df3.copy()
df4 = df4[['pos2','height', 'weight','c_value', 'wage', 'rating', 'Crossing',
'Finishing', 'Heading Accuracy', 'Short Passing', 'Volleys',
'Dribbling', 'Curve', 'FK Accuracy', 'Long Passing', 'Ball Control',
'Acceleration', 'Sprint Speed', 'Agility', 'Reactions', 'Balance',
'Shot Power', 'Jumping', 'Stamina', 'Strength', 'Long Shots',
'Aggression', 'Interceptions', 'Positioning', 'Vision', 'Penalties',
'Composure', 'Defensive Awareness', 'Standing Tackle', 'Sliding Tackle',
'GK Diving', 'GK Handling', 'GK Kicking', 'GK Positioning',
'GK Reflexes']]
df4.shape
import seaborn as sns
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
corr = df4.corr()
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
Some insights:
- Contract value and wages strongly correlated with attacking attributes => attakeer tends to have higher wages
- Interesting how height and weight tend to be correlated with Goalkeeping stats. Perhaps Goalie tends to be taller and heavier
- A lot of individual stats attributes seem to be correlated to each other
Perhaps, instead of predicting positions through 30+ different attributes, we can simplify this to the 6 attributes that FIFA Ultimate team use. These stats are the following:
- Pace
- Shooting
- Passing
- Dribbling
- Defending
- Physical
- Goalkeeping
Based on: https://www.fifauteam.com/fifa-20-attributes-guide/
They are a combination based on several attributes. For example, shooting is made up of: finishing, long sots, penalties, positioning, shot power, volleys. These stats are weighted, but for simplicity I will just average them.
Note that Keepers usually have their own attributes, but I am going to make up keeping attributes which are simply the average of all the keepers attributes here.
df4['pace'] = df4.loc[:,['Acceleration','Sprint Speed'] ].mean(axis=1)
df4['shooting'] = df4.loc[:,['Finishing','Long shots','Penalties','Positioning','Shot Power','Volleys']].mean(axis=1)
df4['passing'] = df4.loc[:,['Crossing','Curve','FK Accuracy','Long Passing','Short Passing','Vision']].mean(axis=1)
df4['dribbling'] = df4.loc[:,['Agility','Balance','Ball Control','Composure','Dribbling','Reactions']].mean(axis=1)
df4['defending'] = df4.loc[:,['Heading Accuracy','Interceptions','Defensive Awareness','Sliding Tackle','Standing Tackle']].mean(axis=1)
df4['physical'] = df4.loc[:,['Aggression','Jumping','Stamina','Strength']].mean(axis=1)
df4['goalkeeping'] = df4.loc[:,['GK Diving', 'GK Handling', 'GK Kicking', 'GK Positioning','GK Reflexes']].mean(axis=1)
And again remove unnecessary column
df5 = df4[['pos2', 'height', 'weight', 'c_value', 'wage', 'rating',
'pace', 'shooting', 'passing', 'dribbling', 'defending','physical', 'goalkeeping']]
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
corr = df5.corr()
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
corr['rating'].sort_values(ascending=False)
fig5 = px.scatter(df5, x="shooting", y="passing", color="pos2", width = 600, height = 600)
HTML(fig5.to_html(include_plotlyjs='cdn'))
This looks promising for unsupervised learning per position
Remove NAN in wages
There are 7 players with no info on wages. They will be removed
df5 = df5[df5['wage'].notna()]
from sklearn import preprocessing
encoder = preprocessing.LabelEncoder()
df5['pos2'] = encoder.fit_transform(df5['pos2'])
positions2 = encoder.inverse_transform([0,1,2,3])
positions2
Split test and training set. use stratified sampling
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 42)
for train_index, test_index in split.split(df5, df5.loc[:,'pos2']):
strat_train_set = df5.iloc[train_index]
strat_test_set = df5.iloc[test_index]
strat_train_set['pos2'].value_counts()/len(strat_train_set)
strat_test_set['pos2'].value_counts()/len(strat_test_set)
strat_train_set.shape, strat_test_set.shape
Split to Xtrain, ytrain, Xtest, ytest
X_train = strat_train_set.copy().drop('pos2', axis=1)
y_train = strat_train_set['pos2']
X_test = strat_test_set.copy().drop('pos2', axis=1)
y_test = strat_test_set['pos2']
Numerical pipeline to standardise all x parameters
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, Normalizer
num_pipeline = Pipeline([('std_scaler',StandardScaler())])
X_train_tr = num_pipeline.fit_transform(X_train)
from sklearn.linear_model import LogisticRegressionCV
#scores
from sklearn.metrics import confusion_matrix,accuracy_score,roc_curve,roc_auc_score,auc,f1_score
from sklearn.model_selection import cross_val_score,learning_curve,GridSearchCV,validation_curve
LR = LogisticRegressionCV(cv=3,random_state=20, solver='liblinear', max_iter=1000)
clf_lr = LR.fit(X_train_tr,y_train)
y_pred_train_lr = clf_lr.predict(X_train_tr)
positions2
fig = plt.figure(figsize=(10,7))
sns.set(font_scale=1.4)
cf = confusion_matrix(y_train, y_pred_train_lr)
df_cm_lr = pd.DataFrame(cf,index=positions2, columns=positions2)
heatmap = sns.heatmap(df_cm_lr, annot=True, fmt="d", annot_kws={"size": 16})
plt.title('Logistic Regression')
plt.ylabel('True label')
plt.xlabel('Predicted label')
print(" Accuracy: ",accuracy_score(y_train, y_pred_train_lr))
print(" F1 score: ",f1_score(y_train, y_pred_train_lr,average='weighted'))
Wow, this is much better than k-means model
plt.figure()
plt.title('Learning curve Logistic regression')
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(clf_lr, X_train_tr, y_train)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier()
param_grid = {'n_neighbors':np.arange(1,15)}
KNN = GridSearchCV(knn_model, param_grid, cv=3)
best_clf_knn = KNN.fit(X_train_tr,y_train)
def clf_performance(classifier,model_name):
print(model_name)
print('Best_score: '+str(classifier.best_score_))
print('Best_parameters: '+str(classifier.best_params_))
clf_performance(best_clf_knn,'K Nearest Neighbors')
y_pred_train_knn = best_clf_knn.predict(X_train_tr)
fig = plt.figure(figsize=(10,7))
sns.set(font_scale=1.4)
cf = confusion_matrix(y_train, y_pred_train_knn)
df_cm_knn = pd.DataFrame(cf,index=positions2, columns=positions2)
heatmap = sns.heatmap(df_cm_knn, annot=True, fmt="d", annot_kws={"size": 16})
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title('K nearest neighbour')
print(" Accuracy: ",accuracy_score(y_train, y_pred_train_knn))
print(" F1 score: ",f1_score(y_train, y_pred_train_knn,average='weighted'))
plt.figure()
plt.title('Learning curve K nearest neighbour')
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(best_clf_knn, X_train_tr, y_train)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
from sklearn.ensemble import RandomForestClassifier
gridsearch_forest = RandomForestClassifier()
params = {
"n_estimators": [1, 10, 100],
"max_depth": [5,8,15], #2,3,5 85 #5,8,10 88 #5 8 15 89
"min_samples_leaf" : [1, 2, 4]}
RF = GridSearchCV(gridsearch_forest, param_grid=params, cv=3 )
best_clf_rf = RF.fit(X_train_tr,y_train)
clf_performance(best_clf_rf,'Random Forest')
y_pred_train_rf = best_clf_rf.predict(X_train_tr)
fig = plt.figure(figsize=(10,7))
sns.set(font_scale=1.4)
cf = confusion_matrix(y_train, y_pred_train_rf)
df_cm_rf = pd.DataFrame(cf,index=positions2, columns=positions2)
heatmap = sns.heatmap(df_cm_rf, annot=True, fmt="d", annot_kws={"size": 16})
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title('Random forest')
print(" Accuracy: ",accuracy_score(y_train, y_pred_train_rf))
print(" F1 score: ",f1_score(y_train, y_pred_train_rf,average='weighted'))
plt.figure()
plt.title('Learning curve Random Forest')
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(best_clf_rf, X_train_tr, y_train)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
from sklearn.svm import SVC
SVM = SVC(kernel='linear', C=1)
clf_svm = SVM.fit(X_train_tr,y_train)
y_pred_train_svm = clf_svm.predict(X_train_tr)
fig = plt.figure(figsize=(10,7))
sns.set(font_scale=1.4)
cf = confusion_matrix(y_train, y_pred_train_svm)
df_cm_svm = pd.DataFrame(cf,index=positions2, columns=positions2)
heatmap = sns.heatmap(df_cm_svm, annot=True, fmt="d", annot_kws={"size": 16})
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title('Support Vector Machine')
print(" Accuracy: ",accuracy_score(y_train, y_pred_train_svm))
print(" F1 score: ",f1_score(y_train, y_pred_train_svm,average='weighted'))
plt.figure()
plt.title('Learning curve SVM')
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(clf_svm, X_train_tr, y_train)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
data_matrix = [["Model","accuracy","F1-score"],
['Logistic regression',accuracy_score(y_train, y_pred_train_lr).round(3),f1_score(y_train, y_pred_train_lr,average='weighted').round(3)],
['K nearest neighbour',accuracy_score(y_train, y_pred_train_knn).round(3),f1_score(y_train, y_pred_train_knn,average='weighted').round(3)],
['Random forest',accuracy_score(y_train, y_pred_train_rf).round(3),f1_score(y_train, y_pred_train_rf,average='weighted').round(3)],
['Support vector machine',accuracy_score(y_train, y_pred_train_svm).round(3),f1_score(y_train, y_pred_train_svm,average='weighted').round(3)]
]
data_matrix
Seems like random forest is the best based on this training set
fig,axn = plt.subplots(2, 2, sharex=True, sharey=True, figsize=(12,12))
fig.suptitle('Confusion matrix for training dataset')
ax = plt.subplot(2, 2, 1)
sns.heatmap(df_cm_lr, annot=True, fmt="d", annot_kws={"size": 16}, ax = ax, cbar=False)
ax.set_title('Logistic regression')
ax.set_aspect('equal')
plt.ylabel('True label')
ax = plt.subplot(2, 2, 2)
sns.heatmap(df_cm_knn, annot=True, fmt="d", annot_kws={"size": 16}, ax = ax, cbar=False)
ax.set_title('K nearest neighbour')
ax.set_aspect('equal')
ax = plt.subplot(2, 2, 3)
sns.heatmap(df_cm_rf, annot=True, fmt="d", annot_kws={"size": 16}, ax = ax, cbar=False)
ax.set_title('Random Forest')
ax.set_aspect('equal')
plt.ylabel('True label')
plt.xlabel('Predicted label')
ax = plt.subplot(2, 2, 4)
sns.heatmap(df_cm_svm, annot=True, fmt="d", annot_kws={"size": 16}, ax = ax, cbar=False)
ax.set_title('Support Vector Machine')
ax.set_aspect('equal')
plt.xlabel('Predicted label')
#sns.heatmap(df_cm_rf, annot=True, fmt="d", annot_kws={"size": 16}, ax = ax[2])
#sns.heatmap(df_cm_svm, annot=True, fmt="d", annot_kws={"size": 16}, ax = ax[3])
#plt.title('Logistic Regression')
plt.show()
Insights:
- All models can predict goalkeeper right. That is expected
- Midfielder is harder to predict. Not surprised, considering a lot of midfielders may have skills like defenders or forwards
Prepare Xtest
X_test_tr = num_pipeline.fit_transform(X_test)
Make prediction
y_pred_test_lr = clf_lr.predict(X_test_tr)
y_pred_test_knn = best_clf_knn.predict(X_test_tr)
y_pred_test_rf = best_clf_rf.predict(X_test_tr)
y_pred_test_svm = clf_svm.predict(X_test_tr)
data_matrix = [["Model","accuracy","F1-score"],
['Logistic regression',accuracy_score(y_test, y_pred_test_lr).round(3),f1_score(y_test, y_pred_test_lr,average='weighted').round(3)],
['K nearest neighbour',accuracy_score(y_test, y_pred_test_knn).round(3),f1_score(y_test, y_pred_test_knn,average='weighted').round(3)],
['Random forest',accuracy_score(y_test, y_pred_test_rf).round(3),f1_score(y_test, y_pred_test_rf,average='weighted').round(3)],
['Support vector machine',accuracy_score(y_test, y_pred_test_svm).round(3),f1_score(y_test, y_pred_test_svm,average='weighted').round(3)]
]
data_matrix
For the test set, SVM is the best model (it was RF in the training set)
cf = confusion_matrix(y_test, y_pred_test_lr)
df_cm_lr_t = pd.DataFrame(cf,index=positions2, columns=positions2)
cf = confusion_matrix(y_test, y_pred_test_knn)
df_cm_knn_t = pd.DataFrame(cf,index=positions2, columns=positions2)
cf = confusion_matrix(y_test, y_pred_test_rf)
df_cm_rf_t = pd.DataFrame(cf,index=positions2, columns=positions2)
cf = confusion_matrix(y_test, y_pred_test_svm)
df_cm_svm_t = pd.DataFrame(cf,index=positions2, columns=positions2)
fig,axn = plt.subplots(2, 2, sharex=True, sharey=True, figsize=(12,12))
fig.suptitle('Confusion matrix for test dataset')
ax = plt.subplot(2, 2, 1)
sns.heatmap(df_cm_lr_t, annot=True, fmt="d", annot_kws={"size": 16}, ax = ax, cbar=False)
ax.set_title('Logistic regression')
ax.set_aspect('equal')
plt.ylabel('True label')
ax = plt.subplot(2, 2, 2)
sns.heatmap(df_cm_knn_t, annot=True, fmt="d", annot_kws={"size": 16}, ax = ax, cbar=False)
ax.set_title('K nearest neighbour')
ax.set_aspect('equal')
ax = plt.subplot(2, 2, 3)
sns.heatmap(df_cm_rf_t, annot=True, fmt="d", annot_kws={"size": 16}, ax = ax, cbar=False)
ax.set_title('Random Forest')
ax.set_aspect('equal')
plt.ylabel('True label')
plt.xlabel('Predicted label')
ax = plt.subplot(2, 2, 4)
sns.heatmap(df_cm_svm_t, annot=True, fmt="d", annot_kws={"size": 16}, ax = ax, cbar=False)
ax.set_title('Support Vector Machine')
ax.set_aspect('equal')
plt.xlabel('Predicted label')
#sns.heatmap(df_cm_rf, annot=True, fmt="d", annot_kws={"size": 16}, ax = ax[2])
#sns.heatmap(df_cm_svm, annot=True, fmt="d", annot_kws={"size": 16}, ax = ax[3])
#plt.title('Logistic Regression')
plt.show()