3

就离谱!使用机器学习预测2022世界杯:小组赛挺准,但冠亚季军都错了 ⛵ - ShowMeAI

 1 year ago
source link: https://www.cnblogs.com/showmeai/p/16994743.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client
6815bac640f07e853d2c532965737541.png

💡 作者:韩信子@ShowMeAI
📘 数据分析实战系列https://www.showmeai.tech/tutorials/40
📘 机器学习实战系列https://www.showmeai.tech/tutorials/41
📘 本文地址https://www.showmeai.tech/article-detail/400
📢 声明:版权所有,转载请联系平台与作者并注明出处
📢 收藏ShowMeAI查看更多精彩内容

💡 赛后作者补充

2936216e831960d2a9767a4e281225d1.png

FIFA 2022世界杯已经落幕!关于哪支球队将赢得冠军的讨论,也有了明确答案。恭喜梅西!恭喜阿根廷!赛前 ShowMeAI 使用数据科学和机器学习的技能,开发一个基于历史数据的模型来预测 FIFA 2022 世界杯比赛结果。现在尘埃落定,让我们一起看看机器学习的预测与实际比赛结果,有多大大大大的差距吧!

对比下方官网发布的赛程结果汇总, ShowMeAI 将机器学习的预测结果可视化后与之进行了比较。

86a610c42e5db162913911e696476e34.png

可以看到,从小组赛开始直到1/4决赛,机器学习模型预测的正确率都是比较高的。然而从半决赛开始,模型预测准确度急转直下,不论是参赛球队还是输赢判断都降为0,冠亚季军无一预测正确

8514c504380cbf35f019d2fba6902f6a.png

但这也正是足球的魅力所在。正是竞技体育中存在的不确定性,让我们更深刻地感受到了奋斗、勇气、英雄和梦想的含义。(下文是赛前完整的建模过程,一起来看看吧!)

💡 数据源

3051b45aa0f709b5cd787bb01f51fd28.png

我们先为机器学习建模准备数据,我们需要一些数据来体现各支球队的表现。我们本次用到的是FIFA 相关的数据:🏆1872到2022历史比赛数据 和 🏆FIFA 排名数据,数据可以直接在Kaggle平台获取,也可以在ShowMeAI的百度网盘获取。

🏆 实战数据集下载(百度网盘):公众号『ShowMeAI研究中心』回复『实战』,或者点击 这里 获取本文 [35]基于机器学习的2022世界杯预测实战FIFA 2022数据集

ShowMeAI官方GitHubhttps://github.com/ShowMeAI-Hub

💡 数据集构建

哪些特征会影响足球比赛的胜负结果?这个开放的问题涉及很多特征维度:从选定的球员到当天球场的温度。我们简单一点处理,仅使用参与比赛的每个团队的过去统计数据构建一个数据集,优先考虑可以通过简单方式收集的可量化统计数据,例如进球数、平均排名、赢得的分数等。这些数据可以在我们上面谈到的两个数据集中整合得到。

另外,我们仅分析 2018 之后的数据,这样我们可以更聚焦在本届世界杯备战这几年球队队员表现的变化。数据构建代码如下:

import pandas as pdimport redf = pd.read_csv("results.csv") #games between national teamsdf["date"] = pd.to_datetime(df["date"])df = df[(df["date"] >= "2018-8-1")].reset_index(drop=True) #games at the 2022 wc cycledf_wc = df #pre-wc outcomes rank = pd.read_csv("fifa_ranking-2022-10-06.csv") #rankingsrank["rank_date"] = pd.to_datetime(rank["rank_date"]) rank = rank[(rank["rank_date"] >= "2018-8-1")].reset_index(drop=True) #selecting games from the 2022 wc cyclerank["country_full"] = rank["country_full"].str.replace("IR Iran", "Iran").str.replace("Korea Republic", "South Korea").str.replace("USA", "United States") #ajustando nomes de algumas seleçõesrank = rank.set_index(['rank_date']).groupby(['country_full'], group_keys=False).resample('D').first().fillna(method='ffill').reset_index()rank_wc = rank #dataframe with rankings #Making the mergedf_wc_ranked = df_wc.merge(rank[["country_full", "total_points", "previous_points", "rank", "rank_change", "rank_date"]], left_on=["date", "home_team"], right_on=["rank_date", "country_full"]).drop(["rank_date", "country_full"], axis=1)df_wc_ranked = df_wc_ranked.merge(rank[["country_full", "total_points", "previous_points", "rank", "rank_change", "rank_date"]], left_on=["date", "away_team"], right_on=["rank_date", "country_full"], suffixes=("_home", "_away")).drop(["rank_date", "country_full"], axis=1)

最终的数据集结果如下:

0e06da591e9564e5e27dfe8dbfb2eb36.png

💡 特征工程

对特征工程细节感兴趣的同学,可以阅读ShowMeAI的详解文章,学习理论知识与实战方法:

📘 机器学习实战 | 机器学习特征工程最全解读

准备好数据之后,我们就可以进行特征工程了,我们希望从原始数据中构建有预测能力的特征信息,我们这里采用了如下特征:

  • 世界杯周期和最近 5 场比赛的平均进球数。
  • 世界杯周期和最近 5 场比赛的平均失球数。
  • 每支球队之间的 FIFA 排名差异。
  • 国际足联排名每支球队在世界杯周期比赛和最近 5 场比赛中平均面对。
  • 从第一场比赛到现在,FIFA 排名的积分变化。
  • FIFA 排名 5 场比赛前和现在的积分变化。
  • 世界杯周期和最近 5 场比赛的平均局分。
  • 根据世界杯周期和最近 5 场比赛中的排名位置加权平均赢得的比赛积分。
  • 表征比赛公正友好的类别变量字段值。

我们选取以上特征的原因是:

  • 前两个特征用于量化一支球队的进攻力和防守力;
  • 国际足联在比赛中排名位置的差异是用来量化国际足联计算的两队实力的差异;
  • 平均排名用于分析球队面对的对手的实力;
  • 国际足联排名积分的变化是为了分析世界杯周期和最近5场比赛中球队能力的变化;
  • 球队的场均胜率量化球队的表现,而球队的场均胜负加权平均是根据球队所面对的对手的排名位次进行加权,以更精准分析球队的表现。
df = df_wc_ranked def result_finder(home, away): if home > away: return pd.Series([0, 3, 0]) if home < away: return pd.Series([1, 0, 3]) else: return pd.Series([2, 1, 1]) results = df.apply(lambda x: result_finder(x["home_score"], x["away_score"]), axis=1) df[["result", "home_team_points", "away_team_points"]] = results df["rank_dif"] = df["rank_home"] - df["rank_away"]df["sg"] = df["home_score"] - df["away_score"]df["points_home_by_rank"] = df["home_team_points"]/df["rank_away"]df["points_away_by_rank"] = df["away_team_points"]/df["rank_home"] home_team = df[["date", "home_team", "home_score", "away_score", "rank_home", "rank_away","rank_change_home", "total_points_home", "result", "rank_dif", "points_home_by_rank", "home_team_points"]] away_team = df[["date", "away_team", "away_score", "home_score", "rank_away", "rank_home","rank_change_away", "total_points_away", "result", "rank_dif", "points_away_by_rank", "away_team_points"]] home_team.columns = [h.replace("home_", "").replace("_home", "").replace("away_", "suf_").replace("_away", "_suf") for h in home_team.columns] away_team.columns = [a.replace("away_", "").replace("_away", "").replace("home_", "suf_").replace("_home", "_suf") for a in away_team.columns] team_stats = home_team.append(away_team) team_stats_raw = team_stats.copy()stats_val = [] for index, row in team_stats.iterrows(): team = row["team"] date = row["date"] past_games = team_stats.loc[(team_stats["team"] == team) & (team_stats["date"] < date)].sort_values(by=['date'], ascending=False) last5 = past_games.head(5) goals = past_games["score"].mean() goals_l5 = last5["score"].mean() goals_suf = past_games["suf_score"].mean() goals_suf_l5 = last5["suf_score"].mean() rank = past_games["rank_suf"].mean() rank_l5 = last5["rank_suf"].mean() if len(last5) > 0: points = past_games["total_points"].values[0] - past_games["total_points"].values[-1]#qtd de pontos ganhos points_l5 = last5["total_points"].values[0] - last5["total_points"].values[-1] else: points = 0 points_l5 = 0 gp = past_games["team_points"].mean() gp_l5 = last5["team_points"].mean() gp_rank = past_games["points_by_rank"].mean() gp_rank_l5 = last5["points_by_rank"].mean() stats_val.append([goals, goals_l5, goals_suf, goals_suf_l5, rank, rank_l5, points, points_l5, gp, gp_l5, gp_rank, gp_rank_l5]) stats_cols = ["goals_mean", "goals_mean_l5", "goals_suf_mean", "goals_suf_mean_l5", "rank_mean", "rank_mean_l5", "points_mean", "points_mean_l5", "game_points_mean", "game_points_mean_l5", "game_points_rank_mean", "game_points_rank_mean_l5"] stats_df = pd.DataFrame(stats_val, columns=stats_cols) full_df = pd.concat([team_stats.reset_index(drop=True), stats_df], axis=1, ignore_index=False) home_team_stats = full_df.iloc[:int(full_df.shape[0]/2),:]away_team_stats = full_df.iloc[int(full_df.shape[0]/2):,:] home_team_stats = home_team_stats[home_team_stats.columns[-12:]]away_team_stats = away_team_stats[away_team_stats.columns[-12:]] home_team_stats.columns = ['home_'+str(col) for col in home_team_stats.columns]away_team_stats.columns = ['away_'+str(col) for col in away_team_stats.columns] match_stats = pd.concat([home_team_stats, away_team_stats.reset_index(drop=True)], axis=1, ignore_index=False) full_df = pd.concat([df, match_stats.reset_index(drop=True)], axis=1, ignore_index=False) def find_friendly(x): if x == "Friendly": return 1 else: return 0 full_df["is_friendly"] = full_df["tournament"].apply(lambda x: find_friendly(x)) full_df = pd.get_dummies(full_df, columns=["is_friendly"]) base_df = full_df[["date", "home_team", "away_team", "rank_home", "rank_away","home_score", "away_score","result", "rank_dif", "rank_change_home", "rank_change_away", 'home_goals_mean', 'home_goals_mean_l5', 'home_goals_suf_mean', 'home_goals_suf_mean_l5', 'home_rank_mean', 'home_rank_mean_l5', 'home_points_mean', 'home_points_mean_l5', 'away_goals_mean', 'away_goals_mean_l5', 'away_goals_suf_mean', 'away_goals_suf_mean_l5', 'away_rank_mean', 'away_rank_mean_l5', 'away_points_mean', 'away_points_mean_l5','home_game_points_mean', 'home_game_points_mean_l5', 'home_game_points_rank_mean', 'home_game_points_rank_mean_l5','away_game_points_mean', 'away_game_points_mean_l5', 'away_game_points_rank_mean', 'away_game_points_rank_mean_l5', 'is_friendly_0', 'is_friendly_1']] base_df.tail()
dd87cf46ff3e447495a9d0d02112001a.png

💡 数据分析

在建模之前,我们对于数据做一点分析。比赛的结果有3种情况:赢、平、输,但作为 3 类分类问题进行建模,类别不均衡是一个很大的问题,且评估也会有点麻烦,我们做一点合并和调整:汇总到「主队赢」和「主队平/输」2种情况。

030b7d6996fa254d665fd5da2416d8a9.png

关于数据分析与可视化的详细教程,可以阅读ShowMeAI关于的数据分析系列教程与文章

我们按照不同的结果(赢/输平)来对不同的特征维度进行分布分析,我们这里使用小提琴图。

base_df_no_fg = base_df.dropna() df = base_df_no_fg def no_draw(x): if x == 2: return 1 else: return x df["target"] = df["result"].apply(lambda x: no_draw(x))import matplotlib.pyplot as plt data1 = df[list(df.columns[8:20].values) + ["target"]] scaled = (data1[:-1] - data1[:-1].mean()) / data1[:-1].std()scaled["target"] = data1["target"]violin1 = pd.melt(scaled,id_vars="target", var_name="features", value_name="value") plt.figure(figsize=(15,10))sns.violinplot(x="features", y="value", hue="target", data=violin1,split=True, inner="quart")plt.xticks(rotation=90)plt.show()
04dbfaf7c629fc8c36fbbf3babc309c6.png
data2 = df[df.columns[20:]] scaled = (data2[:-1] - data2[:-1].mean()) / data2[:-1].std()scaled["target"] = data2["target"]violin2 = pd.melt(scaled,id_vars="target", var_name="features", value_name="value") plt.figure(figsize=(15,10))sns.violinplot(x="features", y="value", hue="target", data=violin2,split=True, inner="quart")plt.xticks(rotation=90)plt.show()
df3e0d59e930bb3258e20c52103d8f12.png

对于第一组数据,目前的特征中只有rank_dif(两队排名的差值)对 target classes 有影响。因此,我们考虑创建更多差异特征,这类特征似乎是很强的特征信息,构建如下特征:

  • 进球差异。
  • 失球差异。
  • 球队进球与对手进球之间的差异。
dif = df.copy()dif.loc[:, "goals_dif"] = dif["home_goals_mean"] - dif["away_goals_mean"]dif.loc[:, "goals_dif_l5"] = dif["home_goals_mean_l5"] - dif["away_goals_mean_l5"]dif.loc[:, "goals_suf_dif"] = dif["home_goals_suf_mean"] - dif["away_goals_suf_mean"]dif.loc[:, "goals_suf_dif_l5"] = dif["home_goals_suf_mean_l5"] - dif["away_goals_suf_mean_l5"]dif.loc[:, "goals_made_suf_dif"] = dif["home_goals_mean"] - dif["away_goals_suf_mean"]dif.loc[:, "goals_made_suf_dif_l5"] = dif["home_goals_mean_l5"] - dif["away_goals_suf_mean_l5"]dif.loc[:, "goals_suf_made_dif"] = dif["home_goals_suf_mean"] - dif["away_goals_mean"]dif.loc[:, "goals_suf_made_dif_l5"] = dif["home_goals_suf_mean_l5"] - dif["away_goals_mean_l5"]

我们再次使用小提琴图分析。

data_difs = dif.iloc[:, -8:]scaled = (data_difs - data_difs.mean()) / data_difs.std()scaled["target"] = data2["target"]violin = pd.melt(scaled,id_vars="target", var_name="features", value_name="value") plt.figure(figsize=(10,10))sns.violinplot(x="features", y="value", hue="target", data=violin,split=True, inner="quart")plt.xticks(rotation=90)plt.show()
0d1c996e5f57091bd0a7373b1285f4f5.png

进球差异和失球差异特征对目标有很好的区分度。然而,球队进球与对手进球之间差异的特征没有影响。那我们再考虑:

  • 排名差异。
  • 世界杯周期和最近 5 场比赛的进球差异。
  • 在世界杯周期和最近 5 场比赛中出现净胜球。

此外,我们还可以计算积分的差异、排名位置的差异以及排名所获得的积分差异。而且,为了衡量对手的水平,我们可以考虑:排名所造成的进球与失球之间的差异。

dif.loc[:, "dif_points"] = dif["home_game_points_mean"] - dif["away_game_points_mean"]dif.loc[:, "dif_points_l5"] = dif["home_game_points_mean_l5"] - dif["away_game_points_mean_l5"]dif.loc[:, "dif_points_rank"] = dif["home_game_points_rank_mean"] - dif["away_game_points_rank_mean"]dif.loc[:, "dif_points_rank_l5"] = dif["home_game_points_rank_mean_l5"] - dif["away_game_points_rank_mean_l5"] dif.loc[:, "dif_rank_agst"] = dif["home_rank_mean"] - dif["away_rank_mean"]dif.loc[:, "dif_rank_agst_l5"] = dif["home_rank_mean_l5"] - dif["away_rank_mean_l5"] dif.loc[:, "goals_per_ranking_dif"] = (dif["home_goals_mean"] / dif["home_rank_mean"]) - (dif["away_goals_mean"] / dif["away_rank_mean"])dif.loc[:, "goals_per_ranking_suf_dif"] = (dif["home_goals_suf_mean"] / dif["home_rank_mean"]) - (dif["away_goals_suf_mean"] / dif["away_rank_mean"])dif.loc[:, "goals_per_ranking_dif_l5"] = (dif["home_goals_mean_l5"] / dif["home_rank_mean"]) - (dif["away_goals_mean_l5"] / dif["away_rank_mean"])dif.loc[:, "goals_per_ranking_suf_dif_l5"] = (dif["home_goals_suf_mean_l5"] / dif["home_rank_mean"]) - (dif["away_goals_suf_mean_l5"] / dif["away_rank_mean"])

我们用提琴图和箱线图对数据进行分析:

data_difs = dif.iloc[:, -10:]scaled = (data_difs - data_difs.mean()) / data_difs.std()scaled["target"] = data2["target"]violin = pd.melt(scaled,id_vars="target", var_name="features", value_name="value") plt.figure(figsize=(15,10))sns.violinplot(x="features", y="value", hue="target", data=violin,split=True, inner="quart")plt.xticks(rotation=90)plt.show()
0b5a1ed4b46951a04d789872a0db276c.png
plt.figure(figsize=(15,10))sns.boxplot(x="features", y="value", hue="target", data=violin)plt.xticks(rotation=90)plt.show()
a9cd38a8d26e2f8a6d29a44e10a1a1ed.png

积分差异、排名的进球差异、排名的积分差异是很好的特征。但是,我们有一些特征之间的相关度非常高,我们通过jointplot进行联合分布分析:

sns.jointplot(data = data_difs, x = 'dif_rank_agst', y = 'dif_rank_agst_l5', kind="reg")plt.show()
28205d3ff51e20ca4dca7233f2f85993.png
sns.jointplot(data = data_difs, x = 'goals_per_ranking_dif', y = 'goals_per_ranking_dif_l5', kind="reg")plt.show()
b938358ed457a79d9a5c497357cca637.png
sns.jointplot(data = data_difs, x = 'dif_points_rank', y = 'dif_points_rank_l5', kind="reg")plt.show()
f3fad30ad80e4bb4ada41e1ef5586d27.png
sns.jointplot(data = data_difs, x = 'dif_points', y = 'dif_points_l5', kind="reg")plt.show()
1220f0434e2bba9bd5a84ab013b8fd30.png

分析相关性可以看出,我们选择其中的1组特征就好,这里我们选择了考虑全周期的版本。最后保留的特征有下面这些:

  • 球队排名差异(rank_dif
  • 世界杯周期和过去 5 场比赛平均进球数之间的差异(goals_dif / goals_dif_l5
  • 世界杯周期和过去 5 场比赛平均失球数之间的差异(goals_suf_dif / goals_suf_dif_l5
  • 世界杯周期和最近 5 场比赛的平均排名差异(dif_rank_agst / dif_rank_agst_l5
  • 世界杯周期平均排名加权进球数之间的差异(goals_per_ranking_dif
  • 世界杯周期和过去 5 场比赛中排名平均得分之间的差异(dif_points_rank / dif_points_rank_l5
  • 表示球赛是否公平友好的类别变量(is_friendly

这样,我们最终的数据集如下,包含后续机器学习模型所需的全部特征。

def create_db(df): columns = ["home_team", "away_team", "target", "rank_dif", "home_goals_mean", "home_rank_mean", "away_goals_mean", "away_rank_mean", "home_rank_mean_l5", "away_rank_mean_l5", "home_goals_suf_mean", "away_goals_suf_mean", "home_goals_mean_l5", "away_goals_mean_l5", "home_goals_suf_mean_l5", "away_goals_suf_mean_l5", "home_game_points_rank_mean", "home_game_points_rank_mean_l5", "away_game_points_rank_mean", "away_game_points_rank_mean_l5","is_friendly_0", "is_friendly_1"] base = df.loc[:, columns] base.loc[:, "goals_dif"] = base["home_goals_mean"] - base["away_goals_mean"] base.loc[:, "goals_dif_l5"] = base["home_goals_mean_l5"] - base["away_goals_mean_l5"] base.loc[:, "goals_suf_dif"] = base["home_goals_suf_mean"] - base["away_goals_suf_mean"] base.loc[:, "goals_suf_dif_l5"] = base["home_goals_suf_mean_l5"] - base["away_goals_suf_mean_l5"] base.loc[:, "goals_per_ranking_dif"] = (base["home_goals_mean"] / base["home_rank_mean"]) - (base["away_goals_mean"] / base["away_rank_mean"]) base.loc[:, "dif_rank_agst"] = base["home_rank_mean"] - base["away_rank_mean"] base.loc[:, "dif_rank_agst_l5"] = base["home_rank_mean_l5"] - base["away_rank_mean_l5"] base.loc[:, "dif_points_rank"] = base["home_game_points_rank_mean"] - base["away_game_points_rank_mean"] base.loc[:, "dif_points_rank_l5"] = base["home_game_points_rank_mean_l5"] - base["away_game_points_rank_mean_l5"] model_df = base[["home_team", "away_team", "target", "rank_dif", "goals_dif", "goals_dif_l5", "goals_suf_dif", "goals_suf_dif_l5", "goals_per_ranking_dif", "dif_rank_agst", "dif_rank_agst_l5", "dif_points_rank", "dif_points_rank_l5", "is_friendly_0", "is_friendly_1"]] return model_df model_db = create_db(df)model_db
68690f6cde74df035c9a2a86c5c1cee3.png

💡 建模优化

7deec17f8b23a0854a77e1e799c432fe.png

关于机器学习建模与调优的相关知识与实战方法,可以查看ShowMeAI的系列教程与文章

📘 机器学习****实战:手把手教你玩转机器学习系列

📘 AI****垂直领域工具库速查表 | Scikit-Learn 速查表

下面我们就可以开始建模了,我们使用两个模型 Random Forest 和 Gradient Boosting 来建模,进行效果对比。对于模型调参,我们使用 SkLearn 的 📘GridSearchCV 进行参数优化,挑选最佳模型。

import numpy as npfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifierfrom sklearn.model_selection import train_test_split, GridSearchCV #separating the target from the featuresX = model_db.iloc[:, 3:]y = model_db[["target"]] #dividing the databaseX_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=1) gb = GradientBoostingClassifier(random_state=5)params = {"learning_rate": [0.01, 0.1, 0.5], "min_samples_split": [5, 10], "min_samples_leaf": [3, 5], "max_depth":[3,5,10], "max_features":["sqrt"], "n_estimators":[100, 200] } gb_cv = GridSearchCV(gb, params, cv = 3, n_jobs = -1, verbose = False)gb_cv.fit(X_train.values, np.ravel(y_train)) #getting the best modelgb = gb_cv.best_estimator_

我们对随机森林也进行调参和优化:

params_rf = {"max_depth": [20], "min_samples_split": [5, 10], "max_leaf_nodes": [175, 200], "min_samples_leaf": [5, 10], "n_estimators": [250], "max_features": ["sqrt"], } rf = RandomForestClassifier(random_state=1)rf_cv = GridSearchCV(rf, params_rf, cv = 3, n_jobs = -1, verbose = False)rf_cv.fit(X_train.values, np.ravel(y_train)) rf = rf_cv.best_estimator_

输出结果:

GridSearchCV(cv=3, estimator=RandomForestClassifier(random_state=1), n_jobs=-1, param_grid={'max_depth': [20], 'max_features': ['sqrt'], 'max_leaf_nodes': [175, 200], 'min_samples_leaf': [5, 10], 'min_samples_split': [5, 10], 'n_estimators': [250]}, verbose=False)

我们使用混淆矩阵和ROC-AUC曲线进行了模型分析,结果是:

from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score def analyze(model): fpr, tpr, _ = roc_curve(y_test, model.predict_proba(X_test.values)[:,1]) #test AUC plt.figure(figsize=(15,10)) plt.plot([0, 1], [0, 1], 'k--') plt.plot(fpr, tpr, label="test") fpr_train, tpr_train, _ = roc_curve(y_train, model.predict_proba(X_train.values)[:,1]) #train AUC plt.plot(fpr_train, tpr_train, label="train") auc_test = roc_auc_score(y_test, model.predict_proba(X_test.values)[:,1]) auc_train = roc_auc_score(y_train, model.predict_proba(X_train.values)[:,1]) plt.legend() plt.title('AUC score is %.2f on test and %.2f on training'%(auc_test, auc_train)) plt.show() plt.figure(figsize=(15, 10)) cm = confusion_matrix(y_test, model.predict(X_test.values)) sns.heatmap(cm, annot=True, fmt="d") analyze(gb)
cdaac824dc1955a2380622b06fce8f1c.png
6deba5a3658509530a1b3b0f2951fa6d.png

对随机森林进行分析:

analyze(rf)
f58bac38c5f97d83eeab8a572b2a3a30.png
1ba1ee9f312c751ce440206e70de901c.png

随机森林模型的性能稍好,但结果上有一点过拟合。分析 Gradient Boosting 模型的 AUC-ROC,它风险较低,我们最终选择它。

💡 应用模型

792ff386ec69d8c4cd95dc30337b8aff.png

下面我们基于这个模型将预测世界杯结果。我们先使用了 📘Pandas的read_html 方法获取参加世界杯的球队名单。

dfs = pd.read_html(r"https://en.wikipedia.org/wiki/2022_FIFA_World_Cup#Teams") from collections.abc import Iterable for i in range(len(dfs)): df = dfs[i] cols = list(df.columns.values) if isinstance(cols[0], Iterable): if any("Tie-breaking criteria" in c for c in cols): start_pos = i+1 if any("Match 46" in c for c in cols): end_pos = i+1matches = []groups = ["A", "B", "C", "D", "E", "F", "G", "H"]group_count = 0 table = {}#TABLE -> TEAM, POINTS, WIN PROBS (CRITERIO DE DESEMPATE)table[groups[group_count]] = [[a.split(" ")[0], 0, []] for a in list(dfs[start_pos].iloc[:, 1].values)] for i in range(start_pos+1, end_pos, 1): if len(dfs[i].columns) == 3: team_1 = dfs[i].columns.values[0] team_2 = dfs[i].columns.values[-1] matches.append((groups[group_count], team_1, team_2)) else: group_count+=1 table[groups[group_count]] = [[a, 0, []] for a in list(dfs[i].iloc[:, 1].values)] table
1e3f561f5ef48528968fe0a44b88dd98.png
matches[:10]
a24a2839f60e642cde7ba94df1d17abd.png

我们的模型对主队获胜和客队获胜/平局进行了分类。那这里面又怎么区分平局呢? 我们处理的办法如下,我们以两种形式进行预测:

  • A 队 x B 队(模拟 1)
  • B 队 x A 队(模拟 2)

如果两个预测都是 A 队或 B 队获胜,则直接判定该队获胜。如果一次预测A队获胜,而第二次预测B队获胜,则判定结果为平局。下面我们构建代码来逐场模拟比赛,计算分数。

def find_stats(team_1):#team_1 = "Qatar" past_games = team_stats_raw[(team_stats_raw["team"] == team_1)].sort_values("date") last5 = team_stats_raw[(team_stats_raw["team"] == team_1)].sort_values("date").tail(5) team_1_rank = past_games["rank"].values[-1] team_1_goals = past_games.score.mean() team_1_goals_l5 = last5.score.mean() team_1_goals_suf = past_games.suf_score.mean() team_1_goals_suf_l5 = last5.suf_score.mean() team_1_rank_suf = past_games.rank_suf.mean() team_1_rank_suf_l5 = last5.rank_suf.mean() team_1_gp_rank = past_games.points_by_rank.mean() team_1_gp_rank_l5 = last5.points_by_rank.mean() return [team_1_rank, team_1_goals, team_1_goals_l5, team_1_goals_suf, team_1_goals_suf_l5, team_1_rank_suf, team_1_rank_suf_l5, team_1_gp_rank, team_1_gp_rank_l5] def find_features(team_1, team_2): rank_dif = team_1[0] - team_2[0] goals_dif = team_1[1] - team_2[1] goals_dif_l5 = team_1[2] - team_2[2] goals_suf_dif = team_1[3] - team_2[3] goals_suf_dif_l5 = team_1[4] - team_2[4] goals_per_ranking_dif = (team_1[1]/team_1[5]) - (team_2[1]/team_2[5]) dif_rank_agst = team_1[5] - team_2[5] dif_rank_agst_l5 = team_1[6] - team_2[6] dif_gp_rank = team_1[7] - team_2[7] dif_gp_rank_l5 = team_1[8] - team_2[8] return [rank_dif, goals_dif, goals_dif_l5, goals_suf_dif, goals_suf_dif_l5, goals_per_ranking_dif, dif_rank_agst, dif_rank_agst_l5, dif_gp_rank, dif_gp_rank_l5, 1, 0] advanced_group = []last_group = "" for k in table.keys(): for t in table[k]: t[1] = 0 t[2] = [] for teams in matches: draw = False team_1 = find_stats(teams[1]) team_2 = find_stats(teams[2]) features_g1 = find_features(team_1, team_2) features_g2 = find_features(team_2, team_1) probs_g1 = gb.predict_proba([features_g1]) probs_g2 = gb.predict_proba([features_g2]) team_1_prob_g1 = probs_g1[0][0] team_1_prob_g2 = probs_g2[0][1] team_2_prob_g1 = probs_g1[0][1] team_2_prob_g2 = probs_g2[0][0] team_1_prob = (probs_g1[0][0] + probs_g2[0][1])/2 team_2_prob = (probs_g2[0][0] + probs_g1[0][1])/2 if ((team_1_prob_g1 > team_2_prob_g1) & (team_2_prob_g2 > team_1_prob_g2)) | ((team_1_prob_g1 < team_2_prob_g1) & (team_2_prob_g2 < team_1_prob_g2)): draw=True for i in table[teams[0]]: if i[0] == teams[1] or i[0] == teams[2]: i[1] += 1 elif team_1_prob > team_2_prob: winner = teams[1] winner_proba = team_1_prob for i in table[teams[0]]: if i[0] == teams[1]: i[1] += 3 elif team_2_prob > team_1_prob: winner = teams[2] winner_proba = team_2_prob for i in table[teams[0]]: if i[0] == teams[2]: i[1] += 3 for i in table[teams[0]]: #adding criterio de desempate (probs por jogo) if i[0] == teams[1]: i[2].append(team_1_prob) if i[0] == teams[2]: i[2].append(team_2_prob) if last_group != teams[0]: if last_group != "": print("\n") print("Group %s advanced: "%(last_group)) for i in table[last_group]: #adding crieterio de desempate i[2] = np.mean(i[2]) final_points = table[last_group] final_table = sorted(final_points, key=itemgetter(1, 2), reverse = True) advanced_group.append([final_table[0][0], final_table[1][0]]) for i in final_table: print("%s -------- %d"%(i[0], i[1])) print("\n") print("-"*10+" Starting Analysis for Group %s "%(teams[0])+"-"*10) if draw == False: print("Group %s - %s vs. %s: Winner %s with %.2f probability"%(teams[0], teams[1], teams[2], winner, winner_proba)) else: print("Group %s - %s vs. %s: Draw"%(teams[0], teams[1], teams[2])) last_group = teams[0] print("\n")print("Group %s advanced: "%(last_group)) for i in table[last_group]: #adding crieterio de desempate i[2] = np.mean(i[2]) final_points = table[last_group]final_table = sorted(final_points, key=itemgetter(1, 2), reverse = True)advanced_group.append([final_table[0][0], final_table[1][0]])for i in final_table: print("%s -------- %d"%(i[0], i[1]))
---------- Starting Analysis for Group A ----------Group A - Qatar vs. Ecuador: Winner Ecuador with 0.62 probabilityGroup A - Senegal vs. Netherlands: Winner Netherlands with 0.62 probabilityGroup A - Qatar vs. Senegal: Winner Senegal with 0.60 probabilityGroup A - Netherlands vs. Ecuador: Winner Netherlands with 0.73 probabilityGroup A - Ecuador vs. Senegal: DrawGroup A - Netherlands vs. Qatar: Winner Netherlands with 0.78 probability Group A advanced: Netherlands -------- 9Senegal -------- 4Ecuador -------- 4Qatar -------- 0 ---------- Starting Analysis for Group B ----------Group B - England vs. Iran: Winner England with 0.62 probabilityGroup B - United States vs. Wales: DrawGroup B - Wales vs. Iran: DrawGroup B - England vs. United States: Winner England with 0.61 probabilityGroup B - Wales vs. England: Winner England with 0.64 probabilityGroup B - Iran vs. United States: Winner United States with 0.58 probability Group B advanced: England -------- 9United States -------- 4Wales -------- 2Iran -------- 1 ---------- Starting Analysis for Group C ----------Group C - Argentina vs. Saudi Arabia: Winner Argentina with 0.79 probabilityGroup C - Mexico vs. Poland: DrawGroup C - Poland vs. Saudi Arabia: Winner Poland with 0.70 probabilityGroup C - Argentina vs. Mexico: Winner Argentina with 0.67 probabilityGroup C - Poland vs. Argentina: Winner Argentina with 0.71 probabilityGroup C - Saudi Arabia vs. Mexico: Winner Mexico with 0.71 probability Group C advanced: Argentina -------- 9Poland -------- 4Mexico -------- 4Saudi Arabia -------- 0 ---------- Starting Analysis for Group D ----------Group D - Denmark vs. Tunisia: Winner Denmark with 0.68 probabilityGroup D - France vs. Australia: Winner France with 0.71 probabilityGroup D - Tunisia vs. Australia: DrawGroup D - France vs. Denmark: DrawGroup D - Australia vs. Denmark: Winner Denmark with 0.71 probabilityGroup D - Tunisia vs. France: Winner France with 0.69 probability Group D advanced: France -------- 7Denmark -------- 7Tunisia -------- 1Australia -------- 1 ---------- Starting Analysis for Group E ----------Group E - Germany vs. Japan: Winner Germany with 0.62 probabilityGroup E - Spain vs. Costa Rica: Winner Spain with 0.76 probabilityGroup E - Japan vs. Costa Rica: Winner Japan with 0.63 probabilityGroup E - Spain vs. Germany: DrawGroup E - Japan vs. Spain: Winner Spain with 0.67 probabilityGroup E - Costa Rica vs. Germany: Winner Germany with 0.65 probability Group E advanced: Spain -------- 7Germany -------- 7Japan -------- 3Costa Rica -------- 0 ---------- Starting Analysis for Group F ----------Group F - Morocco vs. Croatia: Winner Croatia with 0.58 probabilityGroup F - Belgium vs. Canada: Winner Belgium with 0.75 probabilityGroup F - Belgium vs. Morocco: Winner Belgium with 0.67 probabilityGroup F - Croatia vs. Canada: Winner Croatia with 0.64 probabilityGroup F - Croatia vs. Belgium: Winner Belgium with 0.64 probabilityGroup F - Canada vs. Morocco: Draw Group F advanced: Belgium -------- 9Croatia -------- 6Morocco -------- 1Canada -------- 1 ---------- Starting Analysis for Group G ----------Group G - Switzerland vs. Cameroon: Winner Switzerland with 0.69 probabilityGroup G - Brazil vs. Serbia: Winner Brazil with 0.72 probabilityGroup G - Cameroon vs. Serbia: Winner Serbia with 0.66 probabilityGroup G - Brazil vs. Switzerland: DrawGroup G - Serbia vs. Switzerland: Winner Switzerland with 0.57 probabilityGroup G - Cameroon vs. Brazil: Winner Brazil with 0.81 probability Group G advanced: Brazil -------- 7Switzerland -------- 7Serbia -------- 3Cameroon -------- 0 ---------- Starting Analysis for Group H ----------Group H - Uruguay vs. South Korea: Winner Uruguay with 0.62 probabilityGroup H - Portugal vs. Ghana: Winner Portugal with 0.81 probabilityGroup H - South Korea vs. Ghana: Winner South Korea with 0.76 probabilityGroup H - Portugal vs. Uruguay: Winner Portugal with 0.60 probabilityGroup H - Ghana vs. Uruguay: Winner Uruguay with 0.77 probabilityGroup H - South Korea vs. Portugal: Winner Portugal with 0.67 probability Group H advanced: Portugal -------- 9Uruguay -------- 6South Korea -------- 3Ghana -------- 0

上面的模型有一些结果很有趣,比如巴西和瑞士以及丹麦和法国之间的平局。

在季后赛中,思路是一样的:

advanced = advanced_group playoffs = {"Round of 16": [], "Quarter-Final": [], "Semi-Final": [], "Final": []} for p in playoffs.keys(): playoffs[p] = [] actual_round = ""next_rounds = [] for p in playoffs.keys(): if p == "Round of 16": control = [] for a in range(0, len(advanced*2), 1): if a < len(advanced): if a % 2 == 0: control.append((advanced*2)[a][0]) else: control.append((advanced*2)[a][1]) else: if a % 2 == 0: control.append((advanced*2)[a][1]) else: control.append((advanced*2)[a][0]) playoffs[p] = [[control[c], control[c+1]] for c in range(0, len(control)-1, 1) if c%2 == 0] for i in range(0, len(playoffs[p]), 1): game = playoffs[p][i] home = game[0] away = game[1] team_1 = find_stats(home) team_2 = find_stats(away) features_g1 = find_features(team_1, team_2) features_g2 = find_features(team_2, team_1) probs_g1 = gb.predict_proba([features_g1]) probs_g2 = gb.predict_proba([features_g2]) team_1_prob = (probs_g1[0][0] + probs_g2[0][1])/2 team_2_prob = (probs_g2[0][0] + probs_g1[0][1])/2 if actual_round != p: print("-"*10) print("Starting simulation of %s"%(p)) print("-"*10) print("\n") if team_1_prob < team_2_prob: print("%s vs. %s: %s advances with prob %.2f"%(home, away, away, team_2_prob)) next_rounds.append(away) else: print("%s vs. %s: %s advances with prob %.2f"%(home, away, home, team_1_prob)) next_rounds.append(home) game.append([team_1_prob, team_2_prob]) playoffs[p][i] = game actual_round = p else: playoffs[p] = [[next_rounds[c], next_rounds[c+1]] for c in range(0, len(next_rounds)-1, 1) if c%2 == 0] next_rounds = [] for i in range(0, len(playoffs[p])): game = playoffs[p][i] home = game[0] away = game[1] team_1 = find_stats(home) team_2 = find_stats(away) features_g1 = find_features(team_1, team_2) features_g2 = find_features(team_2, team_1) probs_g1 = gb.predict_proba([features_g1]) probs_g2 = gb.predict_proba([features_g2]) team_1_prob = (probs_g1[0][0] + probs_g2[0][1])/2 team_2_prob = (probs_g2[0][0] + probs_g1[0][1])/2 if actual_round != p: print("-"*10) print("Starting simulation of %s"%(p)) print("-"*10) print("\n") if team_1_prob < team_2_prob: print("%s vs. %s: %s advances with prob %.2f"%(home, away, away, team_2_prob)) next_rounds.append(away) else: print("%s vs. %s: %s advances with prob %.2f"%(home, away, home, team_1_prob)) next_rounds.append(home) game.append([team_1_prob, team_2_prob]) playoffs[p][i] = game actual_round = p

结果如下:

----------Starting simulation of Round of 16---------- Netherlands vs. United States: Netherlands advances with prob 0.54Argentina vs. Denmark: Argentina advances with prob 0.59Spain vs. Croatia: Spain advances with prob 0.61Brazil vs. Uruguay: Brazil advances with prob 0.64Senegal vs. England: England advances with prob 0.64Poland vs. France: France advances with prob 0.67Germany vs. Belgium: Belgium advances with prob 0.53Switzerland vs. Portugal: Portugal advances with prob 0.57----------Starting simulation of Quarter-Final---------- Netherlands vs. Argentina: Netherlands advances with prob 0.51Spain vs. Brazil: Brazil advances with prob 0.54England vs. France: England advances with prob 0.51Belgium vs. Portugal: Portugal advances with prob 0.52----------Starting simulation of Semi-Final---------- Netherlands vs. Brazil: Brazil advances with prob 0.55England vs. Portugal: England advances with prob 0.51----------Starting simulation of Final---------- Brazil vs. England: Brazil advances with prob 0.56

我们以图示的方式来展示我们的结果。

import networkx as nxfrom networkx.drawing.nx_pydot import graphviz_layout plt.figure(figsize=(15, 10))G = nx.balanced_tree(2, 3) labels = [] for p in playoffs.keys(): for game in playoffs[p]: label = f"{game[0]}({round(game[2][0], 2)}) \n {game[1]}({round(game[2][1], 2)})" labels.append(label) labels_dict = {}labels_rev = list(reversed(labels)) for l in range(len(list(G.nodes))): labels_dict[l] = labels_rev[l] pos = graphviz_layout(G, prog='twopi')labels_pos = {n: (k[0], k[1]-0.08*k[1]) for n,k in pos.items()}center = pd.DataFrame(pos).mean(axis=1).mean() nx.draw(G, pos = pos, with_labels=False, node_color=range(15), edge_color="#bbf5bb", width=10, font_weight='bold',cmap=plt.cm.Greens, node_size=5000)nx.draw_networkx_labels(G, pos = labels_pos, bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="black", lw=.5, alpha=1), labels=labels_dict)texts = ["Round \nof 16", "Quarter \n Final", "Semi \n Final", "Final\n"]pos_y = pos[0][1] + 55for text in reversed(texts): pos_x = center pos_y -= 75 plt.text(pos_y, pos_x, text, fontsize = 18) plt.axis('equal')plt.show()

模拟世界杯的结果如下,我们的模型预测巴西队获胜,决赛中对阵英格兰队的概率为 56%! 模型预测结果中最大的冷门是比利时击败德国和英格兰进入决赛,在四分之一决赛中淘汰法国。看到一些概率非常小的比赛很有趣,比如荷兰对阿根廷。

a36d80a937673797a352310046978583.png
c4f23a7c21d5d3cc9f8ac0c03c8d4968.png

在本篇内容中,ShowMeAI应用机器学习的方法,对世界杯参赛球队进行分析和建模,模拟与预测世界杯比赛结果。全篇内容包括详细的数据预处理、数据分析、特征工程、机器学习建模与模型调参优化,模型应用及结果可视化。当然,世界杯的有趣之处就在于,比赛场上瞬息万变,任何的结果都可能会发生,让我们一起跟随世界杯,欣赏每一场精彩的比赛吧!

e9190f41b8de4af38c8a1a0c96f0513b~tplv-k3u1fbpfcp-zoom-1.image

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK