5

Notes on matplotlib and seaborn charts (python) | Andrew Wheeler

 3 years ago
source link: https://andrewpwheeler.com/2020/05/05/notes-on-matplotlib-and-seaborn-charts-python/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Notes on matplotlib and seaborn charts (python)

My current workplace is a python shop. I actually didn’t use pandas/numpy for most of my prior academic projects, but I really like pandas for data manipulation now that I know it better. I’m using python objects (lists, dictionaries, sets) inside of data frames quite a bit to do some tricky data manipulations.

I do however really miss using ggplot to make graphs. So here are my notes on using python tools to make plots, specifically the matplotlib and seaborn libraries. Here is the data/code to follow along on your own.

some set up

First I am going to redo the data analysis for predictive recidivism I did in a prior blog post. One change is that I noticed the default random forest implementation in sci-kit was prone to overfitting the data – so one simple regularization was to either limit depth of trees, or number of samples needed to split, or the total number of samples in a final leaf. (I noticed this when I developed a simulated example xgboost did well with the defaults, but random forests did not. It happened to be becauase xgboost defaults had a way smaller number of potential splits, when using similar defaults they were pretty much the same.)

Here I just up the minimum samples per leaf to 100.

#########################################################
#set up for libraries and data I need
import pandas as pd
import os
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

my_dir = r'C:\Users\andre\Dropbox\Documents\BLOG\matplotlib_seaborn'
os.chdir(my_dir)

#Modelling recidivism using random forests, see below for background 
#https://andrewpwheeler.com/2020/01/05/balancing-false-positives/

recid = pd.read_csv('PreppedCompas.csv')
#Preparing the variables I want
recid_prep = recid[['Recid30','CompScore.1','CompScore.2','CompScore.3',
                    'juv_fel_count','YearsScreening']]
recid_prep['Male'] = 1*(recid['sex'] == "Male")
recid_prep['Fel'] = 1*(recid['c_charge_degree'] == "F")
recid_prep['Mis'] = 1*(recid['c_charge_degree'] == "M")
recid_prep['race'] = recid['race']

#Now generating train and test set
recid_prep['Train'] = np.random.binomial(1,0.75,len(recid_prep))
recid_train = recid_prep[recid_prep['Train'] == 1]
recid_test = recid_prep[recid_prep['Train'] == 0]

#Now estimating the model
ind_vars = ['CompScore.1','CompScore.2','CompScore.3',
            'juv_fel_count','YearsScreening','Male','Fel','Mis'] #no race in model
dep_var = 'Recid30'
rf_mod = RandomForestClassifier(n_estimators=500, random_state=10, min_samples_leaf=100)
rf_mod.fit(X = recid_train[ind_vars], y = recid_train[dep_var])

#Now applying out of sample
pred_prob = rf_mod.predict_proba(recid_test[ind_vars] )
recid_test['prob'] = pred_prob[:,1]
#########################################################

matplotlib themes

One thing you can do is easily update the base template for matplotlib. Here are example settings I typically use, in particular making the default font sizes much larger. I also like a using a drop shadow for legends – although many consider drop shadows for data chart-junky, they actually help distinguish the legend from the background plot (a trick I learned from cartographic maps).

#########################################################
#Settings for matplotlib base

andy_theme = {'axes.grid': True,
              'grid.linestyle': '--',
              'legend.framealpha': 1,
              'legend.facecolor': 'white',
              'legend.shadow': True,
              'legend.fontsize': 14,
              'legend.title_fontsize': 16,
              'xtick.labelsize': 14,
              'ytick.labelsize': 14,
              'axes.labelsize': 16,
              'axes.titlesize': 20,
              'figure.dpi': 100}
 
print( matplotlib.rcParams )
#matplotlib.rcParams.update(andy_theme)

#print(plt.style.available)
#plt.style.use('classic')
#########################################################

I have it commented out here, but once you define your dictionary of particular style changes, then you can just run matplotlib.rcParams.update(your_dictionary) to update the base plots. You can also see the ton of options by printing matplotlib.rcParams, and there are a few different styles already available to view as well.

creating a lift-calibration line plot

Now I am going to create a plot that I have seen several names used for – I am going to call it a calibration lift-plot. Calibration is basically “if my model predicts something will happen 5% of the time, does it actually happen 5% of the time”. I used to always do calibration charts where I binned the data, and put the predicted on the X axis, and observed on the Y (see this example). But data-robot has an alternative plot, where you superimpose those two lines that has been growing on me.

#########################################################
#Creating a calibration lift-plot for entire test set

bin_n = 30
recid_test['Bin'] = pd.qcut(recid_test['prob'], bin_n, range(bin_n) ).astype(int) + 1
recid_test['Count'] = 1

agg_bins = recid_test.groupby('Bin', as_index=False)['Recid30','prob','Count'].sum()
agg_bins['Predicted'] = agg_bins['prob']/agg_bins['Count']
agg_bins['Actual'] = agg_bins['Recid30']/agg_bins['Count']

#Now can make a nice matplotlib plot
fig, ax = plt.subplots(figsize=(6,4))
ax.plot(agg_bins['Bin'], agg_bins['Predicted'], marker='+', label='Predicted')
ax.plot(agg_bins['Bin'], agg_bins['Actual'], marker='o', markeredgecolor='w', label='Actual')
ax.set_ylabel('Probability')
ax.legend(loc='upper left')
plt.savefig('Default_mpl.png', dpi=500, bbox_inches='tight')
plt.show()
#########################################################

hztoH7k0hyYq3IRGfmjhLqFyG-rXA6n0IId-Wzp4zLtg3Vn8rl4UA3Xpqae4xAOJL4ugE6MtCRxYZ2LD1Pm1zkNffPEjuVZQLQzwvb72TsqnnRV-LhbdrkQ2v767Cmh0PW6HePcdjxf_R5dt4WWS4KwQFfzJUZ9WtqdYe6ftAEtRfghB3xcxNJtkOFzNJgot4EJDIdEga2XK4b3RmB_qdh5OzSC1r8H-ONPY-HPP0Pl745Zs8qZt4-Mz229v0cXEvfLIQG__j0ql03G-aCVBCoO0Qb7BLJz4NlChNUTz7Vrk3-sikPE-fTKwAGgBYJJAiQbs4OGe5kVyBfhMK4af2Py4ac5ZS1SwkSsWUQJ2EpMNARi6rWhPEPWhiuSO4dwh8fmPeTuEezRBZJfaImhuaN7I8T67LpShKXdMbMIQ1Iprr-0LAwKFhLGR42LNsXKqzpKKKrsZeBCBJhqbUDECE-se2vuhRtmHQ-OvEgjHml3bNC4-3IH1ZJQ4H8YxiCNAT6i7ca4QOISmGMvic0ie8jOdhLWmsecHC_OkujmxgHR9dRMk6U3z12K15MkgWgP-y-H_uXfXDDt3Mh3QQKwhvJiqz6v8jAxF7KjzML3yKpWmZg2vf8F52Romg_yDPpd7cnUnlykYH8eUF036KOgquh5udRMx2_Ep5OxxmlAYvgJO5jgq-i7IHWNO9tPdPCfaJNak-G8Bp63ehaX9bS10yR3zRVuyvnFrkutrDJL0LwPmD7QxrA9jgA=w1395-h898-no

You can see that the model is fairly well calibrated in the test set, and that the predictions range from around 10% to 75%. It is noisy and snakes high and low, but that is expected as we don’t have a real giant test sample here (around a total of 100 observations per bin).

So this is the default matplotlib style. Here is the slight update using my above specific theme.

matplotlib.rcParams.update(andy_theme)
fig, ax = plt.subplots(figsize=(6,4))
ax.plot(agg_bins['Bin'], agg_bins['Predicted'], marker='+', label='Predicted')
ax.plot(agg_bins['Bin'], agg_bins['Actual'], marker='o', markeredgecolor='w', label='Actual')
ax.set_ylabel('Probability')
ax.legend(loc='upper left')
plt.savefig('Mytheme_mpl.png', dpi=500, bbox_inches='tight')
plt.show()

WCrKz7Dc1v-aZR2VvOXt_1HwF1p6yOlA5AeWBY5k2H7b_OmF2sejy9IKjL_C04E53bMpUXUtIrsElFrLZsTPR6blCQ3E0ZEJ1SmZ2wuirVsDpX5iLME66ZDQmazf9Z6TdvXkiXyb_Le03VbalYPvKSSP6j80t6TdJ7zXNUMmA2_iHPsmB8SytzTs8xQIOwlAzSlcS3ytSo6iP1ywS13hd2iQ__s1s9Jd0JXPtsdOlqsEIaV5EsDI_b50Y5SIjpO1KKpIorYGMrAbsyH0h_5pqz4ywZiZQn1B5szRq5mC0fNi_mIRC4ZwY7S9vvWSPUoAzxmDY-wI-jtWT-C1wjPKGyCgbrgkJeGGerXyowK2435k5s1kLkpm96VGs1LyKsy9WM6jBzWLl_2_E5jIqvsLhJLlFMpbi3g4l9XwytDsl8aprcxRPufU0nWGRB1UreI9KH-cvxhXs1UG7MmXy-uEvenb4_-tORWlW2QfV3cmckbuxVvUspczSG0gLg7wATJbhlSgMAvsmnTxYFeKO1J1Kg1NUayPCnUBy65iaWhW5HNWCwVoRua6yiDM7WJLB2Aj9bHwfIHixGel5ssXVg_pOsiznc8yTWj4opdN6GG3y0fdQIbKbCkwIVMTQDtxIrNyL57Zjaj83Mw90M7tMKeOxYLe3-TYvaQdhpGbZxviwkht7eGT8BJ1ia6I65tsSb308X6cBUAUQlx4q-4VaE-xNJYTuUcgOcDSLIOAHMiY88onRwfphcfzjQ=w1416-h898-no

Not too different from the default, but I only have to call matplotlib.rcParams.update(andy_theme) one time and it will apply it to all my charts. So I don’t have to continually set the legend shadow, grid lines, etc.

making a lineplot in seaborn

matplotlib is basically like base graphics in R, where if you want to superimpose a bunch of stuff you make the base plot and then add in lines() or points() etc. on top of the base. This is ok for only a few items, but if you have your data in long format, where a certain category distinguishes groups in the data, it is not very convenient.

The seaborn library provides some functions to get closer to the ggplot idea of mapping aesthetics using long data, so here is the same lineplot example. seaborn builds stuff on top of matplotlib, so it inherits the style I defined earlier. In this code snippet, first I melt the agg_bins data to long format. Then it is a similarish plot call to draw the graph.

#########################################################
#Now making the same chart in seaborn
#Easier to melt to wide data

agg_long = pd.melt(agg_bins, id_vars=['Bin'], value_vars=['Predicted','Actual'], var_name='Type', value_name='Probability')

plt.figure(figsize=(6,4))
sns.lineplot(x='Bin', y='Probability', hue='Type', style='Type', data=agg_long, dashes=False,
             markers=True, markeredgecolor='w')
plt.xlabel(None)
plt.savefig('sns_lift.png', dpi=500, bbox_inches='tight')   
#########################################################

02HXKHQNn75MMNVWwJdBOQC2el4pRl-6PGc9YTOShXM3zvWn7Yjb_TyD7o2wR2j0hNJuPOoGq_NTOXYuwqa50HAUSNdXJUud05l4Uszh1DILxrIBY443hlrurUzbK5efS7SwvFMi99dRZKXQhYnclJY4o521-s0GTGbt1foyEuEYMVCB-6EI3Lc-VDeyLYHELsaRJSXG_xjcBQjA2HINppi4I8zJATiTZX-M9a89eCdIKKcmnrkHDLHs2CD8LamYxjr0lidR8qsLJAD0MDy3gNpGUv-LJQZpMwO-XKJxsMb3eQwR37Qupssz3XjxK90vvCinnrPBCGlQKVQ3eC4wdzV5upxlgROpHnyT7rJjuW9bxRWZ-VmbfFGl5cA9X9Cd8Zz9XhGYD_k1ahcLzDSNYyEgBaJAB_fm4JZoTlLIlkwdBWN8g35k8MgdpaSZPbbrQUpXoJM3SUlM5WXVMWv0bORS3eU6KW_tuzTiag4zMYGAUovBJRuIGOr4em1m-DS3p369Z1AZRlXMu8Mwvktc5lPb55TASsAnVd_9Ir8JpORQK1NMT4SCR-q9b1QO82G5cTOLqMGjiU_0S5P79am2qRedevYTBPcItTruuZiT5ML4cGSXKJvXxzjspp4d04ojhL4f4qQ7iVMsaixCw-ZN_Mybt8tn2UUe65eJGgyatJHtI5qlOOV35PtS2uyI5eA0S3AwES2WK0Mc4niI-B2ER7_qnsJNuFiFLBAb9x1AUW9qrkqyIDgu6A=w1416-h898-no

By default seaborn adds in a legend title – although it is not stuffed into the actual legend title slot. (This is because they will handle multiple sub-aesthetics more gracefully I think, e.g. map color to one attribute and dash types to another.) But here I just want to get rid of it. (Similar to maps, no need to give a legend the title “legend” – should be obvious.) Also the legend did not inherit the white edge colors, so I set that as well.

#Now lets edit the legend
plt.figure(figsize=(6,4))
ax = sns.lineplot(x='Bin', y='Probability', hue='Type', style='Type', data=agg_long, dashes=False,
             markers=True, markeredgecolor='w')
plt.xlabel(None)
handles, labels = ax.get_legend_handles_labels()
for i in handles:
    i.set_markeredgecolor('w')
legend = ax.legend(handles=handles[1:], labels=labels[1:])
plt.savefig('sns_lift_edited_leg.png', dpi=500, bbox_inches='tight')

Ji3eFZ7Dp8w8z1v0eilJYV2_MdEM1vezCMfQUL9eXze94jUrTt-cmig-n72mIvprLRzU-e8frE3RY3tCK9pxtydsPAq2eSnFN8outjAwkQ7mrtRlkaP1RrcJaGqVufk5oosa9yMbFs2RGpy8xyzUlN2TzZAbAmvtMPKQVqzG0Hyn2TjIOM_bi0K0sUjEzYa1LwLybPLgDbeHT0gQcBHViBZJQ43TT-lQFekLi2wD3DCSUStEEkT5y5lAVnH3CTAC4C6n6EkOlAB3BpUFK01rn2930G00TEUW-P6SovXEzSjEGFxRm1DNSjZwCE2ZyxKwwVQAORmMSGT7jNmNjD3huBb0a49xh9sMlteVCYu0LMSk5Se2WBYsf5OHMkhz1HZqPkRNetjpRDh4dZE534n5r8TQ0j7X4eGWwU_d6IW7qUPjhPTTbk8x19th8jTmEkXF4Wa1RQzb7rZW5nuQAoJmR9VqYGECFqirId8Gk_jyieC6Sx-LeO_x3iRYRylHz-hhCKfJmg8szSwO3ONI_d9kvtXxO6YVOeeHAY5OKkTDQm86y-xT2KLu5sxq8gMKdyguQ29hUMAvxfA4hmDjZfV8Tg6pG44Za5YmMVAY-h-jILJR1Q81b5eAOeBHjyIj9eRRkaFNGpN_6uXnXehooe1AsahHpz8Awb4d9YxHHrnw2eSCXGdJZOkW9naTAxwcZH_sCKNxTTYh2V_QXkS2TfqeD420a3Hc-gND37f-7Gj4v4sSva_lmVN60Q=w1416-h898-no

making a small multiple plot

Another nicety of seaborn is that it can make small multiple plots for you. So here I conduct analysis of calibration among subsets of data for different racial categories. First I collapse the different racial subsets into an other category, then I do the same qcut, but within the different groupings. To figure that out, I do what all good programmers do, google it and adapt from a stackoverflow example.

#########################################################
#replace everyone not black/white as other
print( recid_test['race'].value_counts() )
other_group = ['Hispanic','Other','Asian','Native American']
recid_test['RaceComb'] = recid_test['race'].replace(other_group, 'Other')
print(recid_test['RaceComb'].value_counts() )

#qcut by group
bin_sub = 20
recid_test['BinRace'] = (recid_test.groupby('RaceComb',as_index=False)['prob']
                        ).transform( lambda x: pd.qcut(x, bin_sub, labels=range(bin_sub))
                        ).astype(int) + 1

#Now aggregate two categories, and then melt
race_bins = recid_test.groupby(['BinRace','RaceComb'], as_index=False)['Recid30','prob','Count'].sum()
race_bins['Predicted'] = race_bins['prob']/race_bins['Count']
race_bins['Actual'] = race_bins['Recid30']/race_bins['Count']
race_long = pd.melt(race_bins, id_vars=['BinRace','RaceComb'], value_vars=['Predicted','Actual'], var_name='Type', value_name='Probability')

#Now making the small multiple plot
d = {'marker': ['o','X']}
ax = sns.FacetGrid(data=race_long, col='RaceComb', hue='Type', hue_kws=d,
                   col_wrap=2, despine=False, height=4)
ax.map(plt.plot, 'BinRace', 'Probability', markeredgecolor="w")
ax.set_titles("{col_name}")
ax.set_xlabels("")
#plt.legend(loc="upper left")
plt.legend(bbox_to_anchor=(1.9,0.8))
plt.savefig('sns_smallmult_niceleg.png', dpi=500, bbox_inches='tight')
#########################################################

Mr20K_H1_n0WgOy6RX7CDZnsaSb9t0I2mnXO--pbq1CIXG-5eTeJG8BGu9AyMU-nRjase4LO-xUnBzfxrjLh1Q5HU9aN3x8oz2RY83envblHeC4QVb1K3bEYNk6elyxULYUc8a-ZODBUPv5X7bRGUcfY_xP8-OW_PUO4UHWtcjdROjtFZQR1I3JttnzqdSCbWPZhCjQFtgozdk_JeQnOdQdg1daIzI6xXNjAQPpI6E-zCeXGTUMRZNigCfzWlu5z6NPArz_5Ukj_cIl60GSGBxKI_BVVr-kjdIp_Xfk2oMnIhCLg3onvUliZoQz8zOzThx0XHqmymuhbSZoBxYHZE4CEE58fXQlA-M0mQtryxJ74GJhLDQ7xpVr4OYlkZNYIBOZPdP6F7Onl-pZupoCLfyypD1OWyTRsGnon-PpoafkrMjoZxEehrN3wadzbI9ayz8_N3yWem9tH6iW6995CcPTkfxtx8OrlqO-kgHUqICSm1b3nfr2unWfYYVx7IywFuy480EIK9FnB9YlYlLJsD_nkdFnn7EWrZ4UmRNvb5ChiNMKTts1vAsVNhAwBVImDLk7NZNrNcyDvCVGMvJccovVKgbxy4SiyeOyhFxlhSX09qaqPCRFvnOyhWgvJcSnNVjk95Zsk37hnARZpVQUt99I-27rH1VTG3_JSCZJoLUQXt29HViCBkC6xCjKjDESXWnuoGp6tEvVOc68-9wBuokRuKggdcxKWvupCt-rNZBTElhDbUNmqWA=w931-h898-no

And you can see that the model is fairly well calibrated for each racial subset of the data. The other category is more volatile, but it has a smaller number of observations as well. But overall does not look too bad. (If you take out my end leaf needs 100 samples though, all of these calibration plots look really bad!)

I am having a hellishly hard time doing the map of sns.lineplot to the sub-charts, but you can just do normal matplotlib plots. When you set the legend, it defaults to the last figure that was drawn, so one way to set it where you want is to use bbox_to_anchor and just test to see where it is a good spot (can use negative arguments in this function to go to the left more). Seaborn has not nice functions to map the grid text names using formatted string substitution. And the post is long enough, you can play around yourself to see how the other options change the look of the plot.

For a few notes on various gotchas I’ve encountered so far:

  • For sns.FacetGrid, you need to set the size of the plots in that call, not by instantiating plt.figure(figsize=(6,4)). (This is because it draws multiple figures.)
  • When drawing elements in the chart, even for named arguments the order often matters. You often need to do things like color first before other arguments for aethetics. (I believe my problem mapping sns.lineplot to my small multiple is some inheritance problems where plt.plot does not have named arguments for x/y, but sns.lineplot does.)
  • To edit the legend in a FacetGrid generated set of charts, ax returns a grid, not just one element. Since each grid inherits the same legend though, you can do handles, labels = ax[0].get_legend_handles_labels() to get the legend handles to edit if you want.

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK