Arabica Coffee Beans - Origin and Variety
source link: https://datacrayon.com/posts/statistics/data-is-beautiful/arabica-coffee-beans-origin-and-variety/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Made with Chord Pro
You can create beautiful interactive visualisations like this one with Chord Pro. Learn how to make beautiful visualisations with the book, Data is Beautiful.
Chord Diagram
Preamble¶
import itertools import pandas as pd # for DataFrames from chord import Chord
Introduction¶
In this section, we're going to be pointing our beautifully colourful lens towards the warm and aromatic world of coffee. In particular, we're going to be visualising the co-occurrence of coffee bean variety and origin in over a thousand coffee reviews.
This section uses the Chord Pro software to create a visualisation. Grab a copy to produce the same output!
The Dataset¶
We're going to use the popular Coffee Quality Institute Database which I have forked on GitHub for posterity. The file arabatica_data.csv
contains the data we'll be using throughout this section, and the first thing we'll want to do is to load the data and output some samples for a sanity check.
data_url = "https://datacrayon.com/datasets/arabica_data.csv" data = pd.read_csv(data_url) data.head()
Unnamed: 0 Species Owner Country.of.Origin Farm.Name Lot.Number Mill ICO.Number Company Altitude ... Color Category.Two.Defects Expiration Certification.Body Certification.Address Certification.Contact unit_of_measurement altitude_low_meters altitude_high_meters altitude_mean_meters 0 1 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 ... Green 0 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.0 2200.0 2075.0 1 2 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 ... Green 1 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.0 2200.0 2075.0 2 3 Arabica grounds for health admin Guatemala san marcos barrancas "san cristobal cuch NaN NaN NaN NaN 1600 - 1800 m ... NaN 0 May 31st, 2011 Specialty Coffee Association 36d0d00a3724338ba7937c52a378d085f2172daa 0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660 m 1600.0 1800.0 1700.0 3 4 Arabica yidnekachew dabessa Ethiopia yidnekachew dabessa coffee plantation NaN wolensu NaN yidnekachew debessa coffee plantation 1800-2200 ... Green 2 March 25th, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1800.0 2200.0 2000.0 4 5 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 ... Green 2 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.0 2200.0 2075.0
5 rows × 44 columns
Data Wrangling¶
By viewing the CSV directly we can see our desired columns are named Country.of.Origin
and Variety
. Let's print out the columns to make sure they exist in the data we've loaded.
data.columns
Index(['Unnamed: 0', 'Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number', 'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer', 'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year', 'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method', 'Aroma', 'Flavor', 'Aftertaste', 'Acidity', 'Body', 'Balance', 'Uniformity', 'Clean.Cup', 'Sweetness', 'Cupper.Points', 'Total.Cup.Points', 'Moisture', 'Category.One.Defects', 'Quakers', 'Color', 'Category.Two.Defects', 'Expiration', 'Certification.Body', 'Certification.Address', 'Certification.Contact', 'unit_of_measurement', 'altitude_low_meters', 'altitude_high_meters', 'altitude_mean_meters'], dtype='object')
Great! We can see both of these columns exist in our DataFrame.
Now let's take a peek at the unique values in both of these columns to see if any obvious issues stand out. We'll start with the Country.of.Origin
column.
data["Country.of.Origin"].unique()
array(['Ethiopia', 'Guatemala', 'Brazil', 'Peru', 'United States', 'United States (Hawaii)', 'Indonesia', 'China', 'Costa Rica', 'Mexico', 'Uganda', 'Honduras', 'Taiwan', 'Nicaragua', 'Tanzania, United Republic Of', 'Kenya', 'Thailand', 'Colombia', 'Panama', 'Papua New Guinea', 'El Salvador', 'Japan', 'Ecuador', 'United States (Puerto Rico)', 'Haiti', 'Burundi', 'Vietnam', 'Philippines', 'Rwanda', 'Malawi', 'Laos', 'Zambia', 'Myanmar', 'Mauritius', 'Cote d?Ivoire', nan, 'India'], dtype=object)
We can see some points that may cause issues when it comes to our visualisation.
There appears to be at least one nan
value in Country.of.Origin
. We're only interested in coffee bean reviews which aren't missing this data, so let's remove any samples where nan
exists.
data = data[data["Country.of.Origin"].notna()]
Also, the entries in Country.of.Origin
will be used as labels on our visualisation. Ideally, we don't want these to be longer than they need to be. So let's shorten some of the longer names.
data["Country.of.Origin"] = data["Country.of.Origin"].replace( "United States (Hawaii)", "Hawaii" ) data["Country.of.Origin"] = data["Country.of.Origin"].replace( "Tanzania, United Republic Of", "Tanzania" ) data["Country.of.Origin"] = data["Country.of.Origin"].replace( "United States (Puerto Rico)", "Puerto Rico" )
Now let's take a peek at the unique Variety
column.
data["Variety"].unique()
array([nan, 'Other', 'Bourbon', 'Catimor', 'Ethiopian Yirgacheffe', 'Caturra', 'SL14', 'Sumatra', 'SL34', 'Hawaiian Kona', 'Yellow Bourbon', 'SL28', 'Gesha', 'Catuai', 'Pacamara', 'Typica', 'Sumatra Lintong', 'Mundo Novo', 'Java', 'Peaberry', 'Pacas', 'Mandheling', 'Ruiru 11', 'Arusha', 'Ethiopian Heirlooms', 'Moka Peaberry', 'Sulawesi', 'Blue Mountain', 'Marigojipe', 'Pache Comun'], dtype=object)
We can see this column also has at least one nan
entry, so let's remove these too.
data = data[data["Variety"].notna()]
Also, there appears to be at least one entry of Other
for the Variety
. For this visualisation, we're not interested in Other
, so let's remove them too.
data = data[data["Variety"] != "Other"]
From previous Chord diagram visualisations we know that they can become too crowded with too many different categories. With this in mind, let's choose to visualise only the top 121212 most frequently occurring Country.of.Origin
and Variety
.
data = data[ data["Country.of.Origin"].isin( list(data["Country.of.Origin"].value_counts()[:12].index) ) ] data = data[data["Variety"].isin(list(data["Variety"].value_counts()[:12].index))]
As we're creating a bipartite chord diagram, let's define what labels will be going on the left and the right.
On the left, we'll have all of our countries of origin.
left = list(data["Country.of.Origin"].value_counts().index)[::-1] pd.DataFrame(left)
0 0 China 1 El Salvador 2 Uganda 3 Kenya 4 Hawaii 5 Costa Rica 6 Honduras 7 Taiwan 8 Brazil 9 Colombia 10 Guatemala 11 Mexico
And on the right, we'll have all of our varieties.
right = list(data["Variety"].value_counts().index) pd.DataFrame(right)
0 0 Caturra 1 Bourbon 2 Typica 3 Catuai 4 Hawaiian Kona 5 Yellow Bourbon 6 Mundo Novo 7 SL14 8 SL28 9 Pacas 10 Catimor 11 SL34
We're good to go! So let's select just these two columns and work with a DataFrame containing only them as we move forward.
origin_variety = pd.DataFrame(data[["Country.of.Origin", "Variety"]].values) origin_variety
0 1 0 Guatemala Bourbon 1 China Catimor 2 Costa Rica Caturra 3 Brazil Bourbon 4 Uganda SL14 ... ... ... 884 Honduras Catuai 885 Honduras Catuai 886 Mexico Bourbon 887 Guatemala Catuai 888 Honduras Caturra
889 rows × 2 columns
Our chord diagram will need two inputs: the co-occurrence matrix, and a list of names to label the segments.
We can build this list of names by adding together the labels for the left and right side of our bipartite diagram.
names = left + right pd.DataFrame(names)
0 0 China 1 El Salvador 2 Uganda 3 Kenya 4 Hawaii 5 Costa Rica 6 Honduras 7 Taiwan 8 Brazil 9 Colombia 10 Guatemala 11 Mexico 12 Caturra 13 Bourbon 14 Typica 15 Catuai 16 Hawaiian Kona 17 Yellow Bourbon 18 Mundo Novo 19 SL14 20 SL28 21 Pacas 22 Catimor 23 SL34
Now we can create our empty co-occurrence matrix using these type names for the row and column indeces.
matrix = pd.DataFrame(0, index=names, columns=names) matrix
China El Salvador Uganda Kenya Hawaii Costa Rica Honduras Taiwan Brazil Colombia ... Typica Catuai Hawaiian Kona Yellow Bourbon Mundo Novo SL14 SL28 Pacas Catimor SL34 China 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 El Salvador 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 Uganda 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 Kenya 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 Hawaii 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 Costa Rica 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 Honduras 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 Taiwan 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 Brazil 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 Colombia 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 Guatemala 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 Mexico 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 Caturra 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 Bourbon 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 Typica 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 Catuai 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 Hawaiian Kona 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 Yellow Bourbon 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 Mundo Novo 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 SL14 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 SL28 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 Pacas 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 Catimor 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 SL34 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
24 rows × 24 columns
We can populate a co-occurrence matrix with the following approach. We'll start by creating a list with every type pairing in its original and reversed form.
origin_variety = list( itertools.chain.from_iterable((i, i[::-1]) for i in origin_variety.values) )
Which we can now use to create the matrix.
for pairing in origin_variety: matrix.at[pairing[0], pairing[1]] += 1 matrix = matrix.values.tolist()
We can list the DataFrame
for better presentation.
pd.DataFrame(matrix)
0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23 0 0 0 0 0 0 0 0 0 0 0 ... 2 0 0 0 0 0 0 0 12 0 1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 17 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 14 0 0 8 4 0 0 0 0 0 0 0 0 0 0 ... 0 0 44 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 ... 0 15 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 ... 0 21 0 0 0 0 0 4 1 0 7 0 0 0 0 0 0 0 0 0 0 ... 59 0 0 3 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 ... 0 19 0 32 20 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 ... 3 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 ... 0 9 0 0 0 0 0 6 0 0 11 0 0 0 0 0 0 0 0 0 0 ... 137 4 0 0 12 0 0 1 0 0 12 0 0 0 0 0 28 23 2 0 129 ... 0 0 0 0 0 0 0 0 0 0 13 0 13 3 0 0 1 0 2 41 0 ... 0 0 0 0 0 0 0 0 0 0 14 2 0 0 0 0 0 0 59 0 3 ... 0 0 0 0 0 0 0 0 0 0 15 0 0 0 0 0 15 21 0 19 0 ... 0 0 0 0 0 0 0 0 0 0 16 0 0 0 0 44 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 0 0 3 32 0 ... 0 0 0 0 0 0 0 0 0 0 18 0 0 0 0 0 0 0 0 20 0 ... 0 0 0 0 0 0 0 0 0 0 19 0 0 17 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 20 0 0 0 14 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 21 0 2 0 0 0 0 4 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 22 12 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 23 0 0 0 8 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
24 rows × 24 columns
Chord Diagram¶
Time to visualise the co-occurrence of items using a chord diagram. We are going to use a list of custom colours that represent the items.
Let's specify some colours for the left and right sides.
colors = [ "#ff575c","#ff914d","#ffca38","#f2fa00","#C3F500","#94f000", "#00fa68","#00C1A2","#0087db","#0054f0","#5d00e0","#2F06EB", "#6f1d1b","#955939","#A87748","#bb9457","#7f5e38","#432818", "#6e4021","#99582a","#cc9f69","#755939","#BAA070","#ffe6a7",]
Chord.user = "email here" Chord.key = "license key here"
And then we invoke the Chord function passing in our desired customisation arguments.
Chord( matrix, names, colors=colors, wrap_labels=False, width=910, margin=40, padding=0.05, font_size="12px", font_size_large="18px", noun="coffee bean reviews", title="Coffee Bean Reviews - Variety and Origin", divide=True, divide_idx=len(left), divide_size=0.6, allow_download=True, ).show()
Chord Diagram
Conclusion¶
In this section, we demonstrated how to conduct some data wrangling on a downloaded dataset to prepare it for a bipartite chord diagram. Our chord diagram is interactive, so you can use your mouse or touchscreen to investigate the co-occurrences!
Made with Chord Pro
You can create beautiful interactive visualisations like this one with Chord Pro. Learn how to make beautiful visualisations with the book, Data is Beautiful.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK