39

Python抓取国家医疗费用数据:哪个国家花得最多、哪个国家花得最少?

 4 years ago
source link: http://mp.weixin.qq.com/s?__biz=MzI2NjkyNDQ3Mw%3D%3D&%3Bmid=2247493827&%3Bidx=1&%3Bsn=342fd198b587cfd4332837c38c90339e
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

EjyQZvZ.jpg!web

全文共 3326 字,预计学习时长 25 分钟

iEfMvmY.jpg!web

图源:unsplash

整个世界正被大流行困扰着,不同国家拿出了不同的应对策略,也取得了不同效果。这也是本文的脑洞来源,笔者打算研究一下各国在医疗基础设置上的开支,对几个国家的医疗费用进行数据可视化。

由于没有找到最近一年的可靠数据来源,所以这里使用的是2016年的数据。数据清楚哪个国家花得最多、哪个国家花得最少。我一直想试试在Python中网络抓取和数据可视化,这算是个不错的项目。虽然手动将数据输入Excel肯定快得多,但是这样就不会有宝贵的机会来练习一些技能了。

数据科学就是利用各种工具包来解决问题,网络抓取和正则表达式是我需要研究的两个领域。结果简短但复杂,这一项目展示了如何将三种技术结合起来解决数据科学问题。

要求

网络抓取主要分为两部分:

·        通过发出HTTP请求来获取数据

·        通过解析HTMLDOM来提取重要数据

库和工具

·        Requests能够非常简单地发送HTTP请求。

·        Pandas是一个Python包,提供快速、灵活和有表现力的数据结构。

·        Web Scraper可以帮助在不设置任何自动化浏览器的情况下抓取动态网站。

·        Beautiful Soup是一个Python库,用于从HTML和XML文件中提取数据。

·        matplotlib是一个综合的库,用于在Python中创建静态、动画和交互式可视化效果。

设置

设置非常简单,只需创建一个文件夹,并安装BeautifulSoup和Requests。此处假设已经安装了Python3.x,再根据指令来创建文件夹并安装库。

mkdir scraper
pip install beautifulsoup4
pip install requests
pip install matplotlib
pip install pandas

现在,在该文件夹中创建一个任意名称的文件。这里用的是scraping.py.,然后在文件中导入Beautiful Soup和 requests,如下所示:

import pandas as pd
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import requests

抓取的内容:国家名;人均开销。

my6reye.jpg!web

图源:unsplash

网络抓取

现在,所有scraper设置都已准备好,应向target URL发出GET请求以获得原始HTML数据。

<span><span>r</span> =requests.get(<span> https://api.scrapingdog.com/scrape?api_key=&lt;YOUR_API_KEY&gt;&url=https://data.worldbank.org/indicator/SH.XPD.CHEX.PC.CD?most_recent_value_desc=false&dynamic=true </span>).text</span>

这将得出target URL的HTML代码,我们必须使用Beautiful Soup来解析HTML。

soup = BeautifulSoup(r,’html.parser’)
country=list()
expense=list()

笔者用两张空表来存储国家名和每个国家24小时内的开支。可以看到,每个国家都存储在一个“项目”标签中,把所有的项目标签都存储在一张列表中。

try:
 Countries=soup.find_all(“div”,{“class”:”item”})
except:
 Countries=None

世界上有190个国家,为每个国家的医疗开支运行一个for循环:

for i in range(0,190):
country.append(Countries[i+1].find_all(“div”,{“class”:None})[0].text.replace(“”,””))
expense.append(round(float(Countries[i+1].find_all(“div”,{“class”:None})[2].text.replace(“”,””).replace(‘,’,’’)))/365)
Data = {‘country’:country,’expense’: expense}

因为我想看看这些国家每天是如何花钱的,所以把这笔费用除以365。如果把给定的数据直接除以365,这可能会更容易些,但这样就没有学习的意义了。现在的“数据”看起来是这样的:

<span>{<span> country </span>: [<span> Central AfricanRepublic </span>, <span> Burundi </span>, <span> Mozambique </span>, <span> Congo, Dem. Rep. </span>, <span> Gambia, The </span>, <span> Niger </span>,<span> Madagascar </span>, <span> Ethiopia </span>, <span> Malawi </span>, <span> Mali </span>, <span> Eritrea </span>, <span> Benin </span>, <span> Chad </span>,<span> Bangladesh </span>, <span> Tanzania </span>, <span> Guinea </span>, <span> Uganda </span>, <span> Haiti </span>, <span> Togo </span>, <span> Guinea-Bissau </span>,<span> Pakistan </span>, <span> Burkina Faso </span>, <span> Nepal </span>, <span> Mauritania </span>, <span> Rwanda </span>, <span> Senegal </span>, <span> PapuaNew Guinea </span>, <span> Lao PDR </span>, <span> Tajikistan </span>, <span> Zambia </span>, <span> Afghanistan </span>, <span> Comoros </span>,<span> Myanmar </span>, <span> India </span>, <span> Cameroon </span>, <span> Syrian Arab Republic </span>, <span> Kenya </span>, <span> Ghana </span>,<span>&quot;Cote d Ivoire&quot;</span>, <span> Liberia </span>, <span> Djibouti </span>, <span> Congo, Rep. </span>, <span> Yemen, Rep. </span>,<span> Kyrgyz Republic </span>, <span> Cambodia </span>, <span> Nigeria </span>, <span> Timor-Leste </span>, <span> Lesotho </span>, <span> SierraLeone </span>, <span> Bhutan </span>, <span> Zimbabwe </span>, <span> Angola </span>, <span> Sao Tome and Principe </span>, <span> SolomonIslands </span>, <span> Vanuatu </span>, <span> Indonesia </span>, <span> Vietnam </span>, <span> Philippines </span>, <span> Egypt, Arab Rep. </span>,<span> Uzbekistan </span>, <span> Mongolia </span>, <span> Ukraine </span>, <span> Sudan </span>, <span> Iraq </span>, <span> Sri Lanka </span>, <span> CaboVerde </span>, <span> Moldova </span>, <span> Morocco </span>, <span> Fiji </span>, <span> Kiribati </span>, <span> Nicaragua </span>, <span> Guyana </span>,<span> Honduras </span>, <span> Tonga </span>, <span> Bolivia </span>, <span> Gabon </span>, <span> Eswatini </span>, <span> Thailand </span>, <span> Jordan </span>,<span> Samoa </span>, <span> Guatemala </span>, <span> St. Vincent and the Grenadines </span>, <span> Tunisia </span>, <span> Algeria </span>,<span> Kazakhstan </span>, <span> Azerbaijan </span>, <span> Albania </span>, <span> Equatorial Guinea </span>, <span> El Salvador </span>,<span> Jamaica </span>, <span> Belize </span>, <span> Georgia </span>, <span> Libya </span>, <span> Peru </span>, <span> Belarus </span>, <span> Paraguay </span>, <span> NorthMacedonia </span>, <span> Colombia </span>, <span> Suriname </span>, <span> Armenia </span>, <span> Malaysia </span>, <span> Botswana </span>,<span> Micronesia, Fed. Sts. </span>, <span> China </span>, <span> Namibia </span>, <span> Dominican Republic </span>, <span> Iran,Islamic Rep. </span>, <span> Dominica </span>, <span> Turkmenistan </span>, <span> South Africa </span>, <span> Bosnia andHerzegovina </span>, <span> Mexico </span>, <span> Turkey </span>, <span> Russian Federation </span>, <span> Romania </span>, <span> St. Lucia </span>,<span> Serbia </span>, <span> Ecuador </span>, <span> Tuvalu </span>, <span> Grenada </span>, <span> Montenegro </span>, <span> Mauritius </span>,<span> Seychelles </span>, <span> Bulgaria </span>, <span> Antigua and Barbuda </span>, <span> Brunei Darussalam </span>, <span> Oman </span>,<span> Lebanon </span>, <span> Poland </span>, <span> Marshall Islands </span>, <span> Latvia </span>, <span> Croatia </span>, <span> Costa Rica </span>,<span> St. Kitts and Nevis </span>, <span> Hungary </span>, <span> Argentina </span>, <span> Cuba </span>, <span> Lithuania </span>, <span> Nauru </span>,<span> Brazil </span>, <span> Panama </span>, <span> Maldives </span>, <span> Trinidad and Tobago </span>, <span> Kuwait </span>, <span> Bahrain </span>,<span> Saudi Arabia </span>, <span> Barbados </span>, <span> Slovak Republic </span>, <span> Estonia </span>, <span> Chile </span>, <span> CzechRepublic </span>, <span> United Arab Emirates </span>, <span> Uruguay </span>, <span> Greece </span>, <span> Venezuela, RB </span>,<span> Cyprus </span>, <span> Palau </span>, <span> Portugal </span>, <span> Qatar </span>, <span> Slovenia </span>, <span> Bahamas, The </span>, <span> Korea,Rep. </span>, <span> Malta </span>, <span> Spain </span>, <span> Singapore </span>, <span> Italy </span>, <span> Israel </span>, <span> Monaco </span>, <span> SanMarino </span>, <span> New Zealand </span>, <span> Andorra </span>, <span> United Kingdom </span>, <span> Finland </span>, <span> Belgium </span>,<span> Japan </span>, <span> France </span>, <span> Canada </span>, <span> Austria </span>, <span> Germany </span>, <span> Netherlands </span>, <span> Ireland </span>,<span> Australia </span>, <span> Iceland </span>, <span> Denmark </span>, <span> Sweden </span>, <span> Luxembourg </span>, <span> Norway </span>,<span> Switzerland </span>, <span> United States </span>, <span> World </span>], <span> expense </span>: [<span>0.043835616438356165</span>,<span>0.049315068493150684</span>, <span>0.052054794520547946</span>, <span>0.057534246575342465</span>,<span>0.057534246575342465</span>, <span>0.06301369863013699</span>, <span>0.06575342465753424</span>,<span>0.07671232876712329</span>, <span>0.0821917808219178</span>, <span>0.0821917808219178</span>,<span>0.0821917808219178</span>, <span>0.0821917808219178</span>, <span>0.08767123287671233</span>,<span>0.09315068493150686</span>, <span>0.09863013698630137</span>, <span>0.10136986301369863</span>,<span>0.10410958904109589</span>, <span>0.10410958904109589</span>, <span>0.10684931506849316</span>,<span>0.10684931506849316</span>, <span>0.1095890410958904</span>, <span>0.11232876712328767</span>,<span>0.1232876712328767</span>, <span>0.12876712328767123</span>, <span>0.13150684931506848</span>,<span>0.14520547945205478</span>, <span>0.1506849315068493</span>, <span>0.1506849315068493</span>, <span>0.15342465753424658</span>,<span>0.15616438356164383</span>, <span>0.15616438356164383</span>, <span>0.16164383561643836</span>,<span>0.16986301369863013</span>, <span>0.1726027397260274</span>, <span>0.17534246575342466</span>,<span>0.18082191780821918</span>, <span>0.18082191780821918</span>, <span>0.1863013698630137</span>,<span>0.1863013698630137</span>, <span>0.1863013698630137</span>, <span>0.1917808219178082</span>, <span>0.1917808219178082</span>,<span>0.19726027397260273</span>, <span>0.2</span>, <span>0.2136986301369863</span>, <span>0.21643835616438356</span>,<span>0.2191780821917808</span>, <span>0.2356164383561644</span>, <span>0.2356164383561644</span>, <span>0.2493150684931507</span>,<span>0.25753424657534246</span>, <span>0.2602739726027397</span>, <span>0.2876712328767123</span>, <span>0.29041095890410956</span>,<span>0.3013698630136986</span>, <span>0.30684931506849317</span>, <span>0.336986301369863</span>,<span>0.35342465753424657</span>, <span>0.3589041095890411</span>, <span>0.3698630136986301</span>,<span>0.3863013698630137</span>, <span>0.3863013698630137</span>, <span>0.41643835616438357</span>,<span>0.4191780821917808</span>, <span>0.4191780821917808</span>, <span>0.43561643835616437</span>, <span>0.4684931506849315</span>,<span>0.4684931506849315</span>, <span>0.4931506849315068</span>, <span>0.5150684931506849</span>, <span>0.5150684931506849</span>,<span>0.5260273972602739</span>, <span>0.547945205479452</span>, <span>0.5561643835616439</span>, <span>0.5835616438356165</span>,<span>0.6027397260273972</span>, <span>0.6054794520547945</span>, <span>0.6082191780821918</span>, <span>0.6136986301369863</span>,<span>0.6219178082191781</span>, <span>0.6602739726027397</span>, <span>0.684931506849315</span>, <span>0.7013698630136986</span>,<span>0.7123287671232876</span>, <span>0.7178082191780822</span>, <span>0.7342465753424657</span>, <span>0.7452054794520548</span>,<span>0.7698630136986301</span>, <span>0.8054794520547945</span>, <span>0.810958904109589</span>, <span>0.8328767123287671</span>,<span>0.8438356164383561</span>, <span>0.8575342465753425</span>, <span>0.8657534246575342</span>, <span>0.8712328767123287</span>,<span>0.8958904109589041</span>, <span>0.8986301369863013</span>, <span>0.9315068493150684</span>, <span>0.9753424657534246</span>,<span>0.9835616438356164</span>, <span>0.9917808219178083</span>, <span>1.0410958904109588</span>, <span>1.0602739726027397</span>,<span>1.0904109589041096</span>, <span>1.104109589041096</span>, <span>1.1342465753424658</span>, <span>1.1369863013698631</span>,<span>1.1479452054794521</span>, <span>1.158904109589041</span>, <span>1.1726027397260275</span>, <span>1.2164383561643837</span>,<span>1.2657534246575342</span>, <span>1.284931506849315</span>, <span>1.284931506849315</span>, <span>1.3041095890410959</span>,<span>1.3424657534246576</span>, <span>1.3534246575342466</span>, <span>1.3835616438356164</span>, <span>1.389041095890411</span>,<span>1.4136986301369863</span>, <span>1.4575342465753425</span>, <span>1.515068493150685</span>, <span>1.6356164383561644</span>,<span>1.6767123287671233</span>, <span>1.7068493150684931</span>, <span>1.7287671232876711</span>, <span>1.7753424657534247</span>,<span>1.8136986301369864</span>, <span>2.2164383561643834</span>, <span>2.3315068493150686</span>, <span>2.3945205479452056</span>,<span>2.421917808219178</span>, <span>2.4356164383561643</span>, <span>2.5506849315068494</span>, <span>2.5835616438356164</span>,<span>2.6164383561643834</span>, <span>2.66027397260274</span>, <span>2.706849315068493</span>, <span>2.7726027397260276</span>,<span>2.7835616438356166</span>, <span>2.852054794520548</span>, <span>2.871232876712329</span>, <span>2.915068493150685</span>,<span>2.926027397260274</span>, <span>3.010958904109589</span>, <span>3.1424657534246574</span>, <span>3.1890410958904107</span>,<span>3.23013698630137</span>, <span>3.2465753424657535</span>, <span>3.263013698630137</span>, <span>3.621917808219178</span>,<span>3.6246575342465754</span>, <span>3.778082191780822</span>, <span>4.13972602739726</span>, <span>4.323287671232877</span>,<span>4.476712328767123</span>, <span>4.586301369863014</span>, <span>4.934246575342466</span>, <span>5.005479452054795</span>,<span>5.024657534246575</span>, <span>5.027397260273973</span>, <span>5.6</span>, <span>6.3780821917808215</span>,<span>6.5479452054794525</span>, <span>6.745205479452054</span>, <span>7.504109589041096</span>, <span>7.772602739726027</span>,<span>8.054794520547945</span>, <span>8.254794520547945</span>, <span>10.26027397260274</span>, <span>10.506849315068493</span>,<span>10.843835616438357</span>, <span>11.27945205479452</span>, <span>11.367123287671232</span>, <span>11.597260273972603</span>,<span>11.67945205479452</span>, <span>12.213698630136987</span>, <span>12.843835616438357</span>, <span>12.915068493150685</span>,<span>12.991780821917809</span>, <span>13.038356164383561</span>, <span>13.704109589041096</span>, <span>13.873972602739727</span>,<span>15.24931506849315</span>, <span>15.646575342465754</span>, <span>17.18082191780822</span>, <span>20.487671232876714</span>,<span>26.947945205479453</span>, <span>27.041095890410958</span>, <span>2.8109589041095893</span>]}</span>

数据帧

绘制图表之前,必须使用Pandas准备一个数据帧。首先我们得明确DataFrame是什么:DataFrame是一个二维大小可变的、潜在的异构表格式数据结构,带有标记的轴(行和列)。创造一个数据帧非常简单直接:

<span><span>df</span> = pd.DataFrame(Data,columns=[‘country’, ‘expense’])</span>

可视化

我们大部分时间都花在收集和格式化数据上,现在到了做图的时候啦,可以使用matplotlib和seaborn 来可视化数据。如果不太在意美观,可以使用内置的数据帧绘图方法快速显示结果:

<span>df.plot(kind = ‘bar’, x=’country’, y=’expense’)</span>

<span>plt.show()</span>

现在,结论出来了:许多国家每天的支出都低于一美元。这些国家中大多数都位于亚洲和非洲,看来世界卫生组织应更关注这些国家。

FnM3aaZ.jpg!web

图源:unsplash

这不一定是一个值得出版的图表,却是结束一个小项目的最佳方式。

学习技术技能最有效的方法就是动手实践。学习的过程比最终的结果更重要,在这个项目中,展示了如何使用3项关键的数据科学技能:

·        网页抓取:检索联网数据

·        BeautifulSoup:分析数据以提取信息

·        可视化:展示所有的努力

比起技术更重要的是,找到自己感兴趣的项目,不一定是能够改变世界的事物才具有价值,从生活中探索有趣的项目吧。

VbeE7j3.jpg!web

推荐阅读专题

NBJ7vyI.jpg!web

bEniumQ.jpg!web

m6jER3M.jpg!web

iuUFJbn.jpg!web

Nbqmy26.jpg!web

留言点赞发个朋友圈

我们一起分享AI学习与发展的干货

编译组:刘奕琳、高雪窈

相关链接:

https://dzone.com/articles/data-visualization-of-healthcare-expenses-by-count

如转载,请后台留言,遵守转载规范

推荐文章阅读


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK