使用pdftools包获取pdfs的数据
source link: http://shujuren.org/article/818.html?amp%3Butm_medium=referral
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
通常情况下数据都包装在pdfs里面,值得庆幸的是有很多途径可以从中提取出数据。一个非常好的包就是pdftools(Github link),这篇博客将描述该包的一些基本函数。 首先,我们寻找一些包含有趣信息的pdf文件。为了完成目标,我使用世界健康组织整理的国家糖尿病数据概要。你可以在 这里 找到这些文件。如果你打开其中一个pdf文件,你将看到这样的信息: 我比较关心的是表格中间的这部分内容: 我想从不同国家获取这部分数据,把它们放进一个有条理的数据框,并且制作简单的绘图。首先从下载需要的程序包开始:
library("pdftools") library("glue") library("tidyverse")
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ── ## ✔ ggplot2 2.2.1 ✔ purrr 0.2.5 ## ✔ tibble 1.4.2 ✔ dplyr 0.7.5 ## ✔ tidyr 0.8.1 ✔ stringr 1.3.1 ## ✔ readr 1.1.1 ✔ forcats 0.3.0 ## ── Conflicts ───────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::collapse() masks glue::collapse() ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag()`
library("ggthemes") country <- c("lux", "fra", "deu", "usa", "prt", "gbr") url <- "http://www.who.int/diabetes/country-profiles/{country}_en.pdf?ua=1"
前面四行下载了该练习所需的程序包:pdftools是我在本帖开始描述的那个包,glue可作为一种替换paste()和paste0()函数的更好选择。认真观察url,你就会发现我写的是{country}。这不在原始链接中;原始链接应该是这样(以USA为例子):
"http://www.who.int/diabetes/country-profiles/usa_en.pdf?ua=1"
因为我对几个国家感兴趣,所以我创建了一个向量包含了我感兴趣的国家代码。现在,使用glue()函数,奇妙的事情发生了:
(urls <- glue(url))
## http://www.who.int/diabetes/country-profiles/lux_en.pdf?ua=1 ## http://www.who.int/diabetes/country-profiles/fra_en.pdf?ua=1 ## http://www.who.int/diabetes/country-profiles/deu_en.pdf?ua=1 ## http://www.who.int/diabetes/country-profiles/usa_en.pdf?ua=1 ## http://www.who.int/diabetes/country-profiles/prt_en.pdf?ua=1 ## http://www.who.int/diabetes/country-profiles/gbr_en.pdf?ua=1
这语句创建了一个向量包含了所有的链接,其中的{country}被变量country里面的代码逐一替换。 我采用一样的方法创建一系列pdf的名字用于我将要下载的文件。
pdf_names <- glue("report_{country}.pdf")
现在我可以下载它们了:
walk2(urls, pdf_names, download.file, mode = "wb")
walk2()是purrr包里面跟map2()类似的函数。你可以使用map2(),但walk2()用在这里更清洁,因为download.file()是对下载文件产生一定副作用的函数;map2()用于没有副作用的函数。 现在,我终于可以使用pdftools包里面的pdf_text()函数提取pdf文件里面的文本信息:
raw_text <- map(pdf_names, pdf_text)
raw_text是一系列来自pdf文件的文本元素。可以看一下它的内容:
str(raw_text)
## List of 6 ## $ : chr "Luxembourg "| __truncated__ ## $ : chr "France "| __truncated__ ## $ : chr "Germany "| __truncated__ ## $ : chr "United States Of America "| __truncated__ ## $ : chr "Portugal "| __truncated__ ## $ : chr "United Kingdom "| __truncated__
再看看这些元素中的一个,仅仅是一长串字符:
raw_text[[1]]
## [1] "Luxembourg Total population: 567 000\n Income group: High\nMortality\nNumber of diabetes deaths Number of deaths attributable to high blood glucose\n males females males females\nages 30–69 <100 <100 ages 30–69 <100 <100\nages 70+ <100 <100 ages 70+ <100 <100\nProportional mortality (% of total deaths, all ages) Trends in age-standardized prevalence of diabetes\n Communicable,\n maternal, perinatal Injuries 35%\n and nutritional 6% Cardiovascular\n conditions diseases\n 6% 33%\n 30%\n 25%\n % of population\n Other NCDs\n 16% 20%\n No data available 15% No data available\n Diabetes 10%\n 2%\n 5%\n Respiratory\n diseases\n 6% 0%\n Cancers\n 31%\n males females\nPrevalence of diabetes and related risk factors\n males females total\nDiabetes 8.3% 5.3% 6.8%\nOverweight 70.7% 51.5% 61.0%\nObesity 28.3% 21.3% 24.8%\nPhysical inactivity 28.2% 31.7% 30.0%\nNational response to diabetes\nPolicies, guidelines and monitoring\nOperational policy/strategy/action plan for diabetes ND\nOperational policy/strategy/action plan to reduce overweight and obesity ND\nOperational policy/strategy/action plan to reduce physical inactivity ND\nEvidence-based national diabetes guidelines/protocols/standards ND\nStandard criteria for referral of patients from primary care to higher level of care ND\nDiabetes registry ND\nRecent national risk factor survey in which blood glucose was measured ND\nAvailability of medicines, basic technologies and procedures in the public health sector\nMedicines in primary care facilities Basic technologies in primary care facilities\nInsulin ND Blood glucose measurement ND\nMetformin ND Oral glucose tolerance test ND\nSulphonylurea ND HbA1c test ND\nProcedures Dilated fundus examination ND\nRetinal photocoagulation ND Foot vibration perception by tuning fork ND\nRenal replacement therapy by dialysis ND Foot vascular status by Doppler ND\nRenal replacement therapy by transplantation ND Urine strips for glucose and ketone measurement ND\nND = country did not respond to country capacity survey\n〇 = not generally available ● = generally available\nWorld Health Organization – Diabetes country profiles, 2016.\n"
如你所看的那样,这是一串非常长的字符,包含了很多换行符("\n")。首先,我们需要根据"\n"分离字符串。其次,或许很难看到,但表格以字串“Prevalence of diabetes”开始,并且以“National response to diabetes”结束。接着,我们需要从文本中提取国家名,把它们放在同一列。你将看到,一系列的操作是必须的,因此,我将把所有针对raw_text的操作放在一个函数里面。
clean_table <- function(table){ table <- str_split(table, "\n", simplify = TRUE) country_name <- table[1, 1] %>% stringr::str_squish() %>% stringr::str_extract(".+?(?=\\sTotal)") table_start <- stringr::str_which(table, "Prevalence of diabetes") table_end <- stringr::str_which(table, "National response to diabetes") table <- table[1, (table_start +1 ):(table_end - 1)] table <- str_replace_all(table, "\\s{2,}", "|") text_con <- textConnection(table) data_table <- read.csv(text_con, sep = "|") colnames(data_table) <- c("Condition", "Males", "Females", "Total") dplyr::mutate(data_table, Country = country_name) }
我建议你把所有这些操作都过一遍并且理解每一步的含义。但是,我只描述其中的某些部分,比如以下这个:
stringr::str_extract(".+?(?=\\sTotal)")
它使用了一种非常奇异的规则表述:“.+?(?=\sTotal)”。它提取字串“Total”前隔着空格的任何成分。这是因为包含国家名字的第一行是这样的:“Luxembourg Total population: 567 000\n”。所以“Total”前隔着空格的任何成分就是国家名字。这就是它们所在的行。
table <- str_replace_all(table, "\\s{2,}", "|") text_con <- textConnection(table) data_table <- read.csv(text_con, sep = "|")
第一行取代2个空格或者更多("\s{2,}")"|".我这样做的原因是,可以通过识别分隔符“|”在R里面以数据框的形式读取表格的信息。第二行,我把table作为一个文本连接,以便可以在R通过read.csv()的方式读取。倒数第二行,我改变了列名,并且在数据框加入了“Country”新列。 现在,我可以把这个函数应用到一系列pdf的原始文本中:
diabetes <- map_df(raw_text, clean_table) %>% gather(Sex, Share, Males, Females, Total) %>% mutate(Share = as.numeric(str_extract(Share, "\\d{1,}\\.\\d{1,}")))
我用gather()改变数据的排列(可以比较改变前后的数据形式)。接着,将“Share”列转换为数值型(它可以把“12.3%”变成12.3)并把它们绘制成图。首先看看数据:
diabetes
## Condition Country Sex Share ## 1 Diabetes Luxembourg Males 8.3 ## 2 Overweight Luxembourg Males 70.7 ## 3 Obesity Luxembourg Males 28.3 ## 4 Physical inactivity Luxembourg Males 28.2 ## 5 Diabetes France Males 9.5 ## 6 Overweight France Males 69.9 ## 7 Obesity France Males 25.3 ## 8 Physical inactivity France Males 21.2 ## 9 Diabetes Germany Males 8.4 ## 10 Overweight Germany Males 67.0 ## 11 Obesity Germany Males 24.1 ## 12 Physical inactivity Germany Males 20.1 ## 13 Diabetes United States Of America Males 9.8 ## 14 Overweight United States Of America Males 74.1 ## 15 Obesity United States Of America Males 33.7 ## 16 Physical inactivity United States Of America Males 27.6 ## 17 Diabetes Portugal Males 10.7 ## 18 Overweight Portugal Males 65.0 ## 19 Obesity Portugal Males 21.4 ## 20 Physical inactivity Portugal Males 33.5 ## 21 Diabetes United Kingdom Males 8.4 ## 22 Overweight United Kingdom Males 71.1 ## 23 Obesity United Kingdom Males 28.5 ## 24 Physical inactivity United Kingdom Males 35.4 ## 25 Diabetes Luxembourg Females 5.3 ## 26 Overweight Luxembourg Females 51.5 ## 27 Obesity Luxembourg Females 21.3 ## 28 Physical inactivity Luxembourg Females 31.7 ## 29 Diabetes France Females 6.6 ## 30 Overweight France Females 58.6 ## 31 Obesity France Females 26.1 ## 32 Physical inactivity France Females 31.2 ## 33 Diabetes Germany Females 6.4 ## 34 Overweight Germany Females 52.7 ## 35 Obesity Germany Females 21.4 ## 36 Physical inactivity Germany Females 26.5 ## 37 Diabetes United States Of America Females 8.3 ## 38 Overweight United States Of America Females 65.3 ## 39 Obesity United States Of America Females 36.3 ## 40 Physical inactivity United States Of America Females 42.1 ## 41 Diabetes Portugal Females 7.8 ## 42 Overweight Portugal Females 55.0 ## 43 Obesity Portugal Females 22.8 ## 44 Physical inactivity Portugal Females 40.8 ## 45 Diabetes United Kingdom Females 6.9 ## 46 Overweight United Kingdom Females 62.4 ## 47 Obesity United Kingdom Females 31.1 ## 48 Physical inactivity United Kingdom Females 44.3 ## 49 Diabetes Luxembourg Total 6.8 ## 50 Overweight Luxembourg Total 61.0 ## 51 Obesity Luxembourg Total 24.8 ## 52 Physical inactivity Luxembourg Total 30.0 ## 53 Diabetes France Total 8.0 ## 54 Overweight France Total 64.1 ## 55 Obesity France Total 25.7 ## 56 Physical inactivity France Total 26.4 ## 57 Diabetes Germany Total 7.4 ## 58 Overweight Germany Total 59.7 ## 59 Obesity Germany Total 22.7 ## 60 Physical inactivity Germany Total 23.4 ## 61 Diabetes United States Of America Total 9.1 ## 62 Overweight United States Of America Total 69.6 ## 63 Obesity United States Of America Total 35.0 ## 64 Physical inactivity United States Of America Total 35.0 ## 65 Diabetes Portugal Total 9.2 ## 66 Overweight Portugal Total 59.8 ## 67 Obesity Portugal Total 22.1 ## 68 Physical inactivity Portugal Total 37.3 ## 69 Diabetes United Kingdom Total 7.7 ## 70 Overweight United Kingdom Total 66.7 ## 71 Obesity United Kingdom Total 29.8 ## 72 Physical inactivity United Kingdom Total 40.0
现在可以绘图了
ggplot(diabetes) + theme_fivethirtyeight() + scale_fill_hc() + geom_bar(aes(y = Share, x = Sex, fill = Country), stat = "identity", position = "dodge") + facet_wrap(~Condition)
以上就是绘制该图的一系列工作!
作者:Bruno Rodrigues 原文链接:https://www.brodrigues.co/blog/2018-06-10-scraping_pdfs/
版权声明: 作者保留权利。文章为作者独立观点,不代表数据人网立场。严禁修改,转载请注明原文链接:http://shujuren.org/article/818.html
数据人网: 数据人学习,交流和分享的平台,诚邀您创造和分享数据知识,共建和共享数据智库。
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK