使用pdftools包获取pdfs的数据

通常情况下数据都包装在pdfs里面，值得庆幸的是有很多途径可以从中提取出数据。一个非常好的包就是pdftools（Github link），这篇博客将描述该包的一些基本函数。首先，我们寻找一些包含有趣信息的pdf文件。为了完成目标，我使用世界健康组织整理的国家糖尿病数据概要。你可以在这里找到这些文件。如果你打开其中一个pdf文件，你将看到这样的信息： qaYzQbV.png!web 我比较关心的是表格中间的这部分内容： ZjyQJnN.png!web 我想从不同国家获取这部分数据，把它们放进一个有条理的数据框，并且制作简单的绘图。首先从下载需要的程序包开始：

library("pdftools")
library("glue")
library("tidyverse")

## ── Attaching packages ────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.5
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::collapse() masks glue::collapse()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ dplyr::lag()      masks stats::lag()`

library("ggthemes")
country <- c("lux", "fra", "deu", "usa", "prt", "gbr")
url <- "http://www.who.int/diabetes/country-profiles/{country}_en.pdf?ua=1"

前面四行下载了该练习所需的程序包：pdftools是我在本帖开始描述的那个包，glue可作为一种替换paste()和paste0()函数的更好选择。认真观察url，你就会发现我写的是{country}。这不在原始链接中；原始链接应该是这样（以USA为例子）：

"http://www.who.int/diabetes/country-profiles/usa_en.pdf?ua=1"

因为我对几个国家感兴趣，所以我创建了一个向量包含了我感兴趣的国家代码。现在，使用glue()函数，奇妙的事情发生了：

(urls <- glue(url))

## http://www.who.int/diabetes/country-profiles/lux_en.pdf?ua=1
## http://www.who.int/diabetes/country-profiles/fra_en.pdf?ua=1
## http://www.who.int/diabetes/country-profiles/deu_en.pdf?ua=1
## http://www.who.int/diabetes/country-profiles/usa_en.pdf?ua=1
## http://www.who.int/diabetes/country-profiles/prt_en.pdf?ua=1
## http://www.who.int/diabetes/country-profiles/gbr_en.pdf?ua=1

这语句创建了一个向量包含了所有的链接，其中的{country}被变量country里面的代码逐一替换。我采用一样的方法创建一系列pdf的名字用于我将要下载的文件。

pdf_names <- glue("report_{country}.pdf")

现在我可以下载它们了：

walk2(urls, pdf_names, download.file, mode = "wb")

walk2()是purrr包里面跟map2()类似的函数。你可以使用map2(),但walk2()用在这里更清洁，因为download.file()是对下载文件产生一定副作用的函数；map2()用于没有副作用的函数。现在，我终于可以使用pdftools包里面的pdf_text()函数提取pdf文件里面的文本信息：

raw_text <- map(pdf_names, pdf_text)

raw_text是一系列来自pdf文件的文本元素。可以看一下它的内容：

str(raw_text)

## List of 6
##  $ : chr "Luxembourg                                                                                                     "| __truncated__
##  $ : chr "France                                                                                                         "| __truncated__
##  $ : chr "Germany                                                                                                        "| __truncated__
##  $ : chr "United States Of America                                                                                       "| __truncated__
##  $ : chr "Portugal                                                                                                       "| __truncated__
##  $ : chr "United Kingdom                                                                                                 "| __truncated__

再看看这些元素中的一个，仅仅是一长串字符：

raw_text[[1]]

## [1] "Luxembourg                                                                                                                                          Total population: 567 000\n                                                                                                                                                         Income group: High\nMortality\nNumber of diabetes deaths                                                                     Number of deaths attributable to high blood glucose\n                                                                     males         females                                                            males       females\nages 30–69                                                           <100            <100     ages 30–69                                              <100          <100\nages 70+                                                             <100            <100     ages 70+                                                <100          <100\nProportional mortality (% of total deaths, all ages)                                          Trends in age-standardized prevalence of diabetes\n                    Communicable,\n                   maternal, perinatal              Injuries                                                    35%\n                    and nutritional                   6%                     Cardiovascular\n                      conditions                                               diseases\n                          6%                                                      33%\n                                                                                                                30%\n                                                                                                                25%\n                                                                                              % of population\n               Other NCDs\n                  16%                                                                                           20%\n                                     No data available                                                          15%           No data available\n              Diabetes                                                                                          10%\n                 2%\n                                                                                                                5%\n                   Respiratory\n                    diseases\n                       6%                                                                                       0%\n                                                           Cancers\n                                                            31%\n                                                                                                                                  males     females\nPrevalence of diabetes and related risk factors\n                                                                                                                      males               females               total\nDiabetes                                                                                                              8.3%                 5.3%                 6.8%\nOverweight                                                                                                            70.7%               51.5%                61.0%\nObesity                                                                                                               28.3%               21.3%                24.8%\nPhysical inactivity                                                                                                   28.2%               31.7%                30.0%\nNational response to diabetes\nPolicies, guidelines and monitoring\nOperational policy/strategy/action plan for diabetes                                                                                                ND\nOperational policy/strategy/action plan to reduce overweight and obesity                                                                            ND\nOperational policy/strategy/action plan to reduce physical inactivity                                                                               ND\nEvidence-based national diabetes guidelines/protocols/standards                                                                                     ND\nStandard criteria for referral of patients from primary care to higher level of care                                                                ND\nDiabetes registry                                                                                                                                   ND\nRecent national risk factor survey in which blood glucose was measured                                                                              ND\nAvailability of medicines, basic technologies and procedures in the public health sector\nMedicines in primary care facilities                                                          Basic technologies in primary care facilities\nInsulin                                                                               ND      Blood glucose measurement                                             ND\nMetformin                                                                             ND      Oral glucose tolerance test                                           ND\nSulphonylurea                                                                         ND      HbA1c test                                                            ND\nProcedures                                                                                    Dilated fundus examination                                            ND\nRetinal photocoagulation                                                              ND      Foot vibration perception by tuning fork                              ND\nRenal replacement therapy by dialysis                                                 ND      Foot vascular status by Doppler                                       ND\nRenal replacement therapy by transplantation                                          ND      Urine strips for glucose and ketone measurement                       ND\nND = country did not respond to country capacity survey\n〇 = not generally available   ● = generally available\nWorld Health Organization – Diabetes country profiles, 2016.\n"

如你所看的那样，这是一串非常长的字符，包含了很多换行符（"\n"）。首先，我们需要根据"\n"分离字符串。其次，或许很难看到，但表格以字串“Prevalence of diabetes”开始，并且以“National response to diabetes”结束。接着，我们需要从文本中提取国家名，把它们放在同一列。你将看到，一系列的操作是必须的，因此，我将把所有针对raw_text的操作放在一个函数里面。

clean_table <- function(table){
    table <- str_split(table, "\n", simplify = TRUE)
    country_name <- table[1, 1] %>% 
        stringr::str_squish() %>% 
        stringr::str_extract(".+?(?=\\sTotal)")
    table_start <- stringr::str_which(table, "Prevalence of diabetes")
    table_end <- stringr::str_which(table, "National response to diabetes")
    table <- table[1, (table_start +1 ):(table_end - 1)]
    table <- str_replace_all(table, "\\s{2,}", "|")
    text_con <- textConnection(table)
    data_table <- read.csv(text_con, sep = "|")
    colnames(data_table) <- c("Condition", "Males", "Females", "Total")
    dplyr::mutate(data_table, Country = country_name)
}

我建议你把所有这些操作都过一遍并且理解每一步的含义。但是，我只描述其中的某些部分，比如以下这个：

stringr::str_extract(".+?(?=\\sTotal)")

它使用了一种非常奇异的规则表述：“.+?(?=\sTotal)”。它提取字串“Total”前隔着空格的任何成分。这是因为包含国家名字的第一行是这样的：“Luxembourg Total population: 567 000\n”。所以“Total”前隔着空格的任何成分就是国家名字。这就是它们所在的行。

table <- str_replace_all(table, "\\s{2,}", "|")
text_con <- textConnection(table)
data_table <- read.csv(text_con, sep = "|")

第一行取代2个空格或者更多（"\s{2,}"）"|".我这样做的原因是，可以通过识别分隔符“|”在R里面以数据框的形式读取表格的信息。第二行，我把table作为一个文本连接，以便可以在R通过read.csv()的方式读取。倒数第二行，我改变了列名，并且在数据框加入了“Country”新列。现在，我可以把这个函数应用到一系列pdf的原始文本中：

diabetes <- map_df(raw_text, clean_table) %>% 
    gather(Sex, Share, Males, Females, Total) %>% 
    mutate(Share = as.numeric(str_extract(Share, "\\d{1,}\\.\\d{1,}")))

我用gather()改变数据的排列（可以比较改变前后的数据形式）。接着，将“Share”列转换为数值型（它可以把“12.3%”变成12.3）并把它们绘制成图。首先看看数据：

diabetes

##              Condition                  Country     Sex Share
## 1             Diabetes               Luxembourg   Males   8.3
## 2           Overweight               Luxembourg   Males  70.7
## 3              Obesity               Luxembourg   Males  28.3
## 4  Physical inactivity               Luxembourg   Males  28.2
## 5             Diabetes                   France   Males   9.5
## 6           Overweight                   France   Males  69.9
## 7              Obesity                   France   Males  25.3
## 8  Physical inactivity                   France   Males  21.2
## 9             Diabetes                  Germany   Males   8.4
## 10          Overweight                  Germany   Males  67.0
## 11             Obesity                  Germany   Males  24.1
## 12 Physical inactivity                  Germany   Males  20.1
## 13            Diabetes United States Of America   Males   9.8
## 14          Overweight United States Of America   Males  74.1
## 15             Obesity United States Of America   Males  33.7
## 16 Physical inactivity United States Of America   Males  27.6
## 17            Diabetes                 Portugal   Males  10.7
## 18          Overweight                 Portugal   Males  65.0
## 19             Obesity                 Portugal   Males  21.4
## 20 Physical inactivity                 Portugal   Males  33.5
## 21            Diabetes           United Kingdom   Males   8.4
## 22          Overweight           United Kingdom   Males  71.1
## 23             Obesity           United Kingdom   Males  28.5
## 24 Physical inactivity           United Kingdom   Males  35.4
## 25            Diabetes               Luxembourg Females   5.3
## 26          Overweight               Luxembourg Females  51.5
## 27             Obesity               Luxembourg Females  21.3
## 28 Physical inactivity               Luxembourg Females  31.7
## 29            Diabetes                   France Females   6.6
## 30          Overweight                   France Females  58.6
## 31             Obesity                   France Females  26.1
## 32 Physical inactivity                   France Females  31.2
## 33            Diabetes                  Germany Females   6.4
## 34          Overweight                  Germany Females  52.7
## 35             Obesity                  Germany Females  21.4
## 36 Physical inactivity                  Germany Females  26.5
## 37            Diabetes United States Of America Females   8.3
## 38          Overweight United States Of America Females  65.3
## 39             Obesity United States Of America Females  36.3
## 40 Physical inactivity United States Of America Females  42.1
## 41            Diabetes                 Portugal Females   7.8
## 42          Overweight                 Portugal Females  55.0
## 43             Obesity                 Portugal Females  22.8
## 44 Physical inactivity                 Portugal Females  40.8
## 45            Diabetes           United Kingdom Females   6.9
## 46          Overweight           United Kingdom Females  62.4
## 47             Obesity           United Kingdom Females  31.1
## 48 Physical inactivity           United Kingdom Females  44.3
## 49            Diabetes               Luxembourg   Total   6.8
## 50          Overweight               Luxembourg   Total  61.0
## 51             Obesity               Luxembourg   Total  24.8
## 52 Physical inactivity               Luxembourg   Total  30.0
## 53            Diabetes                   France   Total   8.0
## 54          Overweight                   France   Total  64.1
## 55             Obesity                   France   Total  25.7
## 56 Physical inactivity                   France   Total  26.4
## 57            Diabetes                  Germany   Total   7.4
## 58          Overweight                  Germany   Total  59.7
## 59             Obesity                  Germany   Total  22.7
## 60 Physical inactivity                  Germany   Total  23.4
## 61            Diabetes United States Of America   Total   9.1
## 62          Overweight United States Of America   Total  69.6
## 63             Obesity United States Of America   Total  35.0
## 64 Physical inactivity United States Of America   Total  35.0
## 65            Diabetes                 Portugal   Total   9.2
## 66          Overweight                 Portugal   Total  59.8
## 67             Obesity                 Portugal   Total  22.1
## 68 Physical inactivity                 Portugal   Total  37.3
## 69            Diabetes           United Kingdom   Total   7.7
## 70          Overweight           United Kingdom   Total  66.7
## 71             Obesity           United Kingdom   Total  29.8
## 72 Physical inactivity           United Kingdom   Total  40.0

现在可以绘图了

ggplot(diabetes) + theme_fivethirtyeight() + scale_fill_hc() +
    geom_bar(aes(y = Share, x = Sex, fill = Country), 
             stat = "identity", position = "dodge") +
    facet_wrap(~Condition)

qYRrEjM.png!web 以上就是绘制该图的一系列工作！

作者：Bruno Rodrigues 原文链接：https://www.brodrigues.co/blog/2018-06-10-scraping_pdfs/

数据人网： 数据人学习，交流和分享的平台，诚邀您创造和分享数据知识，共建和共享数据智库。

Recommend

App thinning checklist | 非常大人

使用 Kotlin 做开发一个月后的感想

拼多多要起诉羊毛党？回应：截图是PS的

你的团队需要更好的 API 文档流程 | 须臾之学

求推荐个便携滑板，解决通勤最后 2 公里

Go语言开发（十五）、Go语言常用标准库五

beego框架代码分析

go新手容易犯的三个错

Debugging Tips and Tales

ZGC什么时候进行垃圾回收

About Joyk