46

eCFR Parsing with BeautifulSoup and ElementTree

 4 years ago
source link: https://www.tuicool.com/articles/fA3umm7
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

As a Deployment Strategist at Palantir I was given the task of parsing an eCFR document. I had never parsed an XML file before and started out using the ElementTree library before switching over to BeautifulSoup . I had used BeautifulSoup once in my data science boot camp for HTML web-scraping and felt more comfortable with it. I had to prioritize speed and execution over learning because I had a real and tight deadline to meet with plenty subsequent steps to complete aside from parsing so I stuck with what I knew knowing I would have less of a learning curve.

Like with any problem I have I try to do a combination of narrowing it down to its simplest version as well as understanding it at its highest level. This process of down-selecting and generalizing allows for a comprehensive understanding of the problem and how to approach solving it as well as faster iteration. So, here are some of the questions I asked myself:

  • What is a CFR? What is its structure/what does it look like irl?
  • How will this data be used once it is parsed? What would an end-user/SME expect to see?

Asking my first and second questions were helpful because once I understood the structure and what an end-user might expect to see I was able to narrow down my data collection. CFRs typically have titles, subtitles, chapters, subchapters, parts, subparts, subject groups, sections, and appendixes. Again, without understanding the structure or what data would be used I could have wasted a ton of time collecting data that wasn’t useful. In my case we were only interested in chapters, subchapters, parts, and sections. I will add that it is always good practice to try and understand the data you’re working with anyway. The additional context can help you make smarter decisions and save time instead of diving into — storing, manipulating, cleansing, analyzing, etc. — data you know nothing about.

After successfully completing the task, I thought about the code I wrote and the bugs I ran into along the way; and I knew the next time I would have to parse XML I would not want to make the same mistakes. So I decided to not only revisit my code, but to teach myself how to solve my problem again using ElementTree. I then figured it would great to share that code so that anyone else needing to do this could benefit from my logic. Below is all of my fully commented code that I used to parse the XML of the eCFR Title 16 Commercial Practices using either library.

Parsing with BeautifulSoup

eYjIBfJ.png!web

In the code snippet above I imported all of the libraries I’d need — Pandas to create a DataFrame, requests to send requests, re to work with regular expressions, and BeautifulSoup for parsing. I then sent a request to the url to retrieve its data and stored it in the variable xml_bs . I then passed xml_bs through bs which returned a BeautifulSoup object representing the XML document as a nested data structure.

2EJZ737.png!web

Although the code is fully commented out, I’ll provide a bit more context on what’s going on in this next snippet. I created an empty list that I iteratively appended all of the data I retrieved. I then found all DIV3 elements using .find_all() and stored them in the variable chapters_bs . From there I began looping through my soup object looking for attributes or tags associated with the DIV3 elements and saved them to their respective variables. I continued on like this nesting my for loops and accessing the parent object prior which allowed me to preserve the hierarchical structure of my XML document. Note that I appended all of my data at the section-level.

mEV7jqN.png!web

The next few steps above are pretty straightforward — I saved my data to a Pandas DataFrame, wrote a for loop to strip any leading or trailing white space, wrote a regular expression to remove tags and ASCII characters from my section_text column, and displayed the first five rows of my data.

Parsing with ElementTree

r6ruIjq.png!web

Again, in the code snippet above I imported all of the libraries I’d need — all of which were the same except for ElementTree for parsing. I sent a request to the url to retrieve its data and stored it in the variable xml_et . I then pass xml_et through fromstring() which parses the XML from my response content into an Element, which is the root element of the parsed tree.

F3EBNnq.png!web

Most of the code here remains unchanged, but I’ll walk through everything again in case you skipped ahead. I created an empty list that I iteratively appended all of the data I retrieved. I then found all DIV3 elements using .iter() and stored them in the variable chapters_et . From there I began looping through my root object looking for attributes or tags associated with the DIV3 elements and saved them to their respective variables. I continued on like this nesting my for loops and accessing the parent object prior which allowed me to preserve the hierarchical structure of my XML document. Note that I appended all of my data at the section-level.

ni2yueB.png!web

Again, the next few steps above are straightforward — I saved my data to a Pandas DataFrame, wrote a for loop to strip any leading or trailing white space, wrote a regular expression to remove tags and ASCII characters from my section_text column, and displayed the first five rows of my data.

As always, thanks for reading.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK