Resurrecting a Dead Library: Part Two - Stabilization
source link: https://mtlynch.io/resurrecting-2/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Resurrecting a Dead Library: Part Two - Stabilization
In this post, I demonstrate how to retrofit automated tests onto an untested legacy library.
This is part two of a three-part series about how I resurrected ingredient-phrase-tagger, a library that uses machine learning to parse cooking ingredients (e.g., “2 cups milk”) into structured data. Read part one for the full context, but the short version is that I discovered an abandoned library and brought it back to life so that it could power my SaaS business:
- Part One: Resuscitation - In which I nurse the code back to health so that it runs on any modern system
- Part Two: Stabilization (this post) - In which I prevent functionality from regressing while I restore the code
- Part Three: Rehabilitation - In which I begin refactoring the code
Running it in continuous integration 🔗︎
At the end of part one, I created a Docker image that allowed the library to run on any system. The next step was to run the library in continuous integration.
Continuous integration is the practice of using an indepedent, controlled environment to test software on each change to the code. My preferred continuous integration solution is Travis. Their configuration files are intuitive, and they offer unlimited free builds for open-source projects.
To integrate with Travis, I added my fork of ingredient-phrase-tagger on Travis' configuration page and then enabled builds:
Enabling Travis builds for ingredient-phrase-tagger library
Then, I created a file called .travis.yml
, which told Travis how to build the library:
sudo: required
services: docker
script: docker build .
I pushed my commit to Github, created a pull request, and Travis built it successfully:
First successful build on Travis
Adding an end-to-end test 🔗︎
Travis was building my Docker image, but the build wasn’t meaningful yet. It only built the library’s dependencies — it didn’t exercise any of its behavior. I wanted a build that could alert me when I broke the library’s functionality. To do that, I needed an end-to-end test.
An end-to-end test verifies that a complete, real-world scenario works as expected. It generally matches the following structure:
- Supply pre-generated input and its expected output (also known as the “golden output”).
- Use automation tools to feed the input to the library.
- Compare the library’s output to the golden output.
The original repository contained a script called roundtrip.sh that resembled an end-to-end test. It provided pre-generated input to the library, used a portion of the input to train a new machine learning model, then used that model to parse other portions of the input. The only piece missing was that it never compared results to a known-good output.
A basic end-to-end test 🔗︎
In part one, I showed that the roundtrip.sh
script’s final result was a set of summary statistics about the model’s performance:
Sentence-Level Stats:
correct: 1487
total: 1999
% correct: 74.3871935968
Word-Level Stats:
correct: 10391
total: 11450
% correct: 90.7510917031
That gave me enough for a simple end-to-end test. I re-ran the final step of the roundtrip.sh
script, but redirected the console output to a new file called tests/golden/eval_output
and added it to source control:
python bin/evaluate.py tmp/test_output > tests/golden/eval_output
Now that I had known-good output, I modified the end of roundtrip.sh
so that it would compare all future outputs against this saved output:
python bin/evaluate.py tmp/test_output > tmp/eval_output
diff tests/golden/eval_output tmp/eval_output
Does my test know when code breaks? 🔗︎
An end-to-end test is only useful if it catches bugs, so my next step was to simulate a breaking change and check if my end-to-end test caught it.
In cli.py, there was a regular expression that matched sequences of numbers (e.g., "83625"
):
m3 = re.match('^\d+$', ss)
As an experiment, I tweaked the regular expression so that it would fail to recognize any number that included a 9:
m3 = re.match('^[0-8]+$', ss)
I then re-ran my modified roundtrip.sh
script:
3c3
< correct: 1487
---
> correct: 1486
5c5
< % correct: 74.3871935968
---
> % correct: 74.33716858429
It worked!
When I told the code that 9 was no longer considered a number, the library’s accuracy fell, and the script terminated with a failing exit code.
Want to parse ingredients without all this work?
I went through all of these steps, so you don’t have to. Check out Zestful, my managed service for ingredient parsing.
Expanding the end-to-end test 🔗︎
The basic end-to-end test above was useful, but roundtrip.sh
executes a data pipeline with several stages. It would be convenient to know which particular stage broke, so I looked for more outputs to include in the end-to-end test.
In addition to printing output to the console, the script also wrote files to a subdirectory called tmp/
:
$ file tmp/*
tmp/model_file: data
tmp/output.html: HTML document, ASCII text, with very long lines
tmp/test_file: ASCII text
tmp/test_output: ASCII text
tmp/train_file: ASCII text
test_file
, test_output
, and train_file
were all plaintext files that looked a bit like this:
$ head -n 16 tmp/test_file
1 I1 L12 NoCAP NoPAREN B-QTY
boneless I2 L12 NoCAP NoPAREN I-COMMENT
pork I3 L12 NoCAP NoPAREN B-NAME
tenderloin I4 L12 NoCAP NoPAREN I-NAME
, I5 L12 NoCAP NoPAREN B-COMMENT
about I6 L12 NoCAP NoPAREN I-COMMENT
1 I7 L12 NoCAP NoPAREN B-QTY
pound I8 L12 NoCAP NoPAREN I-COMMENT
Salt I1 L8 YesCAP NoPAREN B-NAME
and I2 L8 NoCAP NoPAREN I-NAME
freshly I3 L8 NoCAP NoPAREN B-COMMENT
ground I4 L8 NoCAP NoPAREN I-COMMENT
black I5 L8 NoCAP NoPAREN B-NAME
pepper I6 L8 NoCAP NoPAREN I-NAME
I didn’t understand the file format yet, but I didn’t have to. All I needed was a way to detect when the files changed.
After copying these files to tests/golden
, I saved them to source control as additional golden outputs. Then, I added diff
s to my build script to detect when these output files changed.
Want to parse ingredients without all this work?
I went through all of these steps, so you don’t have to. Check out Zestful, my managed service for ingredient parsing.
The complete build script 🔗︎
After all my modifications to roundtrip.sh
, I saved it as a new file called build.sh
, which looked like this:
#!/bin/bash
# Exit build script on first failure
set -e
# Echo commands to stdout.
set -x
COUNT_TRAIN=20000
COUNT_TEST=2000
OUTPUT_DIR=$(mktemp -d)
ACTUAL_CRF_TRAINING_FILE="${OUTPUT_DIR}/training_data.crf"
ACTUAL_CRF_TESTING_FILE="${OUTPUT_DIR}/testing_data.crf"
ACTUAL_CRF_MODEL_FILE="${OUTPUT_DIR}/model.crfmodel"
ACTUAL_TESTING_OUTPUT_FILE="${OUTPUT_DIR}/testing_output"
ACTUAL_EVAL_OUTPUT_FILE="${OUTPUT_DIR}/eval_output"
bin/generate_data \
--data-path=nyt-ingredients-snapshot-2015.csv \
--count=$COUNT_TRAIN \
--offset=0 > "$ACTUAL_CRF_TRAINING_FILE"
bin/generate_data \
--data-path=nyt-ingredients-snapshot-2015.csv \
--count=$COUNT_TEST \
--offset=$COUNT_TRAIN > "$ACTUAL_CRF_TESTING_FILE"
crf_learn \
template_file "$ACTUAL_CRF_TRAINING_FILE" "$ACTUAL_CRF_MODEL_FILE"
crf_test \
-m "$ACTUAL_CRF_MODEL_FILE" \
"$ACTUAL_CRF_TESTING_FILE" > "$ACTUAL_TESTING_OUTPUT_FILE"
python bin/evaluate.py "$ACTUAL_TESTING_OUTPUT_FILE" > "$ACTUAL_EVAL_OUTPUT_FILE"
# Check against golden output.
GOLDEN_DIR=tests/golden
GOLDEN_CRF_TRAINING_FILE="${GOLDEN_DIR}/training_data.crf"
GOLDEN_CRF_TESTING_FILE="${GOLDEN_DIR}/testing_data.crf"
GOLDEN_TESTING_OUTPUT_FILE="${GOLDEN_DIR}/testing_output"
GOLDEN_EVAL_OUTPUT_FILE="${GOLDEN_DIR}/eval_output"
diff --context=2 "$GOLDEN_CRF_TRAINING_FILE" "$ACTUAL_CRF_TRAINING_FILE"
diff --context=2 "$GOLDEN_CRF_TESTING_FILE" "$ACTUAL_CRF_TESTING_FILE"
diff --context=2 "$GOLDEN_TESTING_OUTPUT_FILE" "$ACTUAL_TESTING_OUTPUT_FILE"
diff "$GOLDEN_EVAL_OUTPUT_FILE" "$ACTUAL_EVAL_OUTPUT_FILE"
I then added a simple wrapper around that script called docker_build
that ran the end-to-end test within the library’s custom Docker container:
#!/bin/bash
# Exit on first failing command.
set -e
# Echo commands to console.
set -x
IMAGE_NAME="ingredient-phrase-tagger-image"
CONTAINER_NAME="ingredient-phrase-tagger-container"
docker build \
--tag "$IMAGE_NAME" \
.
docker run \
--tty \
--detach \
--name "$CONTAINER_NAME" \
"$IMAGE_NAME"
docker exec "$CONTAINER_NAME" ./build.sh
With the docker_build
script, my end-to-end test could run on any system that supported Docker. Naturally, I wanted to run it in my continuous integration environment.
Running my end-to-end tests in continuous integration 🔗︎
My earlier Travis configuration built the Docker image but didn’t exercise the library. Now that I had a thorough test script, I updated my .travis.yml
file to run it:
sudo: required
services: docker
-script: docker build .
+script: ./docker_build
I pushed my changes, ready to witness the splendor of my brilliant test that could run consistently anywhere. Instead, it failed:
End-to-end test fails on Travis after passing on my local machine
I wasn’t happy to see a build break, but I was glad that my end-to-end test caught something. I just had to figure out what it was.
Debugging the discrepancy 🔗︎
The whole point of a Docker container is that the program should behave the same anywhere, so how could I run the same container in two places and see different outputs?
The Travis build log showed that the test failed on the diff of the testing_output
file:
+ diff --context=2 tests/golden/testing_output /tmp/tmp.W5S3C5T4if/testing_output
*** tests/golden/testing_output Fri Jul 27 02:44:20 2018
--- /tmp/tmp.W5S3C5T4if/testing_output Fri Jul 27 03:03:56 2018
***************
*** 173,178 ****
1 I1 L8 NoCAP NoPAREN B-QTY B-QTY
tablespoon I2 L8 NoCAP NoPAREN B-UNIT B-UNIT
! dark I3 L8 NoCAP NoPAREN B-COMMENT B-COMMENT
! corn I4 L8 NoCAP NoPAREN B-NAME B-NAME
syrup I5 L8 NoCAP NoPAREN I-NAME I-NAME
--- 173,178 ----
1 I1 L8 NoCAP NoPAREN B-QTY B-QTY
tablespoon I2 L8 NoCAP NoPAREN B-UNIT B-UNIT
! dark I3 L8 NoCAP NoPAREN B-COMMENT B-NAME
! corn I4 L8 NoCAP NoPAREN B-NAME I-NAME
syrup I5 L8 NoCAP NoPAREN I-NAME I-NAME
The testing_output
file was the result of these two lines in my build.sh
script:
crf_learn \
template_file "$ACTUAL_CRF_TRAINING_FILE" "$ACTUAL_CRF_MODEL_FILE"
crf_test \
-m "$ACTUAL_CRF_MODEL_FILE" \
"$ACTUAL_CRF_TESTING_FILE" > "$ACTUAL_TESTING_OUTPUT_FILE"
crf_learn
and crf_test
were both command-line utilities for CRF++, the engine that powered ingredient-phrase-tagger’s machine learning logic. Without knowing much about these utilities, I could deduce from the syntax that crf_learn
created a machine learning model and crf_test
used that model to classify data.
The end-to-end test had verified that the contents of $ACTUAL_CRF_TRAINING_FILE
and $ACTUAL_CRF_TESTING_FILE
matched my golden versions. This meant that crf_learn
and crf_test
took in inputs that were identical on my local system as well as in continuous integration, but they produced different outputs depending on the environment.
A deeper dive into CRF++ 🔗︎
Was CRF++ non-deterministic? I tried running the test again locally. It passed. I re-ran the test on Travis, and it failed in the same way. This told me that CRF++ was consistent across executions in the same environment, but was inconsistent across environments.
I didn’t like where this was pointing. It suggested that CRF++’s behavior depended on the system’s underlying hardware. Maybe an Intel CPU yielded different results than an AMD CPU. That would be a pain because Travis doesn’t guarantee anything about its hardware environment. Furthermore, if different hardware yielded different results, that would defeat the purpose of a Docker container.
In desperation, I checked CRF++’s command-line documentation to look for anything that might hint about hardware dependencies:
$ crf_learn --help
...
-p, --thread=INT number of threads (default auto-detect)
The --thread
flag looked interesting. I checked the full documentation for more details:
-p NUM:
If the PC has multiple CPUs, you can make the training faster by using multi-threading. NUM is the number of threads.
This sounded promising.
I compared the CRF++ output on Travis to the same output lines in my local environment:
crf_learn runs with two threads on Travis, but eight in my local environment
Ah ha!
Because I omitted the --thread
flag, CRF++ set it automatically based on the number of CPU cores available. My Travis environment had two CPU cores, while my local machine had eight.
I tweaked my build.sh
script to set the thread count explicitly:
-crf_learn template_file "$ACTUAL_CRF_TRAINING_FILE" "$ACTUAL_CRF_MODEL_FILE"
+crf_learn \
+ --thread=2 \
+ template_file "$ACTUAL_CRF_TRAINING_FILE" "$ACTUAL_CRF_MODEL_FILE"
Then, I saved the newly generated output files as my golden copies. I pushed the changes to Github and was greeted with a pleasant sight: my end-to-end tests passed:
End-to-end test passing on Travis
The value of good tests 🔗︎
The end-to-end test proved its value very quickly. While it was tedious to dive into the documentation for one of the library’s dependencies, the test exposed that the library produced inconsistent results depending on its environment. This is something the library’s original authors likely never realized.
With the end-to-end test in place and continuous integration running, I had an authoritative environment that demonstrated the library’s expected functionality. The test provided a valuable safeguard in case I made any changes that unintentionally changed the library’s behavior.
What’s next? 🔗︎
With the confidence from my test, it was time for my favorite part of a software project: refactoring. I was free to make large-scale changes to the code because I knew the build would break loudly if I did anything too stupid.
Read on for part three of this series, where I describe how I:
- added unit tests
- applied style conventions to the code automatically
- integrated static analysis into the build
Want to parse ingredients without all this work?
I went through all of these steps, so you don’t have to. Check out Zestful, my managed service for ingredient parsing.
Cover illustration by Loraine Yow. My fork of the ingredient-phrase-tagger library is available on Github. I offer a managed service based on this library called Zestful.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK