How to Use Tesseract OCR to Convert PDFs to Text

This is a cross-post from my blog Arcadian.Cloud, go there to see the original post.

I have some PDFs which I need to get typed up into text to edit. I decided to go with Tesseract OCR as it seems to be the best tool for the job. Here are the steps for how to use Tesseract OCR to convert PDFs to text.

Installation

First things first, get Tesseract CLI installed. Follow the instructions here, these are linked to from the official Tesseract docs.

sudo add-apt-repository ppa:alex-p/tesseract-ocr-devel
sudo apt-get update
sudo apt install tesseract-ocr tesseract-ocr-eng

Enter fullscreen mode

Exit fullscreen mode

Note: the package didn’t properly place the eng.traineddata file for me. If you get an error about this refer to the troubleshooting steps at the bottom of this article.

Usage

In the CLI, cd into the directory with the images or PDFs you want to convert.

Remember, Tesseract cannot convert PDFs, so first we must convert the PDF to a .tiff file, then we can convert the .tiff to text.

#Convert the PDF to a .tiff file, change out the file names at the end of this command to your own
#Note: If you get an error about security policy check the troubleshooting section below
convert -fill white -draw 'rectangle 10,10 20,20' -background white +matte -density 300 Loring-Lombard-Autobiogrphy-Pages1-10.pdf Loring-Lombard-Autobiogrphy-Pages1-10.tiff

#Tesseract will add .txt to the end of the new file name
tesseract Loring-Lombard-Autobiogrphy-Pages1-10.tiff Loring-Lombard-Autobiogrphy-Pages1-10

Enter fullscreen mode

Exit fullscreen mode

I was able to safely ignore these errors. Once the PDF to .tiff conversion finished I ran the tesseract command to created the text file.

Tesseract OCR Conversion to text

You should now have a text file created. It really is as easy as that to Use Tesseract OCR to Convert PDFs to text files.

Troubleshooting

Missing Language Training Data

If you see something like the bellow error message it means you missed installing the English training data.

Error opening data file /usr/share/tesseract-ocr/5/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

Enter fullscreen mode

Exit fullscreen mode

Simply install the tesseract-ocr-eng package with the below command:

sudo apt install tesseract-ocr-eng

Enter fullscreen mode

Exit fullscreen mode

If this doesn’t fix it then check out this GitHub issue for more troubleshooting steps.

Convert Tool Security Policy Error

convert-im6.q16: attempt to perform an operation not allowed by the security policy `PDF' @ error/constitute.c/IsCoderAuthorized/421.
convert-im6.q16: no images defined `converted-pdf.tiff' @ error/convert.c/ConvertImageCommand/3229.

Enter fullscreen mode

Exit fullscreen mode

To fix the above error you need to edit or get rid of the imagemagic security policy. The simplest solution is to temporarily rename the security policy but this may be dangerous if you forget to put it back. Instead, I recommend just edit the policy and remove the offending policy.

sudo sed -i 's/^.*policy.*coder.*none.*PDF.*//' /etc/ImageMagick-6/policy.xml

Enter fullscreen mode

Exit fullscreen mode

Checkout this StackOverflow post for more details on working around this error.

Installation

Usage

Troubleshooting

Missing Language Training Data

Convert Tool Security Policy Error

Recommend

Realme 9 Pro+ Review: Mid-Range Shutterbug With Party Tricks

Kubernetes/Istio 를 위한 Datadog 설정

Ford Bronco Review: Beyond Nostalgia

Temperature, BTC price and crypto farm earnings

What is Software Development Methodology?

Self-hosting Quirrel

Loki를 이용한 손쉬운 Kubernetes Logging

How to add pdf watermark?

NumPy Basics : Part 2

Apple Reveals Limited Edition Powerbeats Pro Earbuds Inspired By NBA

About Joyk