Whitelisting and Blacklisting Characters with Tesseract and Python

In our previous tutorial, you learned how to OCR only digits from an input image. But what if you wanted to obtain more fine-grained control on the character filtering process?

For example, when building an invoicing application, you may want to extract not only digits and letters but also special characters, such as dollar signs, decimal separators (i.e., periods), and commas. To obtain more fine-grained control, we can apply whitelisting and blacklisting, which is exactly the topic of this tutorial.

Learning Objectives

Inside this tutorial, you will learn:

The differences between whitelists and blacklists
How whitelists and blacklists can be used for OCR problems
How to apply whitelists and blacklists using Tesseract

To learn how to whitelist and blacklist while perfoming OCR, just keep reading.

Looking for the source code to this post?

Whitelisting and Blacklisting Characters for OCR

In the first part of this tutorial, we’ll discuss the differences between whitelists and blacklists, two common character filtering techniques when applying OCR with Tesseract. From there, we’ll review our project and implement a Python script that can be used for whitelist/blacklist filtering. We’ll then check the results of our character filtering work.

What Are Whitelists and Blacklists?

As an example of how whitelists and blacklists work, let’s consider a system administrator working for Google. Google is the most popular website globally — nearly everyone on the internet uses Google — but with its popularity comes nefarious users who may try to attack it, bring down its servers, or compromise user data. A system administrator will need to blacklist IP addresses acting nefariously while allowing all other valid incoming traffic.

Now, let’s suppose that this same system administrator needs to configure a development server for Google’s own internal use and testing. This system admin will need to block all incoming IP addresses except for the whitelisted IP addresses of Google’s developers.

The concept of whitelisting and blacklisting characters for OCR purposes is the same. A whitelist specifies a list of characters that the OCR engine is only allowed to recognize — if a character is not on the whitelist, it cannot be included in the output OCR results.

The opposite of a whitelist is a blacklist. A blacklist specifies the characters that, under no circumstances, can be included in the output.

In the rest of this tutorial, you will learn how to apply whitelisting and blacklisting with Tesseract.

Project Structure

Let’s get started by reviewing our directory structure for this tutorial:

|-- invoice.png
|-- pa_license_plate.png
|-- whitelist_blacklist.py

This tutorial will implement the whitelist_blacklist.py Python script and use two images — an invoice and a license plate — for testing. Let’s dive into the code.

Configuring your development environment

To follow this guide, you need to have the OpenCV library installed on your system.

Luckily, OpenCV is pip-installable:

$ pip install opencv-contrib-python

If you need help configuring your development environment for OpenCV, I highly recommend that you read my pip install OpenCV guide — it will have you up and running in a matter of minutes.

Having problems configuring your development environment?

**Figure 1:** Having trouble configuring your dev environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join PyImageSearch University — you’ll be up and running with this tutorial in a matter of minutes.

All that said, are you:

Short on time?
Learning on your employer’s administratively locked system?
Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
Ready to run the code right now on your Windows, macOS, or Linux system?

Then join PyImageSearch University today!

Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides that are pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.

And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!

Whitelisting and Blacklisting Characters with Tesseract

We’re now going to learn how to whitelist and blacklist characters with the Tesseract OCR engine. Open the whitelist_blacklist.py file in your project directory structure and insert the following code:

# import the necessary packages
import pytesseract
import argparse
import cv2

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,
	help="path to input image to be OCR'd")
ap.add_argument("-w", "--whitelist", type=str, default="",
	help="list of characters to whitelist")
ap.add_argument("-b", "--blacklist", type=str, default="",
	help="list of characters to blacklist")
args = vars(ap.parse_args())

There’s nothing fancy happening with our imports — yet again, we’re using PyTesseract and OpenCV. The whitelisting and blacklisting functionality is built into PyTesseract via string-based configuration options.

Our script accepts an input --image path. Additionally, it accepts two optional command line arguments to drive our whitelisting and blacklisting functionality directly from our terminal:

--whitelist: A string of characters serving as our characters which can pass through to the results
--blacklist: Characters that must never be included in the results

Both the --whitelist and --blacklist arguments have default values of empty strings so that we can use one, both, or neither as part of our Tesseract OCR configuration.

Next, let’s load our image and build our Tesseract OCR options:

# load the input image, swap channel ordering, and initialize our
# Tesseract OCR options as an empty string
image = cv2.imread(args["image"])
rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
options = ""

# check to see if a set of whitelist characters has been provided,
# and if so, update our options string
if len(args["whitelist"]) > 0:
	options += "-c tessedit_char_whitelist={} ".format(
		args["whitelist"])

# check to see if a set of blacklist characters has been provided,
# and if so, update our options string
if len(args["blacklist"]) > 0:
	options += "-c tessedit_char_blacklist={}".format(
		args["blacklist"])

Lines 18 and 19 load our --image in RGB format. Our options variable is first initialized as an empty string (Line 20).

From there, if the --whitelist command line argument has at least one character that we wish to only allow for OCR, it is appended to -c tessedit_char_whitelist= as part of our options (Lines 24-26).

Similarly, if we are blacklisting any characters via the --blacklist argument, the options are appended with -c tessedit_char_blacklist= followed by any characters that under no circumstances will show up in our results (Lines 30-32).

Again, our options string could consist of one, both, or neither whitelist/blacklist characters.

And finally, our call to PyTesseract’s image_to_string performs OCR:

# OCR the input image using Tesseract
text = pytesseract.image_to_string(rgb, config=options)
print(text)

The only parameter that is new in our call to image_to_string is the config parameter (Line 35). Notice how we pass the Tesseract options that we have concatenated. The result of whitelisting and blacklisting OCR characters is printed out via the script’s final line.

Whitelisting and Blacklisting with Tesseract Results

We are now ready to apply whitelisting and blacklisting with Tesseract. Open a terminal and execute the following command:

$ python whitelist_blacklist.py --image pa_license_plate.png
PENNSYLVANIA

ZIW*4681

visitPA.com

As the terminal output demonstrates, we have a Pennsylvania license plate (Figure 2) and everything has been correctly OCR’d except for the asterisk (*) in between the license plate numbers — this special symbol has been incorrectly OCR’d. Using a bit of domain knowledge, we know that license plates cannot contain a * as a character, so a simple solution to the problem is to blacklist the *:

$ python whitelist_blacklist.py --image pa_license_plate.png \
    --blacklist "*#"
PENNSYLVANIA

ZIW4681

visitPA.com

**Figure 2.** A license plate graphic for the state of Pennsylvania is used for testing whitelisting and blacklisting with Tesseract.

As indicated by the --blacklist command line argument, we have blacklisted two characters:

The * from above
The # symbol as well (once you blacklist the *, Tesseract will attempt to mark the special symbol as a #, hence we blacklist both)

By using a blacklist, our OCR results are now correct!

Let’s try another example, this one of an invoice, including the invoice number, issue date, and due date:

$ python whitelist_blacklist.py --image invoice.png
Invoice Number 1785439
Issue Date 2020-04-08
Due Date 2020-05-08

| DUE | $210.07

Tesseract has been correctly able to OCR all fields of the invoice in Figure 3. Interestingly, it even determined the bottom boxes’ edges to be vertical bars (|), which could be useful for multi-column data, but simply an unintended coincidence in this case.

**Figure 3.** A generic invoice to test OCR using whitelisting and blacklisting.

Let’s now suppose we want to filter out only the price information (i.e., digits, dollar signs, and periods), along with the invoice numbers and dates (digits and dashes):

$ python whitelist_blacklist.py --image invoice.png \
    --whitelist "0123456789.-"
1785439
2020-04-08
2020-05-08

210.07

The results are just as we expect! We have now successfully used whitelists to extract the invoice number, issue date, due date, and price information while discarding the rest.

We can also combine whitelists and blacklists if needed:

$ python whitelist_blacklist.py --image invoice.png \
	--whitelist "123456789.-" --blacklist "0"
1785439
22-4-8
22-5-8

21.7

Here, we are whitelisting digits, periods, and dashes, while at the same time blacklisting the digit 0, and as our output shows, we have the invoice number, issue date, due date, and price, but with all occurrences of 0, ignored due to the blacklist.

When you have a priori knowledge of the images or document structure, you’ll be OCR’ing, using whitelists and blacklists as a simple yet effective means for improving your output OCR results. They should be your first stop when attempting to improve OCR accuracy on your projects.

What's next? I recommend PyImageSearch University.

Course information:
30+ total classes • 39h 44m video • Last updated: 12/2021
★★★★★ 4.84 (128 Ratings) • 3,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

✓ 30+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 30+ Certificates of Completion
✓ 39h 44m on-demand video
✓ Brand new courses released every month, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 500+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In this tutorial, you learned how to apply whitelist and blacklist character filtering using the Tesseract OCR engine.

A whitelist specifies a list of characters that the OCR engine is only allowed to recognize — if a character is not on the whitelist, it cannot be included in the output OCR results. The opposite of a whitelist is a blacklist. A blacklist specifies characters that under no circumstances can be included in the output.

Using whitelisting and blacklisting is a simple yet powerful technique that you can use in OCR applications. For whitelisting and blacklisting to work, you need a document or image with a reliable pattern or structure. For example, if you were building a basic receipt scanning software, you could write a whitelist that only allows digits, decimal points, commas, and dollar signs.

If you had built an automatic license plate recognition (ALPR) system, you might have noticed that Tesseract was getting “confused” and outputting special characters that were not in the image.

In our next tutorial, we’ll continue to build on our Tesseract OCR knowledge, this time turning our attention to detecting and correcting text orientation — an important pre-processing step in improving OCR accuracy.

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!

Learning Objectives

Looking for the source code to this post?

Whitelisting and Blacklisting Characters for OCR

What Are Whitelists and Blacklists?

Project Structure

Configuring your development environment

Having problems configuring your development environment?

Whitelisting and Blacklisting Characters with Tesseract

Whitelisting and Blacklisting with Tesseract Results

What's next? I recommend PyImageSearch University.

Summary

Download the Source Code and FREE 17-page Resource Guide

About the Author

Comment section

An interview with Gary Song, deep learning practitioner at Unity Technologies

How to Build a Kick-Ass Mobile Document Scanner in Just 5 Minutes

Multi-Column Table OCR

Topics

Books & Courses

PyImageSearch

Learning Objectives

Looking for the source code to this post?

Whitelisting and Blacklisting Characters for OCR

What Are Whitelists and Blacklists?

Project Structure

Configuring your development environment

Having problems configuring your development environment?

Whitelisting and Blacklisting Characters with Tesseract

Whitelisting and Blacklisting with Tesseract Results

What's next? I recommend PyImageSearch University.

Summary

Download the Source Code and FREE 17-page Resource Guide

About the Author

Detecting and OCR’ing Digits with Tesseract and Python

Intro to Generative Adversarial Networks (GANs)

Comment section

Similar articles

You can learn Computer Vision, Deep Learning, and OpenCV.

Footer

Topics

Books & Courses

PyImageSearch

Access the code to this tutorial and all other 500+ tutorials on PyImageSearch

What's included in PyImageSearch University?