In our previous tutorial, you learned how to OCR only digits from an input image. But what if you wanted to obtain more fine-grained control on the character filtering process?
For example, when building an invoicing application, you may want to extract not only digits and letters but also special characters, such as dollar signs, decimal separators (i.e., periods), and commas. To obtain more fine-grained control, we can apply whitelisting and blacklisting, which is exactly the topic of this tutorial.
Learning Objectives
Inside this tutorial, you will learn:
- The differences between whitelists and blacklists
- How whitelists and blacklists can be used for OCR problems
- How to apply whitelists and blacklists using Tesseract
To learn how to whitelist and blacklist while perfoming OCR, just keep reading.
Looking for the source code to this post?
Jump Right To The Downloads SectionWhitelisting and Blacklisting Characters for OCR
In the first part of this tutorial, we’ll discuss the differences between whitelists and blacklists, two common character filtering techniques when applying OCR with Tesseract. From there, we’ll review our project and implement a Python script that can be used for whitelist/blacklist filtering. We’ll then check the results of our character filtering work.
What Are Whitelists and Blacklists?
As an example of how whitelists and blacklists work, let’s consider a system administrator working for Google. Google is the most popular website globally — nearly everyone on the internet uses Google — but with its popularity comes nefarious users who may try to attack it, bring down its servers, or compromise user data. A system administrator will need to blacklist IP addresses acting nefariously while allowing all other valid incoming traffic.
Now, let’s suppose that this same system administrator needs to configure a development server for Google’s own internal use and testing. This system admin will need to block all incoming IP addresses except for the whitelisted IP addresses of Google’s developers.
The concept of whitelisting and blacklisting characters for OCR purposes is the same. A whitelist specifies a list of characters that the OCR engine is only allowed to recognize — if a character is not on the whitelist, it cannot be included in the output OCR results.
The opposite of a whitelist is a blacklist. A blacklist specifies the characters that, under no circumstances, can be included in the output.
In the rest of this tutorial, you will learn how to apply whitelisting and blacklisting with Tesseract.
Project Structure
Let’s get started by reviewing our directory structure for this tutorial:
|-- invoice.png |-- pa_license_plate.png |-- whitelist_blacklist.py
This tutorial will implement the whitelist_blacklist.py
Python script and use two images — an invoice and a license plate — for testing. Let’s dive into the code.
Configuring your development environment
To follow this guide, you need to have the OpenCV library installed on your system.
Luckily, OpenCV is pip-installable:
$ pip install opencv-contrib-python
If you need help configuring your development environment for OpenCV, I highly recommend that you read my pip install OpenCV guide — it will have you up and running in a matter of minutes.
Having problems configuring your development environment?
All that said, are you:
- Short on time?
- Learning on your employer’s administratively locked system?
- Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
- Ready to run the code right now on your Windows, macOS, or Linux system?
Then join PyImageSearch University today!
Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides that are pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.
And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!
Whitelisting and Blacklisting Characters with Tesseract
We’re now going to learn how to whitelist and blacklist characters with the Tesseract OCR engine. Open the whitelist_blacklist.py
file in your project directory structure and insert the following code:
# import the necessary packages import pytesseract import argparse import cv2 # construct the argument parser and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-i", "--image", required=True, help="path to input image to be OCR'd") ap.add_argument("-w", "--whitelist", type=str, default="", help="list of characters to whitelist") ap.add_argument("-b", "--blacklist", type=str, default="", help="list of characters to blacklist") args = vars(ap.parse_args())
There’s nothing fancy happening with our imports — yet again, we’re using PyTesseract and OpenCV. The whitelisting and blacklisting functionality is built into PyTesseract via string-based configuration options.
Our script accepts an input --image
path. Additionally, it accepts two optional command line arguments to drive our whitelisting and blacklisting functionality directly from our terminal:
--whitelist
: A string of characters serving as our characters which can pass through to the results--blacklist
: Characters that must never be included in the results
Both the --whitelist
and --blacklist
arguments have default
values of empty strings so that we can use one, both, or neither as part of our Tesseract OCR configuration.
Next, let’s load our image and build our Tesseract OCR options
:
# load the input image, swap channel ordering, and initialize our # Tesseract OCR options as an empty string image = cv2.imread(args["image"]) rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) options = "" # check to see if a set of whitelist characters has been provided, # and if so, update our options string if len(args["whitelist"]) > 0: options += "-c tessedit_char_whitelist={} ".format( args["whitelist"]) # check to see if a set of blacklist characters has been provided, # and if so, update our options string if len(args["blacklist"]) > 0: options += "-c tessedit_char_blacklist={}".format( args["blacklist"])
Lines 18 and 19 load our --image
in RGB format. Our options
variable is first initialized as an empty string (Line 20).
From there, if the --whitelist
command line argument has at least one character that we wish to only allow for OCR, it is appended to -c tessedit_char_whitelist=
as part of our options
(Lines 24-26).
Similarly, if we are blacklisting any characters via the --blacklist
argument, the options
are appended with -c tessedit_char_blacklist=
followed by any characters that under no circumstances will show up in our results (Lines 30-32).
Again, our options
string could consist of one, both, or neither whitelist/blacklist characters.
And finally, our call to PyTesseract’s image_to_string
performs OCR:
# OCR the input image using Tesseract text = pytesseract.image_to_string(rgb, config=options) print(text)
The only parameter that is new in our call to image_to_string
is the config
parameter (Line 35). Notice how we pass the Tesseract options
that we have concatenated. The result of whitelisting and blacklisting OCR characters is printed out via the script’s final line.
Whitelisting and Blacklisting with Tesseract Results
We are now ready to apply whitelisting and blacklisting with Tesseract. Open a terminal and execute the following command:
$ python whitelist_blacklist.py --image pa_license_plate.png PENNSYLVANIA ZIW*4681 visitPA.com
As the terminal output demonstrates, we have a Pennsylvania license plate (Figure 2) and everything has been correctly OCR’d except for the asterisk (*
) in between the license plate numbers — this special symbol has been incorrectly OCR’d. Using a bit of domain knowledge, we know that license plates cannot contain a *
as a character, so a simple solution to the problem is to blacklist the *
:
$ python whitelist_blacklist.py --image pa_license_plate.png \ --blacklist "*#" PENNSYLVANIA ZIW4681 visitPA.com
As indicated by the --blacklist
command line argument, we have blacklisted two characters:
- The
*
from above - The
#
symbol as well (once you blacklist the*
, Tesseract will attempt to mark the special symbol as a#
, hence we blacklist both)
By using a blacklist, our OCR results are now correct!
Let’s try another example, this one of an invoice, including the invoice number, issue date, and due date:
$ python whitelist_blacklist.py --image invoice.png Invoice Number 1785439 Issue Date 2020-04-08 Due Date 2020-05-08 | DUE | $210.07
Tesseract has been correctly able to OCR all fields of the invoice in Figure 3. Interestingly, it even determined the bottom boxes’ edges to be vertical bars (|
), which could be useful for multi-column data, but simply an unintended coincidence in this case.
Let’s now suppose we want to filter out only the price information (i.e., digits, dollar signs, and periods), along with the invoice numbers and dates (digits and dashes):
$ python whitelist_blacklist.py --image invoice.png \ --whitelist "0123456789.-" 1785439 2020-04-08 2020-05-08 210.07
The results are just as we expect! We have now successfully used whitelists to extract the invoice number, issue date, due date, and price information while discarding the rest.
We can also combine whitelists and blacklists if needed:
$ python whitelist_blacklist.py --image invoice.png \ --whitelist "123456789.-" --blacklist "0" 1785439 22-4-8 22-5-8 21.7
Here, we are whitelisting digits, periods, and dashes, while at the same time blacklisting the digit 0
, and as our output shows, we have the invoice number, issue date, due date, and price, but with all occurrences of 0
, ignored due to the blacklist.
When you have a priori knowledge of the images or document structure, you’ll be OCR’ing, using whitelists and blacklists as a simple yet effective means for improving your output OCR results. They should be your first stop when attempting to improve OCR accuracy on your projects.
What's next? I recommend PyImageSearch University.
30+ total classes • 39h 44m video • Last updated: 12/2021
★★★★★ 4.84 (128 Ratings) • 3,000+ Students Enrolled
I strongly believe that if you had the right teacher you could master computer vision and deep learning.
Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?
That’s not the case.
All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.
If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.
Inside PyImageSearch University you'll find:
- ✓ 30+ courses on essential computer vision, deep learning, and OpenCV topics
- ✓ 30+ Certificates of Completion
- ✓ 39h 44m on-demand video
- ✓ Brand new courses released every month, ensuring you can keep up with state-of-the-art techniques
- ✓ Pre-configured Jupyter Notebooks in Google Colab
- ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
- ✓ Access to centralized code repos for all 500+ tutorials on PyImageSearch
- ✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
- ✓ Access on mobile, laptop, desktop, etc.
Summary
In this tutorial, you learned how to apply whitelist and blacklist character filtering using the Tesseract OCR engine.
A whitelist specifies a list of characters that the OCR engine is only allowed to recognize — if a character is not on the whitelist, it cannot be included in the output OCR results. The opposite of a whitelist is a blacklist. A blacklist specifies characters that under no circumstances can be included in the output.
Using whitelisting and blacklisting is a simple yet powerful technique that you can use in OCR applications. For whitelisting and blacklisting to work, you need a document or image with a reliable pattern or structure. For example, if you were building a basic receipt scanning software, you could write a whitelist that only allows digits, decimal points, commas, and dollar signs.
If you had built an automatic license plate recognition (ALPR) system, you might have noticed that Tesseract was getting “confused” and outputting special characters that were not in the image.
In our next tutorial, we’ll continue to build on our Tesseract OCR knowledge, this time turning our attention to detecting and correcting text orientation — an important pre-processing step in improving OCR accuracy.
To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!
Download the Source Code and FREE 17-page Resource Guide
Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!
Comment section
Hey, Adrian Rosebrock here, author and creator of PyImageSearch. While I love hearing from readers, a couple years ago I made the tough decision to no longer offer 1:1 help over blog post comments.
At the time I was receiving 200+ emails per day and another 100+ blog post comments. I simply did not have the time to moderate and respond to them all, and the sheer volume of requests was taking a toll on me.
Instead, my goal is to do the most good for the computer vision, deep learning, and OpenCV community at large by focusing my time on authoring high-quality blog posts, tutorials, and books/courses.
If you need help learning computer vision and deep learning, I suggest you refer to my full catalog of books and courses — they have helped tens of thousands of developers, students, and researchers just like yourself learn Computer Vision, Deep Learning, and OpenCV.
Click here to browse my full catalog.