In a previous tutorial, you learned how to use the textblob
library and Tesseract to automatically OCR text and then translate it to a different language. This tutorial will also use textblob
, but this time to improve OCR accuracy by automatically spellchecking OCR’d text.
To learn how to OCR results using spellchecking, just keep reading.
Looking for the source code to this post?
Jump Right To The Downloads SectionUsing spellchecking to improve Tesseract OCR accuracy
It’s unrealistic to expect any OCR system, even state-of-the-art OCR engines, to be 100% accurate. That doesn’t happen in practice. Inevitably, noise in an input image, non-standard fonts that Tesseract wasn’t trained on, or less than ideal image quality will cause Tesseract to make a mistake and incorrectly OCR a piece of text.
When that happens, you need to create rules and heuristics that can be used to improve the output OCR quality. One of the first rules and heuristics you should look at is automatic spellchecking. For example, if you’re OCR’ing a book, you could use spellchecking as an attempt to automatically correct after the OCR process, thereby creating a better, more accurate version of the digitized text.
Learning Objectives
In this tutorial, you will:
- Learn how the
textblob
package can be used for spellchecking - OCR a piece of text that contains incorrect spelling
- Automatically correct the spelling of the OCR’d text
OCR and Spellchecking
We’ll start this tutorial by reviewing our project directory structure. I’ll then show you how to implement a Python script that can automatically OCR a piece of text and then spellcheck it using the textblob
library. Once our script is implemented, we’ll apply it to our example image. We’ll wrap up this tutorial with a discussion on the accuracy of our spellchecking, including some of the limitations and drawbacks associated with automatic spellchecking.
Configuring your development environment
To follow this guide, you need to have the OpenCV library installed on your system.
Luckily, OpenCV is pip-installable:
$ pip install opencv-contrib-python
If you need help configuring your development environment for OpenCV, I highly recommend that you read my pip install OpenCV guide — it will have you up and running in a matter of minutes.
Having problems configuring your development environment?
All that said, are you:
- Short on time?
- Learning on your employer’s administratively locked system?
- Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
- Ready to run the code right now on your Windows, macOS, or Linux system?
Then join PyImageSearch University today!
Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides that are pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.
And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!
Project Structure
The project directory structure for our OCR spellchecker is quite simple:
|-- comic_spelling.png |-- ocr_and_spellcheck.py
We only have a single Python script here, ocr_and_spellcheck.py
. This script does the following:
- Load
comic_spelling.png
from disk - OCR the text in the image
- Apply spellchecking to it
By applying the spellcheck, we will ideally be able to improve the OCR accuracy of our script, regardless if:
- The input image has incorrect spellings in it
- Tesseract incorrectly OCR’d characters
Implementing Our OCR Spellchecking Script
Let’s start implementing our OCR and spellchecking script.
Open a new file, name it ocr_and_spellcheck.py
, and insert the following code:
# import the necessary packages from textblob import TextBlob import pytesseract import argparse import cv2 # construct the argument parser and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-i", "--image", required=True, help="path to input image to be OCR'd") args = vars(ap.parse_args())
Lines 2-5 import our required Python packages. You should note the use of the textblob
package which we utilized in a previous lesson on translating OCR’d text from one language to another. We’ll be using textblob
in this tutorial, but this time for its automatic spellchecking implementation.
Lines 8-11 then parse our command line arguments. We only need a single argument, --image
which is the path to our input image:
Next, we can load the image from disk and OCR it:
# load the input image and convert it from BGR to RGB channel # ordering image = cv2.imread(args["image"]) rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) # use Tesseract to OCR the image text = pytesseract.image_to_string(rgb) # show the text *before* ocr-spellchecking has been applied print("BEFORE SPELLCHECK") print("=================") print(text) print("\n")
Line 15 loads our input image
from the disk using the supplied path. We then swap the color channel ordering from BGR (OpenCV’s default ordering) to RGB (which is what Tesseract and pytesseract
expect).
Once the image is loaded, we make a call to image_to_string
to OCR the image. We then display the OCR’d text
before spellchecking on our screen (Lines 19-25).
However, there may be misspellings, such as text misspelled by the user when creating the image or “typos” caused by Tesseract incorrectly OCR’ing one or more characters — to fix that, we need to utilize textblob
:
# apply spell checking to the OCR'd text tb = TextBlob(text) corrected = tb.correct() # show the text after ocr-spellchecking has been applied print("AFTER SPELLCHECK") print("================") print(corrected)
Line 28 constructs a TextBlob
from the OCR’d text. We then apply automatic spellcheck correction via the correct()
method (Line 29). The corrected
text (i.e., after spellchecking) is then displayed on the terminal (Lines 32-34).
OCR Spellchecking Results
We are now ready to apply OCR spellchecking to an example image.
Open a terminal and execute the following command:
$ python ocr_and_spellcheck.py --image comic_spelling.png BEFORE SPELLCHECK ================= Why can't yu spel corrctly? AFTER SPELLCHECK ================ Why can't you spell correctly?
Figure 2 shows our example image (created via the Explosm comic generator), which includes words with misspellings. Using Tesseract, we can OCR the text with the original misspellings.
It’s important to note that these misspellings were purposely introduced — in your OCR applications, these misspellings may naturally exist in your input images or Tesseract may incorrectly OCR certain characters.
As our output shows, we are able to correct these misspellings using textblob
, correcting the words “yu ⇒ you,” “spel ⇒ spell,” and “corrctly ⇒ correctly.”
Limitations and Drawbacks
One of the biggest problems with spellchecking algorithms is that most spellcheckers require some human intervention to be accurate. When we make a spelling mistake, our word processor automatically detects the error and proposes candidate fixes — often two or three words that the spellchecker thinks we meant to spell. Unless we atrociously misspelled a word, nine times out of 10, we can find the word we meant to use in the candidates proposed by the spellchecker.
We may choose to remove that human intervention piece and instead allow the spellchecker to use the word it deems is most probable based on the internal spellchecking algorithm. We risk replacing words with only minor misspellings with words that do not make sense in the sentence or paragraph’s original context. Therefore, you should be cautious when relying on totally automatic spellcheckers. There is a risk that an incorrect word (versus the correct word, but with minor spelling mistakes) is inserted in the output OCR’d text.
If you find that spellchecking is hurting your OCR accuracy, you may want to:
- Look into alternative spellchecking algorithms other than the generic one included in the
textblob
library - Replace spellchecking with heuristic-based methods (e.g., regular expression matching)
- Allow misspellings to exist, keeping in mind that no OCR system is 100% accurate anyway
What's next? I recommend PyImageSearch University.
30+ total classes • 39h 44m video • Last updated: 12/2021
★★★★★ 4.84 (128 Ratings) • 3,000+ Students Enrolled
I strongly believe that if you had the right teacher you could master computer vision and deep learning.
Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?
That’s not the case.
All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.
If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.
Inside PyImageSearch University you'll find:
- ✓ 30+ courses on essential computer vision, deep learning, and OpenCV topics
- ✓ 30+ Certificates of Completion
- ✓ 39h 44m on-demand video
- ✓ Brand new courses released every month, ensuring you can keep up with state-of-the-art techniques
- ✓ Pre-configured Jupyter Notebooks in Google Colab
- ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
- ✓ Access to centralized code repos for all 500+ tutorials on PyImageSearch
- ✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
- ✓ Access on mobile, laptop, desktop, etc.
Summary
In this tutorial, you learned how to improve OCR results by applying automatic spellchecking. While our method worked well in our particular example, it may not work well in other situations! Keep in mind that spellchecking algorithms typically require a small amount of human intervention. Most spellcheckers automatically check a document for spelling mistakes and then propose a list of candidate corrections to the human user. It’s up to the human to make the final spellcheck decision.
When we remove the human intervention component and instead allow the spellchecking algorithm to choose the correction it deems the best fit, words with only minor misspellings are replaced with words that don’t make sense within the sentence’s original context. Use spellchecking, especially automatic spellchecking, cautiously in your own OCR applications — in some cases, it will help your OCR accuracy, but it can hurt accuracy in other situations.
Citation Information
Rosebrock, A. “Using spellchecking to improve Tesseract OCR accuracy,” PyImageSearch, 2021, https://hcl.pyimagesearch.com/2021/11/29/using-spellchecking-to-improve-tesseract-ocr-accuracy/
@article{Rosebrock_2021_Spellchecking, author = {Adrian Rosebrock}, title = {Using spellchecking to improve {T}esseract {OCR} accuracy}, journal = {PyImageSearch}, year = {2021}, note = {https://hcl.pyimagesearch.com/2021/11/29/using-spellchecking-to-improve-tesseract-ocr-accuracy/}, }
To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!
Download the Source Code and FREE 17-page Resource Guide
Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!
Comment section
Hey, Adrian Rosebrock here, author and creator of PyImageSearch. While I love hearing from readers, a couple years ago I made the tough decision to no longer offer 1:1 help over blog post comments.
At the time I was receiving 200+ emails per day and another 100+ blog post comments. I simply did not have the time to moderate and respond to them all, and the sheer volume of requests was taking a toll on me.
Instead, my goal is to do the most good for the computer vision, deep learning, and OpenCV community at large by focusing my time on authoring high-quality blog posts, tutorials, and books/courses.
If you need help learning computer vision and deep learning, I suggest you refer to my full catalog of books and courses — they have helped tens of thousands of developers, students, and researchers just like yourself learn Computer Vision, Deep Learning, and OpenCV.
Click here to browse my full catalog.