Using Machine Learning to Denoise Images for Better OCR Accuracy

One of the most challenging aspects of applying optical character recognition (OCR) isn’t the OCR itself. Instead, it’s the process of pre-processing, denoising, and cleaning up images such that they can be OCR’d.

To learn how to denoise your images for better OCR, just keep reading.

Looking for the source code to this post?

Using Machine Learning to Denoise Images for Better OCR Accuracy

When working with documents generated by a computer, screenshots, or essentially any piece of text that has never touched a printer and then scanned, OCR becomes far easier. The text is clean and crisp. There is sufficient contrast between the background and foreground. And most of the time, the text doesn’t exist on a complex background.

That all changes once a piece of text is printed and scanned. From there, OCR becomes much more challenging.

The printer could be low on toner or ink, resulting in the text appearing faded and hard to read.
An old scanner could have been used when scanning the document, resulting in low image resolution and poor text contrast.
A mobile phone scanner app may have been used under poor lighting conditions, making it incredibly challenging for human eyes to read the text, let alone a computer.
And all too common are the clear signs that an actual human has handled the paper, including coffee mug stains on the corners, paper crinkling, rips, tears, etc.

For all the amazing things the human mind can do, it seems like we’re all just walking accidents waiting to happen when it comes to printed materials. Give us a piece of paper and enough time, and I guarantee that even the most organized of us will take that document from the pristine condition and eventually introduce some stains, rips, folds, and crinkles on it.

Inevitably, these problems will occur — and when they do, we need to utilize our computer vision, image processing, and OCR skills to pre-process and improve the quality of these damaged documents. From there, we’ll be able to obtain higher OCR accuracy.

In the remainder of this tutorial, you’ll learn how even simple machine learning algorithms constructed in a novel way can help you denoise images before applying OCR.

Learning Objectives

In this tutorial, you will:

Gain experience working with a dataset of noisy, damaged documents
Discover how machine learning is used to denoise these damaged documents
Work with Kaggle’s Denoising Dirty Documents dataset
Extract features from this dataset
Train a random forest regressor (RFR) on the features we extracted
Take the model and use it to denoise images in our test set (and then be able to denoise your datasets as well)

Image Denoising with Machine Learning

In the first part of this tutorial, we will review the dataset we will be using to denoise documents. From there, we’ll review our project structure, including the five separate Python scripts we’ll be utilizing, including:

A configuration file to store variables used across multiple Python scripts
A helper function used to blur and threshold our documents
A script used to extract features and target values from our dataset
Another script used to train a RFR
And a final script used to apply our trained model to images in our test set

This is one of my longer tutorials, and while it’s straightforward and follows a linear progression, there are also many nuanced details here. Therefore, I suggest you review this tutorial twice, once at a high level to understand what we’re doing and then again at a low level to understand the implementation.

With that said, let’s get started!

Our Noisy Document Dataset

We’ll use Kaggle’s Denoising Dirty Documents dataset in this tutorial. The dataset is part of the UCI Machine Learning Repository but converted to a Kaggle competition. We will use three files for this tutorial. Those files are a part of the Kaggle competition data and are named: test.zip, train.zip, and train_cleaned.zip.

The dataset is relatively small, with only 144 training samples, making it easy to work with and use as an educational tool. However, don’t let the small dataset size fool you! What we’re going to do with this dataset is far from basic or introductory.

Figure 1 shows a sample of the dirty documents dataset. For the sample document, the top shows the document’s noisy version, including stains, crinkles, folds, etc. The bottom then shows the target, pristine version of the document that we wish to generate.

**Figure 1:** *Top:* A sample image of a noisy document. *Bottom:* Target, cleaned version. **Our goal is to create a computer vision pipeline that *automatically* transforms the noisy document into a cleaned one** (image credit).

Our goal is to input the image on the top and train a machine learning model to produce a cleaned output on the bottom. It may seem impossible now, but once you see some of the tricks and techniques we’ll be using, it will be a lot more straightforward than you think.

The Denoising Document Algorithm

Our denoising algorithm hinges on training an RFR to accept a noisy image and automatically predict the output pixel values. This algorithm is inspired by a denoising technique introduced by Colin Priest.

These algorithms work by applying a 5 x 5 window that slides from left-to-right and top-to-bottom, one pixel at a time (Figure 2) across both the noisy image (i.e., the image we want to pre-process automatically and cleanup) and the target output image (i.e., the “gold standard” of how the image should appear after cleaning).

At each sliding window stop, we extract:

The 5 x 5 region of the noisy input image. We then flatten the 5 x 5 region into a 25-d list and treat it like a feature vector.
The same 5 x 5 region of the cleaned image, but this time we only take the center (x, y)-coordinate, denoted by the location (2, 2).

Given the 25-d (dimensional) feature vector from the noisy input image, this single pixel value is what we want our RFR to predict.

To make this example more concrete, again consider Figure 2, where we have the following 5 x 5 grid of pixel values from the noisy image:

[[247 227 242 253 237]
 [244 228 225 212 219]
 [223 218 252 222 221]
 [242 244 228 240 230]
 [217 233 237 243 252]]

We then flatten that into a single list of 5 x 5 = 25-d values:

[247 227 242 253 237 244 228 225 212 219 223 218 252 222 221 242 244 228
 240 230 217 233 237 243 252]

**Figure 2:** Our denoising document algorithm works by sliding a `5 x 5` window from *left-to-right* and *top-to-bottom* across the noisy input. We extract this `5 x 5` region, flatten it into a `25-d` vector, and then treat it as our feature vector. Finally, this feature vector is passed into our RFR, and the output, a cleaned pixel, is predicted.

This 25-d vector is our feature vector upon which our RFR will be trained.

However, we still need to define the target output value of the RFR. Our regression model should accept the input 25-d vector and output the cleaned, denoised pixel.

Now, let’s assume that we have the following 5 x 5 window from our gold standard/target image:

[[0 0 0 0 0]
 [0 0 0 0 1]
 [0 0 1 1 1]
 [0 0 1 1 1]
 [0 0 0 1 1]]

We are only interested in the center of this 5 x 5 region, denoted as the location x = 2, y = 2. So, we extract this value of 1 (foreground, versus 0, which is background) and treat it as our target value that our RFR should predict.

Putting this entire example together, we can think of the following as a sample training data point:

trainX = [[247 227 242 253 237 244 228 225 212 219 223 218 252 222 221 242 244 228
 240 230 217 233 237 243 252]]
trainY = [[1]]

Given our trainX variable (our raw pixel intensities), we want to predict the corresponding cleaned/denoised pixel value in trainY.

We will train our RFR in this manner, ultimately leading to a model that can accept a noisy document input and automatically denoise it by examining local 5 x 5 regions and then predicting the center (cleaned) pixel value.

Configuring your development environment

To follow this guide, you need to have the OpenCV library installed on your system.

Luckily, OpenCV is pip-installable:

$ pip install opencv-contrib-python

If you need help configuring your development environment for OpenCV, I highly recommend that you read my pip install OpenCV guide — it will have you up and running in a matter of minutes.

Having Problems Configuring Your Development Environment?

**Figure 3:** Having trouble configuring your dev environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join PyImageSearch University — you’ll be up and running with this tutorial in a matter of minutes.

All that said, are you:

Short on time?
Learning on your employer’s administratively locked system?
Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
Ready to run the code right now on your Windows, macOS, or Linux system?

Then join PyImageSearch University today!

Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides that are pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.

And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!

Project Structure

This tutorial’s project directory structure is a bit more complex than other tutorials as there are five Python scripts to review (three scripts, a helper function, and a configuration file).

Start by accessing the “Downloads” section of this tutorial to retrieve the source code and example images.

Before we get any farther, let’s familiarize ourselves with the files:

|-- pyimagesearch
|   |-- __init__.py
|   |-- denoising
|   |   |-- __init__.py
|   |   |-- helpers.py
|-- config
|   |-- __init__.py
|   |-- denoise_config.py
|-- build_features.py
|-- denoise_document.py
|-- denoiser.pickle
|-- denoising-dirty-documents
|   |-- test
|   |   |-- 1.png
|   |   |-- 10.png
|   |   |-- ...
|   |   |-- 94.png
|   |   |-- 97.png
|   |-- train
|   |   |-- 101.png
|   |   |-- 102.png
|   |   |-- ...
|   |   |-- 98.png
|   |   |-- 99.png
|   |-- train_cleaned
|   |   |-- 101.png
|   |   |-- 102.png
|   |   |-- ...
|   |   |-- 98.png
|   |   |-- 99.png
|-- train_denoiser.py

The denoising-dirty-documents directory contains all images from the Kaggle Denoising Dirty Documents dataset.

Inside the denoising submodule of pyimagesearch, there is a helpers.py file. This file contains a single function, blur_and_threshold, which, as the name suggests, is used to apply a combination of smoothing and thresholding as a pre-processing step for our documents.

We then have the denoise_config.py file, which stores a few configurations specifying training data file paths, output feature CSV files, and the final serialized RFR model.

There are three Python scripts that we’ll review in their entirety:

build_features.py: Accepts our input dataset and creates a CSV file that we’ll use to train our RFR.
train_denoiser.py: Trains the actual RFR model and serializes it to disk as denoiser.pickle.
denoise_document.py: Accepts an input image from disk, loads the trained RFR, and then denoises the input image.

There are several Python scripts that we need to review in this tutorial. Therefore, I suggest you review this tutorial twice to understand better what we are implementing and then grasp the implementation at a deeper level.

Implementing Our Configuration File

The first step in our denoising documents implementation is to create our configuration file. Open the denoise_config.py file in the config subdirectory of the project directory structure and insert the following code:

# import the necessary packages
import os

# initialize the base path to the input documents dataset
BASE_PATH = "denoising-dirty-documents"

# define the path to the training directories
TRAIN_PATH = os.path.sep.join([BASE_PATH, "train"])
CLEANED_PATH = os.path.sep.join([BASE_PATH, "train_cleaned"])

Line 5 defines the base path to our denoising-dirty-documents dataset. If you download this dataset from Kaggle, be sure to unzip all .zip files within this directory to have all images in the dataset uncompressed and residing on disk.

We then define the paths to both the original noisy image directory and the corresponding cleaned image directory, respectively (Lines 8 and 9).

The TRAIN_PATH images contain the noisy documents while the CLEANED_PATH images contain our “gold standard” of what, ideally, our output images should look like after applying document denoising via our trained model. We’ll construct our testing set inside our train_denoiser.py script.

Let’s continue defining our configuration file:

# define the path to our output features CSV file then initialize
# the sampling probability for a given row
FEATURES_PATH = "features.csv"
SAMPLE_PROB = 0.02

# define the path to our document denoiser model
MODEL_PATH = "denoiser.pickle"

Line 13 defines the path to our output features.csv file. Our features here will consist of:

A local 5 x 5 region sampled via sliding window from the noisy input image
The center of the 5 x 5 region, denoted as (x, y)-coordinate (2, 2), for the corresponding cleaned image

However, if we wrote every feature/target combination to disk, we would end up with millions of rows and a CSV many gigabytes in size. So, instead of exhaustively computing all sliding window and target combinations, we’ll instead only write them to disk with SAMPLES_PROB probability.

Finally, Line 17 specifies the path to MODEL_PATH, our output serialized model.

Creating Our Blur and Threshold Helper Function

To help our RFR predict background (i.e., noisy) from foreground (i.e., text) pixels, we need to define a helper function that will pre-process our images before we train the model and make predictions with it.

The flow of our image processing operations can be seen in Figure 4. First, we take our input image, blur it (top-left), and then subtract the blurred image from the input image (top-right). We do this step to approximate the foreground of the image since, by nature, blurring will blur focused features and reveal more of the “structural” components of the image.

**Figure 4:** *Top-left:* Applying a small median blur to the input image. *Top-right:* Subtracting the blurred image from the original image. *Bottom-left:* Thresholding the subtracted image. *Bottom-right:* Performing min-max scaling to ensure the pixel intensities are in the range `[0, 1]`.

Next, we threshold the approximate foreground region by setting any pixel values greater than zero to zero (Figure 4, bottom-left).

The final step is to perform min-max scaling (bottom-right), which brings the pixel intensities back to the range [0, 1] (or [0, 255], depending on your data type). This final image will serve as noisy input when we perform our sliding window sampling.

Now that we understand the general pre-processing steps let’s implement them in Python code.

Open the helpers.py file in the denoising submodule of pyimagesearch, and let’s get to work defining our blur_and_threshold function:

# import the necessary packages
import numpy as np
import cv2

def blur_and_threshold(image, eps=1e-7):
	# apply a median blur to the image and then subtract the blurred
	# image from the original image to approximate the foreground
	blur = cv2.medianBlur(image, 5)
	foreground = image.astype("float") - blur

	# threshold the foreground image by setting any pixels with a
	# value greater than zero to zero
	foreground[foreground > 0] = 0

The blur_and_threshold function accepts two parameters:

image: The input image that we’ll pre-process.
eps: An epsilon value used to prevent division by zero.

We then apply a median blur to the image to reduce noise and subtract the blur from the original image, resulting in a foreground approximation (Lines 8 and 9).

From there, we threshold the foreground image by setting any pixel intensities greater than zero to zero (Line 13).

The final step here is to perform min-max scaling:

	# apply min/max scaling to bring the pixel intensities to the
	# range [0, 1]
	minVal = np.min(foreground)
	maxVal = np.max(foreground)
	foreground = (foreground - minVal) / (maxVal - minVal + eps)

	# return the foreground-approximated image
	return foreground

Here, we find the minimum and maximum values in the foreground image. We use these values to scale the pixel intensities in the foreground image to the range [0, 1].

This foreground-approximated image is then returned to the calling function.

Implementing the Feature Extraction Script

With our blur_and_threshold function defined, we can move on to our build_features.py script.

As the name suggests, this script is responsible for creating our 5 x 5 - 25-d feature vectors from the noisy image and then extracting the target (i.e., cleaned) pixel value from the corresponding gold standard image.

We’ll save these features to disk in CSV format and then train a Random Forest Regression model on them in the section on “Implementing Our Denoising Training Script.”

Let’s get started with our implementation now:

# import the necessary packages
from config import denoise_config as config
from pyimagesearch.denoising import blur_and_threshold
from imutils import paths
import progressbar
import random
import cv2

Line 2 imports our config to access our dataset file paths and output CSV file path. Notice that we’re using the blur_and_threshold function here.

The following code block grabs the paths to all images in our TRAIN_PATH (noisy images) and CLEANED_PATH (cleaned images that our RFR will learn to predict):

# grab the paths to our training images
trainPaths = sorted(list(paths.list_images(config.TRAIN_PATH)))
cleanedPaths = sorted(list(paths.list_images(config.CLEANED_PATH)))

# initialize the progress bar
widgets = ["Creating Features: ", progressbar.Percentage(), " ",
	progressbar.Bar(), " ", progressbar.ETA()]
pbar = progressbar.ProgressBar(maxval=len(trainPaths),
	widgets=widgets).start()

Note that trainPaths contain all our noisy images. The cleanedPaths contain the corresponding cleaned images.

Figure 5 shows an example. On the top is our input training image. On the bottom, we have the corresponding cleaned version of the image. We’ll take 5 x 5 regions from both the trainPaths and the cleanedPaths — the goal is to use the noisy 5 x 5 regions to predict the cleaned versions.

**Figure 5:** *Top:* A sample training image of a noisy input document. *Bottom:* The corresponding target/gold standard for the training image. Our goal is to train a model to take the *top* input and automatically generate the *bottom* output.

Let’s start looping over these image combinations now:

# zip our training paths together, then open the output CSV file for
# writing
imagePaths = zip(trainPaths, cleanedPaths)
csv = open(config.FEATURES_PATH, "w")

# loop over the training images together
for (i, (trainPath, cleanedPath)) in enumerate(imagePaths):
	# load the noisy and corresponding gold-standard cleaned images
	# and convert them to grayscale
	trainImage = cv2.imread(trainPath)
	cleanImage = cv2.imread(cleanedPath)
	trainImage = cv2.cvtColor(trainImage, cv2.COLOR_BGR2GRAY)
	cleanImage = cv2.cvtColor(cleanImage, cv2.COLOR_BGR2GRAY)

On Line 21, we use Python’s zip function to combine the trainPaths and cleanedPaths. We then open our output csv file for writing on Line 22.

Line 25 starts a loop over our combinations of imagePaths. For each trainPath, we also have the corresponding cleanedPath.

We load our trainImage and cleanImage from disk and convert them to grayscale (Lines 28-31).

Next, we need to pad both trainImage and cleanImage with a 2-pixel border in every direction:

	# apply 2x2 padding to both images, replicating the pixels along
	# the border/boundary
	trainImage = cv2.copyMakeBorder(trainImage, 2, 2, 2, 2,
		cv2.BORDER_REPLICATE)
	cleanImage = cv2.copyMakeBorder(cleanImage, 2, 2, 2, 2,
		cv2.BORDER_REPLICATE)

	# blur and threshold the noisy image
	trainImage = blur_and_threshold(trainImage)

	# scale the pixel intensities in the cleaned image from the range
	# [0, 255] to [0, 1] (the noisy image is already in the range
	# [0, 1])
	cleanImage = cleanImage.astype("float") / 255.0

Why do we need to bother with the padding? We’re sliding a window from left-to-right and top-to-bottom of the input image and using the pixels inside the window to predict the output center pixel located at x = 2, y = 2, not unlike a convolution operation (only with convolution our filters are fixed and defined).

Like convolution, you need to pad your input images such that the output image is not smaller in size. Please refer to my guide on Convolutions with OpenCV and Python if you are unfamiliar with the concept.

After padding is complete, we blur and threshold the trainImage and manually scale the cleanImage to the range [0, 1]. The trainImage is already scaled to the range [0, 1] due to the min-max scaling inside blur_and_threshold.

With our images pre-processed, we can now slide a 5 x 5 window across them:

	# slide a 5x5 window across the images
	for y in range(0, trainImage.shape[0]):
		for x in range(0, trainImage.shape[1]):
			# extract the window ROIs for both the train image and
			# clean image, then grab the spatial dimensions of the
			# ROI
			trainROI = trainImage[y:y + 5, x:x + 5]
			cleanROI = cleanImage[y:y + 5, x:x + 5]
			(rH, rW) = trainROI.shape[:2]

			# if the ROI is not 5x5, throw it out
			if rW != 5 or rH != 5:
				continue

Lines 49 and 50 slide a 5 x 5 window from left-to-right and top-to-bottom across the trainImage and cleanImage. At each sliding window stop, we extract the 5 x 5 ROI of the training image and clean image (Lines 54 and 55).

We grab the width and height of the trainROI on Line 56, and if either the width or height is not five pixels (due to us being on the borders of the image), we throw out the ROI (because we are only concerned with 5 x 5 regions).

Next, we construct our feature vectors and save the row to our CSV file:

			# our features will be the flattened 5x5=25 raw pixels
			# from the noisy ROI while the target prediction will
			# be the center pixel in the 5x5 window
			features = trainROI.flatten()
			target = cleanROI[2, 2]

			# if we wrote *every* feature/target combination to disk
			# we would end up with millions of rows -- let's only
			# write rows to disk with probability N, thereby reducing
			# the total number of rows in the file
			if random.random() <= config.SAMPLE_PROB:
				# write the target and features to our CSV file
				features = [str(x) for x in features]
				row = [str(target)] + features
				row = ",".join(row)
				csv.write("{}\n".format(row))

	# update the progress bar
	pbar.update(i)

# close the CSV file
pbar.finish()
csv.close()

Line 65 takes the 5 x 5 pixel region from the trainROI and flattens it into a 5 x 5 = 25-d list — this list serves as our feature vector.

Line 66 then extracts the cleaned/gold-standard pixel value from the center of the cleanROI. This pixel value serves as what we want our RFR to predict.

At this point, we could write our combination of a feature vector and target value to disk; however, if we were to write every feature/target combination to the CSV file, we would end up with a file many gigabytes in size.

To avoid having a massive CSV file, we would need to process it in the next step. So, we instead only allow SAMPLE_PROB (in this case, 2%) of the rows to be written to disk (Line 72). Doing this sampling reduces the resulting CSV file size and makes it easier to manage.

Line 74 constructs our row of features and prepends the target pixel value. We then write the row to our CSV file. We repeat this process for all imagePaths.

Running the Feature Extraction Script

We are now ready to run our feature extractor. First, open a terminal and then execute the build_features.py script:

$ python build_features.py
Creating Features: 100% |#########################| Time:  0:01:05

The entire feature extractor process took just over one minute on my 3 GHz Intel Xeon W processor.

Inspecting my project directory structure, you can now see the resulting CSV file of features:

$ ls -l *.csv
adrianrosebrock  staff  273968497 Oct 23 06:21 features.csv

If you were to open the features.csv file in your system, you would see that each row contains 26 entries.

The first entry in the row is the target output pixel. We will try to predict the output pixel value based on the contents of the remainder of the row, which are the 5 x 5 = 25 input ROI pixels.

The next section covers how to train an RFR model to do exactly that.

Implementing Our Denoising Training Script

Now that our features.csv file has been generated, we can move on to the training script. This script is responsible for loading our features.csv file and training an RFR to accept a 5 x 5 region of a noisy image and then predict the cleaned center pixel value.

Let’s get started reviewing the code:

# import the necessary packages
from config import denoise_config as config
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import numpy as np
import pickle

Lines 2-7 handle our required Python packages, including:

config: Our project configuration holding our output file paths and training variables
RandomForestRegressor: The scikit-learn implementation of the regression model we’ll use to predict pixel values
mean_squared_error: Our error/loss function — the lower this value, the better job we are doing at denoising our images
train_test_split: Used to create a training/testing split from our features.csv file
pickle: Used to serialize our trained RFR to disk

Let’s move on to loading our CSV file from disk:

# initialize lists to hold our features and target predicted values
print("[INFO] loading dataset...")
features = []
targets = []

# loop over the rows in our features CSV file
for row in open(config.FEATURES_PATH):
	# parse the row and extract (1) the target pixel value to predict
	# along with (2) the 5x5=25 pixels which will serve as our feature
	# vector
	row = row.strip().split(",")
	row = [float(x) for x in row]
	target = row[0]
	pixels = row[1:]

	# update our features and targets lists, respectively
	features.append(pixels)
	targets.append(target)

Lines 11 and 12 initialize our features (5 x 5 pixel regions) and targets (target output pixel values we want to predict).

We start looping over all lines of our CSV file on Line 15. For each row, we extract both the target and pixel values (Lines 19-22). We then update both our features and targets lists, respectively.

With the CSV file loaded into memory, we can construct our training and testing split:

# convert the features and targets to NumPy arrays
features = np.array(features, dtype="float")
target = np.array(targets, dtype="float")

# construct our training and testing split, using 75% of the data for
# training and the remaining 25% for testing
(trainX, testX, trainY, testY) = train_test_split(features, target,
	test_size=0.25, random_state=42)

Here, we use 75% of our data for training and mark the remaining 25% for testing. This type of split is fairly standard in the machine learning field.

Finally, we can train our RFR:

# train a random forest regressor on our data
print("[INFO] training model...")
model = RandomForestRegressor(n_estimators=10)
model.fit(trainX, trainY)

# compute the root mean squared error on the testing set
print("[INFO] evaluating model...")
preds = model.predict(testX)
rmse = np.sqrt(mean_squared_error(testY, preds))
print("[INFO] rmse: {}".format(rmse))

# serialize our random forest regressor to disk
f = open(config.MODEL_PATH, "wb")
f.write(pickle.dumps(model))
f.close()

Line 39 initializes our RandomForestRegressor, instructing it to train 10 separate regression trees. The model is then trained on Line 40.

After training is complete, we compute the root-mean-square error (RMSE) to measure how good a job we’ve done at predicting cleaned, denoised images. The lower the error value, the better the job we’ve done.

Finally, we serialize our trained RFR model to disk such that we can use it to make predictions on our noisy images.

Training Our Document Denoising Model

With our train_denoiser.py script implemented, we are now ready to train our automatic image denoiser! First, open a shell and then execute the train_denoiser.py script:

$ time python train_denoiser.py
[INFO] loading dataset...
[INFO] training model...
[INFO] evaluating model...
[INFO] rmse: 0.04990744293857625

real	1m18.708s
user	1m19.361s
sys     0m0.894s

Training our script takes just over one minute, resulting in an RMSE of ≈0.05. This is a very low loss value, indicating that our model successfully accepts noisy input pixel ROIs and correctly predicts the target output value.

Inspecting our project directory structure, you’ll see that the RFR model has been serialized to disk as denoiser.pickle:

$ ls -l *.pickle
adrianrosebrock  staff  77733392 Oct 23 denoiser.pickle

We’ll load our trained denoiser.pickle model from disk in the next section and then use it to automatically clean and pre-process our input documents.

Creating the Document Denoiser Script

This project’s final step is to take our trained denoiser model to clean our input images automatically.

Open denoise_document.py now, and we’ll see how this process is done:

# import the necessary packages
from config import denoise_config as config
from pyimagesearch.denoising import blur_and_threshold
from imutils import paths
import argparse
import pickle
import random
import cv2

Lines 2-8 handle importing our required Python packages. We then move on to parsing our command line arguments:

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-t", "--testing", required=True,
	help="path to directory of testing images")
ap.add_argument("-s", "--sample", type=int, default=10,
	help="sample size for testing images")
args = vars(ap.parse_args())

Our denoise_document.py script accepts two command line arguments:

--testing: The path to the directory containing the testing images for Kaggle’s Denoising Dirty Documents dataset
--sample: Number of testing images we’ll sample when applying our denoising model

Speaking of our denoising model, let’s load the serialized model from disk:

# load our document denoiser from disk
model = pickle.loads(open(config.MODEL_PATH, "rb").read())

# grab the paths to all images in the testing directory and then
# randomly sample them
imagePaths = list(paths.list_images(args["testing"]))
random.shuffle(imagePaths)
imagePaths = imagePaths[:args["sample"]]

We also grab all imagePaths part of the testing set, randomly shuffle them, and then select a total of --sample images where we apply our automatic denoiser model.

Let’s loop over our sample of imagePaths:

# loop over the sampled image paths
for imagePath in imagePaths:
	# load the image, convert it to grayscale, and clone it
	print("[INFO] processing {}".format(imagePath))
	image = cv2.imread(imagePath)
	image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
	orig = image.copy()

	# pad the image followed by blurring/thresholding it
	image = cv2.copyMakeBorder(image, 2, 2, 2, 2,
		cv2.BORDER_REPLICATE)
	image = blur_and_threshold(image)

Here, we are performing the same pre-processing steps that we utilized during the training phase:

We load the input image from disk
Convert it to grayscale
Pad the image with two pixels in every direction
Apply the blur_and_threshold function

Now we need to loop over the processed image and extract every 5 x 5 pixel neighborhood:

	# initialize a list to store our ROI features (i.e., 5x5 pixel
	# neighborhoods)
	roiFeatures = []

	# slide a 5x5 window across the image
	for y in range(0, image.shape[0]):
		for x in range(0, image.shape[1]):
			# extract the window ROI and grab the spatial dimensions
			roi = image[y:y + 5, x:x + 5]
			(rH, rW) = roi.shape[:2]

			# if the ROI is not 5x5, throw it out
			if rW != 5 or rH != 5:
				continue

			# our features will be the flattened 5x5=25 pixels from
			# the training ROI
			features = roi.flatten()
			roiFeatures.append(features)

Line 42 initializes a list, roiFeatures, to store every 5 x 5 neighborhood.

We then slide a 5 x 5 window from left-to-right and top-to-bottom across the image. At every step of the window, we extract the roi (Line 48), grab its spatial dimensions (Line 49), and throw it out if the ROI size is not 5 x 5 (Lines 52 and 53).

We then take our 5 x 5 pixel neighborhood, flatten it into a list of features, and update our roiFeatures list (Lines 57 and 58).

Outside of our sliding window for loops now, we have our roiFeatures populated with every possible 5 x 5 pixel neighborhood.

We can then make predictions on these roiFeatures, resulting in the final cleaned image:

	# use the ROI features to predict the pixels of our new denoised
	# image
	pixels = model.predict(roiFeatures)

	# the pixels list is currently a 1D array so we need to reshape
	# it to a 2D array (based on the original input image dimensions)
	# and then scale the pixels from the range [0, 1] to [0, 255]
	pixels = pixels.reshape(orig.shape)
	output = (pixels * 255).astype("uint8")

	# show the original and output images
	cv2.imshow("Original", orig)
	cv2.imshow("Output", output)
	cv2.waitKey(0)

Line 62 calls the .predict method our RFR, resulting in pixels, our foreground versus background predictions.

However, our pixels list is currently a 1D array, so we must take care to reshape the array into a 2D image and then scale the pixel intensities back to the range [0, 255] (Lines 67 and 68).

Finally, we can show on our screen both the original (noisy image) and the output (cleaned image).

Running Our Document Denoiser

You made it! This has been a long chapter, but we’ve finally ready to apply our document denoiser to our test data.

To see our denoise_document.py script in action, open a terminal and execute the following command:

$ python denoise_document.py --testing denoising-dirty-documents/test
[INFO] processing denoising-dirty-documents/test/133.png
[INFO] processing denoising-dirty-documents/test/160.png
[INFO] processing denoising-dirty-documents/test/40.png
[INFO] processing denoising-dirty-documents/test/28.png
[INFO] processing denoising-dirty-documents/test/157.png
[INFO] processing denoising-dirty-documents/test/190.png
[INFO] processing denoising-dirty-documents/test/100.png
[INFO] processing denoising-dirty-documents/test/49.png
[INFO] processing denoising-dirty-documents/test/58.png
[INFO] processing denoising-dirty-documents/test/10.png

Our results can be seen in Figure 6. The left image for each sample shows the noisy input document, including stains, crinkles, folds, etc. The right then shows the output cleaned image as generated by our RFR.

**Figure 6:** A sample of three noisy paragraph inputs from three different printings of the same document. The samples are shown on the *left,* and their corresponding cleaned outputs are shown on the *right.* Notice how the image quality has *significantly* improved by running our RFR on these images!

As you can see, our RFR is doing a great job cleaning these images for us automatically!

What's next? I recommend PyImageSearch University.

Course information:
30+ total classes • 39h 44m video • Last updated: 12/2021
★★★★★ 4.84 (128 Ratings) • 3,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

✓ 30+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 30+ Certificates of Completion
✓ 39h 44m on-demand video
✓ Brand new courses released every month, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 500+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In this tutorial, you learned how to denoise dirty documents using computer vision and machine learning.

Using this method, we could accept images of documents that had been “damaged,” including rips, tears, stains, crinkles, folds, etc. Then, by applying machine learning in a novel way, we could clean up these images to near pristine conditions, making it easier for OCR engines to detect the text, extract it, and OCR it correctly.

When you find yourself applying OCR to real-world images, especially scanned documents, you’ll inevitably run into documents that are of poor quality. Unfortunately, when that happens, your OCR accuracy will likely suffer.

Instead of throwing in the towel, consider how the techniques used in this tutorial may help. Is it possible to manually pre-process a subset of these images and then use them as training data? From there, you can train a model that can accept a noisy pixel ROI and then produce a pristine, cleaned output.

Typically, we don’t use raw pixels as inputs to machine learning models (the exception being a convolutional neural network, of course). Usually, we’ll quantify an input image using some feature detector or descriptor extractor. From there, the resulting feature vector is handed off to a machine learning model.

Rarely does one see standard machine learning models operating on raw pixel intensities. It’s a neat trick that doesn’t feel like it should work in practice. However, as you saw here, this method works!

I hope you can use this tutorial as a starting point when implementing your document denoising pipelines.

To go deeper, you could use denoising autoencoders to improve denoising quality. In this chapter, we used a random forest regressor, an ensemble of different decision trees. Another ensemble you may want to explore is extreme gradient boosting or XGBoost, for short.

Citation Information

A. Rosebrock, “Using Machine Learning to Denoise Images for Better OCR Accuracy,” PyImageSearch, 2021, https://hcl.pyimagesearch.com/2021/10/20/using-machine-learning-to-denoise-images-for-better-ocr-accuracy/

@article{Rosebrock_2021_Denoise, author = {Adrian Rosebrock}, title = {Using Machine Learning to Denoise Images for Better {OCR} Accuracy}, journal = {PyImageSearch}, year = {2021}, note = {https://hcl.pyimagesearch.com/2021/10/20/using-machine-learning-to-denoise-images-for-better-ocr-accuracy/}, }

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!

Looking for the source code to this post?

Using Machine Learning to Denoise Images for Better OCR Accuracy

Learning Objectives

Image Denoising with Machine Learning

Our Noisy Document Dataset

The Denoising Document Algorithm

Configuring your development environment

Having Problems Configuring Your Development Environment?

Project Structure

Implementing Our Configuration File

Creating Our Blur and Threshold Helper Function

Implementing the Feature Extraction Script

Running the Feature Extraction Script

Implementing Our Denoising Training Script

Training Our Document Denoising Model

Creating the Document Denoiser Script

Running Our Document Denoiser

What's next? I recommend PyImageSearch University.

Summary

Citation Information

Download the Source Code and FREE 17-page Resource Guide

About the Author

Comment section

Detect eyes, nose, lips, and jaw with dlib, OpenCV, and Python

Sneak preview: OCR with OpenCV, Tesseract, and Python

Multi-class SVM Loss

Topics

Books & Courses

PyImageSearch

Looking for the source code to this post?

Using Machine Learning to Denoise Images for Better OCR Accuracy

Learning Objectives

Image Denoising with Machine Learning

Our Noisy Document Dataset

The Denoising Document Algorithm

Configuring your development environment

Having Problems Configuring Your Development Environment?

Project Structure

Implementing Our Configuration File

Creating Our Blur and Threshold Helper Function

Implementing the Feature Extraction Script

Running the Feature Extraction Script

Implementing Our Denoising Training Script

Training Our Document Denoising Model

Creating the Document Denoiser Script

Running Our Document Denoiser

What's next? I recommend PyImageSearch University.

Summary

Citation Information

Download the Source Code and FREE 17-page Resource Guide

About the Author

Introduction to Distributed Training in PyTorch

Training a DCGAN in PyTorch

Comment section

Similar articles

You can learn Computer Vision, Deep Learning, and OpenCV.

Footer

Topics

Books & Courses

PyImageSearch

Access the code to this tutorial and all other 500+ tutorials on PyImageSearch

What's included in PyImageSearch University?