Breaking captchas with deep learning, Keras, and TensorFlow

In the past, we’ve worked with datasets that have been pre-compiled and labeled for us — but what if we wanted to go about creating our own custom dataset and then training a CNN on it? In this tutorial, I’ll present a complete deep learning case study that will give you an example of:

Downloading a set of images.
Labeling and annotating your images for training.
Training a CNN on your custom dataset.
Evaluating and testing the trained CNN.

The dataset of images we’ll be downloading is a set of captcha images used to prevent bots from automatically registering or logging in to a given website (or worse, trying to brute force their way into someone’s account).

Once we’ve downloaded a set of captcha images we’ll need to manually label each of the digits in the captcha. As we’ll find out, obtaining and labeling a dataset can be half (if not more) the battle. Depending on how much data you need, how easy it is to obtain, and whether or not you need to label the data (i.e., assign a ground-truth label to the image), it can be a costly process, both in terms of time and/or finances (if you pay someone else to label the data).

Therefore, whenever possible we try to use traditional computer vision techniques to speed up the labeling process. If we were to use image processing software such as Photoshop or GIMP to manually extract digits in a captcha image to create our training set, it might take us days of non-stop work to complete the task.

However, by applying some basic computer vision techniques, we can download and label our training set in less than an hour. This is one of the many reasons why I encourage deep learning practitioners to also invest in their computer vision education.

To learn how to break captchas with deep learning, Keras, and TensorFlow, just keep reading.

Looking for the source code to this post?

Breaking captchas with deep learning, Keras, and TensorFlow

I’d also like to mention that datasets in the real-world are not like the benchmark datasets such as MNIST, CIFAR-10, and ImageNet where images are neatly labeled and organized and our goal is only to train a model on the data and evaluate it. These benchmark datasets may be challenging, but in the real-world, the struggle is often obtaining the (labeled) data itself — and in many instances, the labeled data is worth a lot more than the deep learning model obtained from training a network on your dataset.

For example, if you were running a company responsible for creating a custom Automatic License Plate Recognition (ALPR) system for the United States government, you might invest years building a robust, massive dataset, while at the same time evaluating various deep learning approaches to recognizing license plates. Accumulating such a massive labeled dataset would give you a competitive edge over other companies — and in this case, the data itself is worth more than the end product.

Your company would be more likely to be acquired simply because of the exclusive rights you have to the massive, labeled dataset. Building an amazing deep learning model to recognize license plates would only increase the value of your company, but again, labeled data is expensive to obtain and replicate, so if you own the keys to a dataset that is hard (if not impossible) to replicate, make no mistake: your company’s primary asset is the data, not the deep learning.

Let’s look at how we can obtain a dataset of images, label them, and then apply deep learning to break a captcha system.

Configuring your development environment

To follow this guide, you need to have the OpenCV library installed on your system.

Luckily, OpenCV is pip-installable:

$ pip install opencv-contrib-python

If you need help configuring your development environment for OpenCV, I highly recommend that you read my pip install OpenCV guide — it will have you up and running in a matter of minutes.

Having problems configuring your development environment?

**Figure 1:** Having trouble configuring your dev environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join PyImageSearch University — you’ll be up and running with this tutorial in a matter of minutes.

All that said, are you:

Short on time?
Learning on your employer’s administratively locked system?
Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
Ready to run the code right now on your Windows, macOS, or Linux system?

Then join PyImageSearch University today!

Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides that are pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.

And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!

Breaking Captchas with a CNN

Here’s how to think about breaking Captchas. Remember the concept of responsible disclosure — something you should always do when computer security is involved.

The process starts when we create a Python script to automatically download a set of images that we’ll be using for training and evaluation.

After downloading our images, we’ll need to use a bit of computer vision to aid us in labeling the images, making the process much easier and substantially faster than simply cropping and labeling inside photo software like GIMP or Photoshop. Once we have labeled our data, we’ll train the LeNet architecture — as we’ll find out, we’re able to break the captcha system and obtain 100% accuracy in less than 15 epochs.

A Note on Responsible Disclosure

Living in the northeastern/midwestern part of the United States, it’s hard to travel on major highways without an E-ZPass. E-ZPass is an electronic toll collection system used on many bridges, interstates, and tunnels. Travelers simply purchase an E-ZPass transponder, place it on the windshield of their car, and enjoy the ability to quickly travel through tolls without stopping, as a credit card attached to their E-ZPass account is charged for any tolls.

E-ZPass has made tolls a much more “enjoyable” process (if there is such a thing). Instead of waiting in interminable lines where a physical transaction needs to take place (i.e., hand the cashier money, receive your change, get a printed receipt for reimbursement, etc.), you can simply blaze through in the fast lane without stopping — it saves a bunch of time when traveling and is much less of a hassle (you still have to pay the toll though).

I spend much of my time traveling between Maryland and Connecticut, two states along the I-95 corridor of the United States. The I-95 corridor, especially in New Jersey, contains a plethora of toll booths, so an E-ZPass pass was a no-brainer decision for me. About a year ago, the credit card I had attached to my E-ZPass account expired, and I needed to update it. I went to the E-ZPass New York website (the state I bought my E-ZPass in) to log in and update my credit card, but I stopped dead in my tracks (Figure 2).

**Figure 2:** The E-Z Pass New York login form. Can you spot the flaw in their login system?

Can you spot the flaw in this system? Their “captcha” is nothing more than four digits on a plain white background which is a major security risk — someone with even basic computer vision or deep learning experience could develop a piece of software to break this system.

This is where the concept of responsible disclosure comes in. Responsible disclosure is a computer security term for describing how to disclose a vulnerability. Instead of posting it on the internet for everyone to see immediately after the threat is detected, you try to contact the stakeholders first to ensure they know there is an issue. The stakeholders can then attempt to patch the software and resolve the vulnerability.

Simply ignoring the vulnerability and hiding the issue is a false security, something that should be avoided. In an ideal world, the vulnerability is resolved before it is publicly disclosed.

However, when stakeholders do not acknowledge the issue or do not fix the problem in a reasonable amount of time it creates an ethical conundrum — do you hide the issue and pretend it doesn’t exist? Or do you disclose it, bringing more attention to the problem in an effort to bring a fix to the problem faster? Responsible disclosure states that you first bring the problem to the stakeholders (responsible) — if it’s not resolved, then you need to disclose the issue (disclosure).

To demonstrate how the E-ZPass NY system was at risk, I trained a deep learning model to recognize the digits in the captcha. I then wrote a second Python script to (1) auto-fill my login credentials and (2) break the captcha, allowing my script access to my account.

In this case, I was only auto-logging into my account. Using this “feature,” I could auto-update a credit card, generate reports on my tolls, or even add a new car to my E-ZPass. But someone nefarious may use this as a method to brute force their way into a customer’s account.

I contacted E-ZPass over email, phone, and Twitter regarding the issue one year before I wrote this. They acknowledged the receipt of my messages; however, nothing has been done to fix the issue, despite multiple contacts.

In the rest of this tutorial, I’ll discuss how we can use the E-ZPass system to obtain a captcha dataset which we’ll then label and train a deep learning model on. I will not be sharing the Python code to auto-login to an account — that is outside the boundaries of responsible disclosure so please do not ask me for this code.

Keep in mind that with all knowledge comes responsibility. This knowledge, under no circumstance, should be used for nefarious or unethical reasons. This case study exists as a method to demonstrate how to obtain and label a custom dataset, followed by training a deep learning model on top of it.

I am required to say that I am not responsible for how this code is used — use this as an opportunity to learn, not an opportunity to be nefarious.

The Captcha Breaker Directory Structure

To build the captcha breaker system, we’ll need to update the pyimagesearch.utils submodule and include a new file named captchahelper.py:

|--- pyimagesearch
|    |--- __init__.py
|    |--- datasets
|    |--- nn
|    |--- preprocessing
|    |--- utils
|    |    |--- __init__.py
|    |    |--- captchahelper.py

This file will store a utility function named preprocess to help us process digits before feeding them into our deep neural network.

We’ll also create a second directory, this one named captcha_breaker, outside of our pyimagesearch module, and include the following files and subdirectories:

|--- captcha_breaker
|    |--- dataset/
|    |--- downloads/
|    |--- output/
|    |--- annotate.py
|    |--- download_images.py
|    |--- test_model.py
|    |--- train_model.py

The captcha_breaker directory is where all our project code will be stored to break image captchas. The dataset directory is where we will store our labeled digits which we’ll be hand-labeling. I prefer to keep my datasets organized using the following directory structure template:

root_directory/class_name/image_filename.jpg

Therefore, our dataset directory will have the structure:

dataset/{1-9}/example.jpg

where dataset is the root directory, {1-9} are the possible digit names, and example.jpg will be an example of the given digit.

The downloads directory will store the raw captcha .jpg files downloaded from the E-ZPass website. Inside the output directory, we’ll store our trained LeNet architecture.

The download_images.py script, as the name suggests, will be responsible for actually downloading the example captchas and saving them to disk. Once we’ve downloaded a set of captchas we’ll need to extract the digits from each image and hand-label every digit — this will be accomplished by annotate.py.

The train_model.py script will train LeNet on the labeled digits, while test_model.py will apply LeNet to captcha images themselves.

Automatically Downloading Example Images

The first step in building our captcha breaker is to download the example captcha images themselves.

If you copy and paste “https://www.e-zpassny.com/vector/jcaptcha.do” into your web browser and hit refresh multiple times, you’ll notice that this is a dynamic program that generates a new captcha each time you refresh. Therefore, to obtain our example captcha images we need to request this image a few hundred times and save the resulting image.

To automatically fetch new captcha images and save them to disk we can use download_images.py:

# import the necessary packages
import argparse
import requests
import time
import os

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-o", "--output", required=True,
	help="path to output directory of images")
ap.add_argument("-n", "--num-images", type=int,
	default=500, help="# of images to download")
args = vars(ap.parse_args())

Lines 2-5 import our required Python packages. The requests library makes working with HTTP connections easy and is heavily used in the Python ecosystem. If you do not already have requests installed on your system, you can install it via:

$ pip install requests

We then parse our command line arguments on Lines 8-13. We’ll require a single command line argument, --output, which is the path to the output directory that will store our raw captcha images (we’ll later hand-label each of the digits in the images).

A second optional switch --num-images, controls the number of captcha images we’re going to download. We’ll default this value to 500 total images. Since there are four digits in each captcha, this value of 500 will give us 500×4 = 2,000 total digits that we can use for training our network.

Our next code block initializes the URL of the captcha image we are going to download along with the total number of images generated thus far:

# initialize the URL that contains the captcha images that we will
# be downloading along with the total number of images downloaded
# thus far
url = "https://www.e-zpassny.com/vector/jcaptcha.do"
total = 0

We are now ready to download the captcha images:

# loop over the number of images to download
for i in range(0, args["num_images"]):
	try:
		# try to grab a new captcha image
		r = requests.get(url, timeout=60)

		# save the image to disk
		p = os.path.sep.join([args["output"], "{}.jpg".format(
			str(total).zfill(5))])
		f = open(p, "wb")
		f.write(r.content)
		f.close()

		# update the counter
		print("[INFO] downloaded: {}".format(p))
		total += 1

	# handle if any exceptions are thrown during the download process
	except:
		print("[INFO] error downloading image...")

	# insert a small sleep to be courteous to the server
	time.sleep(0.1)

On Line 22, we start looping over the --num-images that we wish to download. A request is made on Line 25 to download the image. We then save the image to disk on Lines 28-32. If there was an error downloading the image, our try/except block on Lines 39 and 40 catches it and allows our script to continue. Finally, we insert a small sleep on Line 43 to be courteous to the web server we are requesting.

You can execute download_images.py using the following command:

$ python download_images.py --output downloads

This script will take awhile to run since we have (1) are making a network request to download the image and (2) inserted a 0.1-second pause after each download.

Once the program finishes executing you’ll see that your download directory is filled with images:

$ ls -l downloads/*.jpg | wc -l
500

However, these are just the raw captcha images — we need to extract and label each of the digits in the captchas to create our training set. To accomplish this, we’ll use a bit of OpenCV and image processing techniques to make our life easier.

Annotating and Creating Our Dataset

So, how do you go about labeling and annotating each of our captcha images? Do we open Photoshop or GIMP and use the “select/marquee” tool to copy out a given digit, save it to disk, and then repeat ad nauseam? If we did, it might take us days of non-stop working to label each of the digits in the raw captcha images.

Instead, a better approach would be to use basic image processing techniques inside the OpenCV library to help us out. To see how we can label our dataset more efficiently, open a new file, name it annotate.py, and insert the following code:

# import the necessary packages
from imutils import paths
import argparse
import imutils
import cv2
import os

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--input", required=True,
	help="path to input directory of images")
ap.add_argument("-a", "--annot", required=True,
	help="path to output directory of annotations")
args = vars(ap.parse_args())

Lines 2-6 import our required Python packages, while Lines 9-14 parse our command line arguments. This script requires two arguments:

--input: The input path to our raw captcha images (i.e., the downloads directory).
--annot: The output path to where we’ll be storing the labeled digits (i.e., the dataset directory).

Our next code block grabs the paths to all images in the --input directory and initializes a dictionary named counts that will store the total number of times a given digit (the key) has been labeled (the value):

# grab the image paths then initialize the dictionary of character
# counts
imagePaths = list(paths.list_images(args["input"]))
counts = {}

The actual annotation process starts below:

# loop over the image paths
for (i, imagePath) in enumerate(imagePaths):
	# display an update to the user
	print("[INFO] processing image {}/{}".format(i + 1,
		len(imagePaths)))

	try:
		# load the image and convert it to grayscale, then pad the
		# image to ensure digits caught on the border of the image
		# are retained
		image = cv2.imread(imagePath)
		gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
		gray = cv2.copyMakeBorder(gray, 8, 8, 8, 8,
			cv2.BORDER_REPLICATE)

On Line 22, we start looping over each of the individual imagePaths. For each image, we load it from disk (Line 31), convert it to grayscale (Line 32), and pad the borders of the image with eight pixels in every direction (Lines 33 and 34). Figure 3 shows the difference between the original image (left) and the padded image (right).

**Figure 3:** *Left:* The original image loaded from disk. *Right:* Padding the image to ensure we can extract the digits *just in case* any of the digits are touching the border of the image.

We perform this padding just in case any of our digits are touching the border of the image. If the digits were touching the border, we wouldn’t be able to extract them from the image. Thus, to prevent this situation, we purposely pad the input image so it’s not possible for a given digit to touch the border.

We are now ready to binarize the input image via Otsu’s thresholding method:

  		# threshold the image to reveal the digits
		thresh = cv2.threshold(gray, 0, 255,
			cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)[1]

This function call automatically thresholds our image such that our image is now binary — black pixels represent the background while white pixels are our foreground as shown in Figure 4.

**Figure 4**: Thresholding the image ensures the foreground is *white* while the background is *black*. This is a typical assumption/requirement when working with many image processing functions with OpenCV.

Thresholding the image is a critical step in our image processing pipeline as we now need to find the outlines of each of the digits:

		# find contours in the image, keeping only the four largest
		# ones
		cnts = cv2.findContours(thresh.copy(), cv2.RETR_EXTERNAL,
			cv2.CHAIN_APPROX_SIMPLE)
		cnts = cnts[0] if imutils.is_cv2() else cnts[1]
		cnts = sorted(cnts, key=cv2.contourArea, reverse=True)[:4]

Lines 42 and 43 find the contours (i.e., outlines) of each of the digits in the image. Just in case there is “noise” in the image we sort the contours by their area, keeping only the four largest one (i.e., our digits themselves).

Given our contours we can extract each of them by computing the bounding box:

  		# loop over the contours
		for c in cnts:
			# compute the bounding box for the contour then extract
			# the digit
			(x, y, w, h) = cv2.boundingRect(c)
			roi = gray[y - 5:y + h + 5, x - 5:x + w + 5]

			# display the character, making it large enough for us
			# to see, then wait for a keypress
			cv2.imshow("ROI", imutils.resize(roi, width=28))
			key = cv2.waitKey(0)

On Line 48, we loop over each of the contours found in the thresholded image. We call cv2.boundingRect to compute the bounding box (x, y)-coordinates of the digit region. This region of interest (ROI) is then extracted from the grayscale image on Line 52. I have included a sample of example digits extracted from their raw captcha images as a montage in Figure 5.

**Figure 5:** A sample of the digit ROIs extracted from our captcha images. Our goal will be to label these images in such a way that we can train a custom Convolutional Neural Network on them.

Line 56 displays the digit ROI to our screen, resizing it to be large enough for us to see easily. Line 57 then waits for a keypress on your keyboard — but choose your keypress wisely! The key you press will be used as the label for the digit.

To see how the labeling process works via the cv2.waitKey call, take a look at the following code block:

  			# if the '`' key is pressed, then ignore the character
			if key == ord("`"):
				print("[INFO] ignoring character")
				continue

			# grab the key that was pressed and construct the path
			# the output directory
			key = chr(key).upper()
			dirPath = os.path.sep.join([args["annot"], key])
			
			# if the output directory does not exist, create it
			if not os.path.exists(dirPath):
				os.makedirs(dirPath)

If the tilde key “`” (tilde) is pressed, we’ll ignore the character (Lines 60 and 62). Needing to ignore a character may happen if our script accidentally detects “noise” (i.e., anything but a digit) in the input image or if we are not sure what the digit is. Otherwise, we assume that the key pressed was the label for the digit (Line 66) and use the key to construct the directory path to our output label (Line 67).

For example, if I pressed the 7 key on my keyboard, the dirPath would be:

dataset/7

Therefore, all images containing the digit “7” will be stored in the dataset/7 subdirectory. Lines 70 and 71 make a check to see if the dirPath directory does not exist — if it doesn’t, we create it.

Once we have ensured that dirPath properly exists, we simply have to write the example digit to file:

  			# write the labeled character to file
			count = counts.get(key, 1)
			p = os.path.sep.join([dirPath, "{}.png".format(
				str(count).zfill(6))])
			cv2.imwrite(p, roi)
		
			# increment the count for the current key
			counts[key] = count + 1

Line 74 grabs the total number of examples written to disk thus far for the current digit. We then construct the output path to the example digit using the dirPath. After executing Lines 75 and 76, our output path p may look like:

datasets/7/000001.png

Again, notice how all example ROIs that contain the number seven will be stored in the datasets/7 subdirectory — this is an easy, convenient way to organize your datasets when labeling images.

Our final code block handles if we want to control-c out of the script to exit or if there is an error processing an image:

  	# we are trying to control-c out of the script, so break from the
	# loop (you still need to press a key for the active window to
	# trigger this)
	except KeyboardInterrupt:
		print("[INFO] manually leaving script")
		break

	# an unknown error has occurred for this particular image
	except:
		print("[INFO] skipping image...")

If we wish to control-c and quit the script early, Line 85 detects this and allows our Python program to exit gracefully. Line 90 catches all other errors and simply ignores them, allowing us to continue with the labeling process.

The last thing you want when labeling a dataset is for a random error to occur due to an image encoding problem, causing your entire program to crash. If this happens, you’ll have to restart the labeling process all over again. You can obviously build in extra logic to detect where you left off.

To label the images you downloaded from the E-ZPass NY website, just execute the following command:

$ python annotate.py --input downloads --annot dataset

Here, you can see that the number 7 is displayed on my screen in Figure 6.

**Figure 6:** When annotating our dataset of digits, a given digit ROI will display on our screen. We then need to press the corresponding key on our keyboard to label the image and save the ROI to disk.

I then press 7 key on my keyboard to label it and then the digit is written to file in the dataset/7 subdirectory.

The annotate.py script then proceeds to the next digit for me to label. You can then proceed to label all of the digits in the raw captcha images. You’ll quickly realize that labeling a dataset can be a very tedious, time-consuming process. Labeling all 2,000 digits should take you less than half an hour — but you’ll likely become bored within the first five minutes.

Remember, actually obtaining your labeled dataset is half the battle. From there the actual work can start. Luckily, I have already labeled the digits for you! If you check the dataset directory included in the accompanying downloads of this tutorial you’ll find the entire dataset ready to go:

$ ls dataset/
1  2  3  4  5  6  7  8  9
$ ls -l dataset/1/*.png | wc -l
232

Here, you can see nine subdirectories, one for each of the digits that we wish to recognize. Inside each subdirectory, there are example images of the particular digit. Now that we have our labeled dataset, we can proceed to training our captcha breaker using the LeNet architecture.

Preprocessing the Digits

As we know, our Convolutional Neural Networks require an image with a fixed width and height to be passed in during training. However, our labeled digit images are of various sizes — some are taller than they are wide, others are wider than they are tall. Therefore, we need a method to pad and resize our input images to a fixed size without distorting their aspect ratio.

We can resize and pad our images while preserving the aspect ratio by defining a preprocess function inside captchahelper.py:

# import the necessary packages
import imutils
import cv2

def preprocess(image, width, height):
	# grab the dimensions of the image, then initialize
	# the padding values
	(h, w) = image.shape[:2]

	# if the width is greater than the height then resize along
	# the width
	if w > h:
		image = imutils.resize(image, width=width)

	# otherwise, the height is greater than the width so resize
	# along the height
	else:
		image = imutils.resize(image, height=height)

Our preprocess function requires three parameters:

image: The input image that we are going to pad and resize.
width: The target output width of the image.
height: The target output height of the image.

On Lines 12 and 13, we make a check to see if the width is greater than the height, and if so, we resize the image along the larger dimension (width) Otherwise, if the height is greater than the width, we resize along the height (Lines 17 and 18), which implies either the width or height (depending on the dimensions of the input image) are fixed.

However, the opposite dimension is smaller than it should be. To fix this issue, we can “pad” the image along the shorter dimension to obtain our fixed size:

  	# determine the padding values for the width and height to
	# obtain the target dimensions
	padW = int((width - image.shape[1]) / 2.0)
	padH = int((height - image.shape[0]) / 2.0)

	# pad the image then apply one more resizing to handle any
	# rounding issues
	image = cv2.copyMakeBorder(image, padH, padH, padW, padW,
		cv2.BORDER_REPLICATE)
	image = cv2.resize(image, (width, height))

	# return the pre-processed image
	return image

Lines 22 and 23 compute the required amount of padding to reach the target width and height. Lines 27 and 28 apply the padding to the image. Applying this padding should bring our image to our target width and height; however, there may be cases where we are one pixel off in a given dimension. The easiest way to resolve this discrepancy is to simply call cv2.resize (Line 29) to ensure all images are the same width and height.

The reason we do not immediately call cv2.resize at the top of the function is that we first need to consider the aspect ratio of the input image and attempt to pad it correctly first. If we do not maintain the image aspect ratio, then our digits will become distorted.

Training the Captcha Breaker

Now that our preprocess function is defined, we can move on to training LeNet on the image captcha dataset. Open the train_model.py file and insert the following code:

# import the necessary packages
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from tensorflow.keras.preprocessing.image import img_to_array
from tensorflow.keras.optimizers import SGD
from pyimagesearch.nn.conv import LeNet
from pyimagesearch.utils.captchahelper import preprocess
from imutils import paths
import matplotlib.pyplot as plt
import numpy as np
import argparse
import cv2
import os

Lines 2-14 import our required Python packages. Notice that we’ll be using the SGD optimizer along with the LeNet architecture to train a model on the digits. We’ll also be using our newly defined preprocess function on each digit before passing it through our network.

Next, let’s review our command line arguments:

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-d", "--dataset", required=True,
	help="path to input dataset")
ap.add_argument("-m", "--model", required=True,
	help="path to output model")
args = vars(ap.parse_args())

The train_model.py script requires two command line arguments:

--dataset: The path to the input dataset of labeled captcha digits (i.e., the dataset directory on disk).
--model: Here we supply the path to where our serialized LeNet weights will be saved after training.

We can now load our data and corresponding labels from disk:

# initialize the data and labels
data = []
labels = []

# loop over the input images
for imagePath in paths.list_images(args["dataset"]):
	# load the image, pre-process it, and store it in the data list
	image = cv2.imread(imagePath)
	image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
	image = preprocess(image, 28, 28)
	image = img_to_array(image)
	data.append(image)

	# extract the class label from the image path and update the
	# labels list
	label = imagePath.split(os.path.sep)[-2]
	labels.append(label)

On Lines 25 and 26, we initialize our data and labels lists, respectively. We then loop over every image in our labeled --dataset on Line 29. For each image in the dataset, we load it from disk, convert it to grayscale, and preprocess it such that it has a width of 28 pixels and a height of 28 pixels (Lines 31-35). The image is then converted to a Keras-compatible array and added to the data list (Lines 34 and 35).

One of the primary benefits of organizing your dataset directory structure in the format of:

root_directory/class_label/image_filename.jpg

is that you can easily extract the class label by grabbing the second-to-last component from the filename (Line 39). For example, given the input path dataset/7/000001.png, the label would be 7, which is then added to the labels list (Line 40).

Our next code block handles normalizing raw pixel intensity values to the range [0, 1], followed by constructing the training and testing splits, along with one-hot encoding the labels:

# scale the raw pixel intensities to the range [0, 1]
data = np.array(data, dtype="float") / 255.0
labels = np.array(labels)

# partition the data into training and testing splits using 75% of
# the data for training and the remaining 25% for testing
(trainX, testX, trainY, testY) = train_test_split(data,
	labels, test_size=0.25, random_state=42)

# convert the labels from integers to vectors
lb = LabelBinarizer().fit(trainY)
trainY = lb.transform(trainY)
testY = lb.transform(testY)

We can then initialize the LeNet model and SGD optimizer:

# initialize the model
print("[INFO] compiling model...")
model = LeNet.build(width=28, height=28, depth=1, classes=9)
opt = SGD(lr=0.01)
model.compile(loss="categorical_crossentropy", optimizer=opt,
	metrics=["accuracy"])

Our input images will have a width of 28 pixels, a height of 28 pixels, and a single channel. There are a total of 9 digit classes we are recognizing (there is no 0 class).

Given the initialized model and optimizer we can train the network for 15 epochs, evaluate it, and serialize it to disk:

# train the network
print("[INFO] training network...")
H = model.fit(trainX, trainY,  validation_data=(testX, testY),
	batch_size=32, epochs=15, verbose=1)

# evaluate the network
print("[INFO] evaluating network...")
predictions = model.predict(testX, batch_size=32)
print(classification_report(testY.argmax(axis=1),
	predictions.argmax(axis=1), target_names=lb.classes_))

# save the model to disk
print("[INFO] serializing network...")
model.save(args["model"])

Our last code block will handle plotting the accuracy and loss for both the training and testing sets over time:

# plot the training + testing loss and accuracy
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, 15), H.history["loss"], label="train_loss")
plt.plot(np.arange(0, 15), H.history["val_loss"], label="val_loss")
plt.plot(np.arange(0, 15), H.history["accuracy"], label="acc")
plt.plot(np.arange(0, 15), H.history["val_accuracy"], label="val_acc")
plt.title("Training Loss and Accuracy")
plt.xlabel("Epoch #")
plt.ylabel("Loss/Accuracy")
plt.legend()
plt.show()

To train the LeNet architecture using the SGD optimizer on our custom captcha dataset, just execute the following command:

$ python train_model.py --dataset dataset --model output/lenet.hdf5
[INFO] compiling model...
[INFO] training network...
Train on 1509 samples, validate on 503 samples
Epoch 1/15
0s - loss: 2.1606 - acc: 0.1895 - val_loss: 2.1553 - val_acc: 0.2266
Epoch 2/15
0s - loss: 2.0877 - acc: 0.3565 - val_loss: 2.0874 - val_acc: 0.1769
Epoch 3/15
0s - loss: 1.9540 - acc: 0.5003 - val_loss: 1.8878 - val_acc: 0.3917
...
Epoch 15/15
0s - loss: 0.0152 - acc: 0.9993 - val_loss: 0.0261 - val_acc: 0.9980
[INFO] evaluating network...
             precision    recall  f1-score   support

          1       1.00      1.00      1.00        45
          2       1.00      1.00      1.00        55
          3       1.00      1.00      1.00        63
          4       1.00      0.98      0.99        52
          5       0.98      1.00      0.99        51
          6       1.00      1.00      1.00        70
          7       1.00      1.00      1.00        50
          8       1.00      1.00      1.00        54
          9       1.00      1.00      1.00        63

avg / total       1.00      1.00      1.00       503

[INFO] serializing network...

As we can see, after only 15 epochs our network is obtaining 100% classification accuracy on both the training and validation sets. This is not a case of overfitting either — when we investigate the training and validation curves in Figure 7 we can see that by epoch 5 the validation and training loss/accuracy match each other.

**Figure 7:** Using the LeNet architecture on our custom digits datasets enables us to obtain 100% classification accuracy after only fifteen epochs. Furthermore, there are no signs of overfitting.

If you check the output directory, you’ll also see the serialized lenet.hdf5 file:

$ ls -l output/
total 9844
-rw-rw-r-- 1 adrian adrian 10076992 May  3 12:56 lenet.hdf5

We can then use this model on new input images.

Testing the Captcha Breaker

Now that our captcha breaker is trained, let’s test it out on some example images. Open the test_model.py file and insert the following code:

# import the necessary packages
from tensorflow.keras.preprocessing.image import img_to_array
from tensorflow.keras.models import load_model
from pyimagesearch.utils.captchahelper import preprocess
from imutils import contours
from imutils import paths
import numpy as np
import argparse
import imutils
import cv2

As usual, our Python script starts with importing our Python packages. We’ll again be using the preprocess function to prepare digits for classification.

Next, we’ll parse our command line arguments:

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--input", required=True,
	help="path to input directory of images")
ap.add_argument("-m", "--model", required=True,
	help="path to input model")
args = vars(ap.parse_args())

The --input switch controls the path to the input captcha images that we wish to break. We could download a new set of captchas from the E-ZPass NY website, but for simplicity, we’ll sample images from our existing raw captcha files. The --model argument is simply the path to the serialized weights residing on disk.

We can now load our pre-trained CNN and randomly sample ten captcha images to classify:

# load the pre-trained network
print("[INFO] loading pre-trained network...")
model = load_model(args["model"])

# randomly sample a few of the input images
imagePaths = list(paths.list_images(args["input"]))
imagePaths = np.random.choice(imagePaths, size=(10,),
	replace=False)

Here comes the fun part — actually breaking the captcha:

# loop over the image paths
for imagePath in imagePaths:
	# load the image and convert it to grayscale, then pad the image
	# to ensure digits caught near the border of the image are
	# retained
	image = cv2.imread(imagePath)
	gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
	gray = cv2.copyMakeBorder(gray, 20, 20, 20, 20,
		cv2.BORDER_REPLICATE)

	# threshold the image to reveal the digits
	thresh = cv2.threshold(gray, 0, 255,
		cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)[1]

On Line 30, we start looping over each of our sampled imagePaths. Just like in the annotate.py example, we need to extract each of the digits in the captcha. This extraction is accomplished by loading the image from disk, converting it to grayscale, and padding the border such that a digit cannot touch the boundary of the image (Lines 34-37). We add extra padding here so we have enough room to actually draw and visualize the correct prediction on the image.

Lines 40 and 41 threshold the image such that the digits appear as a white foreground against a black background.

We now need to find the contours of the digits in the thresh image:

  	# find contours in the image, keeping only the four largest ones,
	# then sort them from left-to-right
	cnts = cv2.findContours(thresh.copy(), cv2.RETR_EXTERNAL,
		cv2.CHAIN_APPROX_SIMPLE)
	cnts = cnts[0] if imutils.is_cv2() else cnts[1]
	cnts = sorted(cnts, key=cv2.contourArea, reverse=True)[:4]
	cnts = contours.sort_contours(cnts)[0]

	# initialize the output image as a "grayscale" image with 3
	# channels along with the output predictions
	output = cv2.merge([gray] * 3)
	predictions = []

We can find the digits by calling cv2.findContours on the thresh image. This function returns a list of (x, y)-coordinates that specify the outline of each individual digit.

We then perform two stages of sorting. The first stage sorts the contours by their size, keeping only the largest four outlines. We (correctly) assume that the four contours with the largest size are the digits we want to recognize. However, there is no guaranteed spatial ordering imposed on these contours — the third digit we wish to recognize may be first in the cnts list. Since we read digits from left-to-right, we need to sort the contours from left-to-right. This is accomplished via the sort_contours function (http://pyimg.co/sbm9p).

Line 53 takes our gray image and converts it to a three-channel image by replicating the grayscale channel three times (one for each Red, Green, and Blue channel). We then initialize our list of predictions by the CNN on Line 54.

Given the contours of the digits in the captcha, we can now break it:

  	# loop over the contours
	for c in cnts:
		# compute the bounding box for the contour then extract the
		# digit
		(x, y, w, h) = cv2.boundingRect(c)
		roi = gray[y - 5:y + h + 5, x - 5:x + w + 5]

		# pre-process the ROI and then classify it
		roi = preprocess(roi, 28, 28)
		roi = np.expand_dims(img_to_array(roi), axis=0) / 255.0
		pred = model.predict(roi).argmax(axis=1)[0] + 1
		predictions.append(str(pred))

		# draw the prediction on the output image
		cv2.rectangle(output, (x - 2, y - 2),
			(x + w + 4, y + h + 4), (0, 255, 0), 1)
		cv2.putText(output, str(pred), (x - 5, y - 5),
			cv2.FONT_HERSHEY_SIMPLEX, 0.55, (0, 255, 0), 2)

On Line 57, we loop over each of the outlines (which have been sorted from left-to-right) of the digits. We then extract the ROI of the digit on Lines 60 and 61 followed by preprocessing it on Lines 64 and 65.

Line 66 calls the .predict method of our model. The index with the largest probability returned by .predict will be our class label. We add 1 to this value since indexes values start at zero; however, there is no zero class — only classes for the digits 1-9. This prediction is then appended to the predictions list on Line 67.

Lines 70 and 71 draw a bounding box surrounding the current digit, while Lines 72 and 73 draw the predicted digit on the output image itself.

Our last code block handles writing the broken captcha as a string to our terminal as well as displaying the output image:

  	# show the output image
	print("[INFO] captcha: {}".format("".join(predictions)))
	cv2.imshow("Output", output)
	cv2.waitKey()

To see our captcha breaker in action, simply execute the following command:

$ python test_model.py --input downloads --model output/lenet.hdf5
Using TensorFlow backend.
[INFO] loading pre-trained network...
[INFO] captcha: 2696
[INFO] captcha: 2337
[INFO] captcha: 2571
[INFO] captcha: 8648

In Figure 8, I have included four samples generated from my run of test_model.py. In every case, we have correctly predicted the digit string and broken the image captcha using a simple network architecture trained on a small amount of training data.

**Figure 8:** Examples of captchas that have been correctly classified and broken by our LeNet model.

What's next? I recommend PyImageSearch University.

Course information:
30+ total classes • 39h 44m video • Last updated: 12/2021
★★★★★ 4.84 (128 Ratings) • 3,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

✓ 30+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 30+ Certificates of Completion
✓ 39h 44m on-demand video
✓ Brand new courses released every month, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 500+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In this tutorial, we learned how to:

Gather a dataset of raw images.
Label and annotate our images for training.
Train a custom Convolutional Neural Network on our labeled dataset.
Test and evaluate our model on example images.

To accomplish this, we scraped 500 example captcha images from the E-ZPass NY website. We then wrote a Python script that aids us in the labeling process, enabling us to quickly label the entire dataset and store the resulting images in an organized directory structure.

After our dataset was labeled, we trained the LeNet architecture using the SGD optimizer on the dataset using categorical cross-entropy loss — the resulting model obtained 100% accuracy on the testing set with zero overfitting. Finally, we visualized results of the predicted digits to confirm that we have successfully devised a method to break the captcha.

Again, I want to remind you that this tutorial serves as only an example of how to obtain an image dataset and label it. Under no circumstances should you use this dataset or resulting model for nefarious reasons. If you are ever in a situation where you find that computer vision or deep learning can be used to exploit a vulnerability, be sure to practice responsible disclosure and attempt to report the issue to the proper stakeholders; failure to do so is unethical (as is misuse of this code, which, legally, I must say I cannot take responsibility for).

Secondly, this tutorial (as will the next one on smile detection with deep learning) have leveraged computer vision and the OpenCV library to facilitate building a complete application. If you are planning on becoming a serious deep learning practitioner, I highly recommend that you learn the fundamentals of image processing and the OpenCV library — having even a rudimentary understanding of these concepts will enable you to:

Appreciate deep learning at a higher level.
Develop more robust applications that use deep learning for image classification
Leverage image processing techniques to more quickly obtain your goals.

A great example of using basic image processing techniques to our advantage can be found in the Annotating and Creating Our Dataset section above, where we were able to quickly annotate and label our dataset. Without using simple computer vision techniques, we would have been stuck manually cropping and saving the example digits to disk using image editing software such as Photoshop or GIMP. Instead, we were able to write a quick-and-dirty application that automatically extracted each digit from the captcha — all we had to do was press the proper key on our keyboard to label the image.

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!

Looking for the source code to this post?

Breaking captchas with deep learning, Keras, and TensorFlow

Configuring your development environment

Having problems configuring your development environment?

Breaking Captchas with a CNN

A Note on Responsible Disclosure

The Captcha Breaker Directory Structure

Automatically Downloading Example Images

Annotating and Creating Our Dataset

Preprocessing the Digits

Training the Captcha Breaker

Testing the Captcha Breaker

What's next? I recommend PyImageSearch University.

Summary

Download the Source Code and FREE 17-page Resource Guide

About the Author

Comment section

Data pipelines with tf.data and TensorFlow

Easy Hyperparameter Tuning with Keras Tuner and TensorFlow

Image Pyramids with Python and OpenCV

Topics

Books & Courses

PyImageSearch

Looking for the source code to this post?

Breaking captchas with deep learning, Keras, and TensorFlow

Configuring your development environment

Having problems configuring your development environment?

Breaking Captchas with a CNN

A Note on Responsible Disclosure

The Captcha Breaker Directory Structure

Automatically Downloading Example Images

Annotating and Creating Our Dataset

Preprocessing the Digits

Training the Captcha Breaker

Testing the Captcha Breaker

What's next? I recommend PyImageSearch University.

Summary

Download the Source Code and FREE 17-page Resource Guide

About the Author

Intro to PyTorch: Training your first neural network using PyTorch

PyTorch: Training your first Convolutional Neural Network (CNN)

Comment section

Similar articles

You can learn Computer Vision, Deep Learning, and OpenCV.

Footer

Topics

Books & Courses

PyImageSearch

Access the code to this tutorial and all other 500+ tutorials on PyImageSearch

What's included in PyImageSearch University?