Turning any CNN image classifier into an object detector with Keras, TensorFlow, and OpenCV

In this tutorial, you will learn how to take any pre-trained deep learning image classifier and turn it into an object detector using Keras, TensorFlow, and OpenCV.

Today, we’re starting a four-part series on deep learning and object detection:

Part 1: Turning any deep learning image classifier into an object detector with Keras and TensorFlow (today’s post)
Part 2: OpenCV Selective Search for Object Detection
Part 3: Region proposal for object detection with OpenCV, Keras, and TensorFlow
Part 4: R-CNN object detection with Keras and TensorFlow

The goal of this series of posts is to obtain a deeper understanding of how deep learning-based object detectors work, and more specifically:

How traditional computer vision object detection algorithms can be combined with deep learning
What the motivations behind end-to-end trainable object detectors and the challenges associated with them are
And most importantly, how the seminal Faster R-CNN architecture came to be (we’ll be building a variant of the R-CNN architecture throughout this series)

Today, we’ll be starting with the fundamentals of object detection, including how to take a pre-trained image classifier and utilize image pyramids, sliding windows, and non-maxima suppression to build a basic object detector (think HOG + Linear SVM-inspired).

Over the coming weeks, we’ll learn how to build an end-to-end trainable network from scratch.

But for today, let’s start with the basics.

To learn how to take any Convolutional Neural Network image classifier and turn it into an object detector with Keras and TensorFlow, just keep reading.

Looking for the source code to this post?

Turning any CNN image classifier into an object detector with Keras, TensorFlow, and OpenCV

In the first part of this tutorial, we’ll discuss the key differences between image classification and object detection tasks.

I’ll then show you how you can take any Convolutional Neural Network trained for image classification and then turn it into an object detector, all in ~200 lines of code.

From there, we’ll implement the code necessary to take an image classifier and turn it into an object detector using Keras, TensorFlow, and OpenCV.

Finally, we’ll review the results of our work, noting some of the problems and limitations with our implementation, including how we can improve this method.

Image classification vs. object detection

**Figure 1:** *Left:* Image classification. *Right:* Object detection. In this blog post, we will learn how to turn any deep learning image classifier CNN into an object detector with Keras, TensorFlow, and OpenCV.

When performing image classification, given an input image, we present it to our neural network, and we obtain a single class label and a probability associated with the class label prediction (Figure 1, left).

This class label is meant to characterize the contents of the entire image, or at least the most dominant, visible contents of the image.

We can thus think of image classification as:

One image in
One class label out

Object detection, on the other hand, not only tells us what is in the image (i.e., class label) but also where in the image the object is via bounding box (x, y)-coordinates (Figure 1, right).

Therefore, object detection algorithms allow us to:

Input one image
Obtain multiple bounding boxes and class labels as output

At the very core, any object detection algorithm (regardless of traditional computer vision or state-of-the-art deep learning), follows the same pattern:

1. Input: An image that we wish to apply object detection to
2. Output: Three values, including:
- 2a. A list of bounding boxes, or the (x, y)-coordinates for each object in an image
- 2b. The class label associated with each of the bounding boxes
- 2c. The probability/confidence score associated with each bounding box and class label

Today, you’ll see an example of this pattern in action.

How can we turn any deep learning image classifier into an object detector?

At this point, you’re likely wondering:

Hey Adrian, if I have a Convolutional Neural Network trained for image classification, how in the world am I going to use it for object detection?
Based on your explanation above, it seems like image classification and object detection are fundamentally different, requiring two different types of network architectures.

And essentially, that is correct — object detection does require a specialized network architecture.

Anyone who has read papers on Faster R-CNN, Single Shot Detectors (SSDs), YOLO, RetinaNet, etc. knows that object detection networks are more complex, more involved, and take multiple orders of magnitude and more effort to implement compared to traditional image classification.

That said, there is a hack we can leverage to turn our CNN image classifier into an object detector — and the secret sauce lies in traditional computer vision algorithms.

Back before deep learning-based object detectors, the state-of-the-art was to use HOG + Linear SVM to detect objects in an image.

We’ll be borrowing elements from HOG + Linear SVM to convert any deep neural network image classifier into an object detector.

The first key ingredient from HOG + Linear SVM is to use image pyramids.

An “image pyramid” is a multi-scale representation of an image:

**Figure 2:** Image pyramids allow us to produce images at different scales. When turning an image classifier into an object detector, it is important to classify windows at multiple scales. We will learn how to write an image pyramid Python generator and put it to work in our Keras, TensorFlow, and OpenCV script.

Utilizing an image pyramid allows us to find objects in images at different scales (i.e., sizes) of an image (Figure 2).

At the bottom of the pyramid, we have the original image at its original size (in terms of width and height).

And at each subsequent layer, the image is resized (subsampled) and optionally smoothed (usually via Gaussian blurring).

The image is progressively subsampled until some stopping criterion is met, which is normally when a minimum size has been reached and no further subsampling needs to take place.

The second key ingredient we need is sliding windows:

**Figure 3:** We will classify regions of our multi-scale image representations. These regions are generated by means of sliding windows. The combination of image pyramids and sliding windows allow us to turn any image classifier into an object detector using Keras, TensorFlow, and OpenCV.

As the name suggests, a sliding window is a fixed-size rectangle that slides from left-to-right and top-to-bottom within an image. (As Figure 3 demonstrates, our sliding window could be used to detect the face in the input image).

At each stop of the window we would:

Extract the ROI
Pass it through our image classifier (ex., Linear SVM, CNN, etc.)
Obtain the output predictions

Combined with image pyramids, sliding windows allow us to localize objects at different locations and multiple scales of the input image:

The final key ingredient we need is non-maxima suppression.

When performing object detection, our object detector will typically produce multiple, overlapping bounding boxes surrounding an object in an image.

**Figure 4:** One key ingredient to turning a CNN image classifier into an object detector with Keras, TensorFlow, and OpenCV is applying a process known as non-maxima suppression (NMS). We will use NMS to suppress weak, overlapping bounding boxes in favor of higher confidence predictions.

This behavior is totally normal — it simply implies that as the sliding window approaches an image, our classifier component is returning larger and larger probabilities of a positive detection.

Of course, multiple bounding boxes pose a problem — there’s only one object there, and we somehow need to collapse/remove the extraneous bounding boxes.

The solution to the problem is to apply non-maxima suppression (NMS), which collapses weak, overlapping bounding boxes in favor of the more confident ones:

**Figure 5:** After non-maxima suppression (NMS) has been applied, we’re left with a single detection for each object in the image. TensorFlow, Keras, and OpenCV allow us to turn a CNN image classifier into an object detector.

On the left, we have multiple detections, while on the right, we have the output of non-maxima suppression, which collapses the multiple bounding boxes into a single detection.

Combining traditional computer vision with deep learning to build an object detector

**Figure 6:** The steps to turn a deep learning classifier into an object detector using Python and libraries such as TensorFlow, Keras, and OpenCV.

In order to take any Convolutional Neural Network trained for image classification and instead utilize it for object detection, we’re going to utilize the three key ingredients for traditional computer vision:

Image pyramids: Localize objects at different scales/sizes.
Sliding windows: Detect exactly where in the image a given object is.
Non-maxima suppression: Collapse weak, overlapping bounding boxes.

The general flow of our algorithm will be:

Step #1: Input an image
Step #2: Construct an image pyramid
Step #3: For each scale of the image pyramid, run a sliding window
- Step #3a: For each stop of the sliding window, extract the ROI
- Step #3b: Take the ROI and pass it through our CNN originally trained for image classification
- Step #3c: Examine the probability of the top class label of the CNN, and if meets a minimum confidence, record (1) the class label and (2) the location of the sliding window
Step #4: Apply class-wise non-maxima suppression to the bounding boxes
Step #5: Return results to calling function

That may seem like a complicated process, but as you’ll see in the remainder of this post, we can implement the entire object detection procedure in < 200 lines of code!

Configuring your development environment

To configure your system for this tutorial, I first recommend following either of these tutorials:

Either tutorial will help you configure your system with all the necessary software for this blog post in a convenient Python virtual environment.

Please note that PyImageSearch does not recommend or support Windows for CV/DL projects.

Project structure

Once you extract the .zip from the “Downloads” section of this blog post, your directory will be organized as follows:

.
├── images
│   ├── hummingbird.jpg
│   ├── lawn_mower.jpg
│   └── stingray.jpg
├── pyimagesearch
│   ├── __init__.py
│   └── detection_helpers.py
└── detect_with_classifier.py

2 directories, 6 files

Today’s pyimagesearch module contains a Python file — detection_helpers.py — consisting of two helper functions:

image_pyramid: Assists in generating copies of our image at different scales so that we can find objects of different sizes
sliding_window: Helps us find where in the image an object is by sliding our classification window from left-to-right (column-wise) and top-to-bottom (row-wise)

Using the helper functions, our detect_with_classifier.py Python driver script accomplishes object detection by means of a classifier (using a sliding window and image pyramid approach). The classifier we’re using is a pre-trained ResNet50 CNN trained on the ImageNet dataset. The ImageNet dataset consists of 1,000 classes of objects.

Three images/ are provided for testing purposes. You should also test this script with images of your own — given that our classifier-based object detector can recognize 1,000 types of classes, most everyday objects and animals can be recognized. Have fun with it!

Implementing our image pyramid and sliding window utility functions

In order to turn our CNN image classifier into an object detector, we must first implement helper utilities to construct sliding windows and image pyramids.

Let’s implement this helper functions now — open up the detection_helpers.py file in the pyimagesearch module, and insert the following code:

# import the necessary packages
import imutils

def sliding_window(image, step, ws):
	# slide a window across the image
	for y in range(0, image.shape[0] - ws[1], step):
		for x in range(0, image.shape[1] - ws[0], step):
			# yield the current window
			yield (x, y, image[y:y + ws[1], x:x + ws[0]])

We begin by importing my package of convenience functions, imutils.

From there, we dive right in by defining our sliding_window generator function. This function expects three parameters:

image: The input image that we are going to loop over and generate windows from. This input image may come from the output of our image pyramid.
step: Our step size, which indicates how many pixels we are going to “skip” in both the (x, y) directions. Normally, we would not want to loop over each and every pixel of the image (i.e., step=1), as this would be computationally prohibitive if we were applying an image classifier at each window. Instead, the step size is determined on a per-dataset basis and is tuned to give optimal performance based on your dataset of images. In practice, it’s common to use a step of 4 to 8 pixels. Remember, the smaller your step size is, the more windows you’ll need to examine.
ws: The window size defines the width and height (in pixels) of the window we are going to extract from our image. If you scroll back to Figure 3, the window size is equivalent to the dimensions of the green box that is sliding across the image.

The actual “sliding” of our window takes place on Lines 6-9 according to the following:

Line 6 is our loop over our rows via determining a range of y-values.
Line 7 is our loop over our columns (a range of x-values).
Line 9 ultimately yields the window of our image (i.e., ROI) according to the (x, y)-values, window size (ws), and step size.

The yield keyword is used in place of the return keyword because our sliding_window function is implemented as a Python generator.

For more information on our sliding windows implementation, please refer to my previous Sliding Windows for Object Detection with Python and OpenCV article.

Now that we’ve successfully defined our sliding window routine, let’s implement our image_pyramid generator used to construct a multi-scale representation of an input image:

def image_pyramid(image, scale=1.5, minSize=(224, 224)):
	# yield the original image
	yield image

	# keep looping over the image pyramid
	while True:
		# compute the dimensions of the next image in the pyramid
		w = int(image.shape[1] / scale)
		image = imutils.resize(image, width=w)

		# if the resized image does not meet the supplied minimum
		# size, then stop constructing the pyramid
		if image.shape[0] < minSize[1] or image.shape[1] < minSize[0]:
			break

		# yield the next image in the pyramid
		yield image

Our image_pyramid function accepts three parameters as well:

image: The input image for which we wish to generate multi-scale representations.
scale: Our scale factor controls how much the image is resized at each layer. Smaller scale values yield more layers in the pyramid, and larger scale values yield fewer layers.
minSize: Controls the minimum size of an output image (layer of our pyramid). This is important because we could effectively construct progressively smaller scaled representations of our input image infinitely. Without a minSize parameter, our while loop would continue forever (which is not what we want).

Now that we know the parameters that must be inputted to the function, let’s dive into the internals of our image pyramid generator function.

Referring to Figure 2, notice that the largest representation of our image is the input image itself. Line 13 of our generator simply yields the original, unaltered image the first time our generator is asked to produce a layer of our pyramid.

Subsequent generated images are controlled by the infinite while True loop beginning on Line 16.

Inside the loop, we first compute the dimensions of the next image in the pyramid according to our scale and the original image dimensions (Line 18). In this case, we simply divide the width of the input image by the scale to determine our width (w) ratio.

From there, we go ahead and resize the image down to the width while maintaining aspect ratio (Line 19). As you can see, we are using the aspect-aware resizing helper built into my imutils package.

While we are effectively done (we’ve resized our image, and now we can yield it), we need to implement an exit condition so that our generator knows to stop. As we learned when we defined our parameters to the image_pyramid function, the exit condition is determined by the minSize parameter. Therefore, the conditional on Lines 23 and 24 determines whether our resized image is too small (height or width) and exits the loop accordingly.

Assuming our scaled output image passes our minSize threshold, Line 27 yields it to the caller.

For more details, please refer to my Image Pyramids with Python and OpenCV article, which also includes an alternative scikit-image image pyramid implementation that may be useful to you.

Using Keras and TensorFlow to turn a pre-trained image classifier into an object detector

With our sliding_window and image_pyramid functions implemented, let’s now use them to take a deep neural network trained for image classification and turn it into an object detector.

Open up a new file, name it detect_with_classifier.py, and let’s begin coding:

# import the necessary packages
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.applications.resnet import preprocess_input
from tensorflow.keras.preprocessing.image import img_to_array
from tensorflow.keras.applications import imagenet_utils
from imutils.object_detection import non_max_suppression
from pyimagesearch.detection_helpers import sliding_window
from pyimagesearch.detection_helpers import image_pyramid
import numpy as np
import argparse
import imutils
import time
import cv2

This script begins with a selection of imports including:

ResNet50: The popular ResNet Convolutional Neural Network (CNN) classifier by He et al. introduced in their 2015 paper, Deep Residual Learning for Image Recognition. We will load this CNN with pre-trained ImageNet weights.
non_max_suppression: An implementation of NMS in my imutils package.
sliding_window: Our sliding window generator function as described in the previous section.
image_pyramid: The image pyramid generator that we defined previously.

Now that our imports are taken care of, let’s parse command line arguments:

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,
	help="path to the input image")
ap.add_argument("-s", "--size", type=str, default="(200, 150)",
	help="ROI size (in pixels)")
ap.add_argument("-c", "--min-conf", type=float, default=0.9,
	help="minimum probability to filter weak detections")
ap.add_argument("-v", "--visualize", type=int, default=-1,
	help="whether or not to show extra visualizations for debugging")
args = vars(ap.parse_args())

The following arguments must be supplied to this Python script at runtime from your terminal:

--image: The path to the input image for classification-based object detection.
--size: A tuple describing the size of the sliding window. This tuple must be surrounded by quotes for our argument parser to grab it directly from the command line.
--min-conf: The minimum probability threshold to filter weak detections.
--visualize: A switch to determine whether to show additional visualizations for debugging.

We now have a handful of constants to define for our object detection procedures:

# initialize variables used for the object detection procedure
WIDTH = 600
PYR_SCALE = 1.5
WIN_STEP = 16
ROI_SIZE = eval(args["size"])
INPUT_SIZE = (224, 224)

Our classifier-based object detection methodology constants include:

WIDTH: Given that the selection of images/ for testing (refer to the “Project Structure” section) are all slightly different in size, we set a constant width here for later resizing purposes. By ensuring our images have a consistent starting width, we know that the image will fit on our screen.
PYR_SCALE: Our image pyramid scale factor. This value controls how much the image is resized at each layer. Smaller scale values yield more layers in the pyramid, and larger scales yield fewer layers. The fewer layers you have, the faster the overall object detection system will operate, potentially at the expense of accuracy.
WIN_STEP: Our sliding window step size, which indicates how many pixels we are going to “skip” in both the (x, y) directions. Remember, the smaller your step size is, the more windows you’ll need to examine, which leads to a slower overall object detection execution time. In practice, I would recommend trying values of 4 and 8 to start with (depending on the dimensions of your input and your minSize).
ROI_SIZE: Controls the aspect ratio of the objects we want to detect; if a mistake is made setting the aspect ratio, it will be nearly impossible to detect objects. Additionally, this value is related to the image pyramid minSize value — giving our image pyramid generator a means of exiting. As you can see, this value comes directly from our --size command line argument.
INPUT_SIZE: The classification CNN dimensions. Note that the tuple defined here on Line 32 heavily depends on the CNN you are using (in our case, it is ResNet50). If this notion doesn’t resonate with you, I suggest you read this tutorial and, more specifically the section entitled “Can I make the input dimensions [of a CNN] anything I want?”

Understanding what each of the above constants controls is crucial to your understanding of how to turn an image classifier into an object detector with Keras, TensorFlow, and OpenCV. Be sure to mentally distinguish each of these before moving on.

Let’s load our ResNet classification CNN and input image:

# load our network weights from disk
print("[INFO] loading network...")
model = ResNet50(weights="imagenet", include_top=True)

# load the input image from disk, resize it such that it has the
# has the supplied width, and then grab its dimensions
orig = cv2.imread(args["image"])
orig = imutils.resize(orig, width=WIDTH)
(H, W) = orig.shape[:2]

Line 36 loads ResNet pre-trained on ImageNet. If you choose to use a different pre-trained classifier, you can substitute one here for your particular project. To learn how to train your own classifier, I suggest you read Deep Learning for Computer Vision with Python.

We also load our input --image. Once it is loaded, we resize it (while maintaining aspect ratio according to our constant WIDTH) and grab resulting image dimensions.

From here, we’re ready to initialize our image pyramid generator object:

# initialize the image pyramid
pyramid = image_pyramid(orig, scale=PYR_SCALE, minSize=ROI_SIZE)

# initialize two lists, one to hold the ROIs generated from the image
# pyramid and sliding window, and another list used to store the
# (x, y)-coordinates of where the ROI was in the original image
rois = []
locs = []

# time how long it takes to loop over the image pyramid layers and
# sliding window locations
start = time.time()

On Line 45, we supply the necessary parameters to our image_pyramid generator function. Given that pyramid is a generator object at this point, we can loop over values it produces.

Before we do just that, Lines 50 and 51 initialize two lists:

rois: Holds the regions of interest (ROIs) generated from pyramid + sliding window output
locs: Stores the (x, y)-coordinates of where the ROI was in the original image

And we also set a start timestamp so we can later determine how long our classification-based object detection method (given our parameters) took on the input image (Line 55).

Let’s loop over each image our pyramid produces:

# loop over the image pyramid
for image in pyramid:
	# determine the scale factor between the *original* image
	# dimensions and the *current* layer of the pyramid
	scale = W / float(image.shape[1])

	# for each layer of the image pyramid, loop over the sliding
	# window locations
	for (x, y, roiOrig) in sliding_window(image, WIN_STEP, ROI_SIZE):
		# scale the (x, y)-coordinates of the ROI with respect to the
		# *original* image dimensions
		x = int(x * scale)
		y = int(y * scale)
		w = int(ROI_SIZE[0] * scale)
		h = int(ROI_SIZE[1] * scale)

		# take the ROI and preprocess it so we can later classify
		# the region using Keras/TensorFlow
		roi = cv2.resize(roiOrig, INPUT_SIZE)
		roi = img_to_array(roi)
		roi = preprocess_input(roi)

		# update our list of ROIs and associated coordinates
		rois.append(roi)
		locs.append((x, y, x + w, y + h))

Looping over the layers of our image pyramid begins on Line 58.

Our first step in the loop is to compute the scale factor between the original image dimensions (W) and current layer dimensions (image.shape[1]) of our pyramid (Line 61). We need this value to later upscale our object bounding boxes.

Now we’ll cascade into our sliding window loop from this particular layer in our image pyramid. Our sliding_window generator allows us to look side-to-side and up-and-down in our image. For each ROI that it generates, we’ll soon apply image classification.

Line 65 defines our loop over our sliding windows. Inside, we:

Scale coordinates (Lines 68-71).
Grab the ROI and preprocess it (Lines 75-77). Preprocessing includes resizing to the CNN’s required INPUT_SIZE, converting the image to array format, and applying Keras’ preprocessing convenience function. This includes adding a batch dimension, converting from RGB to BGR, and zero-centering color channels according to the ImageNet dataset.
Update the list of rois and associated locs coordinates (Lines 80 and 81).

We also handle optional visualization:

		# check to see if we are visualizing each of the sliding
		# windows in the image pyramid
		if args["visualize"] > 0:
			# clone the original image and then draw a bounding box
			# surrounding the current region
			clone = orig.copy()
			cv2.rectangle(clone, (x, y), (x + w, y + h),
				(0, 255, 0), 2)

			# show the visualization and current ROI
			cv2.imshow("Visualization", clone)
			cv2.imshow("ROI", roiOrig)
			cv2.waitKey(0)

Here, we visualize both the original image with a green box indicating where we are “looking” and the resized ROI, which is ready for classification (Lines 85-95). As you can see, we’ll only --visualize when the flag is set via the command line.

Next, we’ll (1) check our benchmark on the pyramid + sliding window process, (2) classify all of our rois in batch, and (3) decode predictions:

# show how long it took to loop over the image pyramid layers and
# sliding window locations
end = time.time()
print("[INFO] looping over pyramid/windows took {:.5f} seconds".format(
	end - start))

# convert the ROIs to a NumPy array
rois = np.array(rois, dtype="float32")

# classify each of the proposal ROIs using ResNet and then show how
# long the classifications took
print("[INFO] classifying ROIs...")
start = time.time()
preds = model.predict(rois)
end = time.time()
print("[INFO] classifying ROIs took {:.5f} seconds".format(
	end - start))

# decode the predictions and initialize a dictionary which maps class
# labels (keys) to any ROIs associated with that label (values)
preds = imagenet_utils.decode_predictions(preds, top=1)
labels = {}

First, we end our pyramid + sliding window timer and show how long the process took (Lines 99-101).

Then, we take the ROIs and pass them (in batch) through our pre-trained image classifier (i.e., ResNet) via predict (Lines 104-118). As you can see, we print out a benchmark for the inference process here too.

Finally, Line 117 decodes the predictions, grabbing only the top prediction for each ROI.

We’ll need a means to map class labels (keys) to ROI locations associated with that label (values); the labels dictionary (Line 118) serves that purpose.

Let’s go ahead and populate our labels dictionary now:

# loop over the predictions
for (i, p) in enumerate(preds):
	# grab the prediction information for the current ROI
	(imagenetID, label, prob) = p[0]

	# filter out weak detections by ensuring the predicted probability
	# is greater than the minimum probability
	if prob >= args["min_conf"]:
		# grab the bounding box associated with the prediction and
		# convert the coordinates
		box = locs[i]

		# grab the list of predictions for the label and add the
		# bounding box and probability to the list
		L = labels.get(label, [])
		L.append((box, prob))
		labels[label] = L

Looping over predictions beginning on Line 121, we first grab the prediction information including the ImageNet ID, class label, and probability (Line 123).

From there, we check to see if the minimum confidence has been met (Line 127). Assuming so, we update the labels dictionary (Lines 130-136) with the bounding box and prob score tuple (value) associated with each class label (key).

As a recap, so far, we have:

Generated scaled images with our image pyramid
Generated ROIs using a sliding window approach for each layer (scaled image) of our image pyramid
Performed classification on each ROI and placed the results in our labels list

We’re not quite done yet with turning our image classifier into an object detector with Keras, TensorFlow, and OpenCV. Now, we need to visualize the results.

This is the time where you would implement logic to do something useful with the results (labels), whereas in our case, we’re simply going to annotate the objects. We will also have to handle our overlapping detections by means of non-maxima suppression (NMS).

Let’s go ahead and loop over over all keys in our labels list:

# loop over the labels for each of detected objects in the image
for label in labels.keys():
	# clone the original image so that we can draw on it
	print("[INFO] showing results for '{}'".format(label))
	clone = orig.copy()

	# loop over all bounding boxes for the current label
	for (box, prob) in labels[label]:
		# draw the bounding box on the image
		(startX, startY, endX, endY) = box
		cv2.rectangle(clone, (startX, startY), (endX, endY),
			(0, 255, 0), 2)

	# show the results *before* applying non-maxima suppression, then
	# clone the image again so we can display the results *after*
	# applying non-maxima suppression
	cv2.imshow("Before", clone)
	clone = orig.copy()

Our loop over the labels for each of the detected objects begins on Line 139.

We make a copy of the original input image so that we can annotate it (Line 142).

We then annotate all bounding boxes for the current label (Lines 145-149).

So that we can visualize the before/after applying NMS, Line 154 displays the “before” image, and then we proceed to make another copy (Line 155).

Now, let’s apply NMS and display our “after” NMS visualization:

	# extract the bounding boxes and associated prediction
	# probabilities, then apply non-maxima suppression
	boxes = np.array([p[0] for p in labels[label]])
	proba = np.array([p[1] for p in labels[label]])
	boxes = non_max_suppression(boxes, proba)

	# loop over all bounding boxes that were kept after applying
	# non-maxima suppression
	for (startX, startY, endX, endY) in boxes:
		# draw the bounding box and label on the image
		cv2.rectangle(clone, (startX, startY), (endX, endY),
			(0, 255, 0), 2)
		y = startY - 10 if startY - 10 > 10 else startY + 10
		cv2.putText(clone, label, (startX, y),
			cv2.FONT_HERSHEY_SIMPLEX, 0.45, (0, 255, 0), 2)

	# show the output after apply non-maxima suppression
	cv2.imshow("After", clone)
	cv2.waitKey(0)

To apply NMS, we first extract the bounding boxes and associated prediction probabilities (proba) via Lines 159 and 160. We then pass those results into my imultils implementation of NMS (Line 161). For more details on non-maxima suppression, be sure to refer to my blog post.

After NMS has been applied, Lines 165-171 annotate bounding box rectangles and labels on the “after” image. Lines 174 and 175 display the results until a key is pressed, at which point all GUI windows close, and the script exits.

Great job! In the next section, we’ll analyze results of our method for using an image classifier for object detection purposes.

Image classifier to object detector results using Keras and TensorFlow

At this point, we are ready to see the results of our hard work.

Make sure you use the “Downloads” section of this tutorial to download the source code and example images from this blog post.

From there, open up a terminal, and execute the following command:

$ python detect_with_classifier.py --image images/stingray.jpg --size "(300, 150)"
[INFO] loading network...
[INFO] looping over pyramid/windows took 0.19142 seconds
[INFO] classifying ROIs...
[INFO] classifying ROIs took 9.67027 seconds
[INFO] showing results for 'stingray'

**Figure 7:** *Top:* Classifier-based object detection. *Bottom:* Classifier-based object detection followed by non-maxima suppression. In this tutorial, we used TensorFlow, Keras, and OpenCV to turn a CNN image classifier into an object detector.

Here, you can see that I have inputted an example image containing a “stingray” which CNNs trained on ImageNet will be able to recognize (since ImageNet contains a “stingray” class).

Figure 7 (top) shows the original output from our object detection procedure.

Notice how there are multiple, overlapping bounding boxes surrounding the stingray.

Applying non-maxima suppression (Figure 7, bottom) collapses the bounding boxes into a single detection.

Let’s try another image, this one of a hummingbird (again, which networks trained on ImageNet will be able to recognize):

$ python detect_with_classifier.py --image images/hummingbird.jpg --size "(250, 250)"
[INFO] loading network...
[INFO] looping over pyramid/windows took 0.07845 seconds
[INFO] classifying ROIs...
[INFO] classifying ROIs took 4.07912 seconds
[INFO] showing results for 'hummingbird'

**Figure 8:** Turning a deep learning convolutional neural network image classifier into an object detector with Python, Keras, and OpenCV.

Figure 8 (top) shows the original output of our detection procedure, while the bottom shows the output after applying non-maxima suppression.

Again, our “image classifier turned object detector” procedure performed well here.

But let’s now try an example image where our object detection algorithm doesn’t perform optimally:

$ python detect_with_classifier.py --image images/lawn_mower.jpg --size "(200, 200)"
[INFO] loading network...
[INFO] looping over pyramid/windows took 0.13851 seconds
[INFO] classifying ROIs...
[INFO] classifying ROIs took 7.00178 seconds
[INFO] showing results for 'lawn_mower'
[INFO] showing results for 'half_track'

**Figure 9:** Turning a deep learning convolutional neural network image classifier into an object detector with Python, Keras, and OpenCV. The *bottom* shows the result after NMS has been applied.

At first glance, it appears this method worked perfectly — we were able to localize the “lawn mower” in the input image.

But there was actually a second detection for a “half-track” (a military vehicle that has regular wheels on the front and tank-like tracks on the back):

**Figure 10:** What do we do when we have a false-positive detection using our CNN image classifier-based object detector?

Clearly, there is not a half-track in this image, so how do we improve the results of our object detection procedure?

The answer is to increase our --min-conf to remove false-positive predictions:

$ python detect_with_classifier.py --image images/lawn_mower.jpg --size "(200, 200)" --min-conf 0.95
[INFO] loading network...
[INFO] looping over pyramid/windows took 0.13618 seconds
[INFO] classifying ROIs...
[INFO] classifying ROIs took 6.99953 seconds
[INFO] showing results for 'lawn_mower'

**Figure 11:** By increasing the confidence threshold in our classifier-based object detector (made with TensorFlow, Keras, and OpenCV), we’ve eliminated the false-positive “half-track” detection.

By increasing the minimum confidence to 95%, we have filtered out the less confident “half-track” prediction, leaving only the (correct) “lawn mower” object detection.

While our procedure for turning a pre-trained image classifier into an object detector isn’t perfect, it still can be used for certain situations, specifically when images are captured in controlled environments.

In the rest of this series, we’ll be learning how to improve upon our object detection results and build a more robust deep learning-based object detector.

Problems, limitations, and next steps

If you carefully inspect the results of our object detection procedure, you’ll notice a few key takeaways:

The actual object detector is slow. Constructing all the image pyramid and sliding window locations takes ~1/10th of a second, and that doesn’t even include the time it takes for the network to make predictions on all the ROIs (4-9 seconds on a 3 GHz CPU)!
Bounding box locations aren’t necessarily accurate. The largest issue with this object detection algorithm is that the accuracy of our detections is dependent on our selection of image pyramid scale, sliding window step, and ROI size. If any one of these values is off, then our detector is going to perform suboptimally.
The network is not end-to-end trainable. The reason deep learning-based object detectors such as Faster R-CNN, SSDs, YOLO, etc. perform so well is because they are end-to-end trainable, meaning that any error in bounding box predictions can be made more accurate through backpropagation and updating the weights of the network — since we’re using a pre-trained image classifier with fixed weights, we cannot backpropagate error terms through the network.

Throughout this four-part series, we’ll be examining how to resolve these issues and build an object detector similar to the R-CNN family of networks.

What's next? I recommend PyImageSearch University.

Course information:
30+ total classes • 39h 44m video • Last updated: 12/2021
★★★★★ 4.84 (128 Ratings) • 3,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

✓ 30+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 30+ Certificates of Completion
✓ 39h 44m on-demand video
✓ Brand new courses released every month, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 500+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In this tutorial, you learned how to take any pre-trained deep learning image classifier and turn into an object detector using Keras, TensorFlow, and OpenCV.

To accomplish this task, we combined deep learning with traditional computer vision algorithms:

In order to detect objects at different scales (i.e., sizes), we utilized image pyramids, which take our input image and repeatedly downsample it.
To detect objects at different locations, we used sliding windows, which slide a fixed size window from left-to-right and top-to-bottom across the input image — at each stop of the window, we extract the ROI and pass it through our image classifier.
It’s natural for object detection algorithms to produce multiple, overlapping bounding boxes for objects in an image; in order to “collapse” these overlapping bounding boxes into a single detection, we applied non-maxima suppression.

The end results of our hacked together object detection routine were fairly reasonable, but there were two primary problems:

The network is not end-to-end trainable. We’re not actually “learning” to detect objects; we’re instead just taking ROIs and classifying them using a CNN trained for image classification.
The object detection results are incredibly slow. On my Intel Xeon W 3 Ghz processor, applying object detection to a single image took ~4-9.5 seconds, depending on the input image resolution. Such an object detector could not be applied in real time.

In order to fix both of these problems, next week, we’ll start exploring the algorithms necessary to build an object detector from the R-CNN, Fast R-CNN, and Faster R-CNN family.

This will be a great series of tutorials, so you won’t want to miss them!

To download the source code to this post (and be notified when the next tutorial in this series publishes), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!

Looking for the source code to this post?

Turning any CNN image classifier into an object detector with Keras, TensorFlow, and OpenCV

Image classification vs. object detection

How can we turn any deep learning image classifier into an object detector?

Combining traditional computer vision with deep learning to build an object detector

Configuring your development environment

Project structure

Implementing our image pyramid and sliding window utility functions

Using Keras and TensorFlow to turn a pre-trained image classifier into an object detector

Image classifier to object detector results using Keras and TensorFlow

Problems, limitations, and next steps

What's next? I recommend PyImageSearch University.

Summary

Download the Source Code and FREE 17-page Resource Guide

About the Author

Comment section

My Raspberry Pi for Computer Vision Kickstarter will go live on Wednesday, April 10th at 10AM EDT.

How to use Keras fit and fit_generator (a hands-on tutorial)

Image hashing with OpenCV and Python

Topics

Books & Courses

PyImageSearch

Looking for the source code to this post?

Turning any CNN image classifier into an object detector with Keras, TensorFlow, and OpenCV

Image classification vs. object detection

How can we turn any deep learning image classifier into an object detector?

Combining traditional computer vision with deep learning to build an object detector

Configuring your development environment

Project structure

Implementing our image pyramid and sliding window utility functions

Using Keras and TensorFlow to turn a pre-trained image classifier into an object detector

Image classifier to object detector results using Keras and TensorFlow

Problems, limitations, and next steps

What's next? I recommend PyImageSearch University.

Summary

Download the Source Code and FREE 17-page Resource Guide

About the Author

OpenCV Fast Fourier Transform (FFT) for blur detection in images and video streams

OpenCV Selective Search for Object Detection

Comment section

Similar articles

You can learn Computer Vision, Deep Learning, and OpenCV.

Footer

Topics

Books & Courses

PyImageSearch

Access the code to this tutorial and all other 500+ tutorials on PyImageSearch

What's included in PyImageSearch University?