In this tutorial, you will learn how to perform semantic segmentation using OpenCV, deep learning, and the ENet architecture. After reading today’s guide, you will be able to apply semantic segmentation to images and video using OpenCV.
Deep learning has helped facilitate unprecedented accuracy in computer vision, including image classification, object detection, and now even segmentation.
Traditional segmentation involves partitioning an image into parts (Normalized Cuts, Graph Cuts, Grab Cuts, superpixels, etc.); however, the algorithm has no actual understanding of what these parts represent.
Semantic segmentation algorithms on the other hand attempt to:
- Partition the image into meaningful parts
- While at the same time, associate every pixel in an input image with a class label (i.e., person, road, car, bus, etc.)
Semantic segmentation algorithms are super powerful and have many use cases, including self-driving cars — and in today’s post, I’ll be showing you how to apply semantic segmentation to road-scene images/video!
To learn how to apply semantic segmentation using OpenCV and deep learning, just keep reading!
Semantic segmentation with OpenCV and deep learning
In the first part of today’s blog post, we will discuss the ENet deep learning architecture.
From there, I’ll demonstrate how to use ENet to apply semantic segmentation to both images and video streams.
Along the way, I’ll be sharing example outputs from the segmentation so you can get a feel for what to expect when applying semantic segmentation to your own projects.
The ENet semantic segmentation architecture
The semantic segmentation architecture we’re using for this tutorial is ENet, which is based on Paszke et al.’s 2016 publication, ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation.
One of the primary benefits of ENet is that it’s fast — up to 18x faster and requiring 79x fewer parameters with similar or better accuracy than larger models. The model size itself is only 3.2MB!
A single forward pass on a CPU took 0.2 seconds on my machine — if I were to use a GPU this segmentation network could run even faster. Paszke et al. trained that dataset on The Cityscapes Dataset, a semantic, instance-wise, dense pixel annotation of 20-30 classes (depending on which model you’re using).
As the name suggests, the Cityscapes dataset includes examples of images that can be used for urban scene understanding, including self- driving vehicles.
The particular model we’re using is trained on 20 classes, including:
- Unlabeled (i.e., background)
- Road
- Sidewalk
- Building
- Wall
- Fence
- Pole
- TrafficLight
- TrafficSign
- Vegetation
- Terrain
- Sky
- Person
- Rider
- Car
- Truck
- Bus
- Train
- Motorcycle
- Bicycle
In the rest of this blog post, you’ll learn how to apply semantic segmentation to extract a dense, pixel-wise map of each of these classes in both images and video streams.
If you’re interested in training your own ENet models for segmentation on your own custom datasets, be sure to refer to this page where the authors have provided a tutorial on how to do so.
Project structure
Today’s project can be obtained from the “Downloads” section of this blog post. Let’s take a look at our project structure using the tree
command:
$ tree --dirsfirst . ├── enet-cityscapes │ ├── enet-classes.txt │ ├── enet-colors.txt │ └── enet-model.net ├── images │ ├── example_01.png │ ├── example_02.jpg │ ├── example_03.jpg │ └── example_04.png ├── videos │ ├── massachusetts.mp4 │ └── toronto.mp4 ├── output ├── segment.py └── segment_video.py 4 directories, 11 files
Our project has four directories:
enet-cityscapes/
: Contains our pre-trained deep learning model, classes list, and color labels to correspond with the classes.images/
: A selection of four sample images to test our image segmentation script.videos/
: Includes two sample videos for testing our deep learning segmentation video script. Credits for these videos are listed in the “Video segmentation results” section.output/
: For organizational purposes, I like to have my script save the processed videos to theoutput
folder. I’m not including the output images/videos in the downloads as the file sizes are quite larger. You’ll need to use today’s code to generate them on your own.
Today we’ll be reviewing two Python scripts:
segment.py
: Performs deep learning semantic segmentation on a single image. We’ll walk through this script to learn how segmentation works and then test it on single images before moving on to video.segment_video.py
: As the name suggests, this script will perform semantic segmentation on video.
Semantic segmentation in images with OpenCV
Let’s go ahead and get started — open up the segment.py
file and insert the following code:
# import the necessary packages import numpy as np import argparse import imutils import time import cv2
We begin by importing necessary packages.
For this script, I recommend OpenCV 3.4.1 or higher. You can follow one of my installation tutorials — just be sure to specify which version of OpenCV you want to download and install as you follow the steps.
You’ll also need to install my package of OpenCV convenience functions, imutils — just use pip to install the package:
$ pip install --upgrade imutils
If you are using Python virtual environments don’t forget to use the workon
command before using pip
to install imutils
!
Moving on, let’s parse our command line arguments:
# construct the argument parse and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-m", "--model", required=True, help="path to deep learning segmentation model") ap.add_argument("-c", "--classes", required=True, help="path to .txt file containing class labels") ap.add_argument("-i", "--image", required=True, help="path to input image") ap.add_argument("-l", "--colors", type=str, help="path to .txt file containing colors for labels") ap.add_argument("-w", "--width", type=int, default=500, help="desired width (in pixels) of input image") args = vars(ap.parse_args())
This script has five command line arguments, two of which are optional:
--model
: The path to our deep learning semantic segmentation model.--classes
: The path to a text file containing class labels.--image
: Our input image file path.--colors
: Optional path to a colors text file. If no file is specified, random colors will be assigned to each class.--width
: Optional desired image width. By default the value is500
pixels.
If you aren’t familiar with the concept of argparse
and command line arguments, definitely review this blog post which covers command line arguments in-depth.
Let’s handle our parsing our class labels files and colors next:
# load the class label names CLASSES = open(args["classes"]).read().strip().split("\n") # if a colors file was supplied, load it from disk if args["colors"]: COLORS = open(args["colors"]).read().strip().split("\n") COLORS = [np.array(c.split(",")).astype("int") for c in COLORS] COLORS = np.array(COLORS, dtype="uint8") # otherwise, we need to randomly generate RGB colors for each class # label else: # initialize a list of colors to represent each class label in # the mask (starting with 'black' for the background/unlabeled # regions) np.random.seed(42) COLORS = np.random.randint(0, 255, size=(len(CLASSES) - 1, 3), dtype="uint8") COLORS = np.vstack([[0, 0, 0], COLORS]).astype("uint8")
We load our CLASSES
into memory from the supplied text file where the path is contained in the command line args
dictionary (Line 23).
If a pre-specified set of COLORS
for each class label is provided in a text file (one per line), we load them into memory (Lines 26-29). Otherwise, we randomly generate COLORS
for each label (Lines 33-40).
For testing purposes (and since we have 20 classes), let’s create a pretty color lookup legend using OpenCV drawing functions:
# initialize the legend visualization legend = np.zeros(((len(CLASSES) * 25) + 25, 300, 3), dtype="uint8") # loop over the class names + colors for (i, (className, color)) in enumerate(zip(CLASSES, COLORS)): # draw the class name + color on the legend color = [int(c) for c in color] cv2.putText(legend, className, (5, (i * 25) + 17), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 2) cv2.rectangle(legend, (100, (i * 25)), (300, (i * 25) + 25), tuple(color), -1)
Here we generate a legend visualization so we can easily visually associate a class label with a color. The legend consists of the class label and a colored rectangle next to it. This is quickly created by creating a canvas (Line 43) and dynamically building the legend with a loop (Lines 46-52). Drawing basics are covered in this blog post.
Here’s the result:
The deep learning segmentation heavy lifting takes place in the next block:
# load our serialized model from disk print("[INFO] loading model...") net = cv2.dnn.readNet(args["model"]) # load the input image, resize it, and construct a blob from it, # but keeping mind mind that the original input image dimensions # ENet was trained on was 1024x512 image = cv2.imread(args["image"]) image = imutils.resize(image, width=args["width"]) blob = cv2.dnn.blobFromImage(image, 1 / 255.0, (1024, 512), 0, swapRB=True, crop=False) # perform a forward pass using the segmentation model net.setInput(blob) start = time.time() output = net.forward() end = time.time() # show the amount of time inference took print("[INFO] inference took {:.4f} seconds".format(end - start))
To perform deep learning semantic segmentation of an image with Python and OpenCV, we:
- Load the model (Line 56).
- Construct a
blob
(Lines 61-64).The ENet model we are using in this blog post was trained on input images with 1024×512 resolution — we’ll use the same here. You can learn more about how OpenCV’s blobFromImage works here. - Set the
blob
as input to the network (Line 67) and perform a forward pass through the neural network (Line 69).
I surrounded the forward pass statement with timestamps. The elapsed time is printed to the terminal on Line 73.
Our work isn’t done yet — now it’s time to take steps to visualize our results. In the remaining lines of the script, we’ll be generating a color map to overlay on the original image. Each pixel has a corresponding class label index, enabling us to see the results of semantic segmentation on our screen visually.
To begin, we need to extract volume dimension information from our output, followed by calculating the class map and color mask:
# infer the total number of classes along with the spatial dimensions # of the mask image via the shape of the output array (numClasses, height, width) = output.shape[1:4] # our output class ID map will be num_classes x height x width in # size, so we take the argmax to find the class label with the # largest probability for each and every (x, y)-coordinate in the # image classMap = np.argmax(output[0], axis=0) # given the class ID map, we can map each of the class IDs to its # corresponding color mask = COLORS[classMap]
We determine the spatial dimensions of the output
volume on Line 77.
Next, let’s find the class label index with the largest probability for each and every (x, y)-coordinate of the output volume (Line 83). This is known now as our classMap
and contains a class index for each pixel.
Given the class ID indexes, we can use NumPy array indexing to “magically” (and not to mention, super efficiently) lookup the corresponding visualization color for each pixel (Line 87). Our color mask
will be overlayed transparently on the original image.
Let’s finish the script:
# resize the mask and class map such that its dimensions match the # original size of the input image (we're not using the class map # here for anything else but this is how you would resize it just in # case you wanted to extract specific pixels/classes) mask = cv2.resize(mask, (image.shape[1], image.shape[0]), interpolation=cv2.INTER_NEAREST) classMap = cv2.resize(classMap, (image.shape[1], image.shape[0]), interpolation=cv2.INTER_NEAREST) # perform a weighted combination of the input image with the mask to # form an output visualization output = ((0.4 * image) + (0.6 * mask)).astype("uint8") # show the input and output images cv2.imshow("Legend", legend) cv2.imshow("Input", image) cv2.imshow("Output", output) cv2.waitKey(0)
We resize the mask
and classMap
such that they have the exact same dimensions as our input image
(Lines 93-96). It is critical that we apply nearest neighbor interpolation rather than cubic, bicubic, etc. interpolation as we want to maintain the original class IDs/mask values.
Now that sizing is correct, we create a “transparent color overlay” by overlaying the mask on our original image (Line 100). This enables us to easily visualize the output of the segmentation. More information on transparent overlays, and how to construct them, can be found in this post.
Finally, the legend
and original + output
images are shown to the screen on Lines 103-105.
Single-image segmentation results
Be sure to grab the “Downloads” to this blog post before using the commands in this section. I’ve provided the model + associated files, images, and Python scripts in a zip file for your convenience.
The command line arguments that you supply in your terminal are important to replicate my results. Learn about command line arguments here if you are new to them.
When you’re ready, open up a terminal + navigate to the project, and execute the following command:
$ python segment.py --model enet-cityscapes/enet-model.net \ --classes enet-cityscapes/enet-classes.txt \ --colors enet-cityscapes/enet-colors.txt \ --image images/example_01.png [INFO] loading model... [INFO] inference took 0.2100 seconds
Notice how accurate the segmentation is — it clearly segments classes and accurately identifies the person and bicycle (a safety issue for self-driving cars). The road, sidewalk, cars, and even foliage are identified.
Let’s try another example simply by changing the --image
command line argument to be a different image:
$ python segment.py --model enet-cityscapes/enet-model.net \ --classes enet-cityscapes/enet-classes.txt \ --colors enet-cityscapes/enet-colors.txt \ --image images/example_02.jpg [INFO] loading model... [INFO] inference took 0.1989 seconds
The result in Figure 4 demonstrates the accuracy and clarity of this semantic segmentation model. The cars, road, trees, and sky are clearly marked.
Here’s another example:
$ python segment.py --model enet-cityscapes/enet-model.net \ --classes enet-cityscapes/enet-classes.txt \ --colors enet-cityscapes/enet-colors.txt \ --image images/example_03.png [INFO] loading model... [INFO] inference took 0.1992 seconds
The above figure is a more complex scene, but ENet can still segment the people walking in front of the car. Unfortunately, the model incorrectly classifies the road as sidewalk, but could be due to the fact that people are walking on it.
A final example:
$ python segment.py --model enet-cityscapes/enet-model.net \ --classes enet-cityscapes/enet-classes.txt \ --colors enet-cityscapes/enet-colors.txt \ --image images/example_04.png [INFO] loading model... [INFO] inference took 0.1916 seconds
The final image that we’ve sent through ENet shows how the model can clearly segment a truck from a car among other scene classes such as road, sidewalk, foliage, person, etc.
Implementing semantic segmentation in video with OpenCV
Let’s continue on and apply semantic segmentation to video. Semantic segmentation in video follows the same concept as on a single image — this time we’ll loop over all frames in a video stream and process each one. I recommend a GPU if you need to process frames in real-time.
Open up the segment_video.py
file and insert the following code:
# import the necessary packages import numpy as np import argparse import imutils import time import cv2 # construct the argument parse and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-m", "--model", required=True, help="path to deep learning segmentation model") ap.add_argument("-c", "--classes", required=True, help="path to .txt file containing class labels") ap.add_argument("-v", "--video", required=True, help="path to input video file") ap.add_argument("-o", "--output", required=True, help="path to output video file") ap.add_argument("-s", "--show", type=int, default=1, help="whether or not to display frame to screen") ap.add_argument("-l", "--colors", type=str, help="path to .txt file containing colors for labels") ap.add_argument("-w", "--width", type=int, default=500, help="desired width (in pixels) of input image") args = vars(ap.parse_args())
Here we import
our required packages and parse command line arguments with argparse. Imports are the same as the previous script. With the exception of the following two command line arguments, the other five are the same as well:
--video
: The path to the input video file.--show
: Whether or not to show the video on the screen while processing. You’ll achieve higher FPS throughput if you set this value to0
.
The following lines load our classes and associated colors data (or generate random colors). These lines are identical to the previous script:
# load the class label names CLASSES = open(args["classes"]).read().strip().split("\n") # if a colors file was supplied, load it from disk if args["colors"]: COLORS = open(args["colors"]).read().strip().split("\n") COLORS = [np.array(c.split(",")).astype("int") for c in COLORS] COLORS = np.array(COLORS, dtype="uint8") # otherwise, we need to randomly generate RGB colors for each class # label else: # initialize a list of colors to represent each class label in # the mask (starting with 'black' for the background/unlabeled # regions) np.random.seed(42) COLORS = np.random.randint(0, 255, size=(len(CLASSES) - 1, 3), dtype="uint8") COLORS = np.vstack([[0, 0, 0], COLORS]).astype("uint8")
After loading classes and associating a color with each class for visualization, we’ll load the model and initialize the video stream:
# load our serialized model from disk print("[INFO] loading model...") net = cv2.dnn.readNet(args["model"]) # initialize the video stream and pointer to output video file vs = cv2.VideoCapture(args["video"]) writer = None # try to determine the total number of frames in the video file try: prop = cv2.cv.CV_CAP_PROP_FRAME_COUNT if imutils.is_cv2() \ else cv2.CAP_PROP_FRAME_COUNT total = int(vs.get(prop)) print("[INFO] {} total frames in video".format(total)) # an error occurred while trying to determine the total # number of frames in the video file except: print("[INFO] could not determine # of frames in video") total = -1
Our model only needs to be loaded once on Line 48 — we’ll use that same model to process each and every frame.
From there we open a video stream pointer to input video file on and initialize our video writer object (Lines 51 and 52).
Lines 55-59 attempt to determine the total
number of frames in the video, otherwise a message is printed indicating that the value could not be determined via Lines 63 and 64. The total
value will be used later to calculate the approximate runtime of this video processing script.
Let’s begin looping over video frames:
# loop over frames from the video file stream while True: # read the next frame from the file (grabbed, frame) = vs.read() # if the frame was not grabbed, then we have reached the end # of the stream if not grabbed: break # construct a blob from the frame and perform a forward pass # using the segmentation model frame = imutils.resize(frame, width=args["width"]) blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (1024, 512), 0, swapRB=True, crop=False) net.setInput(blob) start = time.time() output = net.forward() end = time.time()
Our while
loop begins on Line 68.
We grab a frame
on Line 70 and subsequently check that it is valid on Line 74. If it was not grabbed
properly, we’ve likely reached the end of the video, so we break
out of the frame processing loop (Line 75).
The next set of lines mimic what we accomplished previously with a single image, but this time we are operating on a video frame
. Inference occurs here, so don’t overlook these steps where we:
- Construct a
blob
from a resizedframe
(Lines 79-81). The ENet model we are using in this blog post was trained on input images with 1024×512 resolution — we’ll use the same here. Learn about how OpenCV’s blobFromImage works here. - Set the
blob
as input (Line 82) and perform aforward
pass through the neural network (Line 84).
Segmentation inference is now complete, but we want to post process the data in order to visualize + output the results. The remainder of the loop handles this process over three code blocks:
# infer the total number of classes along with the spatial # dimensions of the mask image via the shape of the output array (numClasses, height, width) = output.shape[1:4] # our output class ID map will be num_classes x height x width in # size, so we take the argmax to find the class label with the # largest probability for each and every (x, y)-coordinate in the # image classMap = np.argmax(output[0], axis=0) # given the class ID map, we can map each of the class IDs to its # corresponding color mask = COLORS[classMap] # resize the mask such that its dimensions match the original size # of the input frame mask = cv2.resize(mask, (frame.shape[1], frame.shape[0]), interpolation=cv2.INTER_NEAREST) # perform a weighted combination of the input frame with the mask # to form an output visualization output = ((0.3 * frame) + (0.7 * mask)).astype("uint8")
Just as before:
- We extract the spatial dimensions of the
output
volume on Line 89. - Generate our
classMap
by finding the class label index with the largest probability for each and every pixel of theoutput
image array (Line 95). - Compute our color
mask
from theCOLORS
associated with each class label index in theclassMap
(Line 99). - Resize the
mask
to match theframe
dimensions (Lines 103 and 104). - And finally, overlay the mask on the frame transparently (Line 108).
Let’s write the output frames to disk:
# check if the video writer is None if writer is None: # initialize our video writer fourcc = cv2.VideoWriter_fourcc(*"MJPG") writer = cv2.VideoWriter(args["output"], fourcc, 30, (output.shape[1], output.shape[0]), True) # some information on processing single frame if total > 0: elap = (end - start) print("[INFO] single frame took {:.4f} seconds".format(elap)) print("[INFO] estimated total time: {:.4f}".format( elap * total)) # write the output frame to disk writer.write(output)
The first time the loop runs, the writer is None
, so we need to instantiate it on Lines 111-115. Learn more about writing video to disk with OpenCV.
Using the total
video frame count, we can estimate how long it will take to process the video (Lines 118-122).
Finally, we actually write
the output
to disk on Line 125.
Let’s display the frame
(if needed) and clean up:
# check to see if we should display the output frame to our screen if args["show"] > 0: cv2.imshow("Frame", output) key = cv2.waitKey(1) & 0xFF # if the `q` key was pressed, break from the loop if key == ord("q"): break # release the file pointers print("[INFO] cleaning up...") writer.release() vs.release()
In the last block, we check to see if we should display the output frame
and take action accordingly (Lines 128 and 129). While we’re showing the frames in a window on the screen, if “q” is pressed, we’ll “quit” the frame processing loop (Lines 130-134). Finally we cleanup by releasing pointers.
Video segmentation results
To perform semantic segmentation in video, grab the “Downloads” for this blog post.
Then, open up a terminal and execute the following command:
$ python segment_video.py --model enet-cityscapes/enet-model.net \ --classes enet-cityscapes/enet-classes.txt \ --colors enet-cityscapes/enet-colors.txt \ --video videos/massachusetts.mp4 \ --output output/massachusetts_output.avi [INFO] loading model... [INFO] 4235 total frames in video [INFO] single frame took 0.2491 seconds [INFO] estimated total time: 1077.3574 [INFO] cleaning up...
I’ve included a sample of my output below:
Credits: Thank you to Davis King from dlib for putting together a dataset of front/rear views of vehicles. Davis included the videos in his dataset which I then used for this example. Thank you J Utah and Massachusetts Dash Cam for the example videos. Audio credit to BenSound.
What if I want to train my own segmentation networks?
At this point, if you reviewed both scripts, you learned that deep learning semantic segmentation with a pretrained model is quite easy for both images and video. Python and OpenCV make the process straightforward for us, but don’t be fooled by the low line count of the scripts — there are a ton of computations going on under the hood of the segmentation model.
Training a model isn’t as difficult as you’d imagine. If you would like to train your own segmentation networks on your own custom datasets, make sure you refer to the following tutorial provided by the ENet authors.
Please note that I have not trained a network from scratch using ENet but I wanted to provide it in this post as (1) a matter of completeness and (2) just in case you may want to give it a try.
Keep in mind though — labeling image data requires a ton of time and resources. The ENet authors were able to train their model thanks to the hard work of the Cityscapes team who graciously have made their efforts available for learning and research.
Note: The Cityscapes data is for non-commercial use (i.e. academic, research, and learning). Only use the ENet model accordingly.
What's next? I recommend PyImageSearch University.
30+ total classes • 39h 44m video • Last updated: 12/2021
★★★★★ 4.84 (128 Ratings) • 3,000+ Students Enrolled
I strongly believe that if you had the right teacher you could master computer vision and deep learning.
Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?
That’s not the case.
All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.
If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.
Inside PyImageSearch University you'll find:
- ✓ 30+ courses on essential computer vision, deep learning, and OpenCV topics
- ✓ 30+ Certificates of Completion
- ✓ 39h 44m on-demand video
- ✓ Brand new courses released every month, ensuring you can keep up with state-of-the-art techniques
- ✓ Pre-configured Jupyter Notebooks in Google Colab
- ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
- ✓ Access to centralized code repos for all 500+ tutorials on PyImageSearch
- ✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
- ✓ Access on mobile, laptop, desktop, etc.
Summary
In today’s blog post we learned how to apply semantic segmentation using OpenCV, deep learning, and the ENet architecture.
Using the pre-trained ENet model on the Cityscapes dataset, we were able to segment both images and video streams into 20 classes in the context of self-driving cars and road scene segmentation, including people (both walking and riding bicycles), vehicles (cars, trucks, buses, motorcycles, etc.), construction (building, walls, fences, etc.), as well as vegetation, terrain, and the ground itself.
If you enjoyed today’s blog post, be sure to share it!
And to download the code to this guide, just enter your email address in the form below — I’ll be sure to notify you when new posts are published here on PyImageSearch as well.
Download the Source Code and FREE 17-page Resource Guide
Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!
Hi Adrian
Thanks for your good post
Could you please report the processing time on your CPU? Also the model of CPU
Model processing time on the CPU is already reported on this post. Be sure to refer to the terminal output for each of the respective commands where the throughput time is estimated.
Thanks for the great tutorial Adrian it was really helpful. I have some questions:
Do you know how much fast this implementation works? And if yes, in which hardware?
Hey Darío — I have already included speed throughput information in the tutorial. I have included inference approximation on a CPU.
I noticed the inference approximation value keeps varying the number of times you run the code for the same image(example).
I am seeing much slower inference time (>1second) on a Nvidia TX1 (GPU) than the inference approximations in the blog post. What type of hardware (e.g. CPU) is used for the demo results posted? Thanks.
The demo results were gathered on a 3 GHz Intel Xeon W.
This would be great for background substraction in motion detection for surveillance cameras I guess
Thanks, Adrian. could you please help me to create own ENet model?
Refer to the “What if I want to train my own segmentation networks?” section of this post.
I gather the algorithm is starting fresh on each frame, independent of any previous frames. Of course, that’s not the way people do it- once you identify a car or a tree, you expect to see the same objects nearby a moment later, and would not expect an object to magically change into something else. But I suppose combining the segmentation model with object tracking on every moving object would be vastly more complex.
Great post, again.
I just wonder which framework Mr Paszke used to train, can you let me know, thanks so much, Adrian
It was Caffe. Refer to their GitHub (linked to in this post) for more information.
great post.
I am interested to know what are the major area where I can implementations semantic-segmentation .
thanks.
There are many, both some of the hottest areas for semantic segmentation right now include road scene segmentation for self-driving cars and for work in pathology, such as segmenting cellular structures.
Hey Adrian, Thanks for this article.
Is OpenCv’s dnn layer a wrapper around caffe ?
No, OpenCV does not wrap around Caffe, Torch, Lua, etc. Instead, OpenCV provides methods to load these model formats without requiring those respective libraries to be installed.
Hi Adrian,
Loved your post. I was actually waiting for segmentation post. The thing I need to understand how it works from the scratch. I read a few posts containing the idea of upsampling and skip connections between deconv and maxpool layers. Though, I understood the overview, I need to understand the fine details.
And also can you explain me the concept/requirement of “blob”
Thanks and cheers.
The blob concept is covered in detail in this post.
Hey Adrian,
Great tutorial. I am not able to get what exactly does the color map signifies.
The color map is just a visualization of the pixel-wise segmentation of the image. Each pixel in the image is associated with a class label. The color map visualizations this relationship.
Hi Adrian!
Thanks for the great tutorial.
I have one question:
Can i use it for segmentation a car license plates? Just to get something like this:
https://imgur.com/a/UewmUiF
You would need to train a segmentation model explicitly on car license plates. For what it’s worth I cover how to perform Automatic License Plate Recognition inside the PyImageSearch Gurus course.
Hi Adrian, I’m super stoked for this tutorial, but I just gotta get over this bug I’m running into from the code:
AttributeError: module ‘cv2.dnn’ has no attribute ‘readNet’
When I googled around for this situation it is said that I need to build opencv from source from the opencv Master branch. I have installed opencv 3.4.1 using your instructions (Raspbian, Ubuntu, and all installs work great until this snag), and as far as I know, your install instructions do build from source, correct? But I notice that you use “wget” for the opencv zip folder, and not a “git clone: from the opencv repository, could this be the reason? Anyways, I’m about to embark on a re-install of opencv, but wondering if you have some insight on this issue. Thanks Adrian, you’re awesome!!!
Two possible solutions:
1. Make sure you are using OpenCV 3.4.1 or greater.
2. Change the function call to:
cv2.dnn.readNetFromTorch(args["model"])
Change the function call to: cv2.dnn.readNetFromTorch(args[“model”]) worked for me, but I am curious why id it work?
99% likely due to an OpenCV version difference. Which version of OpenCV are you using?
Currently i’m using opencv- 3.2.0, Does it works.? Or i need update it to latest version.?
You need to update to OpenCV 3.4 or OpenCV 4.
Dear Adrian,
Thanks for this huge work!
Is it possible to reduce the number of classes analyzed by the model (20 -> 5 for example) ? Directly In the py script, only in Enet model ?
It will be interesting to measure the impact on performances.
Can semantic segmentation be used for detection/tracking purposes like some of your other examples? Creation of bounding box ?
Best regards 🙂
You can filter the returned results like I do in this tutorial but you cannot directly modify the model to reduce classes from 20 to 5 without applying fine-tuning. I discuss fine-tuning inside Deep Learning for Computer Vision with Python.
I had the error:
AttributeError: module ‘cv2.dnn’ has no attribute ‘readNet’
Solved it by changing the line:
net = cv2.dnn.readNet(args[“model”])
into:
cv2.dnn.readNetFromTorch(args[“model”])
Make sure you are using OpenCV 3.4.1 or better as well.
Hi Adrian
Thank you very much. I did notice that the readNet() method was missing on my version of Open CV (some others have mentioned this on the net – generally the answer is to re-install opencv from the master node) .
I however was able to apply the model using readNetfromTorch() instead. I seem to get the same result as you have, for example_04.
I just wonder if you are aware of anything I might lose if I use readNetfromTorch() instead of readNet() ? Of course, I understand I will have to use a different method for models trained in Caffe, TensorFlow etc.
regards
Sundar
Hi Adrian,
Regarding my earlier question, I noticed others asked the same this morning ( my page had not refreshed from last night) – sorry for the bother.
regards
Sundar
No problem at all, Sundar! I’m glad you have resolved the issue 🙂
Excellent article Adrian. I am currently researching the application of computer vision in malware classification (converting malware binaries to grayscale and then using image processing/ machine learning etc.). Do you think the methods described in your article have the potential to be applied to identifying malware?
How exactly are you converting the malware to a grayscale image? What process is being performed? Provided you can convert the binary to grayscale and have sufficient data, yes, I do believe you could apply image classification to the problem but I don’t know if it would be more accurate than simply applying an analysis on the binary data itself.
Hi Adrian,
There’s 2 ways that I am converting the binaries:
1) Byte-to-pixel mapping (by converting the binary to an 8-bit unsigned int numpy array and then saving it as a png). With this method, all the binary features are preserved.
2) Binary converted to Hillman Curves (haven’t tested this yet)
In method 1, I have experimented with the following feature descriptors:
LBP
HOG
Haralick GLCM
ORB
I’ve tested using the following classifiers:
Decision Trees
KNN
Naive-Bayes
Random Forest
SVM
I have converted both static binaries (for static analysis) and memory dumps (for dynamic behavioral analysis).
Best results
Static: LBP using KNN ~92% accuracy
Dynamic: HOG using KNN ~94% accuracy
More common methods of feature extraction for malware classification are either n-gram analysis or disassembling the binary and extracting API calls. Feature vectors can be calculated using either frequency of the features or sequences. These methods are generally noisy and are not robust against obfuscation techniques like encryption or compression. I was interested in also testing out deep learning (I have your book!) and when I saw your post, it really grabbed my attention! I guess my ultimate aim is to develop a system that can classify malware using both static and dynamic features.
I’m more familiar with the n-gram analysis technique, I hadn’t thought of converting the 8-bit unsigned integers to an actual image. But wouldn’t the malware representation be a single row of 8-bit integers? How are you converting that into a 2D image? Also, thank you so much for picking up a copy of my book 🙂
Hi Adrian,
Here’s the conversion code (courtesy of Lakshmanan Nataraj)
https://pastebin.com/pMDPeDxB
It reads the binary in as an array, reshapes it and then converts it to a uint8 array. You could also use numpy.fromfile and reshape (works too)
Here’s an interesting article (gave me the idea for using Hillman Curves as an alternative):
https://corte.si/posts/visualisation/binvis/index.html
Really, really cool! Thank you for sharing Stephen.
No problem Adrian!
Hey Adrian,
Thanks for the great tutorial !
However, when I run segment.py with example_02.png, it gives an error :
‘NoneType object has no attribute shape
Could you please help me resolve the error ?
Your path to the input image is not correct and “cv2.imread” is returning None. Refer to this tutorial to help you solve the problem.
Thanks for the very helpful tutorial, Adrian! Just want to ask if you’ve tested in OpenCV the pretrained Caffe models on Ade20k?
I’ve downloaded the caffemodel and prototxt files and am now starting to follow your Object Detection tutorial that uses the MobileNetV2 model, but I’m unsure if the same code would work on OpenCV 3.4.2 with this semantic segmentation model?
Would I just need to change the blob values? I’d greatly appreciate your help (or anybody here with the time and experience) regarding this. Thanks in advance!
Sorry, what is the Abe20k? As long as you’re running OpenCV 3.4.2 the code for both tutorials should work.
Hi Adrain,
I have opencv 3.4.1 do i need to upgrade it, should i install PyImageSearch to run the code.
I would recommend OpenCV 3.4.2 or OpenCV 4 for this code.
Hi Adrain,
I got the output from command prompt. I supplied the arguments from anaconda on windows. Finally, I learned something from your tutorials.
Thank so much!
Awesome, glad to hear it! 🙂
Hi Dr. Adrian. ADE20K is a dataset for semantic segmentation. There are available Pytorch, Caffe and Torch7 implementations in Github.
But I find your approach to be more aligned to what I’m currently working on.
Here’s a link to the pretrained caffemodel: http://sceneparsing.csail.mit.edu/model/caffe/
I’ve managed to get the detection results (class IDs), but I’m stuck at filtering these based on confidence results, and to also get the startY/X and endY/X mask values of the ROI of each detected object (in relation to the input image).
Hope you can have a tutorial blog post about this some time soon. Thanks!
Hi Adrian,
Thank you for your tutorial.
Do you think this can be achievable on RPI3B+ and movidius stick to process picamera stream in realtime?
Thank you
In full real-time as in 20+ FPS? No, that’s unrealistic. As this tutorial shows you may be able to get up to 4-6 FPS but anything higher I believe is unrealistic.
Thank you for the amazing tuto, once again. You detail always every steps, it is just perfect!
i have a question: how do you pilot opencv2 to select cpu or gpu usage? how can i tell it to switch from one to the other? i suppose it is not like tensorflow with the cuda_visible_devices, right?
another comment, i got also the error with missing dnn.readNet whereas i use opencv-python 3.4.1.15
BUT im on windows. maybe this version is not exactly the same as linux version?
working with readNetFromTorch() works perfectly then.
I see there is also a readNetFromTensorFlow, so we can now import TF models too? that’s very good!
So here’s the problem:
OpenCV is starting to include GPU support, including OpenCL support. CUDA + Python support is not yet released but there are PRs in their GitHub repo that are working on CUDA support. I’ll be doing a blog post dedicated to CUDA + Python support once it’s fully supported.
Hi , great article. Please write some thing on how to save cnn model extracted features in hdf5. Later give it to LSTM like human action recognition.
hallo when I run the segment.py,
I get the error:
usage: [-h] -m MODEL -c CLASSES -i IMAGE [-l COLORS] [-w WIDTH]
: error: the following arguments are required: -m/–model, -c/–classes, -i/–image
An exception has occurred, use %tb to see the full traceback.
What’s my problem? Can someone help me out?
You need to supply the command line arguments to the script. If you are new to command line arguments make sure you read this post first.
I am trying to execute this program but it is giving the following error. Please help
module ‘cv2.dnn’ has no attribute ‘readNet’
My previous query is resolved thanks to your solution mention above. I am facing this error now.
…
(h, w) = image.shape[:2]
AttributeError: ‘NoneType’ object has no attribute ‘shape’
Your path to the input image is incorrect and “cv2.imread” is returning “None”. Double-check your path to the input image and make sure you read on on NoneType errors in this tutorial.
How can I use this to detect walls, ceilings etc in a room?
You would need to fine-tune this model on a dataset of walls, ceilings, etc. Do you have such a dataset? If not, you would want to research one.
thanks for the post. It was great
Thanks Utkarsh, I’m glad you liked it! 🙂
Hello Adrian,
The article is wonderful. Thanks.
Can I perform transfer learning on this model. Can you please refer me some method.
I want to classify some more terrains with the help of this model.
Thanks.
I actually cover transfer learning inside Deep Learning for Computer Vision with Python. I’m also working on further semantic segmentation tutorials as well!
hi , thank you for this article
how can i let this model detect only the fence ? is that possible using this technique?
This exact model won’t be able to segment fences; however, if you have a dataset of fence images you could train or fine-tune a model to detect fences.
i have trained a model using my fence’s images but my images dont have only fences ,they have also the background so my model detect also the background and consider it as fence
It sounds like you may not have annotated your dataset correctly. How did you label your images? Did you create a mask for only the fence pixels in your dataset?
Can you please tell what step I need to add to your code so that I get only the road mask?
It would be very helpful.
You would want to build a mask for your returned class IDs with the pixels of the road mask set to 1 (or 255) and all other values to zero. If you’re new to Python and OpenCV I would recommend reading up on bitwise masking and NumPy array indexing.
Hi Adrian,
Will the other models like VGG19, ResNet50 or U-Net work for cityscape datasets and how to write .net file as you have written Enet.net.
If you want to use a different backbone or base network you would need to train it yourself. Make sure you read the “What if I want to train my own segmentation networks?” of this tutorial.
Hi Adrian,
Can we use this method to blur backgrounds so only the people or objects in the fore-ground are clear and everything else behind them is blurry? Or are there simpler methods to accomplish that.
Thank you for all the amazing stuff you share!
Yes, absolutely. You would want to:
1. Apply semantic segmentation
2. Grab the mask for the area you’re interested in
3. Copy the original image
4. Blur it
5. Bitwise AND it with your masked region
Can not load caffemodel?
Hi Adrian,
Are there any trained models for in-door applications?
Thank you for all the amazing stuff you share!
Do you mean segmentation of indoor scenes, such as walls, ceiling, floor, chair, etc.?
Adrian Hi.
Yes, this is exactly what I mean
Mark
I know I’ve seen pre-trained models for indoor scene understanding but I’m totally blanking on the name of the dataset or the model. I hope another PyImageSearch reader can help me out!
When changing to Caffe, why does processing go but not segmentation?
net = cv2.dnn.readNetFromCaffe (arga.prototxt, arga.caffemodel)
I’m not sure what you mean by “why does processing go but not segmentation” — could you elaborate?
hi Adrian
Really your each day blogs is surprised me the contents and the way you write is easy understandable. but how to use multiple .txt file for class and multiple images/videos
Thanks Kelemu, I’m glad you are enjoying the PyImageSearch blog!
As for your question, I’m not sure what you mean by using multiple .txt files. What is your end goal?
As usual, very high quality tutorials and blog!!! I like the way you present not only the practical part but also including the full references (Research papers) and credits for those who wanted to know more in detail. Really appreciate for your effort on knowledge contribution to the community. Well done Adrian!! I am still exciting and waiting for your new books to come out. 😉
Thank you Sai, I really appreciate your kind words 🙂
Hi Adrian,
Thank you for your helpful tutorial.
I have a question:how can I use the GPU on this project?
OpenCV doesn’t currently support CUDA GPUs very well for their “dnn” module. Support is coming but unless you have an Intel GPU you won’t be able to use this code with a GPU.
Hi Adrian
Thank you for your this amazing post this is impecable
I wanna ask you about if there is a way to applyic semant segmentation using Enet Architecture on bills (like water bills ), I mean if there is a dataset like cityscape that you used and you can give me a link to search
Kind Regards
“Bills” as in the bills/invoices that we pay?
Hi adrian
yes exactly
the aim of my project is taking a bill with a precise format ( for example ) : Date : atthe top left , Name of person at the top right of the bill , Total to pay at the bottom …
and do a semantic segmentation of that bill , like to do learn our algorithm where are the fields of the bills and know what it is.
Thank you
Semantic segmentation would be way overkill for such a project. I would suggest you instead look at image registration/document registration algorithms.
Thank your mr adrian
how to run in ubuntu
This code will work in Ubuntu. Just use the “Downloads” section of the tutorial to download the code and model.
I am working on some crop weed segmentation problem. I want to apply semantic segmentation using U-Net architecture. But UNet architecture is not clear to me. Can you please share any case study on U-Net architecture. Thanks in advance.
I don’t have any tutorials on U-Net but I will consider it in the future. Thanks for the suggestion.
Hi Adrian,
Thank you for the great post. I would like to use this code on grey scale image but it didn’t work !:(.
i’m new to this filed.
Hi Adrian,
Really cool tutorial. I’ve tried running the model on some images I took with my iphone and the results are really poor compared to the examples. Any tips for possible pre-processing I should be doing?
It’s hard to say what the issue is without seeing your example images. Keep in mind that deep learning algorithms, while impressive, are not magic. They are only as good as the data they were trained on. In this case your input images may be significantly different than what the model was trained on.
Heya,
Implementing additional_code.py to view each individual class mask I’ve noticed that some pixels are being multi-classed. Example: If ran with the image “example_01.png” the sign in the top left corner is in both the “Person” class and the “TrafficSign” class.
In the final colour mapped output, the sign is correctly colour-coded but I’m not understanding why
1) it’s included in two masks as classMap = np.argmax(output[0], axis=0) shouldn’t allow for this
2) why the final colour map is correct but examining the individual class masks shows contradictions to this. At first I thought the “final” decision for a pixels-class would be whatever it was classed as last but this isn’t the case.
Any help you can give would be appreciated.
Hello adrian,thank you for awesome tutorial.i have a question,Is this the idea of using xray images to detect objects inside the bag?
hi
Adrian great tutorial
can I fine train this model on semantic segmentation of MRI brain images.
I don’t have any tutorials on semantic segmentation on MRI images but I hope to cover it in the future.
Is it possible to design an automatic image segmentation tool?
You mean training your own custom Mask R-CNN segmentation network? If so yes, it’s absolutely possible — I cover how to do so inside Deep Learning for Computer Vision with Python.
Hello Sir,
This is an awesome tutorial as always.
I wanted to know how can I crop each segmented area?
Thank you
You can use NumPy array slicing to to extract the ROI and save it to disk. If you are new to using OpenCV and are unfamiliar with cropping ROIs, be sure to read Practical Python and OpenCV to first learn the basics.
Very good job, as always.
I have a GPU installed with tensorflow, what commands would I have to add in order to use it with this code?
Thank you.
You need to compile and install OpenCV with GPU support.
Hi Adrian,
Superb article !!!
I was thinking about doing something on video segment search based on user query. So for example query is “red shirt”, then I should get all the video segment with person wearing “red shirt”, do you have any idea what this technology is called? Or if you have any tutorial for the same ?
I couldnt find any relevant stuff on google
Learnt a lot from this tutorial but i got a question . Can we perform learning on e-net on new dataset and new classes?
Hi, Adrian. Thanks for your awesome tutorial!
And I have a question: The trained ENet model in this blog is a “.net” file, but the trained ENet model in Section “What if I want to train my own segmentation networks?” are “.prototxt” and “.caffemodel” files. So how do I convert these two files to.net file?
Hi, Adrian. Thanks for your awesome tutorial!
I want to share with you two questions:
I am working with FLIR2 thermal images taken by a drone. I want to identify the panels from all the other stuff. Unsupervised analysis with k-means, DBSCAN and mean shift were just made. Now I want to perform the analysis from the supervised procedure.
Any advise for a novice enthusiastic?
Do you know any repo or site to download a dataset of images and annotations from photovoltaic farm objects, like solar panels?
Thank you in advance.
Sorry, I don’t have any image datasets of solar panels. If I come across any I’ll try to remember to update this comment with a link.
Hi adrian, I trained semantic segmentation with dataset taken from camera A with good result. Then, I applied the trained model for images taken from camera B and the result is worst. Visually, the image nuance from camera A is different from camera B (illumination or color tone such as the green color in trees object are slightly different, etc). Is there any suggestion to process the image from camera A for retraining in order to get good segmentation result for images from camera B? Do I need to do somthings like color normalization?