Fast Neural Network Training with Distributed Training and Google TPUs

My day-to-day life includes training many deep learning models. Sometimes I am blessed with an architecture that is small yet capable of providing extraordinary results. Other times, I have to tread the difficult path of training huge architectures to fetch good results.

With the ever-increasing size of the data-hungry deep learning models, we seldom talk about training a model with less than 10 million parameters. As a result, people with limited hardware access do not get a chance to train these models, and even if they do, the training time is so large, they cannot iterate over the process as quickly as they would want.

Fast Neural Network Training with Distributed Training and Google TPUs

In this article, I will provide some trade secrets that I have found especially useful to speed up my training process. We will talk about the different hardware used for Deep Learning and an efficient data pipeline that does not starve the hardware being used. This article will, in no time, make you and your training pipeline more efficient.

In the article, we will talk about:

Different hardware used for Deep Learning
Efficient data pipeline
Distributing the training process

To learn how to perform distributed training with Google TPUs, just keep reading.

Looking for the source code to this post?

Configuring Your Development Environment

To follow this guide, you need to have the TensorFlow and TensorFlow Datasets libraries installed on your system.

Luckily, these packages are pip-installable:

$ pip install tensorflow
$ pip install tensorflow-datasets

Having Problems Configuring Your Development Environment?

**Figure 1:** Having trouble configuring your dev environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join PyImageSearch University — you’ll be up and running with this tutorial in a matter of minutes.

All that said, are you:

Short on time?
Learning on your employer’s administratively locked system?
Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
Ready to run the code right now on your Windows, macOS, or Linux system?

Then join PyImageSearch University today!

Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides that are pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.

And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!

Project Structure

Before we continue, let’s first review our project directory structure. Start by accessing the “Downloads” section of this guide to retrieve the source code and Python scripts.

You’ll then be presented with the following directory structure:

$ tree . --dirsfirst
.
├── outputs
│   ├── cpu.png
│   ├── gpu.png
│   └── tpu.png
├── pyimagesearch
│   ├── autoencoder.py
│   ├── config.py
│   ├── data.py
│   └── loss.py
├── train_cpu.py
├── train_gpu.py
└── train_tpu.py

2 directories, 10 files

Inside the pyimagesearch module, we have the following files:

autoencoder.py: Defines the autoencoder model that needs to be trained
config.py: Defines the configuration file that is needed for training
data.py: Defines the data pipeline for the model training step
loss.py: Defines the losses that will be used for training

Finally, we have three Python scripts:

train_cpu.py: Trains the model on a CPU
train_gpu.py: Trains the model on a GPU
train_tpu.py: Trains the model on a TPU

The outputs directory consists of the inference images of the autoencoder trained on different hardware.

Hardware

In Deep Learning, the most fundamental operation is that of matrix multiplication. The faster we multiply, the more speed we achieve in training. There is a brilliant lecture dedicated to hardware in deep learning by Michigan University that I recommend watching to have an overview of how hardware has evolved over the years to suit Deep Learning. In this section, we will iterate over the types of hardware and try to figure out which one would serve our purpose better.

CPUs

The Central Processing Unit (CPU) is a processor based on the von Neuma nn architecture. The architecture proposes an electronic computer with the following components:

A processing unit: This processes data that is being fed to it
A control unit: This holds the instructions along with a program counter to control the entire workflow
Memory: For storage

In the von Neumann architecture, the instructions and the data are present in the memory. The processor accesses instructions and processes the data accordingly. It also uses memory to store the intermediary calculations and later accesses it to complete any computation.

This architecture is extremely flexible. We can essentially provide any instruction and data, and the processor will do the rest of the work. However, the flexibility comes with a tradeoff — speed.

The architecture relies on memory access and also on the control instructions for the next step. Memory access becomes what is known as the von Neumann bottleneck. Even if we are doing matrix multiplication all day long, there is no way for the CPU to guess the future operations; hence it needs to keep accessing the data and instructions.

A snippet from the Google guide on TPUs sheds light on the aforementioned problem.

Each CPU’s Arithmetic Logic Units (ALUs), which are the components that hold and control multipliers and adders, can execute only one calculation at a time. Each time, the CPU has to access memory, which limits the total throughput and consumes significant energy.

Figure 2 shows a simplified version of matrix multiplication in a CPU. The operation takes place sequentially with memory access at each step.

**Figure 2:** Matrix multiplication in a CPU (source).

Let’s test the speed of our CPU performing matrix multiplication using TensorFlow. Open a Google Colab Notebook and paste the following code and see the results for yourself.

# import the necessary packages
import tensorflow as tf
import time

# initialize the operands
w = tf.random.normal((1024, 512, 16))
x = tf.random.normal((1024, 16, 512))
b = tf.random.normal((1024, 512, 512))

# start timer
start = time.time()

# perform matrix multiplication
output = tf.matmul(w, x) + b

# end timer
end = time.time()

# print the time taken to perform the operation
print(f"time taken: {(end-start):.2f} sec")

>>> time taken: 0.79 sec

Let’s do a little timing test with our CPUs using the code above. Here, we simulate the multiplication and addition operation, the most common operation found in Deep Learning. We see that the operation takes 0.79 sec to complete.

GPUs

Graphical Processing Units (GPUs) try to increase the throughput of CPUs by incorporating thousands of Arithmetic Logical Units (ALUs) on a single processor. This way, GPUs achieve parallelism in the operations.

Matrix multiplication is a parallel operation that makes Deep Learning calculations suit GPUs. GPUs are not specifically built for matrix multiplication, which means they still need to access memory and control instruction from the next step — the von Neumann bottleneck. Even with the bottleneck, GPUs provide a major step up in the training process due to their parallel operations.

Figure 3 shows a simplified version of matrix multiplication on a GPU. Note how the increase in ALUs helps in achieving parallelism and faster computation.

**Figure 3:** Matrix multiplication in a GPU (source).

The code below is the same as done with CPUs. The change here is with the hardware that was used. We use GPUs here to run the code. As a result, the code took about ~99% less time than CPUs. This shows how powerful our GPUs are and how parallelism creates a huge difference. I highly recommend running the following code in a Google Colab Notebook with GPU runtime.

# import the necessary packages
import tensorflow as tf
import time

# initialize the operands
w = tf.random.normal((1024, 512, 16))
x = tf.random.normal((1024, 16, 512))
b = tf.random.normal((1024, 512, 512))

# start timer
start = time.time()

# perform matrix multiplication
output = tf.matmul(w, x) + b

# end timer
end = time.time()

# print the time taken to perform the operation
print(f"time taken: {(end-start):.6f} sec")

>>> time taken: 0.000436 sec

TPUs

We can already decipher what makes Tensor Processing Units (TPUs) great at Deep Learning.

Here’s a snippet from the guide:

Cloud TPU is the custom-designed machine learning ASIC (Application Specific Integrated Chip) that powers Google products like Translate, Photos, Search, Assistant, and Gmail…. One benefit TPUs have over other devices is a major reduction of the von Neumann bottleneck. Because the primary task for this processor is matrix processing, hardware designers of the TPU knew every calculation step to perform that operation. So they were able to place thousands of multipliers and adders and connect them directly to form a large physical matrix of those operators. This is called a systolic array architecture.

With the help of the systolic array architecture, the TPUs load the parameters first and then process the data on the fly. The architecture makes it possible for the data to be multiplied and added systematically with no requirement for memory access to fetch instructions or store the intermediate results.

Figure 4 displays a great resource on visualization of the TPU processing step:

Figure 4: TPUs, systolic arrays, and bfloat16: accelerate your deep learning | Kaggle.

With a little code change to use TPUs, we can hardly see any time decrease, but we have to keep in mind that we use a cluster of 8 TPUs here. This distribution of operations has a huge impact on the entire training pipeline. If we divide each replica, the training time decreases eightfold here. You can easily test the results and modify the calculation in a Google Colab Notebook with TPU runtime.

# import the necessary packages
import tensorflow as tf
import time

# initialize the cluster of TPU
tpu = tf.distribute.cluster_resolver.TPUClusterResolver() 
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.TPUStrategy(tpu)

# initialize the operands
with strategy.scope():
    w = tf.random.normal((1024, 512, 16))
    x = tf.random.normal((1024, 16, 512))
    b = tf.random.normal((1024, 512, 512))

# perform matrix multiplication
with strategy.scope():
    start = time.time()
    output = tf.matmul(w, x) + b
    end = time.time()
    print(f"time taken: {(end-start):.2f} sec")

>>> time taken: 0.06 sec

Distribute the Training

Before we start working on TPUs, another important concept for us to understand is distributed training. The idea is very simple: if we want to speed up, we delegate the process to multiple hardware that work in unison. This way, no one machine redundantly does the same task, and theoretically, the training time is divided by the number of machines to which we delegate the training process.

In this section, I will cover some basic strategies that I use to distribute my training. I highly recommend starting with this guide to distributed training to get a wider perspective of the topic.

Data

The most important thing to focus on when we want a performance upgrade is the data pipeline. It is quite obvious that an inefficient data pipeline will starve the hardware. Even with TPUs, if you provide data sequentially, the essence of data parallelism is defeated, and there will be no significant gains.

TensorFlow provides the tf.data API that makes data pipelines more efficient. You can refer to our series on tf.data to get a feel of the API. Using this API alone can provide a considerable boost to training time.

When we distribute training across multiple devices, the data pipeline becomes a huge concern. We now need to store the data efficiently so that it can be hosted at a very low cost, and we need to use a technique to transfer that data efficiently without much latency. The solution for this problem is to store the data in the TFRecords format.

The TFRecord format is a simple format for storing a sequence of binary records.

You can better understand the entire process of converting the data into TFRecords by going through the official TensorFlow guide on the same page.

Now that we have our data converted into TFRecords, we need to have it hosted somewhere so that we can access it (with low latency) for the distributed training. There are several solutions for hosting your data, Google Cloud Storage, Amazon Web Services, and more, but I recommend using public Kaggle Datasets. This way, people who cannot afford a paid subscription to hosting solutions can easily host their data as open access Kaggle Datasets and can use them.

With TPUs, it becomes a necessity to have the data in a GCS bucket. Kaggle Datasets provides hosting in a GCS bucket. That way, data stored as Kaggle Datasets can be used to work with TPUs too.

Understanding the “tf.distribute” Function

Now that we know how to store the data efficiently, we move to our next section, the core concept of distributed training on TensorFlow. The tf.distribute function provides an API that makes it very easy for us to distribute the code among multiple GPUs or TPUs.

Using this API, you can distribute your existing models and training code with minimal code changes.

I highly recommend starting with the official TensorFlow guide on distributed training for the curious mind. For an in-depth overview of distributed training, this tutorial beats all the resources out there (Figure 5).

Figure 5: Inside TensorFlow: tf.distribute.Strategy.

I will dive straight into the two most used strategies for distributed training:

MirroredStrategy: As the name suggests, each model parameter is mirrored among the cluster of devices that are used. We split the dataset so that each device gets its share of data. The forward propagation takes place simultaneously among all the devices. In this strategy, the gradients are accumulated and then used for the backpropagation step. This means that in each replica, the model is the same.
TPUStrategy: This strategy is used specifically for TPUs. This is the same strategy as the MirroredStrategy.

Losses

While using distributed training, an important takeaway is to scale the losses properly. When we train on a single machine, the loss is averaged over the entire batch. This means that with multi-device training, we need to scale the losses according to the global batch size.

Suppose we incorporate 4 machines for our distributed setup. Each machine gets a batch of 16 images. The global batch size then becomes $\pmb{4 \times 16 = 64}$ . For example, losses need to be aggregated and averaged on 64 instead of 16.

We scale the losses because with a Mirrored or TPU strategy, the per-device loss is accumulated at the end of the forward propagation, and the gradients are computed on the accumulated loss. We need to keep this point in mind. Otherwise, you will notice a sharp increase in the losses.

Another important thing to notice here is the reduction of the losses to a scalar. Most of the tf.keras.losses use a default reduction of SUM_OVER_BATCH, which computes a reduced mean loss over all the dimensions of the batched loss tensor. When using the distribution strategy, we need to explicitly use the NONE or SUM type reduction, where SUM does a reduced sum over the batched loss on all the dimensions, and NONE computes a reduced mean on the last dimension of the batched loss only. This becomes a vital point to notice. With a NONE reduction, it becomes quite easy for us to miss that the losses are not scalars but remain tensors.

A Tale of Comparison

In this section, we will train an autoencoder on the MNIST dataset. The architecture involves a series of convolutions as the encoder and transpose convolutions as the decoder.

The autoencoder architecture is defined in the autoencoder.py file. Let’s take a look at the architecture in detail.

# import the necessary packages
from tensorflow.keras import Model
from tensorflow.keras import Sequential
from tensorflow.keras.layers import InputLayer
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import Conv2DTranspose

class AutoEncoder(Model):
	def __init__(self):
		super().__init__()
		# build the encoder
		self.encoder = Sequential([
			InputLayer((28, 28, 1)),
			Conv2D(16, (3, 3), activation='relu', padding='same', 
				strides=2),
			Conv2D(8, (3, 3), activation='relu', padding='same',
				strides=2)])
		
		# build the decoder
		self.decoder = Sequential([
			Conv2DTranspose(8, kernel_size=3, strides=2,
				activation='relu', padding='same'),
			Conv2DTranspose(16, kernel_size=3, strides=2,
				activation='relu', padding='same'),
			Conv2D(1, kernel_size=(3, 3), activation='sigmoid',
				padding='same')])
	
	def call(self, x):
		# pass the input through the encoder and output of the encoder
		# through the decoder
		encoded = self.encoder(x)
		decoded = self.decoder(encoded)
		
		# return the output from the decoder
		return decoded

Lines 2-6 import the necessary packages. The model is created with TensorFlow’s Keras package.

Lines 12-17 define a sequential model with two convolutional layers. Both these layers stride twice, which decreases the spatial dimension of the input in half each time. This is our encoder. The encoder is supposed to squeeze the input representation into smaller dimensions.

Lines 20-26 define a sequential model with two convolutional transpose layers and a convolutional layer. The transpose layers scale the input tensors up spatially. This inherently inverted property helps us get our decoder. The work of the decoder is to decode the encoder representation and get an output similar to the input tensor.

Lines 28-35 define the way the model will be called. The data that is being used to call the model is first encoded by the encoder and then decoded by the decoder. The encoded representation is also known as the bottleneck.

We will train this model on different hardware and compare the training time as we go. The MNIST dataset used is part of the tensorflow_datasets API. With the tensorflow_datasets API, we can also use the try_gcs parameter that calls the data from a GCS bucket, if found. This helps with distributed training off the shelf.

CPU

Here, we take the dataset, compile the model with the optimizer and the losses required, and train the model.

$ python train_cpu.py
Epoch 1/5
468/468 [==============================] - 65s 133ms/step - loss: 0.0380 - val_loss: 0.0036
Epoch 2/5
468/468 [==============================] - 57s 122ms/step - loss: 0.0028 - val_loss: 0.0022
Epoch 3/5
468/468 [==============================] - 56s 120ms/step - loss: 0.0020 - val_loss: 0.0018
Epoch 4/5
468/468 [==============================] - 57s 122ms/step - loss: 0.0017 - val_loss: 0.0016
Epoch 5/5
468/468 [==============================] - 57s 121ms/step - loss: 0.0015 - val_loss: 0.0014

We see that it takes a CPU ~57 secs to train for an epoch.

GPU

The only change that we need to do here is to train the model on a GPU. There is no change in the code at all. With the nvidia-smi command, we can look at the GPU that we have used for training.

We have used a Tesla T4 for the experiment.

$ python train_gpu.py
Epoch 1/5
468/468 [==============================] - 24s 15ms/step - loss: 0.0378 - val_loss: 0.0036
Epoch 2/5
468/468 [==============================] - 4s 8ms/step - loss: 0.0028 - val_loss: 0.0022
Epoch 3/5
468/468 [==============================] - 4s 8ms/step - loss: 0.0020 - val_loss: 0.0018
Epoch 4/5
468/468 [==============================] - 4s 8ms/step - loss: 0.0017 - val_loss: 0.0016
Epoch 5/5
468/468 [==============================] - 4s 8ms/step - loss: 0.0015 - val_loss: 0.0015

Just by using GPUs and harnessing parallel computing, we see a drastic change in the training time. For a Tesla T4, it takes ~4 secs to train for an epoch.

TPU

The first thing here would be to harness the TPU and define the distribution strategy that we will use. Let’s take a look at train_tpu.py and dive deep into the training process.

# USAGE
# python train_tpu.py

# import tensorflow and fix the random seed for better reproducibility
import tensorflow as tf
tf.random.set_seed(42)

# import the necessary packages
from pyimagesearch import config
from pyimagesearch.data import get_data
from pyimagesearch.autoencoder import AutoEncoder
from pyimagesearch.loss import MSELoss
from tensorflow.distribute.cluster_resolver import TPUClusterResolver
from tensorflow.config import experimental_connect_to_cluster
from tensorflow.tpu.experimental import initialize_tpu_system
from tensorflow.distribute import TPUStrategy
from tensorflow.keras.preprocessing.image import array_to_img
import matplotlib.pyplot as plt
import os

On Line 5, we import tensorflow, and on Line 6, we set a seed for all the random operations in TensorFlow. All operations that are random (stochastic) in nature will lead to irreproducible results. When we set a random seed, every time the experiment is run from the beginning, the random operations will be the same, hence making the results reproducible.

# initialize the TPU and TPU strategy
tpu = TPUClusterResolver() 
experimental_connect_to_cluster(tpu)
initialize_tpu_system(tpu)
strategy = TPUStrategy(tpu)

# get the number of accelerators
numAcc = strategy.num_replicas_in_sync
print(f"[INFO] Number of accelerators: {numAcc}")

# get the training dataset
print("[INFO] loading the training and validation datasets...")
trainDs = get_data(dataName=config.DATA_NAME,
	split=config.TRAIN_FLAG, shuffleSize=config.SHUFFLE_SIZE,
	batchSize=config.TPU_BATCH_SIZE)

# get the validation dataset
valDs = get_data(dataName=config.DATA_NAME,
	split=config.VALIDATION_FLAG, batchSize=config.TPU_BATCH_SIZE)

Lines 22-25 define how you can define the distribution strategy to initialize a TPU cluster.

Lines 27-28 retrieve and show the number of TPU devices that are initialized and used in training.

Lines 32-39 initialize the training data and the validation data. Here, with the use of the get_data function, we retrieve the batched training and validation data.

# train the model in the scope
with strategy.scope():
	# initialize the autoencoder model and compile it
	print("[INFO] initializing the model...")
	model = AutoEncoder()
	mseLoss = MSELoss(scale=1)
	model.compile(loss=mseLoss, optimizer=config.OPTIMIZER)

	# train the model
	print("[INFO] training the autoencoder...")
	model.fit(trainDs, epochs=config.EPOCHS,
		steps_per_epoch=config.TPU_STEPS_PER_EPOCH, 
		validation_data=valDs,
		validation_steps=config.TPU_VALIDATION_STEPS)

Lines 42-54 define the training phase of the model. When we distribute the training under a defined Distribution Strategy, we need to initialize the model, optimizers, and the losses under the specified strategy. Here, on Line 42, we define the scope of the Distribution Strategy that we will use.

Line 45 initializes the autoencoder model that will be trained.

Line 46 initializes the mean square error loss function that we will use to train the model.

Line 47 compiles the model with the defined loss function and the specific optimizer.

Lines 51-54 use the model.fit() Keras API to train the model with the training dataset. In the fit function, we define the validation dataset as well to monitor the training process better. With the validation dataset, we will check whether the model overfits or underfits the training data.

# grab a batch of data from the test set and run inference
print("[INFO] evaluating the model...")
(testIm, _) = next(iter(valDs))
predIm = model.predict(testIm)

# create subplots
fig, axes = plt.subplots(nrows=8, ncols=2, figsize=(10, 40))

# iterate over the subplots and fill with test and predicted images
print("[INFO] displaying the predicted images...")
for ax, real, pred in zip(axes, testIm[:8], predIm[:8]):
	# plot the input image
	ax[0].imshow(array_to_img(real), cmap="gray")
	ax[0].set_title("Input Image")

	# plot the predicted image
	ax[1].imshow(array_to_img(pred), cmap="gray")
	ax[1].set_title("Predicted Image")

# check if the output image directory exists, if does not, then create
# it
if not os.path.exists(config.BASE_IMG_PATH):
	os.makedirs(config.BASE_IMG_PATH)

# save the figure
print("[INFO] saving the predicted images...")
fig.savefig(config.TPU_IMG_PATH)

Lines 57-82 are dedicated to the inference phase. After our model is trained, we use the model to run inference on unseen data and visualize the output. Lines 58 and 59 grab a batch of the validation dataset and run inference on the same. Lines 62-73 create subplots and plot the input and predicted output. Lines 77-82 check for an existing inference image; if not found, save the inference image to the defined path.

Training a model on the CPU, GPU, and the TPU does not need too many changes. The only change we need to introduce here is to scale the loss and define the distribution strategy. Now that we know about the distribution strategy, let’s jump into the loss.py file to configure the loss functions.

# import the necessary packages
from tensorflow.keras.losses import MeanSquaredError
from tensorflow.keras.losses import Reduction
from tensorflow import reduce_mean

class MSELoss():
	def __init__(self, scale):
		# accept the scalar by which the loss needs to be scaled
		self.scale = scale

	def __call__(self, real, pred):
		# initialize MeanSquaredError loss with no reduction
		MSE = MeanSquaredError(reduction=Reduction.NONE)

		# compute the loss
		loss = MSE(real, pred)

		# scale the loss
		loss = reduce_mean(loss) * (1. / self.scale)
		
		# return loss
		return loss

Lines 2-4 import the necessary packages.

Lines 6-22 define the loss function. We can notice how the loss function is built around a class. We want a structure like this to help us provide the class object with some properties during initialization and not think about it in the function calls later. For example, we would want the mean squared error loss function to notice the scale it needs to be reduced. We use the 1. / self.scale term to scale the loss according to the number of replicas we have.

$ python train_tpu.py
Epoch 1/5
58/58 [==============================] - 8s 60ms/step - loss: 0.1608 - val_loss: 0.0773
Epoch 2/5
58/58 [==============================] - 2s 31ms/step - loss: 0.0555 - val_loss: 0.0384
Epoch 3/5
58/58 [==============================] - 2s 30ms/step - loss: 0.0249 - val_loss: 0.0151
Epoch 4/5
58/58 [==============================] - 2s 27ms/step - loss: 0.0124 - val_loss: 0.0099
Epoch 5/5
58/58 [==============================] - 2s 27ms/step - loss: 0.0087 - val_loss: 0.0072

We train the model with the distribution strategy that we want. Here, we see that it takes ~2 sec to train for an epoch.

In Table 1, we see the timing comparison of the various hardware used. A subtle difference that can go unnoticed is the batch size that is being used for different hardware. With CPUs and GPUs, the batch size was set to 128, while with TPUs, the batch size went up to 1024. This makes it quicker to train an epoch with TPUs.

**Table 1:** Comparison between hardware.

A problem with large batch sizes and fewer steps per epoch is with the gradient update. The gradients are backpropagated fewer times with larger batch sizes, and the model takes many more iterations to train. This problem can be bypassed by using repeated datasets and fixed steps per epoch to constrain the model to backpropagate a fixed number of times in an epoch.

We can better understand the problem with the help of the predictions that we get while training the MNIST autoencoder.

In Figure 8, we notice the blurred prediction from the TPU model. This proves our point. We indeed fit more data in the TPU. However, due to fewer iterations in an epoch, the model learns worse than with a GPU or a CPU.

What's next? I recommend PyImageSearch University.

Course information:
30+ total classes • 39h 44m video • Last updated: 12/2021
★★★★★ 4.84 (128 Ratings) • 3,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

✓ 30+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 30+ Certificates of Completion
✓ 39h 44m on-demand video
✓ Brand new courses released every month, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 500+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

When you quickly iterate over a training process, you will eventually need GPUs and TPUs to train your model. The training time reported above should excite you to try porting some of your existing models to be compatible with TPUs or multiple GPUs. You can harness TPUs from Kaggle Notebooks or Google Colab Notebooks.

It would be great to see people coming in and posting their implementations with a multi-GPU or a TPU setup. Also, please remember to mention @pyimagesearch on Twitter, where you share your work with the world.

Want free GPU credits to train models?

We used Jarvislabs.ai, a GPU cloud, for all the experiments.
We are proud to offer PyImageSearch University students $20 worth of Jarvislabs.ai GPU cloud credits. Join PyImageSearch University and claim your $20 credit here.

In Deep Learning, we need to train Neural Networks. These Neural Networks can be trained on a CPU but take a lot of time. Moreover, sometimes these networks do not even fit (run) on a CPU.

To overcome this problem, we use GPUs. The problem is these GPUs are expensive and become outdated quickly.

GPUs are great because they take your Neural Network and train it quickly. The problem is that GPUs are expensive, so you don’t want to buy one and use it only occasionally. Cloud GPUs let you use a GPU and only pay for the time you are running the GPU. It’s a brilliant idea that saves you money.

JarvisLabs provides the best-in-class GPUs, and PyImageSearch University students get between 10 - 50 hours on a world-class GPU (time depends on the specific GPU you select).

This gives you a chance to test-drive a monstrously powerful GPU on any of our tutorials in a jiffy. So join PyImageSearch University today and try for yourself.

Click here to get Jarvislabs credits now

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!

Fast Neural Network Training with Distributed Training and Google TPUs

Looking for the source code to this post?

Configuring Your Development Environment

Having Problems Configuring Your Development Environment?

Project Structure

Hardware

CPUs

GPUs

TPUs

Distribute the Training

Data

Understanding the “tf.distribute” Function

Losses

A Tale of Comparison

CPU

GPU

TPU

What's next? I recommend PyImageSearch University.

Summary

Want free GPU credits to train models?

Download the Source Code and FREE 17-page Resource Guide

About the Author

Comment section

How to use OpenCV’s “dnn” module with NVIDIA GPUs, CUDA, and cuDNN

Compiling OpenCV with CUDA support

Text skew correction with OpenCV and Python

Topics

Books & Courses

PyImageSearch

Fast Neural Network Training with Distributed Training and Google TPUs

Looking for the source code to this post?

Configuring Your Development Environment

Having Problems Configuring Your Development Environment?

Project Structure

Hardware

CPUs

GPUs

TPUs

Distribute the Training

Data

Understanding the “tf.distribute” Function

Losses

A Tale of Comparison

CPU

GPU

TPU

What's next? I recommend PyImageSearch University.

Summary

Want free GPU credits to train models?

Download the Source Code and FREE 17-page Resource Guide

About the Author

OCR Passports with OpenCV and Tesseract

GAN Training Challenges: DCGAN for Color Images

Comment section

Similar articles

You can learn Computer Vision, Deep Learning, and OpenCV.

Footer

Topics

Books & Courses

PyImageSearch

Access the code to this tutorial and all other 500+ tutorials on PyImageSearch

What's included in PyImageSearch University?