Anime Faces with WGAN and WGAN-GP

In this post, we implement two GAN variants: Wasserstein GAN (WGAN) and Wasserstein GAN with Gradient Penalty (WGAN-GP), to address the training instability discussed in my previous post, GAN Training Challenges: DCGAN for Color Images. We will train the WGAN and WGAN-GP models to generate colorful 64×64 anime faces.

This is the fourth post of our GAN tutorial series:

Intro to Generative Adversarial Networks (GANs)
Get Started: DCGAN for Fashion-MNIST
GAN Training Challenges: DCGAN for Color Images
Anime Faces with WGAN and WGAN-GP (this tutorial)

We will first walk through a WGAN tutorial step-by-step focusing on the new concepts introduced by the WGAN paper. Then we discuss how to improve WGAN with a few changes to make WGAN-GP.

Wasserstein GAN

The Wasserstein GAN (WGAN) was introduced in the paper Wasserstein GAN. Its main contribution was to use the Wasserstein loss to address the GAN training instability issues, which was a major breakthrough for GAN training.

Recall in DCGAN, when the discriminator is too weak or too strong, it won’t give the generator useful feedback for making improvements. Training longer doesn’t necessarily make the DCGAN model better.

With WGAN, these training issues can be solved with the new Wasserstein loss: we no longer need a careful balance in the training of discriminator and generator or careful design of the network architecture. WGAN has linear gradients that are continuous and differentiable almost everywhere (Figure 1). This solves the vanishing gradient problem with regular GAN training,

**Figure 1:** Gradients of regular GAN vs. WGAN (source: WGAN paper).

Here are a few new concepts or key changes introduced in the WGAN paper:

Wasserstein distance (or Earth mover’s distance): measures the effort needed to transform one distribution into another.
Wasserstein loss: a new loss function that measures the Wasserstein distance.
The discriminator is now called a critic in WGAN. Instead of training a discriminator (a binary classifier) to tell whether an image is real or fake (generated), we train a critic that outputs a number.
The critic must meet the Lipschitz constraint for the Wasserstein loss to work.
WGAN uses weight clipping to enforce the 1-Lipschitz constraint.

As we implement each new GAN architecture, I will highlight the changes compared with a previous GAN variant to help you learn the new concepts. Here are the key changes comparing a WGAN with DCGAN:

Table 1 summarizes the changes needed for updating a DCGAN to a WGAN:

**Table 1:** DCGAN to WGAN Summary of Changes.

Now let’s walk through the code to implement these changes in WGAN with TensorFlow 2 / Keras. While following the tutorial below, please refer to the WGAN Colab notebook here for the complete code.

Setup

First, we make sure to set the runtime of the Colab hardware accelerator as GPU. Then we import all the libraries needed (e.g., TensorFlow 2, Keras, and Matplotlib, etc.).

Prepare the Data

We will train the DCGAN with a dataset called Anime Face Dataset from Kaggle, which is a collection of anime faces scraped from www.getchu.com. There are 63,565 small color images to be resized to 64×64 for training.

To download data from Kaggle, you will need to provide your Kaggle credential. You could either upload the Kaggle .json file to Colab or put your Kaggle user name and key in the notebook. We chose the latter option.

os.environ['KAGGLE_USERNAME']="enter-your-own-user-name" 
os.environ['KAGGLE_KEY']="enter-your-own-user-name"

Download and unzip the data to a directory called dataset.

!kaggle datasets download -d splcher/animefacedataset -p dataset
!unzip datasets/animefacedataset.zip -d datasets/

After downloading and unzipping the data, we set a directory where the images are.

anime_data_dir = "/content/datasets/images"

Then we use the Keras utils function of image_dataset_from_directory to create a tf.data.Dataset from the images in the directory, which will be used for training the model later on. We specify the image size of 64×64 and a batch size of 256.

train_images = tf.keras.utils.image_dataset_from_directory(
   anime_data_dir, label_mode=None, image_size=(64, 64), batch_size=256)

Let’s visualize one random training image.

image_batch = next(iter(train_images))
random_index = np.random.choice(image_batch.shape[0])
random_image = image_batch[random_index].numpy().astype("int32")
plt.axis("off")
plt.imshow(random_image)
plt.show()

Here is what this random training image looks like in Figure 2:

**Figure 2:** `64×64` training image (source: Anime Face Dataset).

Same as before, we normalize the images to the range of [-1, 1] because the generator’s final layer activation uses tanh. Finally, we apply the normalization by using the map function of the tf.dataset with a lambda function.

train_images = train_images.map(lambda x: (x - 127.5) / 127.5)

The Generator

There is no change in the WGAN generator architecture, which is the same as in DCGAN. We create the generator architecture with the Keras Sequential API in the build_generator function. Refer to the details of how to create the generator architecture in my previous two DCGAN posts: DCGAN for Fashion-MNIST and DCGAN for Color Images.

After defining the generator architecture in the build_generator() function, we build the generator model with generator = build_generator() and call generator.summary() to visualize the model architecture.

The Critic

In WGAN, we have a critic that assigns a score that measures Wasserstein distance instead of a discriminator for binary classification of real and fake images. Note the critic’s output is now a score instead of a probability. The critic is constrained with a 1-Lipschitz continuity condition.

There are quite a few changes here:

Rename discriminator to critic
Use weight clipping to enforce 1-Lipschitz continuity on the critic
Change the critic’s activation function from sigmoid to linear

Rename discriminator to critic

If you start with the DCGAN code, you will need to rename the discriminator to critic. You can use the “Find and replace” feature in Colab to make all the updates.

So now we have a function called build_critic instead of build_discriminator.

Weight clipping

WGAN enforces 1-Lipschitz constraint by using weight clipping which we implement by subclassing keras.constraints.Constraint. Refer to the Keras layer weight constraint for detailed documentation. Here is how we create the WeightClipping class:

class WeightClipping(tf.keras.constraints.Constraint):
   def __init__(self, clip_value):
       self.clip_value = clip_value
  
   def __call__(self, weights):
       return tf.clip_by_value(weights, -self.clip_value, self.clip_value)
  
   def get_config(self):
       return {'clip_value': self.clip_value}

Then in the build_critic function we create a constraint of [-0.01, 0.01] with the WeightClipping class.

constraint = WeightClipping(0.01)

Now we add kernel_constraint = constraint to all the CONV2D layers of the critic. For example:

model.add(layers.Conv2D(64, (4, 4), 
          padding="same", strides=(2, 2),
          kernel_constraint = constraint, 
          input_shape=input_shape))

Linear activation

In the last layer of the critic, we update the activation from sigmoid to linear.

model.add(layers.Dense(1, activation="linear"))

Please note that in Keras, the Dense layer by default has linear activation so that we could have omitted the activation="linear" part and written the code like this:

model.add(layers.Dense(1))

I left activation = "linear" there to make it clear we are changing from sigmoid to linear activation when updating a DCGAN to WGAN.

Now that we have defined the model architecture in the build_critic function, let’s build the critic model with critic = build_critic(64, 64, 3) and call critic.summary() to visualize the critic model architecture.

The WGAN Model

We define the WGAN model architecture by subclass keras.Model and override train_step to define the custom training loops.

There are a few changes in this section for WGAN:

Update the critic more frequently than the generator
No more image labels for the critic
Use Wasserstein loss instead of Binary Crossentropy (BCE) loss

Update the critic more often than the generator

Per the paper recommendation, we update the critic 5 times more often than the generator. To achieve this, we pass in an additional argument called critic_extra_steps to __init__ of the WGAN class.

def __init__(self, critic, generator, latent_dim, critic_extra_steps):
    ...
    self.c_extra_steps = critic_extra_steps
    ...

Then in train_step(), we use a for loop to apply the extra training steps.

for i in range(self.c_extra_steps):
         # Step 1. Train the critic
         ...

# Step 2. Train the generator

Image labels

Depending on how we write the Wasserstein loss functions, we could either 1) assign ones as real images labels and negative ones as the labels of the fake images, or 2) not assign any labels at all.

Here is a brief explanation of the two options. When using labels, the Wasserstein loss is calculated as tf.reduce mean(y_true * y_pred). If we have the critic loss as loss on real images + loss on fake images and the generator loss on fake images only, then it leads to tf.reduce_mean (1 * pred_real - 1 * pred_fake) for the critic loss and -tf.reduce_mean(pred_fake) for the generator loss.

Note that the critic’s objective is not trying to assign a label of 1 or -1; instead, it tries to maximize the difference between its prediction on real images and its predictions on fake images. So in the case of the Wasserstein loss, the labels don’t really matter much.

So we choose the latter option of not assigning labels, and you will see all the code of real or fake labels are removed.

Wasserstein loss

The Wasserstein loss for the critic and generator get passed in through model.compile:

def compile(self, d_optimizer, g_optimizer, d_loss_fn, g_loss_fn):
	super(WGAN, self).compile()
	...
	self.d_loss_fn = d_loss_fn 
	self.g_loss_fn = g_loss_fn

Then in train_step, we use these functions to calculate the critic loss and generator loss, respectively, during training.

def train_step(self, real_images):

	for i in range(self.c_extra_steps):
	         # Step 1. Train the critic
	         ...

	   d_loss = self.d_loss_fn(pred_real, pred_fake) # critic loss

	# Step 2. Train the generator
	...
	g_loss = self.g_loss_fn(pred_fake) # generator loss

Keras `Callback` for Training Monitoring

Same code as DCGAN with no change — override Keras Callback to monitor and visualize the generated images during training.

class GANMonitor(keras.callbacks.Callback):
    def __init__():
    ...
    def on_epoch_end():
    ...
    def on_train_end():
    ...

Compile and Train WGAN

Putting together the WGAN model

We put together the wgan model with the WGAN class defined above. Note we need to set the extra training steps for the critic as 5 per the WGAN paper.

wgan = WGAN(critic=critic,
             generator=generator,
             latent_dim=LATENT_DIM,
             critic_extra_steps=5) # UPDATE for WGAN

Wasserstein loss functions

As mentioned before, the main change in WGAN is the usage of Wasserstein loss. Here is how to calculate Wasserstein loss for the critic and the generator — by defining custom loss functions in Keras.

# Wasserstein loss for the critic
def d_wasserstein_loss(pred_real, pred_fake):
   real_loss = tf.reduce_mean(pred_real)
   fake_loss = tf.reduce_mean(pred_fake)
   return fake_loss - real_loss

# Wasserstein loss for the generator
def g_wasserstein_loss(pred_fake):
   return -tf.reduce_mean(pred_fake)

Compile WGAN

Now we compile the wgan model with RMSProp optimizer and a learning rate of 0.00005 as per the WGAN paper.

LR = 0.00005 # UPDATE for WGAN: learning rate per WGAN paper
wgan.compile(
   d_optimizer = keras.optimizers.RMSprop(learning_rate=LR, clipvalue=1.0, decay=1e-8), # UPDATE for WGAN: use RMSProp instead of Adam
   g_optimizer = keras.optimizers.RMSprop(learning_rate=LR, clipvalue=1.0, decay=1e-8), # UPDATE for WGAN: use RMSProp instead of Adam
   d_loss_fn = d_wasserstein_loss,
   g_loss_fn = g_wasserstein_loss
)

Note in DCGAN, we use keras.losses.BinaryCrossentropy() while for WGAN, we are using the custom wasserstein_loss functions defined above. These two wasserstein_loss functions get passed in through model.compile(). They will be used in the custom training loop as discussed in the overriding _step section above.

Train the WGAN model

Now we simply call model.fit() to train the wgan model!

NUM_EPOCHS = 50 # number of epochs
wgan.fit(train_images, epochs=NUM_EPOCHS, callbacks=[GANMonitor(num_img=16, latent_dim=LATENT_DIM)])

Wasserstein GAN with Gradient Penalty

While WGAN improves training stability with the Wasserstein loss, even the paper itself admits that “weight clipping is a clearly terrible way to enforce a Lipschitz constraint.” A large clipping parameter can lead to slow training and prevent the critic from reaching optimality. At the same time, a clipping too small can easily lead to vanishing gradients, the exact problem WGAN was proposed to solve.

The Wasserstein with Gradient Penalty (WGAN-GP) was introduced in the paper, Improved Training of Wasserstein GANs. It further improves WGAN by using gradient penalty instead of weight clipping to enforce the 1-Lipschitz constraint for the critic.

We only need to make a few changes to update a WGAN to a WGAN-WP:

Remove batch norm from the critic’s architecture.
Use gradient penalty instead of weight clipping to enforce the Lipschitz constraint.
Use Adam optimizer (α = 0.0002, β₁ = 0.5, β₂ = 0.9) instead of RMSProp.

Please refer to the WGAN-GP Colab notebook here for the complete code example. Here in this tutorial, we discuss only the incremental changes updating a WGAN to a WGAN-WP.

Add Gradient Penalty

Gradient penalty means penalizing gradients with large norm values, and here is how we calculate it in Keras:

def gradient_penalty(self, batch_size, real_images, fake_images):
    """ Calculates the gradient penalty.

    Gradient penalty is calculated on an interpolated image
    and added to the discriminator loss.
    """
    
    alpha = tf.random.normal([batch_size, 1, 1, 1], 0.0, 1.0)
    diff = fake_images - real_images
    # 1. Create the interpolated image
    interpolated = real_images + alpha * diff

    with tf.GradientTape() as gp_tape:
        gp_tape.watch(interpolated)
        # 2. Get the Critic's output for the interpolated image
        pred = self.critic(interpolated, training=True)

    # 3. Calculate the gradients w.r.t to the interpolated image
    grads = gp_tape.gradient(pred, [interpolated])[0]
    # 4. Calculate the norm of the gradients.
    norm = tf.sqrt(tf.reduce_sum(tf.square(grads), axis=[1, 2, 3]))
    # 5. Calculate gradient penalty
    gradient_penalty = tf.reduce_mean((norm - 1.0) ** 2)
    return gradient_penalty

Then in train_step, we calculate the gradient penalty and add it to the original critic loss. Note the penalty weight (or coefficient lambda ƛ) controls the magnitude of the penalty, and it’s set as 10 per WGAN paper.

gp = self.gradient_penalty(batch_size, real_images, fake_images)
d_loss = self.d_loss_fn(pred_real, pred_fake) + gp * self.gp_weight

Remove batchnorm

While batch normalization helps stabilize training in GAN training, it doesn’t work with gradient penalty because with gradient penalty, we penalize the norm of the critic’s gradient to each input independently and not the entire batch. So we need to remove the batch norm code from the critic’s model architecture.

Adam Optimizer instead of RMSProp

DCGAN uses the Adam optimizer, and for WGAN, we switch to the RMSProp optimizer. Now for WGAN-GP, we switch back to Adam optimizer with a learning rate of 0.0002 per the WGAN-GP paper recommendation.

LR = 0.0002 # WGAN-GP paper recommends lr of 0.0002
d_optimizer = keras.optimizers.Adam(learning_rate=LR, beta_1=0.5, beta_2=0.9) g_optimizer = keras.optimizers.Adam(learning_rate=LR, beta_1=0.5, beta_2=0.9)

We compile and train the WGAN-GP model for 50 epochs, and we observe more stable training and better image quality generated by the model.

Figure 3 compares the real (training) images and images generated by WGAN and WGAN-GP, respectively.

**Figure 3:** Real vs. WGAN vs. WGAN Images (image by the author).

Both WGAN and WGAN-GP have improved training stability. The tradeoff is that their training converges slower than DCGAN, and the image quality may be slightly worse; however, with the improved training stability, we can use much more complex generator network architectures, which result in improved image quality. Many later GAN variants adopted the Wasserstein loss and gradient penalty as default, for example, ProGAN and StyleGAN. Even the TF-GAN library uses the Wasserstein loss by default.

What's next? I recommend PyImageSearch University.

Course information:
30+ total classes • 39h 44m video • Last updated: 12/2021
★★★★★ 4.84 (128 Ratings) • 3,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

✓ 30+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 30+ Certificates of Completion
✓ 39h 44m on-demand video
✓ Brand new courses released every month, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 500+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In this post, you learned how to use WGAN and WGAN-GP to improve GAN training stability. You learned about incremental changes moving from a DCGAN to WGAN, then from a WGAN to WGAN-GP with TensorFlow 2 / Keras. You learned how to generate anime faces with WGAN and WGAN-GP. In the next post, we will learn about conditional GAN and image-to-image translation with Pix2Pix.

Citation Information

Maynard-Reid, M. “Anime Faces with WGAN and WGAN-GP,” PyImageSearch, 2022, https://hcl.pyimagesearch.com/2022/02/07/anime-faces-with-wgan-and-wgan-gp/

@article{Maynard-Reid_2022_Anime_Faces,
  author = {Margaret Maynard-Reid},
  title = {Anime Faces with {WGAN} and {WGAN-GP}},
  journal = {PyImageSearch},
  year = {2022},
  note = {https://hcl.pyimagesearch.com/2022/02/07/anime-faces-with-wgan-and-wgan-gp/},
}

Want free GPU credits to train models?

We used Jarvislabs.ai, a GPU cloud, for all the experiments.
We are proud to offer PyImageSearch University students $20 worth of Jarvislabs.ai GPU cloud credits. Join PyImageSearch University and claim your $20 credit here.

In Deep Learning, we need to train Neural Networks. These Neural Networks can be trained on a CPU but take a lot of time. Moreover, sometimes these networks do not even fit (run) on a CPU.

To overcome this problem, we use GPUs. The problem is these GPUs are expensive and become outdated quickly.

GPUs are great because they take your Neural Network and train it quickly. The problem is that GPUs are expensive, so you don’t want to buy one and use it only occasionally. Cloud GPUs let you use a GPU and only pay for the time you are running the GPU. It’s a brilliant idea that saves you money.

JarvisLabs provides the best-in-class GPUs, and PyImageSearch University students get between 10 - 50 hours on a world-class GPU (time depends on the specific GPU you select).

This gives you a chance to test-drive a monstrously powerful GPU on any of our tutorials in a jiffy. So join PyImageSearch University today and try for yourself.

Click here to get Jarvislabs credits now

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!

Wasserstein GAN

Setup

Prepare the Data

The Generator

The Critic

The WGAN Model

Keras `Callback` for Training Monitoring

Compile and Train WGAN

Wasserstein GAN with Gradient Penalty

Add Gradient Penalty

Remove batchnorm

Adam Optimizer instead of RMSProp

What's next? I recommend PyImageSearch University.

Summary

Citation Information

Want free GPU credits to train models?

Download the Source Code and FREE 17-page Resource Guide

About the Author

Comment section

Finding extreme points in contours with OpenCV

Real-time object detection with deep learning and OpenCV

Whitelisting and Blacklisting Characters with Tesseract and Python

Topics

Books & Courses

PyImageSearch

Wasserstein GAN

Setup

Prepare the Data

The Generator

The Critic

The WGAN Model

Keras Callback for Training Monitoring

Compile and Train WGAN

Wasserstein GAN with Gradient Penalty

Add Gradient Penalty

Remove batchnorm

Adam Optimizer instead of RMSProp

What's next? I recommend PyImageSearch University.

Summary

Citation Information

Want free GPU credits to train models?

Download the Source Code and FREE 17-page Resource Guide

About the Author

Correcting Text Orientation with Tesseract and Python

Image Super Resolution

Comment section

Similar articles

You can learn Computer Vision, Deep Learning, and OpenCV.

Footer

Topics

Books & Courses

PyImageSearch

Access the code to this tutorial and all other 500+ tutorials on PyImageSearch

What's included in PyImageSearch University?

Keras `Callback` for Training Monitoring