30

Variational Autoencoders (VAEs) for Dummies — Step By Step Tutorial

 4 years ago
source link: https://towardsdatascience.com/variational-autoencoders-vaes-for-dummies-step-by-step-tutorial-69e6d1c9d8e9?gi=4f557bbb4abd
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Motivation

Why do we need to generate new data when there is already so much data? According to IDC , there are more than 18 zettabytes of data out there.

Most machine learning tasks require labeled data. Getting high-quality, labeled data is difficult. If we are generating that data ourselves, we can make as much of it as we like. New data can give us ideas and options.

How To generate unseen images?

In this article, you will learn what is a Variational Autoencoder and how to create your own for generating new and unseen images. We explain the underlying concepts and intuition without math.

Example of image and its reconstruction using our VAE code (self-created)

The Data

We will be using a subset of the well-known Celebrity dataset to help us build a facial generative model. The dataset can be downloaded as described on the CelebFaces A website. It provides a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations.

  • 10,177 number of identities,
  • 202,599 number of face images,
  • 5 landmark locations, and
  • 40 binary attributes annotations per image.
RvURBne.png!web

Below we pick random faces from the celebrity dataset and display their metadata (attributes). The images are 218px height, 178px width with 3 color channels.

veyYfiR.png!web

What is an Autoencoder (AE)?

By looking at thousands of celebrity faces, a neural network could learn to generate faces of human who do not exist.

Neural networks are one of many possible methods we can use to obtain a function approximation. A reason why they became popular is their ability to learn representations. Networks are able to learn the specific representations that matter when classifying an image as dog or cat, assuming we provide them with the correct labels. This is supervised learning.

In some cases, we do not have these labels. Nevertheless, we can train two networks, one that learns the representation, and one that reconstructs from the representation by minimizing the reconstruction loss function. This is an autoencoder. It gets that name because it automatically finds the best way to encode the input so that the decoded version is as close as possible to the input.

An Autoencoder is made of a pair of two connected neural networks: an encoder model and a decoder model. Its goal is to find a way to encode the celebrity faces into a compressed form (latent space) in such a way that the reconstructed version is as close as possible to the input.

VbIfqaI.png!web
Working components of an autoencoder (self-created)

The encoder model turns the input x into a small dense representation z , similar to how a convolutional neural network works by using filters to learn representations.

The decoder model can be seen as a generative model which is able to generate specific features x’ .

Both encoder and decoder are usually trained as a whole. The loss function penalizes the network for creating output faces which differ from input faces.

Therefore the encoder learns to preserve as much of the relevant information needed in the limitation of the latent space, and cleverly discard irrelevant parts, e.g. noise.

The decoder learns to take the compressed latent information and reconstruct it into a full celebrity face.

An Autoencoder can be also useful for dimensionality reduction anddenoising images, but can also be successful in unsupervized machine translation .

What is a Variational Autoencoder (VAE)?

Typically, the latent space z produced by the encoder is sparsely populated, meaning that it is difficult to predict the distribution of values in that space. Values are scattered and the space will appear to be well utilized in a 2D representation.

This is a very good property for compression systems. However, for generating new celebrity images this sparsity is an issue, because finding a latent value for which the decoder will know how to produce valid image is almost impossible.

Furthermore, if the space has gaps between clusters, and the decoder receives a variation from there, it will lack the knowledge to generate something useful.

Variational Autoencoderworks by making the latent space more predictable, more continuous, less sparse. By forcing latent variables to become normally distributed, VAEs gain control over the latent space.

nUrQJnM.png!web

From AE to VAE using random variables (self-created)

Instead of forwarding the latent values to the decoder directly, VAEs use them to calculate a mean and a standard deviation. The input to the decoder are then sampled from the corresponding normal distribution.

During training, VAEs force this normal distribution to be as close as possible to the standard normal distribution by including the Kullback–Leibler divergence in the loss function. VAE will be altering, or exploring variations on the faces, and not just in a random way, but in a desired, specific direction.

Conditional Variational Autoencodersallow to model the input based on both the latent variable z and additional information such as metadata of the face (smile, glasses, skin color, etc.).

The Image Data Generator

Let’s build a (conditional) VAE that is able to learn on celebrity faces. We use a custom Keras memory-efficient generator in order to deal with our large dataset (202599 images, ca. 10KB each). The idea behind is to get batches of images on the fly during training process.

The VAE Network

We want the encoder to be a convolutional neural network that takes an image and outputs the parameters of the distribution Q(z|[x, c]) where x is the input image of the face, c is the conditioning variable (attributes of the face) and z is the latent variable. For the purpose of this article, we use a simple architecture made of two convolutional layers and a pooling layer.

The decoder is a convolutional neural network built the other way around. It is a generative network that outputs parameters to the likelihood distribution P([x,c]|z) .

The architecture of the whole VAE network is created as follows.

Training

The learning process for VAE models on the images in the celebA dataset is illustrated below. The code ran approximately 8 hours on an AWS instance using 1 GPU.

AF7bU3b.png!web

Visualizing Latent Representations

After training, we can now pick an image randomly from our dataset and use the trained encoder to create a latent representation of the image.

3qQnIbr.png!web

Using the latent representation, a vector of 16 real numbers, we can visualize how the decoder reconstructed the original image.

vaiiE3M.png!web

Although the reconstructed image is blurry, we can see a strong similarity with the original image: gender, clothes color, hairs, smile, skin color.

Generating Unseen Faces

Conditional VAEs are capable of altering the latent space in order to generate new data. Specifically we can generate a random count of unseen images using the decoder, while conditioning it on given attributes.

YBNFVnm.png!web

Although our variational autoencoder produces blurry and non-photorealistic faces, we can recognize the gender, skin color, smile, glasses, hair color of those humans, who never existed.

Give a Smile to Faces

Conditional VAEs are able to interpolate between attributes, and to make a face smile or to add glasses where there was none before. Below, we choose a random celebrity face from the dataset and take advantages of alterations of the latent representation to morph it from a female face into a male face. We also alter the faces to display a smile that was originally not there.

veMnueA.png!web

Conclusion

In this article, we introduced Conditional Variational Autoencoders and demonstrated how they are able to learn how to generate new labeled data.

We provided Python code for training VAEs on large celebrity image datasets. The approach and code can be extended to multiple other use cases.

Generative Adversarial Networks (GANs) tend to produce even nicer-looking images, because they learn to differentiate what is photorealistic to humans and what is not.

The ethical aspect of using VAE/GAN technologies to generate fake images, videos or news should be considered seriously and their usage should be applied responsibly .

neIzyey.jpg!web

Photo by Ivanov Good on Pixabay

I recommend follow-up reading from Joseph Rocca: Understanding Variational Autoencoders (VAEs) .

Many thanks to Vincent Casser for his awesome code with a more advanced approach of implementing convolutional autoencoders for image manipulation provided in his blog . I obtained an explicit authorization from Vincent to adapt his VAE code for the purpose of this article. Building a working VAE from scratch is rather tricky. Credits go to Vincent Casser.

Thanks to Anne Bonner for her editorial notes.

Stay healthy and safe!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK