Background Matting: The World is Your Green Screen

Using deep learning and GANs to enable professional quality background replacement from your own home

Apr 9 ·7min read

AJveyaU.png!web

Do you wish that you could make professional quality videos without a full studio? Or that Zoom’s virtual background function worked better during your video conferences?

Our recently published paper [1] in CVPR 2020 provides a new and easy method to replace your background for a wide variety of applications. You can do this at home in everyday settings, using a fixed or handheld camera. Our method is also state-of-the-art and gives outputs comparable to professional results. In this article we walk through the motivation, technical details, and usage tips for our method.

You can also checkout out our project page and codebase .

What is Matting?

Matting is the process of separating an image into foreground and background so you can composite the foreground onto a new background. This is the key technique behind the green screen effect, and it is widely used in video production, graphics, and consumer apps. To model this problem, we represent every pixel in the captured image as a combination of foreground and background:

The Matting Equation

Our problem is to solve for the foreground (F), background (B), and transparency (alpha) for every pixel given a captured image (C). Clearly this is highly undetermined, and since images have RGB channels, this requires solving 7 unknowns from 3 observed values.

The Problem with Segmentation

One possible approach is to use segmentation to separate foreground for compositing. Although segmentation has made huge strides in recent years, it does not solve the full matting equation. Segmentation assigns a binary (0,1) label to each pixel in order to represent foreground and background instead of solving for a continuous alpha value. The effects of this simplification are visible in the following example:

Bve2ueE.png!web

This example shows why segmentation does not solve the compositing problem. The segmentation was performed with DeepLab v3+ [2].

The areas around the edge, particularly in the hair, have a true alpha value between 0 and 1. Therefore, the binary nature of segmentation creates a harsh boundary around the foreground, leaving visible artifacts. Solving for the partial transparency and foreground color allows much better compositing in the second frame.

Using A Casually Captured Background

Because matting is a harder problem than segmentation, additional information is often used to solve this unconstrained problem, even when using deep learning.

Many existing methods [3][4][5] use a trimap, or a hand-annotated map of known foreground, background, and unknown regions. Although this is possible to do for an image, annotating video is extremely time consuming and is not a feasible research direction for this problem.

We choose instead to use a captured background as an estimate of the true background. This makes it easier to solve for the foreground and alpha value. We call it a “casually captured” background because it can contain slight movements, color differences, slight shadows, or similar colors as the foreground.

f6b6FfR.png!web

Our capture process. When the subject leaves the scene we capture the background behind them to help the algorithm.

The figure above shows how we can easily provide a rough estimate of the true background. As the person leaves the scene, we capture the background behind them. The figure below shows what this looks like:

u2Ij6nZ.png!web

Example of captured input, captured background, and composite on a new background.

Notice how this image is challenging because it has a very similar background and foreground color (particularly around the hair). It was also recorded with a handheld phone and contains slight background movements.

“We call it a casually captured background because it can contain slight movements, color differences, slight shadows, or similar colors as the foreground.”

Tips for Capturing

Although our method works with some background perturbations, it is still better when the background is constant and best in indoor settings. For example, it does not work in the presence of highly noticeable shadows cast by the subject, moving backgrounds (e.g. water, cars, trees), or large exposure changes.

Vz2IZv6.png!web

Failure case. The person was filmed in front of a moving fountain.

We also recommend capturing the background by having the person leave the scene at the end of the video, and pulling that frame from the continuous video. Many phones have different zoom and exposure settings when you switch from video mode to photo mode. You should also enable auto-exposure lock when filming with a phone.

EvYjAbn.png!web

The ideal capture scenario. The background is indoors, not moving, and the subject does not cast a shadow

A summary of the capture tips:

Choose the most constant background you can find.
Don’t stand too close to the background so you don’t cast a shadow.
Enable auto-exposure and auto-focus locks on the phone.

Is This Method Like Background Subtraction?

Another natural question is whether this is like background subtraction. Firstly, if it were easy to use any background for compositing, the movie industry would not be spending thousands of dollars on green screens all these years.

7rMnaeV.png!web

Background subtraction doesn’t work well with casually captured backgrounds

In addition, background subtraction does not solve for partial alpha values, giving the same hard edge as segmentation. It also does not work well when there is a similar foreground and background color or any motions in the background.

Network Details

The network consists of a supervised step followed by an unsupervised refinement. We’ll briefly summarize them here, but for full details you can always check out the paper.

Supervised Learning

In order to first train the network, we use the Adobe Composition-1k dataset, which contains 450 carefully annotated ground truth alpha mattes. We train the network in a fully supervised way, with a per pixel loss on the output.

6v6Jrqe.png!web

The supervised portion of our network. We use several input cues, then we output an alpha matte and the predicted foreground color. We train on the Adobe 1k dataset with ground truth results provided.

Notice that we take several inputs, including the image, background, soft segmentation, and temporal motion information. Our novel Context Switching Block also ensures robustness to poor inputs.

Unsupervised Refinement with GANs

The problem with supervised learning is that the adobe dataset only contains 450 ground truth outputs, which is not nearly enough to train a good network. Obtaining more data is extremely difficult because it involves hand-annotating the alpha matte of an image.

To solve this problem, we use a GAN refinement step. We take the output alpha matte from the supervised network and composite it on a new background. The discriminator then tries to tell if it is a real or fake image. In response, the generator learns to update the alpha matte so the resulting composite is as real as possible in order to fool the discriminator.

B3En2iA.png!web

Our unsupervised GAN refinement step. We put the foreground on a new background, then a GAN tries to tell if it is real or fake.

The important part here is that we don’t need any labelled training data. The discriminator was trained with thousands of real images, which are very easy to obtain.

Training the GAN on Your Data

What’s also useful about the GAN is that you can train the generator on your own images to improve results at test time. Suppose you run the network and the output is not very good. You can update the weights of the generator on that exact data in order to better fool the discriminator. This will overfit to your data, but will improve the results on the images you provided.

Future Work

Although the results we see are quite good, we are continuing to make this method more accurate and easy to use.

In particular, we would like to make this method more robust to circumstances like background motions, camera movements, shadows, etc. We are also looking at ways to make this method work in real-time and with less computational resource power. This could enable a wide variety of use cases in areas like video streaming or mobile apps.

If you have any questions feel free to reach out to me , Vivek Jayaram, or Soumyadip Sengupta

What is Matting?

The Problem with Segmentation

Using A Casually Captured Background

Tips for Capturing

Is This Method Like Background Subtraction?

Network Details

Supervised Learning

Unsupervised Refinement with GANs

Training the GAN on Your Data

Future Work

Recommend

Daylight Saving Time and System Time Zone in MySQL

Predict NBA games, make money — machine learning project

Getting Started with Lottie Animations in iOS

SimulaVR: A New Way to Work in Linux

Procedural Racetrack Generation

WebRTC 学习记录 (一) 云服务器搭建 AppRTC 环境

腾讯AI Lab宣布中国首款智能显微镜获药监局批准进入临床应用

面试资源、公共API、多样化学习路径，这10个GitHub库开发者必看

坐标深圳，先买车还是先买第二套房，现金流十万

Google rolling out redesigned Assistant settings on Android - 9to5Google

About Joyk