Face Allignment: Deep multi-task learning

Face Alignment: Deep multi-task learning

Oct 28 ·9min read

1. Introduction

Facial keypoint prediction: Given a picture of face, predict the locations of various facial features.

This problem comes under the umbrella of computer vision and comes with it’s own challenges.

A lot of work has been done in this space. In this article, we will deep dive into the unique approach proposed by this paper , and implement it from scratch, all using keras.

NVR3Yvb.png!web

Deep Multi-task learning

2. Why we need to solve it? What are it’s application?

Have you ever used Snapchat, specifically tried out their image filters? How it does all of that magic? How it replaces your moustache with the artificial one so accuratley?

To get it done, first it needs to identify the part of your face that corresponds to moustache. And then crop out part(ofcourse, internally) and replaces it with the artifical one.

And this is where facial keypoint detection comes into play. Identifying the various parts of face.

This is just one specific application, and there are tons like this. Check out this article for more.

3. Data Overview

The dataset for this project is provided by the authors of the paper themselves which can be found here .

Data comes with 12295 images in total, out of which 10,000 are training images and 2295 are test images.

Data also comes with two txt files: training.txt and testing.txt . These two files holds the information about the path of images, the co-ordinate positions of facial features and 4 other facial attributes:

1st Attribute: Gender[M/F]

2nd Attribute: Smiling/Not Smiling

3rd Attribute: With Glasses/No Glasses

4th Attribute: Pose Variation

3.1 Loading and Cleaning the data

Let’s load the training.txt file and try to understand and analyse the data. When you’ll read the training.txt files using pandas read_csv function using space as a separator, it will not be loaded correctly and that’s because of the reason there is space at the beginning of each line. So, we need to strip that out.

FnmmErU.png!web

Training.txt file

The following code will do exactly that.

f = open('training.txt','r')
f2 = open('training_new.txt','w')
for i,line in enumerate(f.readlines()):
    if i==0:
        continue
    line = line.strip()
    
    f2.write(line)
    f2.write('\n')
f2.close()
f.close()

Now, we’ll use this newly created file training_new.txt in the project. Do the same for testing.txt file.

Reading the cleaned training.txt file.

names = ['Path']+list('BCDEFGHIJK')+['Gender','Smile','Glasses','Pose']train = pd.read_csv('training_new.txt',sep=' ',header=None,names=names)
train['Path'] = train['Path'].str.replace('\\','/')

Here is the meaning of each attribute in the training file.

Path: the path of the image(absolute path)
B: x co-ordinate of right eye centre
C: x co-ordinate of left eye centre
D: x co-ordinate of nose centre
E: x co-ordinate of extreme right point of mouth
F: x co-ordinate of extreme left point of mouth
G: y co-ordinate of right eye centre
H: y co-ordinate of left eye centre
I: y co-ordinate of nose centre
J: y co-ordinate of extreme right point of mouth
K: y co-ordinate of extreme left point of mouth
Gender: whether the person is male/female, 1: Male, 2: Female
Smile: Whether the person is smiling or not, 1: Smile, 2:Not smile
Glasses: whether the person has glasses or not, 1: Glasses, 2: No Glasses
Pose: [Pose estimation ] , 5 categories.

3.2 Visualize the data

Now, let’s visualize some of the images with the facial keypoints.

Code:

#visualising the dataset
images = []
all_x = []
all_y= []
random_ints = np.random.randint(low=1,high=8000,size=(9,))
for i in random_ints:
    img = cv2.imread(train['Path'].iloc[i])
    x_pts = train[list('BCDEF')].iloc[i].values.tolist()
    y_pts = train[list('GHIJK')].iloc[i].values.tolist()
    all_x.append(x_pts)
    all_y.append(y_pts)
    img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)
    images.append(img)fig,axs = plt.subplots(nrows=3,ncols=3,figsize=(14,10))k =0
for i in range(0,3):
    for j in range(0,3):
        axs[i,j].imshow(images[k])
        axs[i,j].scatter(all_x[k],all_y[k])
        k += 1plt.show()

eiUVvqz.png!web

4. Deep Dive

Now, we get the idea of what facial key point prediction is all about. Let’s dive deep and understand the technical details about it.

Takes a image as input and give the co-ordinate of facial features.

It’s a regression problem, as it predict the continuous values i.e co-ordinate of facial landmarks.

neuMJvv.png!web

Face Alignment, Magic box?

What’s that magic thing in the box which does all of this?

Let’s deep dive more, and understand it.

As of now there are two approaches through which we can solve this problem, one is vanilla computer vision techniques(like viola and jones for bounding box prediction of face), and other one is deep learning based, especially convolutional neural network based.

But what the heck is this convolutional neural network?

Simply put, it’s technique that is used to extract and detect meaningful information from image. If you’re interested in learning more, head over to here .

We’ll take the second route for this problem, that is deep learning based.

5.Literature overview

Tremendous work had been done in this space by various researchers. Most of the work around this problem poses it as a single task problem in which they try to solely solve this problem only. But this research paper comes up with an interesting idea, that is, they’ve posed it as a deep mutli-task problem.

What is a mutli-task problem?

Multi-task problem: Instead of solving only one main problem, let’s solve related multiple problems together.

Ex: Instead of solving only facial landmark detection problem, let’s also solve related auxiliary problems like: whether the person in the image is similng or not, what’s the gender of the person, etc..

But why solve mutliple tasks together?

The author of above mentioned paper notices a very crucial detail about facial landmark detection(main task) and that is the position of facial landmarks is highly dependent on whether the person is smiling or not, the pose of person in the image, and other supported tasks. So, they introduces the concept of deep multi-task learning, and found it accurate for this problem.

6. Implementation

If you’re already familiar with deep learning, by this time, you got that this is a multi-output problem because we’re trying to solve this mutiple tasks at the same time. As we’re going to use keras for implementation, a multi-output model can be implemented through Functional API, and not sequential API.

As per the data, we’ve 5 tasks at the hand, out of which face alignment is the main one. So, we’re going to train the model for these 5 tasks together using a multi-output model.

We will train the main task(face alignment) with different auxiliary tasks to evaluate the effectiveness of deep multi-task learning.

1st Model: Face Alignment + all other auxiliary tasks(4)

2nd Model: Face Alignment + Gender + Smile + Glasses

3rd Model: Face Alignment + Pose estimation

4th Model: Face Alignment only

Network Architecture

We’ll gonna use four convolutional layers, 3 max pooling layers, one dense layer, and seperate output layers for all tasks. The network architecture is as same as it’s implemented by the authors of the paper except the input shape of images.

ZnY7Zz6.png!web

Network architecture. Source

6.1 Implementation of first model

The code is as follows:

inp = Input(shape=(160,160,3))#1st convolution pair
conv1 = Conv2D(16,kernel_size=(5,5), activation='relu')(inp)
mx1 = MaxPooling2D(pool_size=(2,2))(conv1)#2nd convolution pair
conv2 = Conv2D(48,kernel_size=(3,3), activation='relu')(mx1)
mx2 = MaxPooling2D(pool_size=(2,2))(conv2)#3rd convolution pair
conv3 = Conv2D(64,kernel_size=(3,3), activation='relu')(mx2)
mx3 = MaxPooling2D(pool_size=(2,2))(conv3)#4th convolution pair
conv4 = Conv2D(64,kernel_size=(2,2), activation='relu')(mx3)flt = Flatten()(conv4)dense = Dense(100,activation='relu')(flt)reg_op = Dense(10,activation='linear',name='key_point')(dense)gndr_op = Dense(2,activation='sigmoid',name='gender')(dense)smile_op = Dense(2,activation='sigmoid',name='smile')(dense)glasses_op = Dense(2,activation='sigmoid',name='glasses')(dense)pose_op = Dense(5,activation='softmax',name='pose')(dense)model = Model(inp,[reg_op,gndr_op,smile_op,glasses_op,pose_op])model.summary()

This will print out the following output:

zuENZfI.png!web

Now, the next step is to mention the loss functions we’re going to use for each output in keras. This is pretty straight forward to figure out. We’re gonna use Mean Square Error(MSE) for facial keypoints, Binary crossentropy for gender output, smile output, and glasses output, and Categorical crossentropy for pose output.

loss_dic = {'key_point':'mse','gender':'binary_crossentropy','smile':'binary_crossentropy', 'glasses':'binary_crossentropy' , 'pose':'categorical_crossentropy'}

Internally, the total loss will be the sum of all individual losses. Now, we can also explictly set the weights for each loss function in keras, and the resultant loss will be the weighted sum of all individual losses.

Since the main task is keypoint detection, hence we’ll giving more weightage to this one.

loss_weights = {'key_point':7,'gender':2,'smile':4,'glasses':1,'pose':3}

Metrics for each task will be:

metrics = {'key_point':'mse','gender':['binary_crossentropy','acc'],'smile':['binary_crossentropy','acc'], 'glasses':['binary_crossentropy','acc'] , 'pose':['categorical_crossentropy','acc']}

All set. Let’s train the network.

epochs = 35
bs = 64H = model.fit(train_images,[train_keypoint_op,]+train_categorical_ops, epochs = epochs, batch_size=bs, validation_data=(val_images,[val_keypoint_op,]+val_categorical_ops),callbacks=[TrainValTensorBoard(log_dir='./log',write_graph=False)])

Let’s evaluate the performance of model.

train_pred = model.predict(train_images)
val_pred = model.predict(val_images)print('MSE on train data: ', mean_squared_error(train_keypoint_op,train_pred[0]))
print('MSE on validation data: ', mean_squared_error(val_keypoint_op,val_pred[0]))

The above code snippet gives the following output:

MSE on train data:  2.0609966325423565
MSE on validation data:  29.55315040683187

Visualize the result on validation set.

fmmIJnR.png!web

Output Result: Model 1

It works really good just with 35 epochs and fairly simple model architecture.

6.2 Implementation of Second model

The code for this one is as follows:

inp = Input(shape=(160,160,3))#1st convolution pair
conv1 = Conv2D(16,kernel_size=(5,5), activation='relu')(inp)
mx1 = MaxPooling2D(pool_size=(2,2))(conv1)#2nd convolution pair
conv2 = Conv2D(48,kernel_size=(3,3), activation='relu')(mx1)
mx2 = MaxPooling2D(pool_size=(2,2))(conv2)#3rd convolution pair
conv3 = Conv2D(64,kernel_size=(3,3), activation='relu')(mx2)
mx3 = MaxPooling2D(pool_size=(2,2))(conv3)#4th convolution pair
conv4 = Conv2D(64,kernel_size=(2,2), activation='relu')(mx3)flt = Flatten()(conv4)dense = Dense(100,activation='relu')(flt)reg_op = Dense(10,activation='linear',name='key_point')(dense)gndr_op = Dense(2,activation='sigmoid',name='gender')(dense)smile_op = Dense(2,activation='sigmoid',name='smile')(dense)glasses_op = Dense(2,activation='sigmoid',name='glasses')(dense)model = Model(inp,[reg_op,gndr_op,smile_op,glasses_op])model.summary()

Compiling the model.

loss_dic ={'key_point':'mse','gender':'binary_crossentropy','smile':'binary_crossentropy', 'glasses':'binary_crossentropy' }loss_weights = {'key_point':2,'gender':1,'smile':4,'glasses':1}metrics = {'key_point':'mse','gender':['binary_crossentropy','acc'],'smile':['binary_crossentropy','acc'], 'glasses':['binary_crossentropy','acc'] }model.compile(optimizer='adam',loss=loss_dic,loss_weights=loss_weights,metrics=metrics)

All set. Let’s train the network.

H = model.fit(train_images, [train_keypoint_op,]+train_categorical_ops[:-1], epochs = epochs, batch_size=bs, validation_data=(val_images,[val_keypoint_op,]+val_categorical_ops[:-1]),callbacks=[TrainValTensorBoard(log_dir='./log3',write_graph=False)])

Let’s evaluate the performance of model.

train_pred = model.predict(train_images)
val_pred = model.predict(val_images)print('MSE on train data: ', mean_squared_error(train_keypoint_op,train_pred[0]))
print('MSE on validation data: ', mean_squared_error(val_keypoint_op,val_pred[0]))

The above code snippet gives the following output:

MSE on train data:  2.9205250961752722
MSE on validation data:  35.072992153148434

Visualize the result on validation set.

E3i22mU.png!web

Output Result: Model 2

6.3 Implementation of Third model

The code for this one is as follows:

inp = Input(shape=(160,160,3))#1st convolution pair
conv1 = Conv2D(16,kernel_size=(5,5), activation='relu')(inp)
mx1 = MaxPooling2D(pool_size=(2,2))(conv1)#2nd convolution pair
conv2 = Conv2D(48,kernel_size=(3,3), activation='relu')(mx1)
mx2 = MaxPooling2D(pool_size=(2,2))(conv2)#3rd convolution pair
conv3 = Conv2D(64,kernel_size=(3,3), activation='relu')(mx2)
mx3 = MaxPooling2D(pool_size=(2,2))(conv3)#4th convolution pair
conv4 = Conv2D(64,kernel_size=(2,2), activation='relu')(mx3)flt = Flatten()(conv4)dense = Dense(100,activation='relu')(flt)reg_op = Dense(10,activation='linear',name='key_point')(dense)pose_op = Dense(5,activation='softmax',name='pose')(dense)model = Model(inp,[reg_op,pose_op])model.summary()

Compiling the model.

loss_dic = {'key_point':'mse','pose':'categorical_crossentropy'}loss_weights = {'key_point':4,'pose':11}metrics = {'key_point':'mse', 'pose':['categorical_crossentropy','acc']}model.compile(optimizer='adam',loss=loss_dic,loss_weights=loss_weights,metrics=metrics)

All set. Let’s train the network.

H = model.fit(train_images, [train_keypoint_op,train_categorical_ops[-1]], epochs = epochs, batch_size=bs, validation_data=(val_images,[val_keypoint_op,val_categorical_ops[-1]]),callbacks=[TrainValTensorBoard(log_dir='./log4',write_graph=False)])

Let’s evaluate the performance of model.

train_pred = model.predict(train_images)
val_pred = model.predict(val_images)print('MSE on train data: ', mean_squared_error(train_keypoint_op,train_pred[0]))
print('MSE on validation data: ', mean_squared_error(val_keypoint_op,val_pred[0]))

The above code snippet gives the following output:

MSE on train data:  2.825882283863525
MSE on validation data:  31.41507419233826

Visualize the result on validation set.

rY7jI36.png!web

Output Result: Model 3

6.4 Implementation of fourth model

The code for this one is as follows:

inp = Input(shape=(160,160,3))#1st convolution pair
conv1 = Conv2D(16,kernel_size=(5,5), activation='relu')(inp)
mx1 = MaxPooling2D(pool_size=(2,2))(conv1)#2nd convolution pair
conv2 = Conv2D(48,kernel_size=(3,3), activation='relu')(mx1)
mx2 = MaxPooling2D(pool_size=(2,2))(conv2)#3rd convolution pair
conv3 = Conv2D(64,kernel_size=(3,3), activation='relu')(mx2)
mx3 = MaxPooling2D(pool_size=(2,2))(conv3)#4th convolution pair
conv4 = Conv2D(64,kernel_size=(2,2), activation='relu')(mx3)flt = Flatten()(conv4)dense = Dense(100,activation='relu')(flt)reg_op = Dense(10,activation='linear',name='key_point')(dense)model = Model(inp,reg_op)model.summary()

Compiling the model.

loss_dic = {'key_point':'mse'}metrics = {'key_point':['mse','mae']}model.compile(optimizer='adam',loss=loss_dic,metrics=metrics)

All set. Let’s train the network.

H = model.fit(train_images, train_keypoint_op, epochs = epochs, batch_size=bs, validation_data=(val_images,val_keypoint_op),callbacks=[TrainValTensorBoard(log_dir='./log5',write_graph=False)])

Let’s evaluate the performance of model.

train_pred = model.predict(train_images)
val_pred = model.predict(val_images)print('MSE on train data: ', mean_squared_error(train_keypoint_op,train_pred ))
print('MSE on validation data: ', mean_squared_error(val_keypoint_op,val_pred ))

The above code snippet gives the following output:

MSE on train data:  2.822843715225789
MSE on validation data:  30.50257287238015

Visualize the result on validation set.

MNFzMfa.png!web

Output Result: Model 4

Conclusions

As per the above experiments, it can be easily concluded that multi-task learning is more effective than solving this problem solely.

Sometimes, solving multi-task problem is more helpful than solving the problem solely but note that only solve auxiliary problems, if main problem is dependent on auxiliary problems.

The full souce code for this project can be found here .

Hope, you’ve enjoyed this article and if you have learned anything new from it, then you can show your love by sharing it with others and by following me for more such articles.. It takes so much time to write such comprehensive blog post, hopes my hard work will help some of you to understand the details of this case study so that you can also implement it at your end.

And feel free to connect with me on LinkedIn , follow me on Twitter and Quora as well.

Face Alignment: Deep multi-task learning

1. Introduction

2. Why we need to solve it? What are it’s application?

3. Data Overview

1st Attribute: Gender[M/F]

2nd Attribute: Smiling/Not Smiling

3rd Attribute: With Glasses/No Glasses

4th Attribute: Pose Variation

3.1 Loading and Cleaning the data

3.2 Visualize the data

4. Deep Dive

But what the heck is this convolutional neural network?

5.Literature overview

What is a mutli-task problem?

6. Implementation

Network Architecture

6.1 Implementation of first model

6.2 Implementation of Second model

6.3 Implementation of Third model

6.4 Implementation of fourth model

Conclusions

Recommend

AWS Elastic MapReduce (EMR) — 6 Caveats You Shouldn’t Ignore

Spring Boot构建的Web项目如何在服务端校验表单输入

Websites can change content inside a selection

Design Systems: Embedding UI Design Patterns in React Components

Ripe NIC: 'In Five Weeks We'll Run Out of IPv4 Internet Addresse...

Build a virtual private network with Wireguard

SQLite Is Easy to Compile

How Figma's Multiplayer Technology Works

Addressing some misconceptions about our plans for improving the security of DNS

How to orchestrate Cloud Dataprep jobs using Cloud Composer

About Joyk