14

Human Pose Detection using PyTorch Keypoint RCNN

 3 years ago
source link: https://debuggercafe.com/human-pose-detection-using-pytorch-keypoint-rcnn/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Human Pose Detection using PyTorch Keypoint RCNN


Human Pose Detection using PyTorch Keypoint RCNN

In this tutorial, we will learn how to carry out human pose detection using PyTorch and the Keypoint RCNN neural network.

We will use a pre-trained PyTorch KeyPoint RCNN with ResNet50 backbone to detect keypoints in human bodies. The following image will make things much more clear about what we will be doing in this article.

Human keypoint and pose estimation from the COCO dataset.
Figure 1. Human keypoint and pose estimation from the COCO dataset (Source).

The above image (figure 1) is from COCO 2018 Keypoint Detection Task dataset. It is one of the largest datasets for keypoint detection that is publicly available.

But before diving further into the code, let’s gain some more knowledge about human keypoint detection.

What will we be learning in this article?

  • What is human body keypoint detection or pose detection?
  • Different models and systems that are available for human pose detection.
  • Detecting human pose in images and videos using PyTorch Keypoint RCNN?

Pose Estimation / Detection in Deep Learning and Computer Vision

In general, pose detection or estimation is not a deep learning problem. In fact, it is a computer vision problem. But in recent years, deep learning based computer vision techniques have helped the research community a lot in achieving great results.

So, what is pose estimation in human body?

Human pose detection is detecting the important keypoints that can describe the orientation or movement of a person. You can also relate it to facial keypoint detection where we detect the interesting parts of a face and mark them. We can also do this in real-time.

Similarly, we can also detect the keypoints in a human body which describe the movement of the human body. This is known as human pose detection.

For example, take a look at the following image.

Pose estimation and skeletal structure example when a person is walking.
Figure 2. Pose estimation and skeletal structure example when a person is walking. We can use deep learning and computer vision to get such results.

Figure 2 shows 17 keypoints of a human body while the person is walking. You can see that each of the major joints below the face are all numbered. Similarly, the important points on the head are also numbered.

I hope that the above image gives you a good idea of what we are trying to achieve here.

PyTorch Keypoint RCNN for Human Pose Detection

In this section, we will take learn a bit more about the Keypoint RCNN deep learning model for pose detection.

But before that, let’s have a brief look at different deep learning methods available for human pose detection. Although we will not go into the details, still you may wish to explore these methods on your own.

OpenPose

OpenPose is of one of the most famous human keypoints and pose detection systems.

It is a real-time multi-person keypoint detection library for body, face, hands, and foot estimation. Along with that, it can detect a total of 135 keypoints on a human body.

The above are only some of the features. There are many more and I insist that you take a look at their GitHub repository which is quite impressive.

Also, if you wish, you can also read the OpenPose paper. It is authored by Gines Hidalgo, Zhe Cao, Tomas Simon, Shih-En Wei, Hanbyul Joo, and Yaser Sheikh.

AlphaPose

AlphaPose is another real-time multi-person human pose detection system.

The GitHub repository is also quite actively maintained. Do take a look at it to get some more details.

The original paper was published under the name RMPE: Regional Multi-person Pose Estimation by Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. The updated and better version has been renamed to AlphaPose.

DeepCut

DeepCut is another such multi-person keypoint detection system. To fully do justice to the method, I am quoting the authors here.

We propose an approach that jointly solves the tasks of detection and pose estimation: it infers the number of persons in a scene, identifies occluded body parts, and disambiguates body parts between people in close proximity of each other.

DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation

You can visit the official website to get the complete details. There, you will also find the link to their GitHub repository and pre-trained models as well.

Keypoint RCNN

Now, coming to the deep learning model and technique that we will use in this tutorial.

We will use one of the PyTorch pre-trained models for human pose and keypoint detection. It is the Keypoint RCNN deep learning model with a ResNet-50 base architecture.

This model has been pre-trained on the COCO Keypoint dataset. It outputs the keypoints for 17 human parts and body joints. They are: ‘nose’, ‘left_eye’, ‘right_eye’, ‘left_ear’, ‘right_ear’, ‘left_shoulder’, ‘right_shoulder’, ‘left_elbow’, ‘right_elbow’, ‘left_wrist’, ‘right_wrist’, ‘left_hip’, ‘right_hip’, ‘left_knee’, ‘right_knee’, ‘left_ankle’, ‘right_ankle’.

Do you remember the keypoints and joints that you saw in figure 2? The following is the complete image with the keypoints and pose shown on the human body.

Pose estimation and keypoint detection using PyTorch Keypoint RCNN neural network.
Figure 3. Example of pose estimation and keypoint detection using PyTorch Keypoint RCNN neural network.

Figure 3 shows the boy along with all the keypoints and the body pose. It is quite impressive what deep learning models are capable of achieving when trained on the right dataset.

Input and Output Format of Keypoint RCNN

The Keypoint RCNN model takes an image tensor as input during inference. It is of the format

[batch_size x num_channels x height x width]
[batch_size x num_channels x height x width]

. For inference, the batch size is mostly going to be one.

For the output, the model returns a list of dictionary which in turn contains the resulting tensors. The fields of the dictionary are as follows:

  • Boxes (
    FloatTensor[N, 4]
    FloatTensor[N, 4]): the predicted boxes in 
    [x1, y1, x2, y2]
    [x1, y1, x2, y2] format, with values of x between 0 and W and values of y between 0 and H.
  • Labels (
    Int64Tensor[N]
    Int64Tensor[N]): the predicted labels for each image
  • Scores (
    Tensor[N]
    Tensor[N]): the scores or each prediction.
  • Keypoints (
    FloatTensor[N, K, 3]
    FloatTensor[N, K, 3]): the locations of the predicted keypoints, in 
    [x, y, v]
    [x, y, v] format.

And for figure 3, the actual output is the following.

[{'boxes': tensor([[617.7941, 152.0900, 943.1877, 775.0088]], device='cuda:0'), 'labels': tensor([1], device='cuda:0'), 'scores': tensor([0.9999], device='cuda:0'), 'keypoints': tensor([[[785.8252, 237.9547, 1.0000],
[803.9619, 221.9550, 1.0000],
[771.9560, 223.0217, 1.0000],
[834.9009, 221.9550, 1.0000],
[750.6187, 223.0217, 1.0000],
[865.8401, 284.8869, 1.0000],
[732.4820, 294.4867, 1.0000],
[925.5844, 354.2186, 1.0000],
[694.0748, 373.4182, 1.0000],
[904.2471, 381.9513, 1.0000],
[653.5340, 424.6169, 1.0000],
[855.1714, 451.2830, 1.0000],
[773.0228, 441.6832, 1.0000],
[841.3022, 585.6799, 1.0000],
[774.0897, 508.8817, 1.0000],
[839.1684, 717.9435, 1.0000],
[752.7524, 649.6784, 1.0000]]], device='cuda:0'), 'keypoints_scores': tensor([[16.1233, 17.6176, 16.5380, 16.0087, 14.1357, 10.5159, 9.5042, 10.7226,
11.4684, 15.8394, 11.3504, 10.8490, 11.1282, 11.4942, 14.9927, 10.8179,
10.6713]], device='cuda:0')}]
[{'boxes': tensor([[617.7941, 152.0900, 943.1877, 775.0088]], device='cuda:0'), 'labels': tensor([1], device='cuda:0'), 'scores': tensor([0.9999], device='cuda:0'), 'keypoints': tensor([[[785.8252, 237.9547,   1.0000],
         [803.9619, 221.9550,   1.0000],
         [771.9560, 223.0217,   1.0000],
         [834.9009, 221.9550,   1.0000],
         [750.6187, 223.0217,   1.0000],
         [865.8401, 284.8869,   1.0000],
         [732.4820, 294.4867,   1.0000],
         [925.5844, 354.2186,   1.0000],
         [694.0748, 373.4182,   1.0000],
         [904.2471, 381.9513,   1.0000],
         [653.5340, 424.6169,   1.0000],
         [855.1714, 451.2830,   1.0000],
         [773.0228, 441.6832,   1.0000],
         [841.3022, 585.6799,   1.0000],
         [774.0897, 508.8817,   1.0000],
         [839.1684, 717.9435,   1.0000],
         [752.7524, 649.6784,   1.0000]]], device='cuda:0'), 'keypoints_scores': tensor([[16.1233, 17.6176, 16.5380, 16.0087, 14.1357, 10.5159,  9.5042, 10.7226,
         11.4684, 15.8394, 11.3504, 10.8490, 11.1282, 11.4942, 14.9927, 10.8179,
         10.6713]], device='cuda:0')}]

So, it outputs the bounding boxes, the labels, the scores, and the keypoints. All in all, it is a full fledge deep learning object detection model along with human pose detection capabilities.

This is all the detail we need about the Keypoint RCNN model for now. We will get to know the rest of the details while coding.

Project Structure and PyTorch Version

By now you know that we will be using PyTorch in this tutorial. PyTorch already provides a pre-trained Keypoint RCNN model with ResNet50 base which has been trained on the COCO keypoint dataset.

To follow along smoothly, I recommend that you download the latest version of PyTorch (PyTorch 1.6 at the time of writing this). This will ensure that you can download all the pre-trained models without any hassle.

Now, coming to the project directory structure. We will follow the following structure.

├───input
│ image1.jpg
│ video1.mp4
│ video2.mp4
├───outputs
└───src
│ keypoint_rcnn_images.py
│ keypoint_rcnn_videos.py
│ utils.py
├───input
│       image1.jpg
│       ...
│       video1.mp4
│       video2.mp4
│
├───outputs
└───src
    │   keypoint_rcnn_images.py
    │   keypoint_rcnn_videos.py
    │   utils.py

We have three folders.

  • The
    input
    input folder contains the images and videos that we will use for inference.
  • The
    outputs
    outputs folder will contain all the output images and videos that we will obtain after running them through the Keypoint RCNN model.
  • And the src folder contains three Python scripts. We will write the code for each of them shortly.

You can either use your own images and videos for inference and keypoint detection. Or if you want to use the same data as in this tutorial, then you can download the input files by clicking the button below.

After downloading, you can unzip the file and keep them in the same project directory. All these images and videos are taken from Pixabay.

Human Pose and Keypoint Detection using PyTorch and Keypoint RCNN

From this section onward, we will get into the coding part of this tutorial. We have three Python scripts and we will fill them with code as per our requirement.

Writing Utility Code for Human Pose Detection

First, we will write some helper/utility code that will help us along to detect the human poses efficiently. These utility scripts are mostly repetitive scripts that we can re-use many times.

All of this code will go into the

utils.py
utils.py Python file inside the src folder.

Let’s start with the imports.

utils.py
import cv2
import matplotlib
import cv2
import matplotlib

We need only two libraries here. One is the OpenCV module to plot the different edges joining the keypoints. The other one is the Matplotlib library to get different sets of colors for different edges.

Define the Different Edges that We Will Join

Remember how Keypoint RCNN outputs 17 keypoints for each person. To get the skeletal lines that we have seen above, we need to define the pairs of keypoints that we need to join. Let’s define those now.

utils.py
# pairs of edges for 17 of the keypoints detected ...
# ... these show which points to be connected to which point ...
# ... we can omit any of the connecting points if we want, basically ...
# ... we can easily connect less than or equal to 17 pairs of points ...
# ... for keypoint RCNN, not mandatory to join all 17 keypoint pairs
edges = [
(0, 1), (0, 2), (2, 4), (1, 3), (6, 8), (8, 10),
(5, 7), (7, 9), (5, 11), (11, 13), (13, 15), (6, 12),
(12, 14), (14, 16), (5, 6)
# pairs of edges for 17 of the keypoints detected ...
# ... these show which points to be connected to which point ...
# ... we can omit any of the connecting points if we want, basically ...
# ... we can easily connect less than or equal to 17 pairs of points ...
# ... for keypoint RCNN, not  mandatory to join all 17 keypoint pairs
edges = [
    (0, 1), (0, 2), (2, 4), (1, 3), (6, 8), (8, 10),
    (5, 7), (7, 9), (5, 11), (11, 13), (13, 15), (6, 12),
    (12, 14), (14, 16), (5, 6)
]

I have provided detailed documentation in the form of comment in the above code snippet.

So, how will we use the

edges
edges list that is holding the tuples?
  • Suppose that we have the 17 keypoints that Keypoint RCNN outputs. Those are marked from 0 to 16.
  • Now, each of the tuples pairs in the edges list contains those two points that we will connect. This means that we will connect point 0 with 1, point 0 with 2, point 2 with 4, and so on.
  • Also, we need not connect all the 17 keypoints with another keypoint. We can decide to connect any point that we want to get the skeletal shape that we want.

Code to Join the Keypoints and Draw the Skeletal Lines

We have the pairs of the edges now. But how do we connect those keypoints to get the skeletal frame. We will write a very simple Python function for that.

The following block of code does that.

utils.py
def draw_keypoints(outputs, image):
# the `outputs` is list which in-turn contains the dictionaries
for i in range(len(outputs[0]['keypoints'])):
keypoints = outputs[0]['keypoints'][i].cpu().detach().numpy()
# proceed to draw the lines if the confidence score is above 0.9
if outputs[0]['scores'][i] > 0.9:
keypoints = keypoints[:, :].reshape(-1, 3)
for p in range(keypoints.shape[0]):
# draw the keypoints
cv2.circle(image, (int(keypoints[p, 0]), int(keypoints[p, 1])),
3, (0, 0, 255), thickness=-1, lineType=cv2.FILLED)
# uncomment the following lines if you want to put keypoint number
# cv2.putText(image, f"{p}", (int(keypoints[p, 0]+10), int(keypoints[p, 1]-5)),
# cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0), 1)
for ie, e in enumerate(edges):
# get different colors for the edges
rgb = matplotlib.colors.hsv_to_rgb([
ie/float(len(edges)), 1.0, 1.0
rgb = rgb*255
# join the keypoint pairs to draw the skeletal structure
cv2.line(image, (keypoints[e, 0][0], keypoints[e, 1][0]),
(keypoints[e, 0][1], keypoints[e, 1][1]),
tuple(rgb), 2, lineType=cv2.LINE_AA)
else:
continue
return image
def draw_keypoints(outputs, image):
    # the `outputs` is list which in-turn contains the dictionaries 
    for i in range(len(outputs[0]['keypoints'])):
        keypoints = outputs[0]['keypoints'][i].cpu().detach().numpy()

        # proceed to draw the lines if the confidence score is above 0.9
        if outputs[0]['scores'][i] > 0.9:
            keypoints = keypoints[:, :].reshape(-1, 3)
            for p in range(keypoints.shape[0]):
                # draw the keypoints
                cv2.circle(image, (int(keypoints[p, 0]), int(keypoints[p, 1])), 
                            3, (0, 0, 255), thickness=-1, lineType=cv2.FILLED)

                # uncomment the following lines if you want to put keypoint number
                # cv2.putText(image, f"{p}", (int(keypoints[p, 0]+10), int(keypoints[p, 1]-5)),
                #             cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0), 1)

            for ie, e in enumerate(edges):
                # get different colors for the edges
                rgb = matplotlib.colors.hsv_to_rgb([
                    ie/float(len(edges)), 1.0, 1.0
                ])
                rgb = rgb*255
                # join the keypoint pairs to draw the skeletal structure
                cv2.line(image, (keypoints[e, 0][0], keypoints[e, 1][0]),
                        (keypoints[e, 0][1], keypoints[e, 1][1]),
                        tuple(rgb), 2, lineType=cv2.LINE_AA)
        else:
            continue

    return image
  • We are defining a
    draw_keypoints()
    draw_keypoints() function which accepts the
    outputs
    outputs list and the NumPy image as input parameters.
  • We loop over the
    keypoints
    keypoints in the list from line 15.
  • Then we detach the keypoints from the GPU that is detected for each person and convert them to NumPy array. This happens at line 16.
  • Starting from line 19, we only proceed if the confidence score of the detection is greater than 0.9. Less than 0.9 tends to give a lot of false positives and error-prone results.
  • Then we reshape the
    keypoints
    keypoints to convert them into 17 rows. Starting from line 21, we loop over each row and draw the keypoints on the image. Also, if you want to draw the keypoint numbers, then you can uncomment lines 27 and 28.
  • Now, coming to line 30. Here, we loop over each of the edge pairs defined above.
  • At line 32, first, we get different colors for different lines that we will draw using the Matplotlib library.
  • Then starting from line 37, we join each of the edge pairs to draw the skeletal structure over the image.
  • Finally, we return the resulting image.

Actually, most of the heavy work is done. Drawing the keypoints and skeletal structures is one of the most important tasks regarding human pose detection. Now, most of the work will be done done by the pre-trained Keypoint RCNN model.

Human Pose Detection in Images

We are all set to write the code for human pose detection using deep learning in images. We will start with images and then move over to videos as well.

The code in this section will go into the

keypoint_rcnn_images.py
keypoint_rcnn_images.py Python script.

We will start with importing all the modules and libraries that we will need.

keypoint_rcnn_images.py
import torch
import torchvision
import numpy as np
import cv2
import argparse
import utils
from PIL import Image
from torchvision.transforms import transforms as transforms
import torch
import torchvision
import numpy as np
import cv2
import argparse
import utils

from PIL import Image
from torchvision.transforms import transforms as transforms

We are importing

utils
utils

script at line 6 that we have just written. Also, we need the PIL library to read the image so that we have the image pixel values in the proper range for PyTorch pre-trained model.

Next, let’s define the argument parser to parse the command line arguments.

keypoint_rcnn_images.py
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input', required=True,
help='path to the input data')
args = vars(parser.parse_args())
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input', required=True, 
                    help='path to the input data')
args = vars(parser.parse_args())

We will provide just one command line argument, that is the path to the input image.

The next block of code defines the image transforms.

keypoint_rcnn_images.py
# transform to convert the image to tensor
transform = transforms.Compose([
transforms.ToTensor()
# transform to convert the image to tensor
transform = transforms.Compose([
    transforms.ToTensor()
])

The above will transform the PIL image to PyTorch tensors.

Initialize the Model and Set the Computation Device

We will use the

torchvision
torchvision

module from PyTorch to initialize the Keypoint RCNN model.

keypoint_rcnn_images.py
# initialize the model
model = torchvision.models.detection.keypointrcnn_resnet50_fpn(pretrained=True,
num_keypoints=17)
# set the computation device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# load the modle on to the computation device and set to eval mode
model.to(device).eval()
# initialize the model
model = torchvision.models.detection.keypointrcnn_resnet50_fpn(pretrained=True,
                                                               num_keypoints=17)
# set the computation device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# load the modle on to the computation device and set to eval mode
model.to(device).eval()

For the

keypointrcnn_resnet50_fpn
keypointrcnn_resnet50_fpn model, we need to provide two arguments. The first one is
pretrained=True
pretrained=True, so that it returns a pre-trained model. The second one is
num_keypoints=17
num_keypoints=17 which will instruct the model to detect 17 of the keypoints in the human body. Here, we are also defininfg the computation device, loading the Keypoint RCNN model onto it and getting the model into
eval()
eval()

mode.

Read and Prepare the Image

The following block of code reads the image and prepares it to be fed into the Keypoint RCNN neural network model.

keypoint_rcnn_images.py
image_path = args['input']
image = Image.open(image_path).convert('RGB')
# NumPy copy of the image for OpenCV functions
orig_numpy = np.array(image, dtype=np.float32)
# convert the NumPy image to OpenCV BGR format
orig_numpy = cv2.cvtColor(orig_numpy, cv2.COLOR_RGB2BGR) / 255.
# transform the image
image = transform(image)
# add a batch dimension
image = image.unsqueeze(0).to(device)
image_path = args['input']
image = Image.open(image_path).convert('RGB')
# NumPy copy of the image for OpenCV functions
orig_numpy = np.array(image, dtype=np.float32)
# convert the NumPy image to OpenCV BGR format
orig_numpy = cv2.cvtColor(orig_numpy, cv2.COLOR_RGB2BGR) / 255.
# transform the image
image = transform(image)
# add a batch dimension
image = image.unsqueeze(0).to(device)

At line 28, we are keeping a copy of the image by converting it into NumPy array so that we can apply different OpenCV functions to it. Then we are converting it into the OpenCV BGR color format as well.

Line 32 transforms the image and line 34 adds a batch dimension.

Predict the Keypoints and Visualize the Result

We need to perform a forward pass so that the Keypoint RCNN model can detect the keypoints on the image.

keypoint_rcnn_images.py
with torch.no_grad():
outputs = model(image)
output_image = utils.draw_keypoints(outputs, orig_numpy)
# visualize the image
cv2.imshow('Keypoint image', output_image)
cv2.waitKey(0)
with torch.no_grad():
    outputs = model(image)

output_image = utils.draw_keypoints(outputs, orig_numpy)

# visualize the image
cv2.imshow('Keypoint image', output_image)
cv2.waitKey(0)

We do not need to calculate the gradients during inference, therefore, we are keeping the forward pass within the

with torch.no_grad()
with torch.no_grad()

block.

At line 38, we are calling the

draw_keypoints()
draw_keypoints() function from
utils
utils

script to draw the keypoints and skeletal structure. Then we are visualizing the image.

Finally, we just need to save our results to disk.

keypoint_rcnn_images.py
# set the save path
save_path = f"../outputs/{args['input'].split('/')[-1].split('.')[0]}.jpg"
cv2.imwrite(save_path, output_image*255.)
# set the save path
save_path = f"../outputs/{args['input'].split('/')[-1].split('.')[0]}.jpg"
cv2.imwrite(save_path, output_image*255.)

We are just setting a

save_path
save_path using the input path and saving the result to
outputs
outputs

folder.

This is all the code we need to detect human pose and keypoints using deep learning on images.

Executing keypoint_rcnn_images.py Script to Detect Human Pose in Images

We are all set to execute

keypoint_rcnn_images.py
keypoint_rcnn_images.py

script and take a look at how the Keypoint RCNN model is performing.

We will use the images from the

input
input

folder that I have provided above. Let’s start with the first image.

Open up your command line/terminal and cd into the src folder of the project directory. Then type the following command.

python keypoint_rcnn_images.py --input ../input/image1.jpg
python keypoint_rcnn_images.py --input ../input/image1.jpg

You should get the following output.

Human pose estimation using PyTorch, deep learning and Keypoint RCNN neural network model
Figure 4. Human pose estimation in an image using PyTorch, deep learning, and Keypoint RCNN neural network model. We can see that the neural network is detecting all the keypoints accurately.

The PyTorch Keypoint RCNN model is working perfectly in this case. It is correctly detecting all the 17 keypoints and joining the desired keypoint pairs is also giving a good structure. Looks like deep learning has made keypoint detection in humans really effective.

Now, let’s throw a bit more challenging image to the model for keypoint detection.

python keypoint_rcnn_images.py --input ../input/image2.jpg
python keypoint_rcnn_images.py --input ../input/image2.jpg
Human pose estimation using PyTorch, deep learning and Keypoint RCNN neural network model.
Figure 5. The Keypoint RCNN model is detecting the keypoints of one of the legs and one of arms wrongly. Looks like that the results suffer when a person is doing any complex movement like dancing.

In figure 5, we can see that the model is detecting the keypoints for one of the legs and one of the hands wrongly. It is detecting keypoint for one of the legs when there should not be any, and it is detecting the keypoints for the legs as keypoints for hands. Looks like our dancing man photo was a bit too much for the Keypoint RCNN model.

What about multi-person detection? Let’s throw a final image challenge at our model.

python keypoint_rcnn_images.py --input ../input/image3.jpg
python keypoint_rcnn_images.py --input ../input/image3.jpg
Human pose estimation using PyTorch, deep learning and Keypoint RCNN neural network model
Figure 6. The PyTorch Keypoint RCNN neural network model can also detect poses and keypoints for multiple person quite accurately. The results are only bad where the lighting is bad, like the left most corner.

Frankly, the results are better than I expected. It is detecting all the keypoints correctly, except the boy’s right arm at the extreme left corner. Looks like because of low-lighting, the Keypoint RCNN neural network is finding it difficult to correctly regress the final keypoint for the right arm. Still, it is much better than expected.

Our Keypoint RCNN deep neural network is performing well on images, but what about videos? This is what we are going to find out in the next section, where we will write the code to detect keypoints and human pose in videos using Keypoint RCNN deep neural network.

Human Pose Detection in Videos using PyTorch and Keypoint RCNN

In this section, we will write the code to detect keypoints and human pose in videos using PyTorch and Keypoint RCNN neural network.

It is going to be very similar to what we did for images. For videos, we just need to treat each individual frame as an image and our work is mostly done. The rest of the work will done by our

utils.py
utils.py

script.

All of this code will go into the

keypoint_rcnn_videos.py
keypoint_rcnn_videos.py Python script.

As usual, we will start with the imports.

keypoint_rcnn_videos.py
import torch
import torchvision
import cv2
import argparse
import utils
import time
from PIL import Image
from torchvision.transforms import transforms as transforms
import torch
import torchvision
import cv2
import argparse
import utils
import time

from PIL import Image
from torchvision.transforms import transforms as transforms

In the above block, we are also importing the time module to keep track of the time and calculate the average FPS (Frames Per Second) which we will do later.

The next block of code does some preliminary work. This includes defining the argument parser, the transforms, and preparing the Keypoint RCNN model. It is going to be the same as we did in the case of images.

keypoint_rcnn_videos.py
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input', required=True,
help='path to the input data')
args = vars(parser.parse_args())
# transform to convert the image to tensor
transform = transforms.Compose([
transforms.ToTensor()
# initialize the model
model = torchvision.models.detection.keypointrcnn_resnet50_fpn(pretrained=True,
num_keypoints=17)
# set the computation device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# load the modle on to the computation device and set to eval mode
model.to(device).eval()
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input', required=True, 
                    help='path to the input data')
args = vars(parser.parse_args())

# transform to convert the image to tensor
transform = transforms.Compose([
    transforms.ToTensor()
])

# initialize the model
model = torchvision.models.detection.keypointrcnn_resnet50_fpn(pretrained=True,
                                                               num_keypoints=17)
# set the computation device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# load the modle on to the computation device and set to eval mode
model.to(device).eval()

Setting Up OpenCV for Video Capture and Saving

We will use OpenCV

VideoCapture()
VideoCapture()

to capture the videos.

keypoint_rcnn_videos.py
cap = cv2.VideoCapture(args['input'])
if (cap.isOpened() == False):
print('Error while trying to read video. Please check path again')
# get the video frames' width and height
frame_width = int(cap.get(3))
frame_height = int(cap.get(4))
# set the save path
save_path = f"../outputs/{args['input'].split('/')[-1].split('.')[0]}.mp4"
# define codec and create VideoWriter object
out = cv2.VideoWriter(save_path,
cv2.VideoWriter_fourcc(*'mp4v'), 20,
(frame_width, frame_height))
frame_count = 0 # to count total frames
total_fps = 0 # to get the final frames per second
cap = cv2.VideoCapture(args['input'])
if (cap.isOpened() == False):
    print('Error while trying to read video. Please check path again')
# get the video frames' width and height
frame_width = int(cap.get(3))
frame_height = int(cap.get(4))

# set the save path
save_path = f"../outputs/{args['input'].split('/')[-1].split('.')[0]}.mp4"
# define codec and create VideoWriter object 
out = cv2.VideoWriter(save_path, 
                      cv2.VideoWriter_fourcc(*'mp4v'), 20, 
                      (frame_width, frame_height))
frame_count = 0 # to count total frames
total_fps = 0 # to get the final frames per second

After capturing the video, we are getting the video frames’ width and height at lines 31 and 32.

At line 35, we are defining the name for the

save_path
save_path of the output video file. The resulting video with the keypoints will save under this name in the
outputs
outputs

folder.

Then we are using the

cv2.VideoWriter()
cv2.VideoWriter()

for defining the codec and save format of the video file. We will save the resulting video files in .mp4 format.

Finally, at lines 40 and 41, we are defining

frame_count
frame_count and
total_fps
total_fps

to keep track of the total number of video frames and the total Frames Per Second as well.

Detecting Keypoints and Pose in Each Video Frame

To detect pose in human bodies and predict the keypoints, we need to treat each individual frame in a video as one image. We can easily do that by iterating over all the video frames using a

while
while

loop.

The next block of code does just that. Let’s write the code for that and then get into the explanation part.

keypoint_rcnn_videos.py
# read until end of video
while(cap.isOpened()):
# capture each frame of the video
ret, frame = cap.read()
if ret == True:
pil_image = Image.fromarray(frame).convert('RGB')
orig_frame = frame
# transform the image
image = transform(pil_image)
# add a batch dimension
image = image.unsqueeze(0).to(device)
# get the start time
start_time = time.time()
with torch.no_grad():
outputs = model(image)
# get the end time
end_time = time.time()
output_image = utils.draw_keypoints(outputs, orig_frame)
# get the fps
fps = 1 / (end_time - start_time)
# add fps to total fps
total_fps += fps
# increment frame count
frame_count += 1
wait_time = max(1, int(fps/4))
cv2.imshow('Face detection frame', output_image)
out.write(output_image)
# press `q` to exit
if cv2.waitKey(wait_time) & 0xFF == ord('q'):
break
else:
break
# read until end of video
while(cap.isOpened()):
    # capture each frame of the video
    ret, frame = cap.read()
    if ret == True:
        
        pil_image = Image.fromarray(frame).convert('RGB')
        orig_frame = frame

        # transform the image
        image = transform(pil_image)
        # add a batch dimension
        image = image.unsqueeze(0).to(device)

        # get the start time
        start_time = time.time()

        with torch.no_grad():
            outputs = model(image)

        # get the end time
        end_time = time.time()

        output_image = utils.draw_keypoints(outputs, orig_frame)

        # get the fps
        fps = 1 / (end_time - start_time)
        # add fps to total fps
        total_fps += fps
        # increment frame count
        frame_count += 1

        wait_time = max(1, int(fps/4))

        cv2.imshow('Face detection frame', output_image)
        out.write(output_image)
        # press `q` to exit
        if cv2.waitKey(wait_time) & 0xFF == ord('q'):
            break

    else:
        break
  • We are iterating over the frames until they are present. Else we break out of the loop.
  • If a frame is present, then we convert the frame from OpenCV format to PIL image format at line 48 and convert it into RGB color format. We also keep a copy of the original frame at line 49.
  • Lines 52 and 54 transform the frame and add a batch dimension to it respectively.
  • At line 57, we define
    start_time
    start_time just before we feed our frame to the Keypoint RCNN model.
  • After detections happen at line 60, we define
    end_time
    end_time so that we can know the time taken for each forward pass.
  • Line 65 calls the
    draw_keypoints()
    draw_keypoints() function of
    utils
    utils script to draw the keypoints and skeletal structure.
  • Line 68 calculates the FPS for the current frame, line 70 adds FPS to the total FPS, and line 72 increments the frame count.
  • We show the image on the screen at line 76 and write it to disk at line 77.
  • Finally, when there are no more frames present, then we break out of the loop.

There are just a few more lines of code. We need to release the

VideoCapture()
VideoCapture()

object, destroy all cv2 windows and calculate the average FPS.

keypoint_rcnn_videos.py
# release VideoCapture()
cap.release()
# close all frames and video windows
cv2.destroyAllWindows()
# calculate and print the average FPS
avg_fps = total_fps / frame_count
print(f"Average FPS: {avg_fps:.3f}")
# release VideoCapture()
cap.release()
# close all frames and video windows
cv2.destroyAllWindows()
# calculate and print the average FPS
avg_fps = total_fps / frame_count
print(f"Average FPS: {avg_fps:.3f}")

That is all the code we need for human pose detection in videos using Keypoint RCNN and PyTorch. Now, we can move forward and execute the script to see how well the neural network performs.

Executing keypoint_rcnn_videos.py for Human Pose Detection on Videos

We have two videos in our

input
input

folder. Let’s begin with the first video.

python keypoint_rcnn_videos.py --input ../input/video1.mp4
python keypoint_rcnn_videos.py --input ../input/video1.mp4

The following is the final video that is saved to the disk. You should also get result similar to this.

Clip 1. Human pose detection using Keypoint RCNN neural network using PyTorch. We can see that the neural network is doing a good job of detecting the keypoints.

In clip 1, we can see that the neural network is predicting the keypoints quite accurately for the most part. Even when the person is lifting his legs and moving his hands, the keypoints are correct. But the detections go wrong when the person is too near to the camera at the end of the clip. This is most probably because the neural network is not getting enough information to predict the keypoints correctly. It is not able to see the whole body in those frames.

Also, I got around 4.8 FPS on an average on my GTX 1060. It is not real-time, at least not on a mid-range GPU. Yours may vary according to your hardware.

Now, let’s move on to the detect keypoints on the second video.

python keypoint_rcnn_videos.py --input ../input/video2.mp4
python keypoint_rcnn_videos.py --input ../input/video2.mp4

The following is the output clip.

Clip 2. We can also carry out multi-person human pose estimation in videos using Keypoint RCNN.

We can clearly see that Keypoint RCNN can easily detect poses and keypoints for multiple people at the same time. There are some wrong detections for sure but they are for those people who are at the far back and appear small. The people who are closer to the camera, for them, the keypoint detections are quite accurate.

I got around 3.6 FPS on an average for this clip. The reduction in FPS might to be due to the more number of detections.

Further Improvements

There are many ways in which you can improve upon this project. You can try:

  • Using a better and bigger network as the backbone. Like ResNet101 pre-trained model.
  • You can also try inferencing over the video frames in batches instead of a single frame. This may also help to increase the FPS.

Summary and Conclusion

In this tutorial, you got to learn about human pose detection using deep learning. We used PyTorch Keypoint RCNN neural network model to detect keypoints and human pose in images and videos. I hope that you learned new things regarding deep learning in this article.

If you have any doubts, thoughts, or suggestions, then please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and Twitter.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK