Monocular 3D Object Detection in Autonomous driving

Lifting 2D to 3D? Hard but tractable

3D object detection from a 2D image is a challenging task. It is fundamentally ill-posed as the critical information of the depth dimension is collapsed during the formation of the 2D image (for more background see my previous post on lifting 2D bbox to 3D ). However, under specific conditions and with strong prior information this task is still tractable. Particularly in autonomous driving, most objects of interest, for example, the vehicles are rigid objects with well-known geometry, therefore 3D vehicle information can be recovered using monocular images.

1. Representation transformation (BEV, pseudo-lidar)

Cameras are usually mounted on car roofs on some prototype self-driving cars, or behind rear-view mirrors like a normal dash-cam. Therefore camera images typically have perspective views of the world. This view is easy to understand for human drivers as it resembles what we see during driving and but poses two challenges for computer vision: occlusion and scale variation due to distance.

One way to alleviate this is to convert perspective images to Birds-eye-view (BEV) . In BEV, cars have the same size, invariant to the distance to ego vehicle, and different vehicles do not overlap (given the reasonable assumption that no car is on top of other cars in 3D world under normal driving condition). Inverse perspective mapping (IPM) is a commonly used technique to generate BEV image, but it assumes that all pixels are on the ground, and accurate online extrinsic (and intrinsic) information is known for the camera. However, extrinsic parameters need to be calibrated online to be accurate enough for IPM.

Convert perspective image to BEV (from BEV-IPM )

This is what BEV IPM OD (IV 2019) does. It uses IMU data for online calibration of extrinsics information to obtain more accurate IPM images, and then perform object detection on them. Their Youtube demo can be seen here .

Orthographic Feature Transform (OFT) (BMVC 2019) is another way to lift perspective images to BEV, but via a deep learning framework. The idea is to use orthographic feature transform (OFT) to map perspective image-based features into an orthographic birds-eye-view. A ResNet-18 is used to extract perspective image features. Then voxel-based features are generated by accumulating image-based features over the projected voxel area. (This process reminds me of back-projection in CT image reonstruction .) The voxel features are then collapsed along the vertical dimension to yield orthographic ground plane features. Finally, another ResNet-like topdown network is used to reason and refine BEV map.

Architecture of Orthographic Feature Transform ( source )

The idea of OFT is really simple and interesting, and it works relatively well. Although the back-projection step could have been improved by using some heuristics for better initialization of the voxel-based features, rather than naively do a back-projection. For example, the image features in a really big bbox cannot correspond to very distant objects. Another issue I have about this method is the assumption of accurate extrinsic, which may not be available online.

Another way to transform perspective images to BEV is BirdGAN (IROS 2019) , which uses a GAN to perform image-to-image translation. The paper achieved great results, but as the paper admits, the translation to BEV space can only perform well with a frontal distance only 10 to 15 meters, and thus of limited use.

BirdGAN translates perspective images to BEV ( source )

Then enter a line of work on the idea of pseudo-lidar. The idea is to generate point cloud based on the estimated depth from the image, thanks to recent advancement in monocular depth estimation (which itself is a hot topic in autonomous driving, which I shall review in the future). Previous efforts on using RGBD images largely treat depth as the fourth channel and apply the normal networks on this input, with minimal changes to the first layer. Multi-Level Fusion (MLF, CVPR 2018) is one of the first to propose lifting the estimated depth information to 3D. It uses the estimated depth information (with fixed pretrained weights by MonoDepth ) to project each of the pixels in RGB image to 3D space, and then this generate point cloud is fused with image features to regress 3D bounding boxes.

Lifting 2D to 3D? Hard but tractable

1. Representation transformation (BEV, pseudo-lidar)

Recommend

Blogged Answers: Learning and Using TypeScript as an App Dev and a Library Maint...

Microsoft Introduces Icebreaker to Address the Famous Ice-Start Challenges in Ma...

Machine Learning on Encrypted Data Without Decrypting It

有奖竞猜（88个集思录金币）通光转债11月28日上市开盘价

Powered by AI: Instagram’s Explore recommender system

A closer look at go sync package

Building a language server for the Muon programming language

MediaTek Announces Dimensity 1000 SoC: Back To The High-End With 5G

Using CSS custom properties to reduce the size of your CSS

Create a Splash Screen with SwiftUI [SUBSCRIBER]

About Joyk