Home > Other > How to perform 3D object detection on LiDAR point clouds

How to perform 3D object detection on LiDAR point clouds

Source: InternetPublisher:super_star Keywords: 3D LIDAR Updated: 2024/05/27

This project will leverage the PYNQ-DPU overlay on the KV260, enabling us to do 3D object detection on LiDAR point clouds more efficiently than ever before!

background

Environmental perception plays an integral role in building self-driving cars, autonomous navigating robots, and other real-world applications.

Why 3D Object Detection on Point Clouds?

While deep learning-based 2D object detection from camera data shows high accuracy, it may not be effective for activities such as localization, measuring distances between objects, and calculating depth information.

The point cloud generated by the LiDAR sensor provides 3D information of the object to more effectively locate the object and characterize the shape. Therefore, 3D object detection on point clouds is emerging in various applications, especially in autonomous driving.

Nevertheless, designing LiDAR-based 3D object detection systems is challenging. First, such systems require a lot of computation in model inference. Second, since point cloud data is irregular, the processing pipeline requires pre-processing and post-processing to provide end-to-end perception results.

KV260 is a perfect match for 3D object detection systems. The expensive computation of model inference can be offloaded to and accelerated by the programmable logic portion of KV260, while the powerful ARM core of KV260 is capable of handling pre-processing and post-processing tasks.

Design Overview

We now discuss the selected deep learning models for 3D object detection on point clouds and a system overview including software and hardware.

Network Architecture

As a sanity check on existing work, we chose the ResNet-based Keypoint Feature Pyramid Network (KFPN), the first real-time system for monocular 3D detection with state-of-the-art performance on the KITTI benchmark. In particular, we adopted its open source PyTorch implementation on point clouds, called SFA3D.

PYNQ-DPU on KV260

The reason why we use Ubuntu Desktop 20.04.3 LTS for Xilinx development board instead of Petalinux as the operating system on KV260 is that Ubuntu is a good development environment for installing the packages required for pre-processing point clouds and post-processing results. On the other hand, KV260's support for Pynq and DPU coverage avoids designing an efficient DPU from scratch and enables us to work in a python environment. This greatly simplifies the migration of CPU/GPU-based deep learning implementations to KV260.

Setting up the environment

Follow the official guide to install the Ubuntu image to KV260, and then refer to Github to install Pynq in Ubuntu OS. Git clones all the required files and installs the required packages to the board by executing the following commands.

git clone https://github.com/SoldierChen/DPU-Accelerated-3D-Object-Detection-on-Point-Clouds.git
cd DPU-Accelerated-3D-Object-Detection-on-Point-Clouds
pip install -r requirements.txt

Here, we need Pytorch 1.4 because the VART of Pynq DPU is v1.4.

data preparation

The data that needs to be downloaded include:

Velodyne Point Cloud (29 GB)

Training labels for object dataset (5 MB)

Camera calibration matrices for the object dataset (16 MB)

Left color image of the object dataset (12 GB) (for visualization purposes only)

To visualize the 3D point cloud using a 3D box, let's execute:

cd model_quant_compile/data_process/
python kitti_dataset.py

Model Training

python train.py --gpu_idx 0

This command uses one GPU for training, but it supports distributed training. In addition, you can choose fpn_resnet or resnet as the target model. The trained model will be stored in a checkpoint folder named "Model_restnet/fpn_resnet_epoch_#". Depending on your hardware, the epoch can be from 10 to 300, and the higher the accuracy, the better.

Model quantization and compilation

Similarly, since Pynq's VART is V1.4, we need VITIS AI v1.4 instead of the latest version (V2.0) to perform model quantization.

# install the docker at first (if not stalled)
docker pull xilinx/vitis-ai-cpu:1.4.1.978

# run the docker
./docker_run.sh xilinx/vitis-ai-cpu:1.4.1.978

We then quantize the model using the following command:

# activate the pytorch environment
conda activate vitis-ai-pytorch

# install required packages
pip install -r requirements.txt

# configure the quant_mode to calib
ap.add_argument(’-q’, ’--quant_mode’, type=str, default=’calib’, choices=[’calib’,’test’], help=’Quantization mode (calib or test). Default is calib’)
# here, it quantize the example model: Model_resnet_18_epoch_10.pth
python quantize.py

# configure the quant_mode to test
ap.add_argument(’-q’, ’--quant_mode’, type=str, default=’test’, choices=[’calib’,’test’], help=’Quantization mode (calib or test). Default is calib’)
# here, it outputs the quantized model.
python quantize.py

Next, we will compile the model:

./compile.sh zcu102 build/

Never mind that zcu102 shares the same DPU architecture as KV260. You will see the following message for a successful compilation:

So far, we have a compiled xmodel that can be executed on the DPU, over-executing on the KV260. Next, we deploy it on the board and develop the application code.

KV260 deployment

According to the official guide, we first installed the Ubuntu operating system on the KV260. Then, we installed Python on the board according to the PYNQ-DPU GitHub.

After building the board, we need to install git, clone the code to the board, and copy the compiled xmodel into the folder.

Application code design

Here we will describe how to call and interface with the DPU for inference.

We first load the DPU overlay and the customized xmodel. Then, it is important to know the input and output tensor information to coordinate with the dataset. Here, we have only one tensor as input and five tensors as output. Allocate the input and output buffers accordingly.

# load model and overly
overlay = DpuOverlay("dpu.bit")
overlay.load_model("./CNN_zcu102.xmodel")
dpu = overlay.runner

# get tensor information
inputTensors = dpu.get_input_tensors()
outputTensors = dpu.get_output_tensors()
shapeIn = tuple(inputTensors[0].dims)
outputSize = int(outputTensors[0].get_data_size() / shapeIn[0])
shapeOut = tuple(outputTensors[0].dims)
shapeOut1 = tuple(outputTensors[1].dims)
shapeOut2 = tuple(outputTensors[2].dims)
shapeOut3 = tuple(outputTensors[3].dims)
shapeOut4 = tuple(outputTensors[4].dims)

# allocate input and output buffers.
# Note the output is a list of five tensors.
output_data = [np.empty(shapeOut, dtype=np.float32, order="C"),
np.empty(shapeOut1, dtype=np.float32, order="C"),
np.empty(shapeOut2, dtype=np.float32, order="C"),
np.empty(shapeOut3, dtype=np.float32, order="C"),
np.empty(shapeOut4, dtype=np.float32, order="C")]
# the input is only one tensor.
input_data = [np.empty(shapeIn, dtype=np.float32, order="C")]

image = input_data[0]

The process of one-shot inference is encapsulated in the function below. Here, we permute the input tensor to the shape of the DPU input tensor and permute the tensor to the shape required for post-processing. This is critical for correct results.

def do_detect(dpu, shapeIn, image, input_data, output_data, configs, bevmap, is_front):

if not is_front:
bevmap = torch.flip(bevmap, [1, 2])
input_bev_maps = bevmap.unsqueeze(0).to("cpu", non_blocking=True).float()

# do permutation
input_bev_maps = input_bev_maps.permute(0, 2, 3, 1)
image[0,...] = input_bev_maps[0,...] #.reshape(shapeIn[1:])

job_id = dpu.execute_async(input_data, output_data)
dpu.wait(job_id)

# convert the output arrays to tensors for the following post-processing.
outputs0 = torch.tensor(output_data[0])
outputs1 = torch.tensor(output_data[1])
outputs2 = torch.tensor(output_data[2])
outputs3 = torch.tensor(output_data[3])
outputs4 = torch.tensor(output_data[4])

# do permutation
outputs0 = outputs0.permute(0, 3, 1, 2)
outputs1 = outputs1.permute(0, 3, 1, 2)
outputs2 = outputs2.permute(0, 3, 1, 2)
outputs3 = outputs3.permute(0, 3, 1, 2)
outputs4 = outputs4.permute(0, 3, 1, 2)
outputs0 = _sigmoid(outputs0)
outputs1 = _sigmoid(outputs1)

# post-processing
detections = decode(
outputs0,
outputs1,
outputs2,
outputs3,
outputs4, K=configs.K)
detections = detections.cpu().numpy().astype(np.float32)
detections = post_processing(detections, configs.num_classes, configs.down_ratio, configs.peak_thresh)

return detections[0], bevmap

Execute on KV260

Inference on the demo data will be performed on the DPU by running the following command:

python demo_2_sides-dpu.py

Then run the following command:

pythondemo_front-dpu.py

Performance ranges from 10 to 20 FPS, which is 100 to 200 times faster than execution on a server-grade CPU (Intel Xeon Gold 6226R).

in conclusion

In summary, we have shown how easy it is to use AMD-Xilinx DPU on KV260 to accelerate point cloud based 3D object detection. To further improve performance, we plan to optimize the model inference stage by using multiple DPU instances, as well as the pre-processing and post-processing stages by using multi-threading and batching.

super_star

Latest Other Circuits

Popular Circuits

Popular Components