Panoramic/fisheye camera close-range perception for low-speed autonomous driving-EEWORLD

Collect

As shown in Figure 11, when we are dealing with a perspective camera, translation occurs when the object moves at a constant Z distance from the camera, that is, in a plane parallel to the image plane. However, in a cylindrical image, the distance in the horizontal plane must remain constant for image translation to occur (the object must be rotated about the cylinder axis). In contrast, in the original fisheye image, it is unclear what object motion causes image translation.

WoodScape dataset

The WoodScape panoramic dataset was collected in two different geographical locations: the United States and Europe. While most of the data was obtained from sedans, a large portion of the data was from sport utility vehicles ensuring a strong combination of sensor mechanical configurations, with driving scenarios divided into highway, city driving, and parking use cases. Internal and external calibrations are provided in the dataset for all sensors as well as timestamp files to enable data synchronization, including mechanical data of the associated vehicles (e.g., wheel circumference, wheelbase). The sensors recorded for this dataset are as follows:

1) 4x 1MPx RGB fisheye cameras (190◦ horizontal field of view)

2) 1x LiDAR, 20Hz rotation (Velodyne HDL-64E)

3) 1x GNSS/IMU (NovAtel Propak6 and SPAN-IGM-A1)

4) 1x GNSS positioning with SPS (Garmin 18x)

5) Odometer signal from vehicle bus

System Architecture Considerations

An important consideration in autonomous driving computer vision design, especially pipeline design, is the constraints of embedded systems, where multiple cameras and multiple computer vision algorithms must run in parallel. Since computer vision algorithms are computationally intensive, automotive SoCs have many dedicated hardware accelerators for image signal processing, lens distortion correction, dense optical flow, stereo disparity, etc. In computer vision, deep learning plays a leading role in various recognition tasks and is gradually used for geometric tasks such as depth and motion estimation.

To maximize the performance of the processing hardware, it is best to think of embedded vision in terms of processing stages and consider shared processing at each processing stage, with pipelines as shown in Figure 12.

1) Preprocessing: The preprocessing stage of the pipeline can be viewed as the processing that prepares the data for computer vision. This includes image signal processing (ISP) steps such as white balancing, denoising, color correction, and color space conversion. For a detailed discussion of ISP and the tuning of ISP for computer vision tasks in an automotive context, see [52]. ISP is typically done by a hardware engine, such as as part of the main SoC. It is rarely done in software because a lot of pixel-level processing needs to be done. Some methods are being proposed to automatically tune the hyperparameters of the ISP pipeline to optimize the performance of computer vision algorithms [52], [53]. Notably, methods are being proposed to simplify ISP visual perception pipelines, see [54].

2) Pixel processing stage: Pixel processing can be considered the part of the computer vision architecture that directly contacts the image. In classical computer vision, these algorithms include edge detection, feature detection, descriptors, morphological operations, image registration, stereo disparity, etc. In neural networks, this is equivalent to the early layers of the CNN encoder. Processing at this stage is dominated by relatively simple algorithms that must be run on millions of pixels many times per second. That is, the computational cost is related to the fact that these algorithms may run millions of times per second, rather than to the complexity of the algorithms themselves. The processing hardware at this stage is typically dominated by hardware accelerators and GPUs, although some elements may be suitable for DSPs.

3) Intermediate processing stage: As the name suggests, the intermediate processing stage is the bridge between the pixel and object detection stages. Here, the amount of data to be processed is still high, but significantly lower than the pixel processing stage. This may include steps such as estimation of vehicle motion through visual odometry, stereo triangulation of disparity maps, and reconstruction of general features of the scene, including the CNN decoder at this stage of the pipeline. The processing hardware at this stage is usually a digital signal processor.

4) Object Processing Stage: The object processing stage is where higher level reasoning is integrated, where point clouds can be clustered to create objects, objects can be classified, and through the above reasoning, algorithms can be applied to suppress rescaling of moving objects. Processing at this stage is dominated by more complex algorithms, but operates on fewer data points. In terms of hardware, it is usually suitable to run these processors on general purpose processing units such as ARM, although digital signal processors are also commonly used.

5) Post-processing: The final post-processing stage can also be called the global processing stage. Persist data in time and space. Since you can have long-term persistence and large spatial maps, the overall goal of the previous stages is to minimize the amount of data that reaches this stage while maintaining all relevant information that is ultimately used for vehicle control. In this stage, steps such as bundle adjustment, map construction, advanced target tracking and prediction, and fusion of various computer vision inputs will be included. Since the highest level of reasoning in the system is processed, and ideally the fewest data points are processed, general-purpose processing units are usually required here.

4R Components Introduction

Identification

Recognition tasks identify the semantics of a scene through pattern recognition. In the automotive domain, the first successful application was pedestrian detection, which combined hand-crafted features such as histograms of oriented gradients and machine learning classifiers such as support vector machines. Recently, CNNs have shown significant performance leaps in various computer vision tasks in object recognition applications, however, this comes at a cost.

First, automotive scenes are very diverse and the system is expected to work in different countries and under different weather and lighting conditions, so one of the main challenges is to build an effective dataset that covers different aspects. Second, CNNs are computationally intensive and usually require dedicated hardware accelerators or GPUs (compared to classical machine learning methods that are feasible on general-purpose computing cores). Therefore, effective design techniques are crucial for any design. Finally, while CNNs for normal images have been well studied, as mentioned earlier, the translation invariance assumption is broken for fisheye images, which poses additional challenges.

In this paper's recognition pipeline, a multi-task deep learning network is proposed to recognize objects based on appearance patterns. It includes three tasks, namely object detection (pedestrians, vehicles, and cyclists), semantic segmentation (roads, curbs, and road markings), and lens contamination detection (opaque, semi-transparent, transparent, transparent). Object detection and semantic segmentation are standard tasks, please refer to the FisheyeMulTINet paper for more implementation details. One of the challenges is to balance the weights of the three tasks during the training phase, because one task may converge faster than the others.

Fisheye cameras are mounted relatively low on the vehicle (∼0.5 to 1.2 m above the ground) and are susceptible to lens fouling due to road spray from other vehicles or road water. Therefore, it is critical to detect dirt on the camera lens to alert the driver to clean the camera or trigger a cleaning system. The dirt detection task and its use in cleaning and algorithm degradation are discussed in detail in SoilingNet. A closely related task is decontamination by repairing contaminated areas through patching, but these decontamination techniques currently fall into the realm of visualization improvements rather than perception.

This is an ill-defined problem because it is impossible to predict what is behind occlusions (although this can be improved by exploiting temporal information). Due to the limited CNN processing power of low-power automotive ECUs, this paper uses a multi-task architecture where most of the computation is shared in the encoder, as shown in Figure 13.

reconstruction

As mentioned earlier, reconstruction means inferring scene geometry from a video sequence. This often means estimating a point cloud or voxelized representation of the scene, for example. Reconstruction of static objects is traditionally done using methods such as motion stereo [56] or triangulation in multi-view geometry [73]. In the context of designing depth estimation algorithms, a brief overview of how humans infer depth is given in [74], with useful further references. There are four basic approaches to inferring depth: monocular visual cues, motion parallax, stereopsis, and depth from focus.

Each of these approaches has its equivalent in computer vision, with Grimson providing a computational implementation of stereopsis in the early 1980s [76], based on early theoretical work by Marr & Poggio [75], and work on stereopsis continuing since then. However, stereo systems are not commonly deployed in vehicles, and as a result, monocular motion parallax methods remain popular in automotive research. Computationally, depth from motion parallax is traditionally done using feature triangulation [78], but motion stereo has also proven popular [79].

Considering fisheye images increases the complexity of the reconstruction task. Most works in multi-view geometry, stereo vision, and depth estimation usually assume a planar perspective image of the scene. Traditional stereo methods further restrict the epipolar lines in the image to be horizontal. However, this is rarely the case for real cameras due to lens distortion, which destroys the planar projection model. It is usually addressed by calibration and rectification of the images. However, for fisheye images with very severe lens distortion, it is not feasible to maintain a wide field of view during the rectification process. Several fisheye stereo depth estimation methods have been proposed in the field. A common approach is multi-planar rectification, where the fisheye image is mapped to multiple perspective planes [82].

However, as mentioned earlier, any planar rectification (even with multiple planes) suffers from severe resampling distortion. To minimize this resampling, methods have been proposed to distort and rectify non-planar images. Some methods distort different image geometries to maintain the stereo requirement that the epipolar lines are straight and horizontal [83]. There are also some methods that circumvent the requirement that the epipolar lines are horizontal. For example, the plane scanning method [84], [85] has been recently applied to fisheye [86]. A related problem with resampling fisheye images is that the noise function is distorted by the resampling process, which is a problem for any method that attempts to minimize the reprojection error. Kukelova et al. [73] addressed this problem using an iterative technique for standard field of view cameras that minimizes the reprojection error while avoiding distortion. However, this method depends on the specific camera model and is therefore not directly applicable to fisheye cameras.

[1] [2] [3] [4]

Reference address：Panoramic/fisheye camera close-range perception for low-speed autonomous driving

Previous article：Disassembly of the car's keyless system: the internal circuit principle of the remote control
Next article：Introducing the use of GPIO in automotive applications

Popular Resources
Popular amplifiers