Analyze the characteristics and working principle of TLD visual tracking technology-EEWORLD

Collect

In the monitoring of urban rail transit, intelligent video analysis technology was once very popular. However, the monitoring environment of urban rail transit is relatively complex. It is not only large in area and long in perimeter, but also has multiple platforms, multiple entrances and exits, and many fences and other related equipment. This complex environment brings many difficulties to intelligent analysis. The current novel TLD (abbreviation for Tracking-Learning-Detection) visual tracking technology can solve these problems.

The biggest feature of the TLD tracking system is that it can continuously learn the locked target to obtain the latest appearance features of the target, so as to improve the tracking in time and achieve the best state. In other words, at the beginning, only one frame of static target image is provided, but as the target continues to move, the system can continuously detect and learn the changes in the target in angle, distance, depth of field, etc., and identify it in real time. After a period of learning, the target can no longer hide.

TLD technology consists of three parts, namely the tracker, the learning process and the detector. TLD technology adopts a strategy that combines tracking and detection and is an adaptive and reliable tracking technology. In TLD technology, the tracker and detector run in parallel, and the results produced by both are involved in the learning process. The learned model reacts to the tracker and detector, updating them in real time, thus ensuring that the target can be continuously tracked even if its appearance changes.

Tracker

The TLD tracker uses an overlapping block tracking strategy, and single-block tracking uses the Lucas-Kanade optical flow method. Before tracking, TLD needs to specify the target to be tracked, which is marked by a rectangular box. The final movement of the overall target is the median of all local block movements. This local tracking strategy can solve the problem of local occlusion.

Learning process

The learning process of TLD is based on an online model. The online model is a collection of image patches of size 15×15, which are obtained from the results of the tracker and the inspector. The initial online model is the target image to be tracked specified when starting tracking.

The online model is a dynamic model that grows or decreases with the video sequence. The development of the online model is driven by two events, namely growth events and pruning events. In practice, the appearance of the target is constantly changing due to the influence of multiple factors such as the environment and the target itself, which makes the target image predicted by the tracker contain more other factors of interest. If we regard all target images on the tracking trajectory as a feature space, then as the video sequence advances, the feature space caused by the tracker will continue to increase, which is called a growth event. In order to prevent the impurities (other non-target images) brought by the growth event from affecting the tracking effect, the corresponding pruning event is used to balance it. The pruning event is used to remove the impurities caused by the growth event. Therefore, the interaction of the two events prompts the online model to always remain consistent with the current tracking target.

The expansion of the feature space brought by the growth event comes from the tracker, that is, selecting appropriate samples from the target image on the tracking track and updating the online model with them. There are three selection strategies, as follows.

Image patches similar to the starting target image to be tracked are added to the online model;

If the tracking target image of the current frame is similar to that of the previous frame, the current tracking result image is added to the online model;

Calculate the distance between the target image on the tracking trajectory and the online model, and select the target image with a specific pattern, that is, the distance between the target image and the online model is small at first, then the distance gradually increases, and then the distance returns to a small state. Loop to check whether there is such a pattern, and add the target image in the pattern to the online model.

The feature selection method of growth events ensures that the online model always keeps up with the latest state of the tracking target, avoiding tracking loss caused by unreal-time model updates. The last selection strategy is also one of the characteristics of TLD technology, which embodies the characteristics of adaptive tracking. When the tracking drifts, the tracker will automatically adapt to the background instead of suddenly shifting to the tracking target.

The pruning event assumes that there is only one object per frame, and when both the tracker and the detector agree on the object location, the remaining detection images are considered as wrong samples and removed from the online model.

The samples in the online model provide material for the learning process of TLD. In addition, TLD uses two constraints in the process of training the generated classifier (random forest): P constraint and N constraint. The P constraint stipulates that the image blocks that are close to the target image on the tracking trajectory are positive samples; otherwise, they are negative samples, which is the N constraint. The PN constraint reduces the error rate of the classifier, and within a certain range, its error rate approaches zero.

Detector

TLD technology designs a fast and reliable detector, which provides the necessary support for the tracker. When the results obtained by the tracker are invalid, the results of the detector need to be used to supplement and correct, and the tracker needs to be reinitialized. The specific steps are as follows.

For each frame, the tracker and detector are run simultaneously. The tracker predicts a target location information, while the detector may detect multiple images;

When determining the final position of the target, the results obtained by the tracker are given priority, that is, if the similarity between the tracked image and the original target image is greater than a certain threshold, the tracking result is accepted; otherwise, the image with the greatest similarity to the original target is selected from the detector results as the tracking result;

If it is the latter in the second step, then the initial target model of the tracker is updated at this time, and the original target model is replaced with the currently selected tracking result. At the same time, the samples in the previous model are deleted and restarted with new samples.

The detector is a random forest classifier generated by training and learning samples in the online model. The selected feature is the edge direction of the region, called 2bitBP feature, which has the characteristic of not being disturbed by light. The feature is quantized and there are 4 possible encodings. For a given region, its feature encoding is unique. Multi-scale feature calculation can use the integral image method.

Each image block is represented by a number of 2-bit BP features, and these features are divided into different groups of the same size, each group representing a different representation of the appearance of the image block. The classifier used for detection is in the form of a random forest. The random forest is composed of trees, and each tree is constructed from a feature group. Each feature of the tree serves as a decision node.

Random forests are updated and evolved online through growth events and pruning events. At the beginning, each tree is constructed from the feature set of the initial target template and has only one "branch". As growth events select positive samples, random forests continue to add new "branches"; pruning events are the opposite, removing unused "branches" in random forests. This real-time detector adopts a scanning window strategy: it scans the input frame according to position and scale, and applies a classifier to each sub-window to determine whether it belongs to the target image.

TLD technology cleverly combines the tracker, detector and learning process to achieve target tracking.