New SOTA for object detection, real-time recognition on the end, Shen Xiangyang rarely reposts and likes

Latest update time：2024-05-27

Reads：

Bai Jiao sent from Aofei Temple
Quantum Bit | Public Account QbitAI

The field of target detection has ushered in new progress——

Grounding DINO 1.5, produced by the IDEA Research Institute team, can achieve real-time recognition on the terminal side.

This progress was forwarded by AI tycoon Harry Shum, who usually reposts it once a year.

There are two main versions of this release: Pro and Edge. The Pro version is more powerful, and the Edge version is faster.

It still retains the dual encoder-single decoder structure of the previous version, Grounding DINO . On this basis, it expands the model size by combining a larger visual backbone, and uses more than 20 million Grounding data to obtain rich corpus, which greatly improves the detection accuracy and speed. It is also optimized for different application scenarios through the Pro and Edge versions.

The Pro version excels in building large-scale data sets and in scenarios requiring high precision, while the Edge version demonstrates its unique advantages in end-side deployment.

Let’s take a look at each one separately.

Pro version of the new SOTA target detection

The Grounding DINO 1.5 Pro version achieves the current SOTA level of open-set object detection, performs well in semantic understanding of images and text, and can quickly and accurately detect and identify target objects in images based on language cues.

△ Comparison of zero-shot transfer performance on COCO, LVIS, ODinW35, and ODinW13 benchmarks

Object-level understanding is the perceptual basis for the interaction between machines and the physical world, and is also a fundamental issue that cannot be avoided in solving the hallucination problem of large multimodal models (VLM).

As the best performing open-set detection model currently available, Grounding DINO 1.5 Pro can help build massive amounts of multimodal data with object-level semantic information, thereby effectively assisting the training of large multimodal models.

It can accurately match phrases in long text descriptions to specific objects or scenes in images to enhance AI’s understanding of the relationship between visual content and text.

In addition, Grounding DINO 1.5 Pro also has great application value in other fields that need to process large amounts of complex data, such as e-commerce, social media, and autonomous driving.

For example, in the e-commerce field, the model can help quickly annotate product images and optimize search and recommendation systems. In social media, the model can automatically annotate pictures uploaded by users and improve the efficiency of content review and classification.

Support industry data fine-tuning

In addition, the Pro version also supports fine tuning through industry data to meet the specific needs of various industries, thereby achieving more accurate recognition results.

In order to verify the improvements brought by fine-tuning, the CVR team conducted comparative experiments on public datasets such as LVIS, which is commonly used in the field of vision.

As can be seen from the last two rows, Grounding DINO 1.5 Pro has shown significant performance improvements on multiple datasets after fine-tuning.

It is also very suitable in many actual scenarios.

For example, in the medical field, the fine-tuned Grounding DINO 1.5 Pro can more accurately identify lesions in medical images, assist doctors in diagnosis, and improve diagnosis and treatment efficiency.

In the retail industry, fine-tuned models can more accurately identify and classify products, helping with inventory management and sales analysis.

Edge version can be deployed on the client side

In terms of edge deployment, Grounding DINO 1.5 Edge was successfully deployed on the NVIDIA Orin NX card through model structure optimization, and achieved an inference speed of 10FPS.

Furthermore, it allows robots to interact with open environments.

In the field of autonomous driving, Grounding DINO 1.5 Edge can run in real time on vehicles in the future to achieve efficient target detection and environmental perception, and improve driving safety. In intelligent security, the model can quickly process video surveillance data, detect abnormal behavior in real time, and improve the response speed of security monitoring.

In the future, the operating speed of Grounding DINO 1.5 Edge is expected to increase to 20 to 30FPS, further expanding its application in the field of edge computing.

Paper link:
https://arxiv.org/abs/2405.10300
Project demo link:
https://deepdataspace.com/playground/grounding_dino

-over-

Please send your submissions to:

ai@qbitai.com

Please indicate [Submission] in the title and tell us:

Who are you, where are you from, what is your contribution ‍

Attach the link to the paper/project homepage and contact information

We will (try to) reply you promptly

Click here ???? Follow me, remember to mark the star~