# A novel multi-modal retrieval framework for tracking vehicles using natural language descriptions

**Authors:** Changhao Zhang, Zhandong Liu, Ke Li, Yong Li, Xiangwei Qi, Nan Ding, Zhixia Li, Zhixia Li, Zhixia Li

PMC · DOI: 10.1371/journal.pone.0327468 · PLOS One · 2025-08-11

## TL;DR

This paper introduces a new system that uses natural language to track vehicles in traffic surveillance by combining text and video data.

## Contribution

The novel MVR system uses multi-modal learning and CLIP to achieve high accuracy in vehicle trajectory retrieval via natural language.

## Key findings

- The MVR system achieved an MRR score of 0.8966 on the AI City Challenge Track 2 dataset.
- The system outperformed previous top-ranked methods on the public leaderboard.
- The framework effectively integrates text-video comparison and multi-context attributes for accurate retrieval.

## Abstract

Recent advances in multimodal and contrastive learning have significantly enhanced image and video retrieval capabilities. This fusion provides numerous opportunities for multi-dimensional and multi-view retrieval, especially in multi-camera surveillance scenarios in traffic environments. This paper introduces a novel Multi-modal Vehicle Retrieval (MVR) system designed to retrieve the trajectories of tracked vehicles using natural language descriptions. The MVR system integrates an end-to-end text-video comparison learning model, utilizes CLIP for feature extraction, and uses a matching control system and multi-context-based attributes. Post-processing techniques are used to eliminate erroneous information. By comprehensively understanding vehicle characteristics, the MVR system can effectively identify trajectories based on natural language descriptions. Our method achieves a mean reciprocal ranking (MRR) score of 0.8966 on the test data set of the 7th AI City Challenge Track 2 for retrieving tracked vehicles through natural language descriptions, surpassing the previous top-ranked result on the public leaderboard.

## Full-text entities

- **Genes:** VIT (vitrin) [NCBI Gene 5212] {aka VIT1}
- **Chemicals:** -D-24-43459A (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12338829/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12338829/full.md

## References

24 references — full list in the complete paper: https://tomesphere.com/paper/PMC12338829/full.md

---
Source: https://tomesphere.com/paper/PMC12338829