Driving with InternVL: Oustanding Champion in the Track on Driving with   Language of the Autonomous Grand Challenge at CVPR 2024

Jiahan Li; Zhiqi Li; Tong Lu

arXiv:2412.07247·cs.CV·December 11, 2024

Driving with InternVL: Oustanding Champion in the Track on Driving with Language of the Autonomous Grand Challenge at CVPR 2024

Jiahan Li, Zhiqi Li, Tong Lu

PDF

Open Access

TL;DR

This paper presents a fine-tuned multimodal model, InternVL-1.5, for autonomous driving tasks involving language, achieving top performance in the CVPR 2024 Autonomous Grand Challenge by effectively handling multi-view images and automatic annotations.

Contribution

The work introduces a novel approach to adapt InternVL-1.5 for autonomous driving by multi-view image formatting and automatic annotation, leading to state-of-the-art results in the challenge.

Findings

01

Achieved a score of 0.6002 on the challenge leaderboard.

02

Effectively handled multi-view images with specific formatting.

03

Demonstrated the effectiveness of automatic annotation strategy.

Abstract

This technical report describes the methods we employed for the Driving with Language track of the CVPR 2024 Autonomous Grand Challenge. We utilized a powerful open-source multimodal model, InternVL-1.5, and conducted a full-parameter fine-tuning on the competition dataset, DriveLM-nuScenes. To effectively handle the multi-view images of nuScenes and seamlessly inherit InternVL's outstanding multimodal understanding capabilities, we formatted and concatenated the multi-view images in a specific manner. This ensured that the final model could meet the specific requirements of the competition task while leveraging InternVL's powerful image understanding capabilities. Meanwhile, we designed a simple automatic annotation strategy that converts the center points of objects in DriveLM-nuScenes into corresponding bounding boxes. As a result, our single model achieved a score of 0.6002 on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies