Scalable Object Detection in the Car Interior With Vision Foundation Models

Sebastian Schmidt; B\'alint M\'esz\'aros; Ahmet Firintepe; Stephan G\"unnemann

arXiv:2508.19651·cs.CV·May 14, 2026

Scalable Object Detection in the Car Interior With Vision Foundation Models

Sebastian Schmidt, B\'alint M\'esz\'aros, Ahmet Firintepe, Stephan G\"unnemann

PDF

TL;DR

The paper introduces ODAL, a distributed framework leveraging foundation models for interior vehicle object detection, overcoming resource constraints, and benchmarks it with a new ODALbench metric.

Contribution

It proposes a novel distributed architecture for interior scene understanding using foundation models and introduces ODALbench for comprehensive performance assessment.

Findings

01

Fine-tuned ODAL-LLaVA achieves 89% ODAL score, a 71% improvement.

02

ODAL-LLaVA outperforms GPT-4o by nearly 20% in ODAL score.

03

Fine-tuning reduces hallucinations and maintains high detection accuracy.

Abstract

AI tasks in the car interior like identifying and localizing externally introduced objects is crucial for response quality of personal assistants. However, computational resources of on-board systems remain highly constrained, restricting the deployment of such solutions directly within the vehicle. To address this limitation, we propose the novel Object Detection and Localization (ODAL) framework for interior scene understanding. Our approach leverages vision foundation models through a distributed architecture, splitting computational tasks between on-board and cloud. This design overcomes the resource constraints of running foundation models directly in the car. To benchmark model performance, we introduce ODALbench, a new metric for comprehensive assessment of detection and localization.Our analysis demonstrates the framework's potential to establish new standards in this domain. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.