BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning
Zekun Qian, Ruize Han, Wei Feng

TL;DR
BoxTuning introduces a visual prompting method that injects object spatial-temporal information directly into video frames, significantly reducing token costs and preserving fine-grained dynamics for improved video question answering.
Contribution
It presents a novel visual prompting approach that embeds object information directly into frames, outperforming text-coordinate methods in efficiency and accuracy across multiple benchmarks.
Findings
Achieves 87-93% token reduction compared to text-coordinate methods.
Surpasses baselines on spatially oriented video QA tasks.
Nearly eliminates accuracy loss on reasoning-centric tasks.
Abstract
Object-level spatial-temporal understanding is essential for video question answering, yet existing multimodal large language models (MLLMs) encode frames holistically and lack explicit mechanisms for fine-grained object grounding. Recent work addresses this by serializing bounding box coordinates as text tokens, but this text-coordinate paradigm suffers from a fundamental modality mismatch: object information is inherently visual, yet encoding it as text incurs a high token cost that forces aggressive temporal downsampling. We propose BoxTuning, which resolves this mismatch by injecting object spatial-temporal information directly into the visual modality. Colored bounding boxes and trajectory trails are rendered onto video frames as visual prompts, with only a concise color-to-object legend retained as text. This reduces the token cost significantly, achieving 87-93% text token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
