BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

Zekun Qian; Ruize Han; Wei Feng

arXiv:2604.11136·cs.CV·April 14, 2026

BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

Zekun Qian, Ruize Han, Wei Feng

PDF

TL;DR

BoxTuning introduces a visual prompting method that injects object spatial-temporal information directly into video frames, significantly reducing token costs and preserving fine-grained dynamics for improved video question answering.

Contribution

It presents a novel visual prompting approach that embeds object information directly into frames, outperforming text-coordinate methods in efficiency and accuracy across multiple benchmarks.

Findings

01

Achieves 87-93% token reduction compared to text-coordinate methods.

02

Surpasses baselines on spatially oriented video QA tasks.

03

Nearly eliminates accuracy loss on reasoning-centric tasks.

Abstract

Object-level spatial-temporal understanding is essential for video question answering, yet existing multimodal large language models (MLLMs) encode frames holistically and lack explicit mechanisms for fine-grained object grounding. Recent work addresses this by serializing bounding box coordinates as text tokens, but this text-coordinate paradigm suffers from a fundamental modality mismatch: object information is inherently visual, yet encoding it as text incurs a high token cost that forces aggressive temporal downsampling. We propose BoxTuning, which resolves this mismatch by injecting object spatial-temporal information directly into the visual modality. Colored bounding boxes and trajectory trails are rendered onto video frames as visual prompts, with only a concise color-to-object legend retained as text. This reduces the token cost significantly, achieving 87-93% text token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.