TL;DR
Hydra unifies document retrieval and generation in a single vision-language model, reducing memory usage and complexity while maintaining high performance across tasks.
Contribution
The paper introduces Hydra, a dual-head vision-language model with a single LoRA adapter enabling both retrieval and generation, addressing efficiency and structural challenges.
Findings
Hydra reduces peak GPU memory by over 60% compared to two-model setups.
Hydra achieves retrieval and generation performance comparable to specialized models.
A proof-of-concept demonstrates Hydra's transferability to other modalities like audio.
Abstract
Visual document understanding typically requires separate retrieval and generation models, doubling memory and system complexity. We present Hydra, a dual-head approach that provides both ColBERT-style late-interaction retrieval and autoregressive generation from a single vision-language model. A single LoRA adapter, trained only for retrieval, is toggled at inference: enabling it produces multi-vector embeddings; disabling it recovers the base model's generation quality, with 426 of 426 language-model weight tensors byte-for-byte identical to a freshly-loaded Qwen3.5-4B. We identify two failure modes that can silently break generation in retrieval-fine-tuned VLMs (attention-mode restoration and lm_head preservation) plus an efficiency requirement (KV-cache-aware decoding); Hydra sidesteps the first two structurally and addresses the third in the decode loop. We release two scales,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
