Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models
Chaeyoung Jung, Youngjoon Jang, Jongmin Choi, Joon Son Chung

TL;DR
This paper introduces Fork-Merge Decoding, an inference-time strategy that improves multimodal understanding in audio-visual large language models by reducing modality bias without additional training.
Contribution
It proposes a simple inference-time method that separates and then merges modality-specific reasoning to enhance balanced multimodal understanding in AV-LLMs.
Findings
Consistent performance improvements across multiple AV-LLMs.
Effective reduction of modality bias during inference.
No additional training or architectural changes needed.
Abstract
The goal of this work is to enhance balanced multimodal understanding in audio-visual large language models (AV-LLMs) by addressing modality bias without additional training. In current AV-LLMs, audio and video features are typically processed jointly in the decoder. While this strategy facilitates unified multimodal understanding, it may introduce modality bias, where the model tends to over-rely on one modality due to imbalanced training signals. To mitigate this, we propose Fork-Merge Decoding (FMD), a simple yet effective inference-time strategy that requires no additional training or architectural modifications. FMD first performs modality-specific reasoning by processing audio-only and video-only inputs through the early decoder layers (fork), and then merges the resulting hidden states for joint reasoning in the remaining layers (merge). This separation allows each modality to be…
Peer Reviews
Decision·Submitted to ICLR 2026
- The design of FMD avoids additional training or architectural modifications. - The paper is easy to follow.
- Lack of technical novelty: This method does not alter the fundamental nature of modality imbalance. During training, visual information has higher information density, and there are cases where the model can answer correctly by focusing solely on one modality—these factors lead to the model assigning different levels of attention to different modalities. Therefore, even though the model adopts a separate masking strategy, it still fails to address the aforementioned issues and thus cannot effe
1. **Novel yet simple method:** FMD is easy to implement, making it broadly applicable across different AV-LLMs. 2. **Comprehensive analysis:** detailed quantitative studies, including attention analysis and ablation experiments on fork layers are conducted 3. **Latency:** FMD sets the fork layer to be early layers. This leads to limited additional computation, and may be optimized by engineering (e.g., parallel processing masked audio and video inputs)
1. **Dataset/task-specific $\alpha$:** The optimization of $\alpha$ for specific datasets limits the robustness of FMD. For real-world applications, it may be infeasible to retrieve representative samples in advance. 2. **Models and benchmarks for validation:** Although representative AV-LLMs have been evaluated on benchmarks, the robustness still needs to be assessed on more models, such as reasoning models, since the improvement on VideoLLaMA2 and Qwen2.5-Omni is slight
1. Important and practical problem: Modality bias is a core challenge in modern multimodal LLMs; this work tackles a real-world issue with clear applicability. 2. Simple and efficient method: FMD requires no additional training or architectural changes—only input masking and hidden-state fusion during inference—making it a plug-and-play solution that is easy to deploy. 3. Comprehensive experiments: The evaluation covers diverse AV-LLM architectures and multiple benchmark datasets, demonstratin
1. Heuristic hyperparameter selection: The fusion weight α is estimated as a fixed value from 100 samples on AVHBench (e.g., α = 0.8 for VideoLLaMA2). Although it generalizes well to other datasets, it lacks theoretical grounding or an adaptive mechanism. Manual calibration per model undermines true plug-and-play usability in real-world settings. As shown in Figure 6, the optimal Lfork (fork layer depth) varies significantly across tasks (e.g., A→V prefers shallow layers, V→A prefers deeper ones
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing
