Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models

Chaeyoung Jung; Youngjoon Jang; Jongmin Choi; Joon Son Chung

arXiv:2505.20873·cs.CV·October 1, 2025

Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models

Chaeyoung Jung, Youngjoon Jang, Jongmin Choi, Joon Son Chung

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Fork-Merge Decoding, an inference-time strategy that improves multimodal understanding in audio-visual large language models by reducing modality bias without additional training.

Contribution

It proposes a simple inference-time method that separates and then merges modality-specific reasoning to enhance balanced multimodal understanding in AV-LLMs.

Findings

01

Consistent performance improvements across multiple AV-LLMs.

02

Effective reduction of modality bias during inference.

03

No additional training or architectural changes needed.

Abstract

The goal of this work is to enhance balanced multimodal understanding in audio-visual large language models (AV-LLMs) by addressing modality bias without additional training. In current AV-LLMs, audio and video features are typically processed jointly in the decoder. While this strategy facilitates unified multimodal understanding, it may introduce modality bias, where the model tends to over-rely on one modality due to imbalanced training signals. To mitigate this, we propose Fork-Merge Decoding (FMD), a simple yet effective inference-time strategy that requires no additional training or architectural modifications. FMD first performs modality-specific reasoning by processing audio-only and video-only inputs through the early decoder layers (fork), and then merges the resulting hidden states for joint reasoning in the remaining layers (merge). This separation allows each modality to be…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

- The design of FMD avoids additional training or architectural modifications. - The paper is easy to follow.

Weaknesses

- Lack of technical novelty: This method does not alter the fundamental nature of modality imbalance. During training, visual information has higher information density, and there are cases where the model can answer correctly by focusing solely on one modality—these factors lead to the model assigning different levels of attention to different modalities. Therefore, even though the model adopts a separate masking strategy, it still fails to address the aforementioned issues and thus cannot effe

Reviewer 02Rating 4Confidence 4

Strengths

1. **Novel yet simple method:** FMD is easy to implement, making it broadly applicable across different AV-LLMs. 2. **Comprehensive analysis:** detailed quantitative studies, including attention analysis and ablation experiments on fork layers are conducted 3. **Latency:** FMD sets the fork layer to be early layers. This leads to limited additional computation, and may be optimized by engineering (e.g., parallel processing masked audio and video inputs)

Weaknesses

1. **Dataset/task-specific $\alpha$:** The optimization of $\alpha$ for specific datasets limits the robustness of FMD. For real-world applications, it may be infeasible to retrieve representative samples in advance. 2. **Models and benchmarks for validation:** Although representative AV-LLMs have been evaluated on benchmarks, the robustness still needs to be assessed on more models, such as reasoning models, since the improvement on VideoLLaMA2 and Qwen2.5-Omni is slight

Reviewer 03Rating 4Confidence 3

Strengths

1. Important and practical problem: Modality bias is a core challenge in modern multimodal LLMs; this work tackles a real-world issue with clear applicability. 2. Simple and efficient method: FMD requires no additional training or architectural changes—only input masking and hidden-state fusion during inference—making it a plug-and-play solution that is easy to deploy. 3. Comprehensive experiments: The evaluation covers diverse AV-LLM architectures and multiple benchmark datasets, demonstratin

Weaknesses

1. Heuristic hyperparameter selection: The fusion weight α is estimated as a fixed value from 100 samples on AVHBench (e.g., α = 0.8 for VideoLLaMA2). Although it generalizes well to other datasets, it lacks theoretical grounding or an adaptive mechanism. Manual calibration per model undermines true plug-and-play usability in real-world settings. As shown in Figure 6, the optimal Lfork (fork layer depth) varies significantly across tasks (e.g., A→V prefers shallow layers, V→A prefers deeper ones

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing