Is Extending Modality The Right Path Towards Omni-Modality?
Tinghui Zhu, Kai Zhang, Muhao Chen, Yu Su

TL;DR
This paper critically examines whether extending modality in language models is an effective path toward true omni-modality, analyzing trade-offs, model merging, and generalization through extensive experiments.
Contribution
It provides an empirical analysis of modality extension techniques, assessing their impact on language abilities and the potential of model merging for omni-modality.
Findings
Modality extension may compromise core language skills.
Model merging can help achieve omni-modality.
Omni-modality extension does not always improve knowledge sharing.
Abstract
Omni-modal language models (OLMs) aim to integrate and reason over diverse input modalities--such as text, images, video, and audio--while maintaining strong language capabilities. Despite recent advancements, existing models, especially open-source ones, remain far from true omni-modality, struggling to generalize beyond the specific modality pairs they are trained on or to achieve strong performance when processing multi-modal inputs. We study the effect of extending modality, the dominant technique for training multimodal models, where an off-the-shelf language model is fine-tuned on target-domain and language data. Specifically, we investigate three key questions: (1) Does modality extension compromise core language abilities? (2) Can model merging effectively integrate independently fine-tuned modality-specific models to achieve omni-modality? (3) Does omni-modality extension lead…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper tackles important questions that when well executed would be useful for the overall community. The paper has done a good job in identifying the benchmarks for language capabilities and considered a range of models as well. The methodology of the paper was overall easy to read but the experiments can be improved as discussed below.
My major concern is the lack of evidences for the claims throughout the paper. For instance, * **Visual modality extends the knowledge scope:** The paper highlights that Qwen-VL-Instruct obtains 5% improvement compared with LLaVA. But this inteprertation is based on MMLU-Pro. There is no difference in performance on MMLU. This is further conflated by the use of 1.4T paired samples by Qwen and only around 10M by LLaVA. It is thus also unfair to make claims based on the efficiency of vision over a
Deep empirical insight into multimodal learning trade-offs: The paper provides one of the most systematic investigations into how modality extension reshapes an LLM’s internal balance between language, reasoning, and multimodal understanding. By quantifying these effects across diverse benchmarks, it gives the community a clearer, evidence-based understanding of why multimodal expansion may harm reasoning, offering diagnostic insight rather than only performance metrics.
1. The paper’s conclusions are based mainly on Qwen2-era models, and newer architectures such as Qwen3 already show that the observed trade-offs between language and multimodal performance may not hold universally. This limits how far the conclusions can be generalized. 2. The paper does not provide a theoretical explanation for why omni-modality fine-tuning or model merging work or fail. Without an analytical view of optimization dynamics or representation sharing, it is difficult to know wheth
1. The paper systematically analyzes two major strategies for achieving omni-modality, covering text, image, video, and audio. 2. Clear introduce of used multimodal benchmarks,and well-chosen metrics.
1. The paper's novelty is largely lower than the bar of the top-tier conference ICLR, which does not draw out some insight and novel conclusion. **The answers to the three main questions listed in the abstract even have already been studied by the previous relevant works**. For example, * [a] analyzed and conducted experiments to demonstrate that multimodality does not enhance the model's language capability (RQ1); * [a] also examined whether multimodal fine-tuning leads to better knowledge shar
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSyntax, Semantics, Linguistic Variation · Philosophy and Theoretical Science · Linguistics and Discourse Analysis
