Training-Free Mitigation of Language Reasoning Degradation After   Multimodal Instruction Tuning

Neale Ratzlaff; Man Luo; Xin Su; Vasudev Lal; Phillip Howard

arXiv:2412.03467·cs.CV·December 5, 2024

Training-Free Mitigation of Language Reasoning Degradation After Multimodal Instruction Tuning

Neale Ratzlaff, Man Luo, Xin Su, Vasudev Lal, Phillip Howard

PDF

Open Access

TL;DR

This paper investigates how multimodal instruction tuning affects language reasoning in large language models and proposes a training-free method to mitigate reasoning degradation, improving performance across various tasks.

Contribution

It reveals the varied effects of multimodal tuning on language reasoning and introduces a training-free model merging technique to counteract reasoning performance loss.

Findings

01

Multimodal tuning degrades Mistral's language reasoning but improves Vicuna's.

02

Mathematical reasoning performance declines, while commonsense reasoning improves after multimodal tuning.

03

A training-free model merging method mitigates reasoning degradation and enhances visual task performance.

Abstract

Multimodal models typically combine a powerful large language model (LLM) with a vision encoder and are then trained on multimodal data via instruction tuning. While this process adapts LLMs to multimodal settings, it remains unclear whether this adaptation compromises their original language reasoning capabilities. In this work, we explore the effects of multimodal instruction tuning on language reasoning performance. We focus on LLaVA, a leading multimodal framework that integrates LLMs such as Vicuna or Mistral with the CLIP vision encoder. We compare the performance of the original LLMs with their multimodal-adapted counterparts across eight language reasoning tasks. Our experiments yield several key insights. First, the impact of multimodal learning varies between Vicuna and Mistral: we observe a degradation in language reasoning for Mistral but improvements for Vicuna across most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Neurobiology of Language and Bilingualism

MethodsContrastive Language-Image Pre-training · Focus