Rethinking Cross-Modal Fine-Tuning: Optimizing the Interaction Between Feature Alignment and Target Fitting

Trong Khiem Tran; Manh Cuong Dao; Phi Le Nguyen; Thao Nguyen Truong; Trong Nghia Hoang

arXiv:2601.18231·cs.LG·April 21, 2026

Rethinking Cross-Modal Fine-Tuning: Optimizing the Interaction Between Feature Alignment and Target Fitting

Trong Khiem Tran, Manh Cuong Dao, Phi Le Nguyen, Thao Nguyen Truong, Trong Nghia Hoang

PDF

TL;DR

This paper introduces a theoretical framework for optimizing the interaction between feature alignment and target fine-tuning in cross-modal model adaptation, leading to improved generalization and performance.

Contribution

It provides the first provable generalization bound for the interaction between feature alignment and target fitting, guiding better algorithm design.

Findings

01

Achieves state-of-the-art performance on benchmark datasets.

02

Develops a novel concept of feature-label distortion.

03

Provides theoretical insights into cross-modal adaptation.

Abstract

Adapting pre-trained models to unseen feature modalities has become increasingly important due to the growing need for cross-disciplinary knowledge integration. A key challenge here is how to align the representation of new modalities with the most relevant parts of the pre-trained model's representation space to enable accurate knowledge transfer. This requires combining feature alignment with target fine-tuning, but uncalibrated combinations can exacerbate misalignment between the source and target feature-label structures and reduce target generalization. Existing work, however, lacks a theoretical understanding of this critical interaction between feature alignment and target fitting. To bridge this gap, we develop a principled framework that establishes a provable generalization bound on the target error, which explains the interaction between feature alignment and target fitting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.