Directional Gradient Projection for Robust Fine-Tuning of Foundation Models
Chengyue Huang, Junjiao Tian, Brisa Maneechotesuwan, Shivang Chopra, Zsolt Kira

TL;DR
This paper introduces DiGraP, a novel layer-wise method using directional gradient information to improve robust fine-tuning of foundation models across image classification and multi-modal tasks, especially under distribution shifts.
Contribution
It proposes DiGraP, a new gradient-based regularization technique for robust fine-tuning, and extends robust fine-tuning evaluation to multi-modal VQA benchmarks with analysis of distribution shifts.
Findings
DiGraP outperforms existing methods in image classification and VQA tasks.
It improves both in-distribution accuracy and out-of-distribution robustness.
The method generalizes well across different model backbones.
Abstract
Robust fine-tuning aims to adapt large foundation models to downstream tasks while preserving their robustness to distribution shifts. Existing methods primarily focus on constraining and projecting current model towards the pre-trained initialization based on the magnitudes between fine-tuned and pre-trained weights, which often require extensive hyper-parameter tuning and can sometimes result in underfitting. In this work, we propose Directional Gradient Projection (DiGraP), a novel layer-wise trainable method that incorporates directional information from gradients to bridge regularization and multi-objective optimization. Besides demonstrating our method on image classification, as another contribution we generalize this area to the multi-modal evaluation settings for robust fine-tuning. Specifically, we first bridge the uni-modal and multi-modal gap by performing analysis on Image…
Peer Reviews
Decision·ICLR 2025 Poster
1. The proposed approach makes the projection strength $\omega$ as a learnable parameter, and makes it tunable by a learning rate $\mu$. This makes the approach less sensitive to hyperparameters. The authors show good experimental evidence for the claim. 2. The proposed evaluation of reformulating image classification as a VQA task allows for evaluations on foundation VLMs like PaliGemma. 3. The authors show that the proposed approach can be adapted to PEFT approaches like LoRA.
1. While the paper addresses the problem of robust fine-tuning, the proposed approach Directional Gradient Projection appears similar to PCGrad. The problem of conflicting gradients and the idea of projecting the gradient to the normal plane have been addressed in PCGrad. I would encourage the authors to highlight the unique contributions more clearly. 2. The paper proposes a general approach for robust fine-tuning of foundation models, but the paper focuses on Image classification and VQA task
+This work adopting multi-objective learning to alleviate the previous method's sensitivity to hyper-parameters and underfitting problems, which seems to a relatively novel attempt in the current field. +The work expands the experimental setting from single-modal to multi-modal, filling the gap in experiments.
-Although the work attempts to utilize gradient direction information, there are still deficiencies in the innovation of the method. It looks like a combination of previous work, so the innovation needs to be explained more. -Sec 4.1 lacks qualitative analysis of the experimental results, especially the reasons why the ID performance on the real domain is worse than LP-FT. -The article seems to have only conducted a quantitative analysis of hyper-parameters for multi-modal tasks, and correspon
DiGraP’s gradient-based approach is innovative, incorporating directional information to handle conflicting objectives in a way that traditional regularization does not. The expansion from uni-modal to multi-modal evaluation settings represents the contribution of robust fine-tuning. Comprehensive ablation studies and sensitivity analyses of the hyper-parameters (e.g., projection strength) strengthen the validity of the findings, showing DiGraP’s robustness and effectiveness across different c
There are many public unimodal and multimodal foundation models, e.g., MAE, CLIP, BEiT3, LLaVa, etc. It is unclear why ResNet50 and PaliGemma are selected as foundation models. The ResNet50 pretrained in a supervised manner on ImageNet can hardly be deemed as foundation models. The PaliGemma is pretrained on a broad mixture of large-scale vision-language tasks. Whether the conclusion of this paper holds across other, more general foundation models, e.g., CLIP or LLaVa, is questionable. Although
Videos
Taxonomy
TopicsOptical measurement and interference techniques · Advanced Vision and Imaging · 3D Shape Modeling and Analysis
MethodsFocus
