TRAM: Bridging Trust Regions and Sharpness Aware Minimization

Tom Sherborne; Naomi Saphra; Pradeep Dasigi; Hao Peng

arXiv:2310.03646·cs.LG·March 13, 2024

TRAM: Bridging Trust Regions and Sharpness Aware Minimization

Tom Sherborne, Naomi Saphra, Pradeep Dasigi, Hao Peng

PDF

Open Access 1 Repo 3 Reviews

TL;DR

TRAM combines trust-region and sharpness-aware minimization techniques to enhance out-of-domain generalization in vision and language tasks by optimizing both parameter and function space curvature.

Contribution

It introduces a novel fine-tuning algorithm, TRAM, that unifies trust-region and SAM strategies to improve domain transfer and representation robustness.

Findings

01

TRAM outperforms existing SAM and trust-region methods across vision and language tasks.

02

TRAM achieves superior domain transfer, especially in challenging anticorrelated domain scenarios.

03

Minimal additional computation is required compared to previous sharpness-aware methods.

Abstract

Sharpness-aware minimization (SAM) reports improving domain generalization by reducing the loss surface curvature in the parameter space. However, generalization during fine-tuning is often more dependent on the transferability of representations in the function space. Trust-region methods (TR) target this goal by regularizing representation curvature to reduce catastrophic forgetting of pre-trained task-agnostic information while adopting task-specific skills. We consider unifying these strategies for low curvature in both parameter space and function space to improve out-of-domain (OOD) generalization. We propose Trust Region Aware Minimization (TRAM), a SAM algorithm fine-tuning for low parameter sharpness and smooth, informative representations preserving pre-trained structure. TRAM uses a trust region bound to inform the SAM adversarial neighborhood, introducing an awareness of…

Peer Reviews

Decision·ICLR 2024 spotlight

Reviewer 01Rating 8· accept, good paperConfidence 4

Strengths

- The paper did a good job summarizing the existing approaches and how the proposed method builds on top of them. - The experimental settings are detailed, and reasonable. - Experiments seem quite comprehensive at least for the settings considered in this work. - It's quite remarkable that the proposed methods achieve best performance across different fine-tuning tasks.

Weaknesses

- See the question section below.

Reviewer 02Rating 8· accept, good paperConfidence 3

Strengths

1. The proposed method is intuitive and well-motivated. The combination of SAM and Trust region methods is reasonable and interesting. 2. Extensive experiments on multiple NLP tasks demonstrate the effectiveness of the proposed method.

Weaknesses

1. Theoretical motivation for unifying SAM and Trust region methods is not provided. 2. Some results have high variance across runs. More runs may better characterize the performance.

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

**Strengths** 1. The paper is clearly written and easy to follow. 2. I think the paper aims to contribute to SAM from a very interesting perspective, i.e. fine-tuning techniques. Considering that fine-tuning has become a nearly necessary procedure in NLP tasks, the paper may provide some promising instructions further. 3. Combine the proposed method with Fisher-SAM can reduce extra forward-propagation count when implementing to the same count as in vanilla SAM.

Weaknesses

**Weakness** 1. The core of this proposed method is to adaptively change the neighbourhood radius in SAM (or ASAM) based on certain distance measure. This somehow does not follow the idea of Trust Region Regularization which adds additional constraint on top of the loss according to the measure. More accurately, they are two different things. And, I could not find a clear meaning why using such a distance as the neighbourhood radius could give the "Trust". Several questions arise: what does the

Code & Models

Repositories

tomsherborne/tram_optimizer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Topic Modeling

MethodsSegment Anything Model · Attentive Walk-Aggregating Graph Neural Network · Sharpness-Aware Minimization