SALMAN: Stability Analysis of Language Models Through the Maps Between Graph-based Manifolds

Wuxinlin Cheng; Yupeng Cao; Jinwen Wu; Koduvayur Subbalakshmi; Tian Han; Zhuo Feng

arXiv:2508.18306·cs.LG·August 27, 2025

SALMAN: Stability Analysis of Language Models Through the Maps Between Graph-based Manifolds

Wuxinlin Cheng, Yupeng Cao, Jinwen Wu, Koduvayur Subbalakshmi, Tian Han, Zhuo Feng

PDF

3 Reviews

TL;DR

SALMAN introduces a unified framework for assessing language model robustness using a novel distance mapping distortion measure, enabling efficient, model-agnostic stability analysis without complex adversarial modifications.

Contribution

It proposes a new local robustness evaluation method, DMD, that simplifies and unifies robustness assessment across different language models.

Findings

01

Significant improvements in attack efficiency.

02

Enhanced robustness training results.

03

Model-agnostic applicability demonstrated.

Abstract

Recent strides in pretrained transformer-based language models have propelled state-of-the-art performance in numerous NLP tasks. Yet, as these models grow in size and deployment, their robustness under input perturbations becomes an increasingly urgent question. Existing robustness methods often diverge between small-parameter and large-scale models (LLMs), and they typically rely on labor-intensive, sample-specific adversarial designs. In this paper, we propose a unified, local (sample-level) robustness framework (SALMAN) that evaluates model stability without modifying internal parameters or resorting to complex perturbation heuristics. Central to our approach is a novel Distance Mapping Distortion (DMD) measure, which ranks each sample's susceptibility by comparing input-to-output distance mappings in a near-linear complexity manner. By demonstrating significant gains in attack…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

Originality: The paper proposes an original method for evaluating the per-sample stability of LLM outputs. Quality: The paper derives a novel algorithm, then investigates its core claims with empirical experiments. Clarity: The paper is generally clearly written. Significance: LLM robustness is an important area.

Weaknesses

W1 (Significance): It's unclear how to really interpret the cosine similarities in section 4.1 -- for the most part the similarities are high for robust and non-robust examples, and while the experiment shows that the metric better corresponds to sensitivity to changes in the input space, it's not really clear what the implications are. W2 (Significance): The paper combines motivations from both adversarial robustness, and robustness to random text modifications. It's unclear to me what benefits

Reviewer 02Rating 4Confidence 4

Strengths

- The article proposes numerous ways to evaluate a robustness measure for an LLM - The authors completed a large body of numerical experiments, with some of them that I had never encountered. Maybe, focusing on specific ones should be of a separate interest (and the overall positioning of a paper as a framework for LLM robustness evaluation from multiple points of view) - Clear structure, easy to understand the contributions, easy to read.

Weaknesses

The main concern for me in this paper is a weak answer to the question: Is the introduced SALMAN measure better than other possible measures (and the best overall)? Specifically, - Weak connections to topological methods are presented. See [1] for a review and [2] for a specific MTop-Div similarity measure (also CKA can be used this way, I believe), suitable for comparing "input" and "output" embeddings, as applied previously to LLMs in e.g. [3]. See also another review on the similarity measure

Reviewer 03Rating 4Confidence 2

Strengths

- The paper is well-structed and easy to follow.

Weaknesses

- One concern is about the alignment between the new proposed stability metric and the existing proposed stability metrics. The definition is reasonable and the results looks promising, but it's not compared with existing stability metrics. Ideally in most cases the newly proposed approach align with existing metrics in most cases, and stands out for some special tasks to highlight the value.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.