TL;DR
SALMAN introduces a unified framework for assessing language model robustness using a novel distance mapping distortion measure, enabling efficient, model-agnostic stability analysis without complex adversarial modifications.
Contribution
It proposes a new local robustness evaluation method, DMD, that simplifies and unifies robustness assessment across different language models.
Findings
Significant improvements in attack efficiency.
Enhanced robustness training results.
Model-agnostic applicability demonstrated.
Abstract
Recent strides in pretrained transformer-based language models have propelled state-of-the-art performance in numerous NLP tasks. Yet, as these models grow in size and deployment, their robustness under input perturbations becomes an increasingly urgent question. Existing robustness methods often diverge between small-parameter and large-scale models (LLMs), and they typically rely on labor-intensive, sample-specific adversarial designs. In this paper, we propose a unified, local (sample-level) robustness framework (SALMAN) that evaluates model stability without modifying internal parameters or resorting to complex perturbation heuristics. Central to our approach is a novel Distance Mapping Distortion (DMD) measure, which ranks each sample's susceptibility by comparing input-to-output distance mappings in a near-linear complexity manner. By demonstrating significant gains in attack…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Originality: The paper proposes an original method for evaluating the per-sample stability of LLM outputs. Quality: The paper derives a novel algorithm, then investigates its core claims with empirical experiments. Clarity: The paper is generally clearly written. Significance: LLM robustness is an important area.
W1 (Significance): It's unclear how to really interpret the cosine similarities in section 4.1 -- for the most part the similarities are high for robust and non-robust examples, and while the experiment shows that the metric better corresponds to sensitivity to changes in the input space, it's not really clear what the implications are. W2 (Significance): The paper combines motivations from both adversarial robustness, and robustness to random text modifications. It's unclear to me what benefits
- The article proposes numerous ways to evaluate a robustness measure for an LLM - The authors completed a large body of numerical experiments, with some of them that I had never encountered. Maybe, focusing on specific ones should be of a separate interest (and the overall positioning of a paper as a framework for LLM robustness evaluation from multiple points of view) - Clear structure, easy to understand the contributions, easy to read.
The main concern for me in this paper is a weak answer to the question: Is the introduced SALMAN measure better than other possible measures (and the best overall)? Specifically, - Weak connections to topological methods are presented. See [1] for a review and [2] for a specific MTop-Div similarity measure (also CKA can be used this way, I believe), suitable for comparing "input" and "output" embeddings, as applied previously to LLMs in e.g. [3]. See also another review on the similarity measure
- The paper is well-structed and easy to follow.
- One concern is about the alignment between the new proposed stability metric and the existing proposed stability metrics. The definition is reasonable and the results looks promising, but it's not compared with existing stability metrics. Ideally in most cases the newly proposed approach align with existing metrics in most cases, and stands out for some special tasks to highlight the value.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
