Robustness of Probabilistic Models to Low-Quality Data: A Multi-Perspective Analysis

Liu Peng; Yaochu Jin

arXiv:2512.11912·cs.AI·December 16, 2025

Robustness of Probabilistic Models to Low-Quality Data: A Multi-Perspective Analysis

Liu Peng, Yaochu Jin

PDF

Open Access 3 Reviews

TL;DR

This paper systematically compares the robustness of various probabilistic models to low-quality data, revealing that autoregressive language models are highly resilient, while diffusion models are highly sensitive, with robustness influenced by conditioning richness and data information content.

Contribution

It provides a comprehensive multi-perspective analysis explaining the differing robustness levels of probabilistic models under data corruption.

Findings

01

Autoregressive language models show modest performance degradation with corrupted data.

02

Diffusion models degrade significantly under the same corruption levels.

03

Classifiers' robustness improves with larger datasets.

Abstract

A systematic, comparative investigation into the effects of low-quality data reveals a stark spectrum of robustness across modern probabilistic models. We find that autoregressive language models, from token prediction to sequence-to-sequence tasks, are remarkably resilient (for GPT-2, test NLL increases modestly from 2.87 to 3.59 despite 50% token corruption). By contrast, under the same levels of data corruption, class-conditional diffusion models degrade catastrophically (image-label consistency plummets by 56.81% relative to baseline), while classifiers show a moderate impact that diminishes with dataset scale. To explain these discrepancies, we analyze the results through a multi-perspective lens, integrating information theory, PAC learning, and gradient dynamics. These analyses suggest that robustness is heavily influenced by two key principles: the richness of conditioning…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

This paper has many strengths: - I don't think I've ever seen a paper put this extent of models into one controlled framework. I think the design here is very careful, and highlights meaningful differences between these model types and training dynamics - The theoretical framing is clean. It ties together a variety of ideas (information theory, PAC, gradient perpsective) to try to explain what's going on here, to give more explanation beyond the high-level takeaways (e.g., that the richness of

Weaknesses

As stated in my summary, please note I am not an expert on all material in this paper. I've combined observations that I think might be weaknesses with related questions. **Noise model** All injected noise is random and unstructured (uniform token or label replacement). In practical settings, low-quality data are often structured (e.g., correlated, systematically mislabeled). Is it fair to say that the results therefore demonstrate robustness to _stochastic corruption_, not necessarily to _rea

Reviewer 02Rating 2Confidence 3

Strengths

In this paper the authors provide a comprehensive review of two classes of generative models: autoregressive models for text generation and class-conditional diffusion models, in the presence of noisy/low quality data, for several signal to noise ratios. In order to perform their analysis they employ metrics from a spectrum of perspectives, namely information theory, PAC learning, and gradient dynamics. They conclude that the robustness of the method is dependent on the task (richness of conditi

Weaknesses

The authors test two different types of generative models under different tasks, for different types of data (categorical and continuous) using different metrics. They do not provide a single task that test these models against, using the same criteria, so in a sense, the comparison in not objective/very informative. Also, the results that they present under no corrupted data do not match the ones in the corresponding literature For example the test accuracy of CIFAR-100 was found to be signif

Reviewer 03Rating 2Confidence 4

Strengths

The empirical observation that GPT-2 and ImageNet-scale classification can tolerate, or even thrive under heavy label/target corruption is interesting and perhaps worth documenting.

Weaknesses

Beyond that, the theoretical analysis/explanation they provide seems weak and superficial. Each of the three perspectives allows for straightforward counter-arguments, and the paper stops short of unifying them into a single, predictive theory. The writing is polished and persuasive on the surface, but the theoretical substance is thin; as a result, the paper takes rather longer to read than it should. Below, I provide my view on each perspective the paper provides. 1) Information-theoretic per

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis