RN-F: A Novel Approach for Mitigating Contaminated Data in Large Language Models

Le Vu Anh; Dinh Duc Nha Nguyen; Phi Long Nguyen

arXiv:2505.13249·cs.LG·May 20, 2025

RN-F: A Novel Approach for Mitigating Contaminated Data in Large Language Models

Le Vu Anh, Dinh Duc Nha Nguyen, Phi Long Nguyen

PDF

Open Access 1 Repo

TL;DR

This paper introduces RN-F, a lightweight, model-agnostic method for detecting contaminated data in large language models, significantly improving detection accuracy without extra computational overhead.

Contribution

We propose RN-F, a novel residual-noise fingerprinting framework that effectively identifies contaminated data in LLMs without additional floating-point operations.

Findings

01

RN-F outperforms existing methods by up to 10.5% in detection metrics.

02

RN-F is lightweight, gradient-free, and model-agnostic.

03

The approach is effective across multiple LLMs and datasets.

Abstract

Large Language Models (LLMs) have become foundational in modern artificial intelligence, powering a wide range of applications from code generation and virtual assistants to scientific research and enterprise automation. However, concerns about data contamination--where test data overlaps with training data--have raised serious questions about the reliability of these applications. Despite awareness of this issue, existing methods fall short in effectively identifying or mitigating contamination. In this paper, we propose Residual-Noise Fingerprinting (RN-F), a novel framework for detecting contaminated data in LLMs. RN-F is a single-pass, gradient-free detection method that leverages residual signal patterns without introducing additional floating-point operations. Our approach is lightweight, model-agnostic, and efficient. We evaluate RN-F on multiple LLMs across various contaminated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

csplevuanh/quant_anomaly
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Machine Learning and Data Classification · Explainable Artificial Intelligence (XAI)