MemGuard-Alpha: Detecting and Filtering Memorization-Contaminated Signals in LLM-Based Financial Forecasting via Membership Inference and Cross-Model Disagreement
Anisha Roy, Dip Roy

TL;DR
This paper introduces MemGuard-Alpha, a practical, zero-cost framework for filtering memorization-contaminated signals in LLM-based financial forecasting, improving out-of-sample performance by detecting and removing memorized data.
Contribution
MemGuard-Alpha combines membership inference and cross-model disagreement techniques to effectively identify and filter memorized signals in LLMs for financial prediction.
Findings
CMMD achieves a Sharpe ratio of 4.11 versus 2.76 for unfiltered signals.
Clean signals yield 14.48 bps daily return versus 2.13 bps for contaminated signals.
In-sample accuracy increases with contamination, while out-of-sample accuracy decreases.
Abstract
Large language models (LLMs) are increasingly used to generate financial alpha signals, yet growing evidence shows that LLMs memorize historical financial data from their training corpora, producing spurious predictive accuracy that collapses out-of-sample. This memorization-induced look-ahead bias threatens the validity of LLM-based quantitative strategies. Prior remedies -- model retraining and input anonymization -- are either prohibitively expensive or introduce significant information loss. No existing method offers practical, zero-cost signal-level filtering for real-time trading. We introduce MemGuard-Alpha, a post-generation framework comprising two algorithms: (i) the MemGuard Composite Score (MCS), which combines five membership inference attack (MIA) methods with temporal proximity features via logistic regression, achieving Cohen's d = 18.57 for contamination separation (d =…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
