AnomSeer: Reinforcing Multimodal LLMs to Reason for Time-Series Anomaly Detection
Junru Zhang, Lang Feng, Haoran Shi, Xu Guo, Han Yu, Yabo Dong, Duanqing Xu

TL;DR
AnomSeer enhances multimodal large language models for time-series anomaly detection by grounding reasoning in detailed structural analysis, improving accuracy and interpretability across diverse scenarios.
Contribution
It introduces a novel reinforcement learning approach with time-series grounded policy optimization and expert reasoning traces for improved anomaly detection.
Findings
Outperforms larger commercial models in classification accuracy
Provides verifiable, fine-grained reasoning traces
Effective across diverse anomaly scenarios
Abstract
Time-series anomaly detection (TSAD) with multimodal large language models (MLLMs) is an emerging area, yet a persistent challenge remains: MLLMs rely on coarse time-series heuristics but struggle with multi-dimensional, detailed reasoning, which is vital for understanding complex time-series data. We present AnomSeer to address this by reinforcing the model to ground its reasoning in precise, structural details of time series, unifying anomaly classification, localization, and explanation. At its core, an expert chain-of-thought trace is generated to provide a verifiable, fine-grained reasoning from classical analyses (e.g., statistical measures, frequency transforms). Building on this, we propose a novel time-series grounded policy optimization (TimerPO) that incorporates two additional components beyond standard reinforcement learning: a time-series grounded advantage based on…
Peer Reviews
Decision·Submitted to ICLR 2026
- **Originality (compositional):** A thoughtful pairing of *process-evidence alignment* via OT with *orthogonal advantage composition* inside GRPO, targeted at TSAD. The ExpCoT design grounds CoT in verifiable TSAD signals rather than generic text heuristics. - **Quality:** Solid performance improvements over zero-shot MLLMs and a strong RL baseline, with the biggest wins on the hard frequency/trend categories that motivated the paper. Ablations show each component matters; sensitivity to (\alph
- **Related Work coverage:** Missing discussion of **OT in RL/alignment** and **multi-objective/gradient-projection** literature (e.g., PCGrad). As a result, novelty may be under-justified as more than a careful composition. - **Baselines:** Ablations remove components, but comparisons lack *alternative* multi-objective schemes: (i) simple weighted-sum (no projection), (ii) PCGrad-style gradient orthogonalization, (iii) replacing OT with cosine/CLIP-style similarity. These are crucia
The idea of incorporating expert-generated reasoning traces from classical TSAD methods is conceptually sound. It provides a structured and verifiable way to include numerical priors into the model. The TimerPO algorithm is also technically interesting, especially its use of Optimal Transport to measure reasoning similarity. The framework demonstrates decent generalization, with stable performance across datasets despite being trained only on synthetic data. This suggests a certain degree of ro
- Several issues limit the strength of the paper’s claims. First, the evaluation on the AnomLLM dataset may be affected by potential information leakage, since ExpCoT traces include ground-truth anomaly intervals. - Second, the paper lacks ablation studies isolating the effects of ExpCoT and GRPO, which makes it difficult to understand their individual contributions. - Third, the manuscript does not provide clear definitions or implementation details for the reported Affinity-Precision, Affinity
Clear and meaningful motivation addressing the lack of fine-grained reasoning in MLLMs for time-series tasks. Well-designed framework combining ExpCoT (expert chain-of-thought supervision) and TimerPO (reinforcement optimization). Strong experimental results demonstrating improved interpretability and reasoning quality.
My concerns are as follows: 1. The generation of ExpCoT requires traditional statistical analyses (FFT, residual detection, Matrix Profile, etc.), and each anomaly type needs specific parameters and templates. When transferring to new domains, the “expert reasoning templates” need to be redefined; and it is difficult to automatically scale to large heterogeneous datasets. 2. Since each anomaly type is defined by fixed parameters and templates, if such reasoning chains are already effective, I am
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Time Series Analysis and Forecasting · Topic Modeling
