LLmFPCA-detect: LLM-powered Multivariate Functional PCA for Anomaly Detection in Sparse Longitudinal Texts
Prasanjit Dubey, Aritra Guha, Zhengyi Zhou, Qiong Wu, Xiaoming Huo, Paromita Dubey

TL;DR
This paper introduces LLmFPCA-detect, a novel framework combining large language model embeddings with functional principal component analysis to identify patterns and anomalies in sparse longitudinal textual data across various domains.
Contribution
It presents a new method that integrates LLM-based text embeddings with multivariate functional PCA for anomaly detection in complex, sparse longitudinal text datasets.
Findings
Outperforms state-of-the-art baselines in experiments.
Effective in diverse applications like customer reviews and online comments.
Enhances downstream predictive tasks with cluster-specific features.
Abstract
Sparse longitudinal (SL) textual data arises when individuals generate text repeatedly over time (e.g., customer reviews, occasional social media posts, electronic medical records across visits), but the frequency and timing of observations vary across individuals. These complex textual data sets have immense potential to inform future policy and targeted recommendations. However, because SL text data lack dedicated methods and are noisy, heterogeneous, and prone to anomalies, detecting and inferring key patterns is challenging. We introduce LLmFPCA-detect, a flexible framework that pairs LLM-based text embeddings with functional data analysis to detect clusters and infer anomalies in large SL text datasets. First, LLmFPCA-detect embeds each piece of text into an application-specific numeric space using LLM prompts. Sparse multivariate functional principal component analysis (mFPCA)…
Peer Reviews
Decision·Submitted to ICLR 2026
* The pipeline decribed in the paper shows an end2end procedure to analyze textual corpus with timestamps. The highlight of the algorithm is the interpretability of the anomalies found, which is very useful in real scenarios. * The application of the pipeline is demostrated on two use cases in detail. The many qualitative results are interesting to read. * The description of the algorithms in the many Alg blocks are clear and helpful for readers to understand.
* It is a complex system and have many steps. The design of some important steps are only presented but not justified. So, it reads somewhat like product white paper. (1) Why mFPCA instead of uFPCA, and what is the benefit? (2) Is Alg. 2 a standard way of anomaly detection (if so, what's the citation), or a novel one (if so, why is it a good design)? (3) Why clustering is required at all, and how to determine the number of clusters? * The evaluation are mostly qualitative, and very few baselines
The paper presents a novel integration of LLM-derived embeddings with statistical functional analysis via mFPCA. It introduces outlier detection, featuring a two-stage anomaly calibration procedure (Algorithms 2 and 3) that employs sample splitting and multiple-comparison control. The authors also provide empirical studies demonstrating the effectiveness of the proposed approach.
The novelty of the work is not very clear, as the proposed method appears to combine elements of existing algorithms rather than introducing a fundamentally new approach. Moreover, the paper lacks strong theoretical justification. There are some guarantees in appendix, but their scopes seem limited. Given the absence of strong theoretical results, the empirical validation also feels limited. A broader set of simulations or comparisons would strengthen the paper’s overall contribution.
1. The combination of LLM embeddings and functional data analysis (mFPCA) is both creative and technically elegant. It bridges modern NLP with classical statistical modeling, addressing an underexplored setting—sparse, irregular longitudinal text. 2. The paper clearly defines the challenge of SL texts, distinguishing them from standard time series or dense sequences. The motivation for type-I-controlled anomaly detection in irregular textual data is compelling.
1. The empirical evaluation mainly compares against Isolation Forests (on BERT and GPT embeddings). This feels weak given the paper’s ambition. 2. The performance hinges on the quality of LLM-derived embeddings (e.g., emotion or toxicity scores). However, prompt variability or model drift could affect reproducibility. 3. mFPCA requires covariance estimation and eigen-decomposition, which may scale poorly with large numbers of subjects or long trajectories.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Authorship Attribution and Profiling · Anomaly Detection Techniques and Applications
