HT-Transformer: Event Sequences Classification by Accumulating Prefix Information with History Tokens
Ivan Karpukhin, Andrey Savchenko

TL;DR
This paper introduces history tokens to enhance transformer models for event sequence classification, addressing their inability to effectively summarize sequence history and improve local context understanding, leading to better performance across various domains.
Contribution
The paper proposes history tokens, a novel method for accumulating sequence history in transformers, improving classification accuracy in event sequence tasks.
Findings
Significant performance improvements in finance, e-commerce, and healthcare tasks.
History tokens effectively capture sequence history and local context.
Transformers with history tokens outperform traditional models in sequence classification.
Abstract
Deep learning has achieved remarkable success in modeling sequential data, including event sequences, temporal point processes, and irregular time series. Recently, transformers have largely replaced recurrent networks in these tasks. However, transformers often underperform RNNs in classification tasks where the objective is to predict future targets. The reason behind this performance gap remains largely unexplored. In this paper, we identify a key limitation of transformers: the absence of a single state vector that provides a compact and effective representation of the entire sequence. Additionally, we show that contrastive pretraining of embedding vectors fails to capture local context, which is crucial for accurate prediction. To address these challenges, we introduce history tokens, a novel concept that facilitates the accumulation of historical information during next-token…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
(1). The concept of history tokens with specialized attention patterns thoughtfully integrates recurrent principles into Transformer architectures. (2). The writing is generally clear, with the method explained step-by-step.
(1). The paper positions HT-Transformer as a significant departure from prior work, but the core idea of using special tokens to aggregate sequence information has been explored in existing works. The authors should more clearly differentiate their contribution. (2). The paper lacks explanation for why the Random strategy and Bias-End placement work better. It's better to provide analysis about why these strategies improve performance. (3). The paper uses gradient boosting on frozen embeddings
**Clear motivation and conceptual description:** The paper clearly explains the limitations of the existing transformers and presents the differences among the proposed HT-Transformer and previous variants, especially on the masking strategy. Detailed descriptions of the history token insertion strategies are provided. **Comprehensive experiment:** The paper evaluates across multiple domains and covers different methods (contrastive learning, NTP, etc.) on both network architectures (RNN, Trans
1. The pretraining part should have more details. Currently, Sec. 2 includes part of the training objectives and Sec. 3.1 introduces the masking and inserting strategies. However, it is still not clear what the optimization target is for this pretraining stage. This is important as the pretrained embeddings are directly used for classification. 2. The performance gain is marginal, considering that LongFormer also aims to reduce memory. The implementation details and ablation study actually weake
1. It proposes “history tokens” as accumulators of prefix informationthat mimics the RNN hidden-state mechanism. 2. Strong empirical evidence: The method is evaluated across multiple domains.
1. Lack of methodological contribution: This work proposes "history token" which is very similar to [SEP] and [EOS] token in BERT, which can be used to summarize the past information. 2. Token aggregation: For classification, this work does not compare or make it clear why history token is better than traditional [EOS] or simply avg pooling. It seems like, if you use history token at the end of sequence (Fig 1.b), you actually use the [EOS] token which is already used by GPT before. 3. Metho
1. Novelty and Practicality: This paper introduces a simple yet effective idea called history tokens to flexibly incorporate historical context addressing a key limitation in temporal sequence modeling. The concept of history token is very interesting. 2. Well Explored Design Choices: Authors provide a systematic study of token placement and attention strategies with solid ablation experiments that enhance both interpretability and adaptability. 3. Strong Empirical Results : Experiments dem
While this paper presents valuable contributions, addressing the following points could further strengthen its impact and clarity: 1. Clarify the Relationship with CLS Tokens: It would be more comprehensive to include a detailed comparison between history tokens and the CLS token mechanism used in models like BERT, other than the history tokens appearing multiple times in a sequence. Elaborating on how they differ functionally and architecturally will help clarify history token's unique role.
Originality: the authors observe the effect of history embedding/token and its success in NLU from Recurrent memory transformer paper, and apply to event sequences, (irregular) time series. This is a new adaptation. quality, the authors conduct lot of experiments and variants of HT-transformer on 5 datasets and that support their 4 main claims. clarity, overall decent – I can follow the structure of the paper fairly well. significance— I think the history token or in general recurrent mem
Originality : it can be improved by thinking what are unique aspects of time series/event sequences from the adaptation perspective. It can also be improved by provide some theoretical aspect to explain why bring in history token helps learn global characteristics to predict well on classification Quality: it can be improved by extending experiments to braoder setting: including time series, and adding regression etc. Currently it is a little bit weak as the authors did mention time series/ ev
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Time Series Analysis and Forecasting · Data Quality and Management
