The first step is the hardest: Pitfalls of Representing and Tokenizing Temporal Data for Large Language Models
Dimitris Spathis, Fahim Kawsar

TL;DR
This paper examines the challenges of representing and tokenizing temporal data in Large Language Models, highlighting issues with current tokenizers and proposing potential solutions like prompt tuning and multimodal adapters.
Contribution
It identifies the pitfalls of tokenizing temporal data in LLMs and discusses methods to improve their understanding of numerical and temporal information.
Findings
Popular LLMs tokenize temporal data incorrectly
Tokenization issues hinder understanding of temporal relationships
Proposed solutions include prompt tuning and multimodal adapters
Abstract
Large Language Models (LLMs) have demonstrated remarkable generalization across diverse tasks, leading individuals to increasingly use them as personal assistants and universal computing engines. Nevertheless, a notable obstacle emerges when feeding numerical/temporal data into these models, such as data sourced from wearables or electronic health records. LLMs employ tokenizers in their input that break down text into smaller units. However, tokenizers are not designed to represent numerical values and might struggle to understand repetitive patterns and context, treating consecutive values as separate tokens and disregarding their temporal relationships. Here, we discuss recent works that employ LLMs for human-centric tasks such as in mobile health sensing and present a case study showing that popular LLMs tokenize temporal data incorrectly. To address that, we highlight potential…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Recommender Systems and Techniques · AI in Service Interactions
