Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding
Paul Quinlan, Jeremy Levasseur, Qingguo Li, Xiaodan Zhu

TL;DR
Chronicle is a novel 324M-parameter transformer trained from scratch on both text and time series data, enabling joint multimodal understanding and outperforming specialized models across various benchmarks.
Contribution
First model jointly pretrained on text and time series from scratch, sharing parameters and evaluated against dedicated unimodal foundation models.
Findings
Matches Gemma-3-270M-PT on 19 NLU tasks
Sets a new standard for time series classification on 24 datasets
Outperforms supervised fusion baselines in multimodal forecasting
Abstract
Real-world time series come with text: metadata, descriptions, news, reports. Yet time series foundation models process numerical sequences in isolation, and the multimodal text-and-time-series models that attempt to bridge the two all adapt a pretrained language model post hoc, inheriting representations shaped without ever seeing temporal data. These models are also evaluated almost exclusively against other multimodal baselines, not against the strongest unimodal foundation models in either domain, leaving open whether joint training is needed at all. We present Chronicle, a compact 324M-parameter decoder-only transformer trained from scratch on natural language and time series within a single unified architecture. Both modalities share the same transformer blocks, attention mechanism, and residual stream; the bulk of pretraining uses unimodal batches so cross-modal capability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
