Temporal-Visual Semantic Alignment: A Unified Architecture for Transferring Spatial Priors from Vision Models to Zero-Shot Temporal Tasks
Xiangkai Ma, Han Zhang, Wenzhong Li, Sanglu Lu

TL;DR
TimeArtist introduces a novel framework that aligns temporal data with visual concepts at the semantic level, enabling high-quality image generation from time series and improving zero-shot temporal task performance.
Contribution
It pioneers a semantic-level alignment method between time series and visual concepts using a dual-autoencoder and shared quantizer, facilitating cross-modal generation and analysis.
Findings
Achieves high-quality image generation from time series data.
Outperforms existing methods in zero-shot temporal tasks.
Establishes a new paradigm for cross-modal temporal-visual alignment.
Abstract
Large Multimodal Models (LMMs) have achieved remarkable progress in aligning and generating content across text and image modalities. However, the potential of using non-visual, continuous sequential, as a conditioning signal for high-fidelity image generation remains largely unexplored. Furthermore, existing methods that convert series into "pseudo-images" for temporal forecasting fail to establish semantic-level alignment. In this paper, we propose TimeArtist, a temporal-visual conversion framework that pioneers semantic-level alignment between time series fluctuations and visual concepts. It pioneers a "warmup-align" paradigm: first, a dual-autoencoder and shared quantizer are self-supervised trained on large-scale datasets to learn modality-shared representations. Then, the encoders and quantizer are frozen, and a projection is introduced to align temporal and visual samples at the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis
