Temporal-Visual Semantic Alignment: A Unified Architecture for Transferring Spatial Priors from Vision Models to Zero-Shot Temporal Tasks

Xiangkai Ma; Han Zhang; Wenzhong Li; Sanglu Lu

arXiv:2511.19856·cs.CV·November 26, 2025

Temporal-Visual Semantic Alignment: A Unified Architecture for Transferring Spatial Priors from Vision Models to Zero-Shot Temporal Tasks

Xiangkai Ma, Han Zhang, Wenzhong Li, Sanglu Lu

PDF

Open Access

TL;DR

TimeArtist introduces a novel framework that aligns temporal data with visual concepts at the semantic level, enabling high-quality image generation from time series and improving zero-shot temporal task performance.

Contribution

It pioneers a semantic-level alignment method between time series and visual concepts using a dual-autoencoder and shared quantizer, facilitating cross-modal generation and analysis.

Findings

01

Achieves high-quality image generation from time series data.

02

Outperforms existing methods in zero-shot temporal tasks.

03

Establishes a new paradigm for cross-modal temporal-visual alignment.

Abstract

Large Multimodal Models (LMMs) have achieved remarkable progress in aligning and generating content across text and image modalities. However, the potential of using non-visual, continuous sequential, as a conditioning signal for high-fidelity image generation remains largely unexplored. Furthermore, existing methods that convert series into "pseudo-images" for temporal forecasting fail to establish semantic-level alignment. In this paper, we propose TimeArtist, a temporal-visual conversion framework that pioneers semantic-level alignment between time series fluctuations and visual concepts. It pioneers a "warmup-align" paradigm: first, a dual-autoencoder and shared quantizer are self-supervised trained on large-scale datasets to learn modality-shared representations. Then, the encoders and quantizer are frozen, and a projection is introduced to align temporal and visual samples at the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis