Temporal Flow Matching for Learning Spatio-Temporal Trajectories in 4D Longitudinal Medical Imaging

Nico Albert Disch; Yannick Kirchhoff; Robin Peretzke; Maximilian Rokuss; Saikat Roy; Constantin Ulrich; David Zimmerer; Klaus Maier-Hein

arXiv:2508.21580·cs.CV·September 1, 2025

Temporal Flow Matching for Learning Spatio-Temporal Trajectories in 4D Longitudinal Medical Imaging

Nico Albert Disch, Yannick Kirchhoff, Robin Peretzke, Maximilian Rokuss, Saikat Roy, Constantin Ulrich, David Zimmerer, Klaus Maier-Hein

PDF

Open Access

TL;DR

This paper introduces Temporal Flow Matching (TFM), a novel generative method for modeling and predicting complex spatio-temporal trajectories in 4D longitudinal medical imaging, addressing limitations of existing approaches.

Contribution

The paper presents TFM, a unified generative trajectory model capable of handling 3D volumes, multiple scans, and irregular sampling, surpassing existing methods in 4D medical image prediction.

Findings

01

TFM outperforms existing spatio-temporal methods on three public datasets.

02

Establishes a new state-of-the-art in 4D medical image prediction.

03

Supports flexible sampling and multiple prior scans.

Abstract

Understanding temporal dynamics in medical imaging is crucial for applications such as disease progression modeling, treatment planning and anatomical development tracking. However, most deep learning methods either consider only single temporal contexts, or focus on tasks like classification or regression, limiting their ability for fine-grained spatial predictions. While some approaches have been explored, they are often limited to single timepoints, specific diseases or have other technical restrictions. To address this fundamental gap, we introduce Temporal Flow Matching (TFM), a unified generative trajectory method that (i) aims to learn the underlying temporal distribution, (ii) by design can fall back to a nearest image predictor, i.e. predicting the last context image (LCI), as a special case, and (iii) supports $3 D$ volumes, multiple prior scans, and irregular sampling.…

Tables4

Table 1. Table 1 : Technical comparison of spatio-temporal prediction methods. Methods are grouped by origin (medical or natural imaging). TFM satisfies all requirements. *Difference Modelingindicates modeling changes from context instead of full images.

Category	Method	$3 D$	Disease Agnostic	Multiple Contexts	Difference Modeling*
Medical Imaging	NODER [2]	✓	✓	✗	✗
	Image Flow [12]	✗	✓	✗	✓
	BrLP [16]	✓	✗	✗	✗
Natural Imaging	ConvLSTM [18]	✓	✓	✓	✗
	SimVP [6]	✓	✓	✓	✗
	ViViT [1]	✓	✓	✓	✗
Ours	TFM	✓	✓	✓	✓

Table 2. Table 2 : Quantitative Evaluation on Test Sequences: Reported values are mean (standard deviation) over three runs. Metrics include normalized root M S E MSE , N R M S E NRMSE , structural similarity index ( S S I M [ % ] SSIM[\%] ) and peak signal-to-noise-ratio P S N R PSNR . *ViViT on Lumiere ran out of 40 G B 40GB memory, despite having a smaller batch size and the lowest possible feature size.

Dataset	Model	NRMSE	SSIM[ $%$ ]	PSNR
ACDC	LCI	0.056	93.3	28.49
	ConvLSTM	0.112 (0.005)	50.4 (1.5)	19.12 (0.31)
	SimVP	0.124 (0.001)	52.8 (1.6)	21.21 (0.13)
	ViViT	0.120 (0.008)	30.1 (6.9)	18.47 (0.53)
	TFM (ours)	0.040 (0.012)	94.5 (0.8)	30.51 (1.56)
ISLES	LCI	0.057	95.6	28.39
	ConvLSTM	0.182 (0.005)	40.8 (0.9)	17.85 (0.23)
	SimVP	0.124 (0.001)	52.8 (1.6)	21.21 (0.13)
	ViViT	0.162 (0.003)	32.5 (0.8)	18.84 (0.21)
	TFM (ours)	0.041 (0.007)	97.6 (0.8)	31.03 (1.08)
Lumiere*	LCI	0.085	89.3	21.55
	ConvLSTM	0.352 (0.009)	7.9 (4.2)	9.12 (0.22)
	SimVP	0.711 (0.028)	-2.5 (0.8)	2.98 (0.34)
	TFM (ours)	0.069 (0.007)	89.7 (1.2)	23.73 (0.82)

Table 3. Table 3 : Ablation Results for TFM on ACDC: This table compares TFM under different design changes, showing the performance under each scenario. The ablations were done on an ACDC validation set. We evaluate the effect of using a more lightweight version of the UNet which does not use attention(’No Att’). Instead, τ \tau and image embeddings are merged via concatenation in the bottleneck. We also compare aggregating via the mean and the last image, but these results are only for inference. Training is still done the same way. Third, we compare sparsity filling with the alternative of using the image sequences ℐ \mathcal{I} as they are given. This notable reduces performance. *Limiting the model to only see LCI during training and perform FM on this is unstable, which highlights the importance of temporal context.

Change	NRMSE	SSIM $[%]$	PSNR
Att UNet & Mean(6)	0.0261	96.04	32.30
No Att: Mean	0.0270	95.77	31.88
No Att: Last	0.0271	95.77	31.87
No Sparsity Filling	0.0444	90.92	27.30
LCI + FM*	0.1029	66.83	19.97
LCI	0.0380	93.50	29.49

Table 4. Table 4 : Evaluating S S I M SSIM vs. Number of Function Evaluations: We evaluate how the number of function evaluations (NFEs) affects S S I M SSIM performance on one ACDC validation set. S S I M SSIM increases with more evaluations and peaks at 25 25 NFEs, after which it plateaus. However, the improvement becomes marginal after beyond just 5 5 NFEs.

NFEs	SSIM
1	0.956645
5	0.959890
10	0.959926
25	0.959954
50	0.959951
100	0.959926
150	0.959879
200	0.959877
300	0.959884
400	0.959920

Equations16

\frac{d}{d τ} ψ_{τ} (x) = u_{τ} (ψ_{τ} (x)), with X_{1} = X_{0} + \int_{0}^{1} u_{τ} (X_{τ}) d τ,

\frac{d}{d τ} ψ_{τ} (x) = u_{τ} (ψ_{τ} (x)), with X_{1} = X_{0} + \int_{0}^{1} u_{τ} (X_{τ}) d τ,

v_{θ} (X_{τ}, τ) \approx u_{τ},

v_{θ} (X_{τ}, τ) \approx u_{τ},

X_{τ} = (1 - τ) X_{0} + τ X_{1},

X_{τ} = (1 - τ) X_{0} + τ X_{1},

L_{F M} = E_{τ \sim U (0, 1), X_{0} \sim p} ∥ v_{θ} (X_{τ}, τ) - u_{τ} ∥.

L_{F M} = E_{τ \sim U (0, 1), X_{0} \sim p} ∥ v_{θ} (X_{τ}, τ) - u_{τ} ∥.

X_{1} : = [T I_{target}, \dots, I_{target}],

X_{1} : = [T I_{target}, \dots, I_{target}],

X_{0} = I and X_{1} = [T I_{target}, \dots, I_{target}]

X_{0} = I and X_{1} = [T I_{target}, \dots, I_{target}]

\hat{X}_{1} = X_{0} + \int_{0}^{1} v_{θ} (X_{τ}, τ) d τ .

\hat{X}_{1} = X_{0} + \int_{0}^{1} v_{θ} (X_{τ}, τ) d τ .

L_{baseline} = ∥ f_{θ} (I) - I_{target} ∥.

L_{baseline} = ∥ f_{θ} (I) - I_{target} ∥.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Time Series Analysis and Forecasting · Generative Adversarial Networks and Image Synthesis

Full text

(wacv) Package wacv Warning: Package ‘hyperref’ is not loaded, but highly recommended for camera-ready version

Temporal Flow Matching for Learning Spatio-Temporal Trajectories in 4D Longitudinal Medical Imaging

Nico Albert Disch 1,2,3

0000-0001-8791-622x

Yannick Kirchhoff 1,2,3

0000-0001-8124-8435

Robin Peretzke 1,5

0000-0002-6187-3636

Maximilian Rokuss 1,3

0009-0004-4560-0760

Saikat Roy 1,3

0000-0002-0809-6524

Constantin Ulrich 1,5

0000-0003-3002-8170

David Zimmerer 1,2

0000-0002-8865-2171

Klaus Maier-Hein 1,2,4,6

1 Division of Medical Image Computing, German Cancer Research Center, Heidelberg, Germany

2 HIDSS4Health - Helmholtz Information and Data Science School for Health,

Karlsruhe/Heidelberg, Germany

3 Faculty of Mathematics and Computer Science, University of Heidelberg Heidelberg, Germany

4 Pattern Analysis and Learning Group, Department of Radiation Oncology Heidelberg University Hospital

Heidelberg, Germany

5 Medical Faculty Heidelberg, University of Heidelberg, Heidelberg, Germany

6 Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital

[email protected]

0000-0002-6626-2463

Abstract

Understanding temporal dynamics in medical imaging is crucial for applications such as disease progression modeling, treatment planning and anatomical development tracking. However, most deep learning methods either consider only single temporal contexts, or focus on tasks like classification or regression, limiting their ability for fine-grained spatial predictions. While some approaches have been explored, they are often limited to single timepoints, specific diseases or have other technical restrictions. To address this fundamental gap, we introduce Temporal Flow Matching (TFM), a unified generative trajectory method that (i) aims to learn the underlying temporal distribution, (ii) by design can fall back to a nearest image predictor, i.e. predicting the last context image (LCI), as a special case, and (iii) supports $3D$ volumes, multiple prior scans, and irregular sampling. Extensive benchmarks on three public longitudinal datasets show that TFM consistently surpasses spatio-temporal methods from natural imaging, establishing a new state-of-the-art and robust baseline for $4D$ medical image prediction. 111Code will be published at https://github.com/MIC-DKFZ/Temporal-Flow-Matching

1 Introduction

Longitudinal medical imaging is essential for tracking disease progression, monitoring treatment effects, and modeling anatomical development. When a patient undergoes imaging across multiple visits, whether for disease monitoring or post-treatment follow-ups, a longitudinal series is created. Moreover, there are multiple modalities which intrinsically contain temporal dimensions, such as ultrasound, Cine-MRI or perfusion CT. Despite the inherent temporal structure of such data, most current deep learning approaches analyze images as isolated time points, ignoring the valuable temporal dimension. Applications of longitudinal imaging span a wide range of clinical tasks, including neurodegenerative disease progression (e.g. Alzheimer’s disease [14]), cardiac motion analysis [3], and treatment response prediction in oncology [19, 4]. However, deep learning for spatio-temporal medical imaging remains underexplored compared to image analysis approaches that focus on single timepoints. Most existing approaches focus on classification and regression such as [24, 25]. Albeit valuable, these tasks do not fully represent fine-grained changes in the images. High-dimensional generative models offer richer insights, as they can model the evolution of structures like tumors over time rather than merely detecting changes. Generative models, such as diffusion models [11, 23, 15] and Neural ODEs [8, 12], have been applied to medical imaging, but they also predominantly operate on single time points, having only partially available context, restricting their applicability. Some approaches embed multiple time points [2], yet they still only encode single images independently. In contrast, jointly using multiple observations has been shown to enhance prediction accuracy [5]. Other approaches interpolate images between two time points [26], limiting their use for predictive purposes. Consequently, current techniques are either technically constrained, limiting their general application to longitudinal imaging, or rely on disease-specific priors.

Yet our experiments demonstrate that spatio-temporal methods from natural imaging cannot outperform a simple baseline: Last Context Image (LCI), which used the most recent image as a prediction. Table 1 summarizes technical comparison of baselines from medical and natural imaging: Static Bias: Pixel level scores are dominated by unchanged anatomy. Figure 1 shows the differences between consecutive frames. We note that changes are small, and in some cases quite localized. Full dataset statistics can be found in Figure 7. For example, in the ACDC dataset [3] see e.g. Figure 1, the temporal differences account for only $\sim 3\%$ of $NRMSE$ .

Motivated by these observations, we introduce Temporal Flow Matching (TFM), a unified generative trajectory model that captures $3D$ temporal evolution across multiple scans, modeling only the changes. We term this mechanism as Difference Modeling. Crucially, this modeling objective imposes no architectural or regularization constraints, since it is mathematically just a transformation of the output space (see Appendix A for further discussion). Therefore, TFM remains fully flexible and offers the following capabilities:

•

Efficient Training: Offers end-to-end optimization within $11.3GB$ during training

•

4D Time Series Handles 3D volumetric time series of variable length and amount of context

•

Robust to Sparse and Irregular Sampling Robust to irregular or missing follow-up scans

•

Disease and Modality Agnostic Generalizes across heterogeneous applications, including cardiac function (Cine-MRI), stroke progression (perfusion CT) and glioblastoma growth (MRI).

Through extensive benchmarks on three public longitudinal and spatio-temporal datasets, TFM consistently outperforms our prior spatio-temporal baseline, including LCI. To the best of our knowledge, this results in the first comprehensive benchmark of spatio-temporal prediction methods in medical imaging. With its strong performance and broad technical flexibility, TFM establishes a robust foundation and new baseline for future advances in $4D$ medical image analysis.

2 Methods

Longitudinal medical imaging requires handling of irregularly sampled time series, while capturing spatial and temporal dynamics. In Section 2.1, we formalize the problem of irregular medical imaging. We summarize Flow Matching (FM) in Section 2.2, and discuss challenges with integrating FM into image time series. We then introduce a novel extension of FM, namely TFM, in Section 2.3, designed to explicitly address these challenges. Finally, in Section 2.4, we address missing images by introducing a sparsity filling strategy, which is essential for maximizing the performance of TFM.

2.1 Problem Setup

Let us assume a dataset of $p$ spatio-temporal image sequences (i.e. one per patient). For each patient, we assume $T$ context images $\mathcal{I}=\{I_{1},\dots,I_{T}\}$ with $I_{i}\in\mathbb{R}^{H\times D\times W}$ acquired at ordered, and possibly irregular,time points $\mathcal{T}=\{t_{1},\dots,t_{T}\}$ , with a target image $I_{\text{target}}$ at a time $t_{\text{target}}$ . Due to irregular and sparse acquisitions, missing context images are set to [math]. For this task, we propose Temporal Flow Matching (TFM), a generative model that extends Flow Matching (FM) to predict future medical images from sparse and irregular historical observations.

2.2 Flow Matching (FM)

We adopt the notation of Flow Matching (FM) as introduced in [9]. FM learns a continuous transformation between a source sample $X_{0}\sim p$ and a target sample $X_{1}\sim q$ by modeling an optimal transport field $\psi$ parametrized by an Ordinary Differential Equation (ODE):

[TABLE]

where $\psi_{\tau}(X_{0})=X_{\tau}$ defines the trajectory at interpolation step $\tau\in[0,1]$ . Since the equation (1) is an ODE, we can fix the initial conditions as $X_{0}$ . The vector field $u_{\tau}$ denotes the velocity of the transport field $\psi$ at position $\psi_{\tau}(X_{0})$ . To avoid ambiguity with real-valued medical timepoints, we refer to $\tau$ as the FM step, rather than the time $t$ . A neural network $v_{\theta}$ is trained to predict the true velocity field:

[TABLE]

where $X_{\tau}$ is the intermediate state at step $\tau$ , and $\theta$ are the network parameters. As only $X_{0}$ and $X_{1}$ are observed, we define $X_{\tau}$ using a known transport map $\psi$ . Typically, a linear interpolation is used:

[TABLE]

while other choices are possible. The FM training objective then minimizes the discrepancy between the predicted velocity and true velocity:

[TABLE]

Unlike diffusion models, which rely on iterative denoising guided by learned score functions, FM learns a direct mapping via velocity fields. Under certain conditions, FM can be shown to be equivalent to diffusion models [10].

2.3 Temporal Flow Matching (TFM)

Medical image follow-ups are often irregular, both in terms of temporal spacing and the number of available context images. This poses a challenge for standard generative models, such as Flow Matching (FM) or Diffusion, which models a transformation between two distributions $q$ and $p$ . Therefore, FM cannot be directly applied when the input and target sequences differ in dimensionality. There are two canonical strategies to address this; i): Temporal Pooling: Compress $\mathcal{I}$ via a spatio-temporal encoder, or predict the flow only from the last available image. ii): Dimension padding Extend the target and the context dimensionality of the to a set context sequence length. We adopt Dimension Padding, since we find that only using flows from the last image is not stable. The second method is in part inspired by [6], which also lifts all predictions to the same temporal dimension as a fixed input dimension. With this, we propose Temporal Flow Matching(TFM), a generative model that directly learns transformations from each context image to the target image within a unified spatio-temporal flow formulation. Unlike approaches that compress temporal information early, or operate on latent representation, TFM retains full spatial resolution. This is feasible because TFM has a computational footprint comparable to other spatio-temporal methods, which makes it able to afford this modeling compute. By jointly processing the entire input sequence, the model can leverage spatio-temporal dependencies between input images. To enable this, we define the FM target as

[TABLE]

where $X_{1}\in\mathbb{R}^{T\times D\times H\times W}$ , with $T$ being the number of context images. The FM initial conditions for equation (1) then reads:

[TABLE]

where $\mathcal{I}$ is the series of input images. Then we have the vector field $\psi_{\tau}:\mathbb{R}^{T\times D\times H\times W}\to\mathbb{R}^{T\times D\times H\times W}$ . Training and inference are described in Algorithm 1. We then calculate $X_{\tau}=\psi_{\tau}(X_{0})$ and $u_{\tau}(x)=\frac{d}{dt}\psi_{\tau}(x)$ using (3). The neural net $v_{\theta}$ then predicts a velocity $\hat{u}_{\tau}$ (2), and is trained via (4). Inference is then done using eq. (1), i.e.

[TABLE]

In practice, equation (7) can only be solved numerically. This requires choosing an ODE solver (e.g. Euler or Runge-Kutta) and the number of integration steps (and optionally solver hyperparameters). Since $\hat{X}_{1}\in\mathbb{R}^{T\times D\times H\times W}$ , we need to reduce the temporal dimension. For the final temporal reduction, we use either the last predicted time channel or the mean across time.

Difference Modeling

Rather than modeling the whole spatio-temporal image distribution, our method predicts the velocity field, meaning the differences between context and target. Standard Flow Matching transforms between two distributions $p$ and $q$ , but here both stem from the same patient at different timepoints. Consequently, the velocity is the difference between $\mathcal{I}^{\prime}$ and $\mathcal{I}_{\text{target}}$ . Hence, we call this mechanism Difference Modeling, since $v_{\tau}$ models this difference. See Appendix C for a toy example illustrating how this modeling can influence evaluation metrics.

2.4 Handling Missing Data: Sparsity Filling

Irregular sampling in longitudinal data creates ’holes’ in the time axis (i.e., missing images for certain time points), which can distort the estimated flow between $I_{i}$ and $I_{\text{target}}$ . This reflects the same issue discussed in the motivation, but now arising from missing context frames. To address missing context images, we apply sparsity filling, replacing them with the most recent available scan (see Fig. 2 for visualization). This ensures smoother inputs and more stable flow estimation across masked inputs. If missing frames occur before the first available scan, we fill them using the earliest available image. We denote the filled context sequence via $\mathcal{I}^{\prime}$ . We hypothesize this helps because each filled image in $\mathcal{I}^{\prime}$ is closer to $I_{\text{target}}$ than an empty/ zero-filled image, resulting in more homogeneous flow fields. In out ablation studies, sparsity filling was essential; omitting it leads to unstable training and degraded convergence.

3 Data and Experimental Design

We compare TFM to methods that jointly model spatial and temporal information across multiple time points. SimVP [6]: This method originates from the natural image domain and simply uses all context images as input of their network. The original architecture consists of a 2D UNet, which we extend here to 3D. The temporal information is handled via flattening the time dimensions into the channel dimension. ConvLSTM [18]: It extracts spatial features using convolutions while capturing temporal dependencies through an LSTM’s recurrent states [7]. At each time step, the model processes an input image using convolutional layers and updates its internal memory, which maintains information across the sequence. ViViT The Video Vision Transformer (ViViT) first processes all input context images into image patches, where the patch size is $8\times 32\times 32$ . We use the ViViT as in [22], for fair comparison of the pure spatio-temporal backbone.

Baseline Training

The baseline methods directly predict the target given the context sequence $\mathcal{I}$ . So the loss for those methods reads:

[TABLE]

Further definitions of architecture, model and method is found in section A.

Last Context Image

Furthermore, we use the Last Context Image (LCI), a heuristic that serves as an estimated lower bound. LCI is denoted as the last image in the sequence which is non-zero. This baseline is medically motivated, as it serves as a part of medical decision making when looking at longitudinal series (see [20]). 222While LCI is optimal for monotonic sequences, it might not be the best performing image from the context sequence. However, selecting the best image from the sequence would require further insight or an oracle model. So LCI is the best we can naively do for most tasks. Yet for the datasets we consider LCI is in fact the best.

3.1 Datasets

ACDC [3] is a cardiac MRI dataset for different states of the heart. Images are reshaped to $[T,H,D,W]=[11,32,128,128]$ , where the target is a single image having the same spatial dimensions. For the ACDC dataset, we randomly mask out time points, in order to make it irregular. We split the dataset into $80$ training, $20$ validation and $50$ test images. This dataset was used for method development and ablations were done on the validation set.

ISLES [17] consists of perfusion CT images from stroke patients. For our experiments, we utilize this 4D modality. Since there are dozens of time steps with minor changes in the image, we further process the image. For that, we only take every other time step of the perfusion sequence. From the resulting series, we randomly pick 4 consecutive points, where the last point is the target, and we randomly mask context images. The context then has shape $[T,H,D,W]=[7,16,128,128]$ . This dataset is split into $92$ training, $23$ validation and $34$ test images.

Lumiere [19] is a longitudinal dataset of tumor growth in gliomas, consisting of 3D MRI scans. The images are reshaped to $[T,H,D,W]=[7,96,96,64]$ . Since not all patients have many acquisitions, we prepended zeros to ensure pre-processing is consistent. For Lumiere we have $48$ training, $12$ validation and $14$ test images. Example images from two timepoints for each dataset are shown in Figure 1.

3.2 Experimental settings

All methods (see A for notation) were trained with AdamW and a cosine-annealed learning-rate schedule, using a batch size of 4. The learning rate was fixes at $1\mathrm{e}{-4}$ for all experiments. For TFM, we used 10 integration steps during inference (see Table 4). Our TFM builds on the standard UNet from the TorchCFM library [21], using cross-attention between time embeddings and spatial feature maps (see Figure 2 for an overview). To ensure fair comparison, we ran each experiment three times with different validation splits and the same random seed within each split.

Random Masking

For ACDC and ISLES, we randomly omit context images during both training and validation, to simulate irregular sampling. Since we believe we are the first to benchmark methods in this very specific irregular setting, we highlight a potentially grave pitfall: If validation masks are resampled at each validation epoch, even with a fixed seed, the masking evaluation metrics change every time, which is exacerbated by our small validation set . This variability affects even the trivial LCI baseline and makes "best" epoch selection arbitrary. Since the validation set is small, context sequences can be extremely sparse or dense, causing the LCI baseline’s performance to fluctuate drastically333In natural imaging, validation sets are much larger, so random fluctuations are less severe. In medical imaging, however, smaller validation sets make these fluctuations significant.. To avoid this issue, we generate one fixed set of masks per split (using a single seed) and reuse those exact masks for every model at every validation epoch. This ensures consistent validation conditions, meaningful epoch selection, and fully reproducible and interpretable validation results. In all cases, models were selected by the lowest validation $MSE$ and then evaluated on the held-out test set.

4 Results and Discussion

TFM outperforms LCI across all datasets and metrics as shown in Table 2. It achieves top performance on every metric and dataset tested. Competing methods struggle to generate realistic images, often scoring below the LCI baseline. We stress that pixel-wise metrics such as $MSE$ uniformly penalize any change, so unchanged anatomy dominates the score and can mask fine spatio-temporal predictions (see e.g. Figure 7). By modeling the difference directly, we argue that TFM sidesteps this bias and more faithfully captures true temporal evolution. Future work should consider modeling differences directly, capturing metrics only on regions of interests with substantial change, or adopting task-specific metrics that align more with clinical motivations [13]. Lumiere is a longitudinal tumor growth dataset, characterized by sparse sequences and a small number of training cases. The strong differences in patient-specific trajectories and its data scarcity make Lumiere particularly difficult; most methods fail to come close to LCI, except for TFM. We attribute the poorer performance of other methods to the small training set size and high inter-subject variability. This leads to negative $SSIM$ for the SimVP method, and qualitatively noisy results. Despite these challenges, TFM outperforms LCI on Lumiere in both $MSE$ and $PSNR$ . This supports our hypothesis that TFM benefits from Difference Modeling, making it more robust in real-world scenarios. These findings suggest that Temporal Flow Matching is especially well-suited for real-world scenarios. Figure 4 illustrates that TFM generates realistic images. Further qualitative results can be found in the appendix. Thanks to our use of Runge-Kutta integration, memory savings are non-trivial; memory usage can be further reduced by switching to Euler integration and detaching tensors after every step, or to aim for single step predictions. This could significantly reduce memory usage, if needed.

Insights on design decisions Table 3 summarizes the impact of design choices on ACDC, including an alternative lightweight architecture by replacing the attention mechanism with concatenating time embeddings in the bottleneck, no sparsity filling, and two aggregation methods. We see that switching to a lighter-weight architecture had only a minor effect on performance. Future work may explore alternative architectures for the flow network (2), either to improve efficiency or further boost performance. Yet this choice shows the flexibility of TFM. We observe no significant difference between aggregating by the mean or using only the final predicted frame. Since the model is trained on full flows in both settings, this suggests it learns to predict the target from any context frame. Crucially, our sparsity filling strategy significantly improves TFM’s performance. We attribute this due to the fact that filled frames are closer to the target image than zero tensors, resulting in more learnable and stable flow velocities. This reinforces our core design principle: TFM focuses on temporal changes, not the entire image distribution.

An important additional finding is that the LCI + FM baseline (using TFM’s UNet but a single time channel) performs poorly. This occurs even though TFM can handle single context inputs (see Figure 3). Instinctively, both methods should yield similar results at inference, since they receive the same input. But we believe this discrepancy stems from the training dynamics; In the LCI + FM setting, the model learns the flow over uneven time intervals. These inconsistent intervals introduce high variance and instabilities in the flows. In contrast, TFM end-to-end training on the randomly masked sequence yields more information on the temporal spacing, making the predictions more robust against single-input performance. We suspect that this consistency yields a more stable training regime and enables reliable performance even when reduced to a single input at test time.

4.1 Future Directions and Limitations

Building on TFM’s core strengths-full resolution flow, end-to-end training, and robust handling of irregular sampling-numerous promising avenues emerge. None of these expose a fundamental limitation of our method; instead capitalize on its flexibility:

•

Advanced Sparsity Filling We demonstrated that even a simple nearest-image fill substantially stabilized training (see Table 3). Future work can explore more sophisticated schemes. This includes learned imputation networks, temporal interpolation priors, which could seamlessly plug into TFM.

•

Explicit Continuous Time Modeling While our current setting treats $\tau$ as abstract interpolation scalars, this can be naturally extended to model continuous, real-valued time steps. This would allow for more flexible predictions, ideal for clinical workflows.

•

Stochastic Generation It is technically simple to include stochastic sampling into TFM. This would allow TFM to sample multiple possible futures. This could be again valuable in the clinical context for risk assessment and planning. On a technical level, this only requires extending the ODE to SDE integration and adding noise during training (see step (6)1).

Limitations

Several challenges remain, which stem from the realities of longitudinal imaging. First, truly large-scale, high quality follow-up cohorts remain rare for many diseases. Acquiring more multi-timepoint studies can be costly, yet such data are essential for validating disease trajectory models. Second, our current approach of globally sampling to a set resolution might not be optimal. A more localized, patch based strategy could capture finer details better but remains challenging for generative modeling. Finally, conventional baseline methods struggle when data are scarce, a prominent issue in our setting. To overcome this limitation, future work might need to leverage large-scale pretrained models, and fine tune them on specific 4D prediction tasks, although such pretrained resources are not yet readily available. Despite these constraints, TFM maintains remarkably stable performance. We hope that this contribution will encourage the acquisition of larger longitudinal datasets and inspire further clinical studies.

5 Conclusion

In this paper, we address the challenge of modeling longitudinal medical imaging with sparse and irregular time series by introducing Temporal Flow Matching (TFM), a state-of-the-art generative approach for 4D medical image prediction. In our datasets, and often in clinical practice, temporal changes constitute only a small portion of the total image content, relative to inter-patient differences. TFM leverages this insight by explicitly modeling differences between context and target, an approach which we call Difference Modeling. Through extensive experiments on publicly available datasets, we demonstrate that TFM consistently outperforms prior methods, establishing a new baseline for disease progression modeling. While we gratefully acknowledge the public datasets which support this work, advancing spatio-temporal prediction will demand larger cohorts with detailed records of confounding factors, such as surgeries and treatment changes.

Acknowledgements

The present contribution is supported by the Helmholtz Association under the joint research school HIDSS4Health – Helmholtz Information and Data Science School for Health.

Appendix A Method vs Model

In previous sections we mentioned TFM as a baseline method. To further disambiguate semantic definitions, we define the following; Architecture is defined here as the actual neural network. That is the functional output between input of the network and the output. This seems redundant, but important in comparison. We define Model as the input and especially output space. The best example here is diffusion vs. flow matching; Both can be done via the same network $f_{\theta}$ , but in the case of diffusion, the input is a noisy sample, and the output is the noise. Whereas for flow matching, the network receives $\{X_{\tau},\tau\}$ , and predicts $u_{\tau}$ . We say the diffusion network models the noise, and the flow matching network models the velocity.

Appendix B Datasets

In table 4 we ablate the amount of number of function evaluations. We note that for a single NFE the performance significantly drops. For a trade-off we chose 10 as the integration steps for all datasets.

Appendix C Toy Example - Difference Modeling

To clarify why modeling differences can be advantageous in low-change environments (even when full image reconstruction is limited), we present a simplified toy example illustrating a resolution-performance paradox. To show how limited resolution and bounded changes can impact error metrics, consider the following setting: Assume an $8\times 8$ checkerboard pattern where each pixel alternates between [math] and $1$ . In the center, a $4\times 4$ patch undergoes a change, specifically within a $2\times 2$ region. For simplicity, let $I_{0}$ contain two black squares in the center, and $I_{1}$ contain three. Suppose a perfect longitudinal model captures the central change but operates at a coarse resolution of $2\times 2$ . Since it cannot represent the high-frequency checkerboard pattern, its best prediction is a uniform value of $0.5$ across the image. This yields a total MSE of $12/64$ , but an LCI MSE of only $4/64$ . Now consider the difference image $I_{1}-I_{0}$ , which contains a single $2\times 2$ black square and zeros elsewhere. The same low-resolution model can now perfectly represent this difference, achieving an MSE of [math], despite lacking the resolution to represent the full images individually. This illustrates how Difference Modelingcan resolve the apparent paradox where a model with limited spatial capacity still performs well under certain metrics. While this does not fully explain the behavior of TFM, it provides intuition for how modeling the difference yields strong starting conditions, and how methods can benefit from this formulation.

Appendix D Further Experimental Settings

Models were trained for $500$ epochs. All methods are implemented using the AdamW optimizer, with cosine annealing learning rate, and batch size 4. We utilized cosine annealing, as well as a warm-up scheduler for $10\%$ of the total epochs, as well as a gradient clipping of magnitude $1$ .

TFM network details For the experiments we used a feature size of $32$ , and a channel multiplication per layer of $(1,1,2,4)$ , with one res block per layer. The attention resolution was set to $16$ . For anything else, the default parameters of the UNet from [21] was used.

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vi Vi T: A Video Vision Transformer, Nov. 2021.
2[2] Hao Bai and Yi Hong. NODER: Image Sequence Regression Based on Neural Ordinary Differential Equations, July 2024.
3[3] Olivier Bernard, Alain Lalande, Clement Zotti, Frederick Cervenansky, Xin Yang, Pheng-Ann Heng, Irem Cetin, Karim Lekadir, Oscar Camara, Miguel Angel Gonzalez Ballester, Gerard Sanroma, Sandy Napel, Steffen Petersen, Georgios Tziritas, Elias Grinias, Mahendra Khened, Varghese Alex Kollerathu, Ganapathy Krishnamurthi, Marc-Michel Rohé, Xavier Pennec, Maxime Sermesant, Fabian Isensee, Paul Jäger, Klaus H. Maier-Hein, Peter M. Full, Ivo Wolf, Sandy Engelhardt, Christian F. Baumgartner, Lisa
4[4] Evan Calabrese, Javier E. Villanueva-Meyer, Jeffrey D. Rudie, Andreas M. Rauschecker, Ujjwal Baid, Spyridon Bakas, Soonmee Cha, John T. Mongan, and Christopher P. Hess. The University of California San Francisco Preoperative Diffuse Glioma MRI Dataset. Radiology. Artificial Intelligence , 4(6):e 220058, Nov. 2022.
5[5] Cong Fang, Song Bai, Qianlan Chen, Yu Zhou, Liming Xia, Lixin Qin, Shi Gong, Xudong Xie, Chunhua Zhou, Dandan Tu, Changzheng Zhang, Xiaowu Liu, Weiwei Chen, Xiang Bai, and Philip H. S. Torr. Deep learning for predicting COVID-19 malignant progression. Medical Image Analysis , 72:102096, Aug. 2021.
6[6] Zhangyang Gao, Cheng Tan, Lirong Wu, and Stan Z. Li. Sim VP: Simpler Yet Better Video Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 3170–3180, 2022.
7[7] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Comput. , 9(8):1735–1780, Nov. 1997.
8[8] Dmitrii Lachinov, Arunava Chakravarty, Christoph Grechenig, Ursula Schmidt-Erfurth, and Hrvoje Bogunovic. Learning Spatio-Temporal Model of Disease Progression with Neural OD Es from Longitudinal Volumetric Data, Nov. 2022.