Temporal Flow Matching for Learning Spatio-Temporal Trajectories in 4D Longitudinal Medical Imaging
Nico Albert Disch, Yannick Kirchhoff, Robin Peretzke, Maximilian Rokuss, Saikat Roy, Constantin Ulrich, David Zimmerer, Klaus Maier-Hein

TL;DR
This paper introduces Temporal Flow Matching (TFM), a novel generative method for modeling and predicting complex spatio-temporal trajectories in 4D longitudinal medical imaging, addressing limitations of existing approaches.
Contribution
The paper presents TFM, a unified generative trajectory model capable of handling 3D volumes, multiple scans, and irregular sampling, surpassing existing methods in 4D medical image prediction.
Findings
TFM outperforms existing spatio-temporal methods on three public datasets.
Establishes a new state-of-the-art in 4D medical image prediction.
Supports flexible sampling and multiple prior scans.
Abstract
Understanding temporal dynamics in medical imaging is crucial for applications such as disease progression modeling, treatment planning and anatomical development tracking. However, most deep learning methods either consider only single temporal contexts, or focus on tasks like classification or regression, limiting their ability for fine-grained spatial predictions. While some approaches have been explored, they are often limited to single timepoints, specific diseases or have other technical restrictions. To address this fundamental gap, we introduce Temporal Flow Matching (TFM), a unified generative trajectory method that (i) aims to learn the underlying temporal distribution, (ii) by design can fall back to a nearest image predictor, i.e. predicting the last context image (LCI), as a special case, and (iii) supports volumes, multiple prior scans, and irregular sampling.…
| Dataset | Model | NRMSE | SSIM[] | PSNR |
|---|---|---|---|---|
| ACDC | LCI | 0.056 | 93.3 | 28.49 |
| ConvLSTM | 0.112 (0.005) | 50.4 (1.5) | 19.12 (0.31) | |
| SimVP | 0.124 (0.001) | 52.8 (1.6) | 21.21 (0.13) | |
| ViViT | 0.120 (0.008) | 30.1 (6.9) | 18.47 (0.53) | |
| TFM (ours) | 0.040 (0.012) | 94.5 (0.8) | 30.51 (1.56) | |
| ISLES | LCI | 0.057 | 95.6 | 28.39 |
| ConvLSTM | 0.182 (0.005) | 40.8 (0.9) | 17.85 (0.23) | |
| SimVP | 0.124 (0.001) | 52.8 (1.6) | 21.21 (0.13) | |
| ViViT | 0.162 (0.003) | 32.5 (0.8) | 18.84 (0.21) | |
| TFM (ours) | 0.041 (0.007) | 97.6 (0.8) | 31.03 (1.08) | |
| Lumiere* | LCI | 0.085 | 89.3 | 21.55 |
| ConvLSTM | 0.352 (0.009) | 7.9 (4.2) | 9.12 (0.22) | |
| SimVP | 0.711 (0.028) | -2.5 (0.8) | 2.98 (0.34) | |
| TFM (ours) | 0.069 (0.007) | 89.7 (1.2) | 23.73 (0.82) |
| Change | NRMSE | SSIM | PSNR |
|---|---|---|---|
| Att UNet & Mean(6) | 0.0261 | 96.04 | 32.30 |
| No Att: Mean | 0.0270 | 95.77 | 31.88 |
| No Att: Last | 0.0271 | 95.77 | 31.87 |
| No Sparsity Filling | 0.0444 | 90.92 | 27.30 |
| LCI + FM* | 0.1029 | 66.83 | 19.97 |
| LCI | 0.0380 | 93.50 | 29.49 |
| NFEs | SSIM |
|---|---|
| 1 | 0.956645 |
| 5 | 0.959890 |
| 10 | 0.959926 |
| 25 | 0.959954 |
| 50 | 0.959951 |
| 100 | 0.959926 |
| 150 | 0.959879 |
| 200 | 0.959877 |
| 300 | 0.959884 |
| 400 | 0.959920 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Time Series Analysis and Forecasting · Generative Adversarial Networks and Image Synthesis
(wacv) Package wacv Warning: Package ‘hyperref’ is not loaded, but highly recommended for camera-ready version
Temporal Flow Matching for Learning Spatio-Temporal Trajectories in 4D Longitudinal Medical Imaging
Nico Albert Disch 1,2,3
Yannick Kirchhoff 1,2,3
Robin Peretzke 1,5
Maximilian Rokuss 1,3
Saikat Roy 1,3
Constantin Ulrich 1,5
David Zimmerer 1,2
Klaus Maier-Hein 1,2,4,6
1 Division of Medical Image Computing, German Cancer Research Center, Heidelberg, Germany
2 HIDSS4Health - Helmholtz Information and Data Science School for Health,
Karlsruhe/Heidelberg, Germany
3 Faculty of Mathematics and Computer Science, University of Heidelberg Heidelberg, Germany
4 Pattern Analysis and Learning Group, Department of Radiation Oncology Heidelberg University Hospital
Heidelberg, Germany
5 Medical Faculty Heidelberg, University of Heidelberg, Heidelberg, Germany
6 Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital
Abstract
Understanding temporal dynamics in medical imaging is crucial for applications such as disease progression modeling, treatment planning and anatomical development tracking. However, most deep learning methods either consider only single temporal contexts, or focus on tasks like classification or regression, limiting their ability for fine-grained spatial predictions. While some approaches have been explored, they are often limited to single timepoints, specific diseases or have other technical restrictions. To address this fundamental gap, we introduce Temporal Flow Matching (TFM), a unified generative trajectory method that (i) aims to learn the underlying temporal distribution, (ii) by design can fall back to a nearest image predictor, i.e. predicting the last context image (LCI), as a special case, and (iii) supports volumes, multiple prior scans, and irregular sampling. Extensive benchmarks on three public longitudinal datasets show that TFM consistently surpasses spatio-temporal methods from natural imaging, establishing a new state-of-the-art and robust baseline for medical image prediction. 111Code will be published at https://github.com/MIC-DKFZ/Temporal-Flow-Matching
1 Introduction
Longitudinal medical imaging is essential for tracking disease progression, monitoring treatment effects, and modeling anatomical development. When a patient undergoes imaging across multiple visits, whether for disease monitoring or post-treatment follow-ups, a longitudinal series is created. Moreover, there are multiple modalities which intrinsically contain temporal dimensions, such as ultrasound, Cine-MRI or perfusion CT. Despite the inherent temporal structure of such data, most current deep learning approaches analyze images as isolated time points, ignoring the valuable temporal dimension. Applications of longitudinal imaging span a wide range of clinical tasks, including neurodegenerative disease progression (e.g. Alzheimer’s disease [14]), cardiac motion analysis [3], and treatment response prediction in oncology [19, 4]. However, deep learning for spatio-temporal medical imaging remains underexplored compared to image analysis approaches that focus on single timepoints. Most existing approaches focus on classification and regression such as [24, 25]. Albeit valuable, these tasks do not fully represent fine-grained changes in the images. High-dimensional generative models offer richer insights, as they can model the evolution of structures like tumors over time rather than merely detecting changes. Generative models, such as diffusion models [11, 23, 15] and Neural ODEs [8, 12], have been applied to medical imaging, but they also predominantly operate on single time points, having only partially available context, restricting their applicability. Some approaches embed multiple time points [2], yet they still only encode single images independently. In contrast, jointly using multiple observations has been shown to enhance prediction accuracy [5]. Other approaches interpolate images between two time points [26], limiting their use for predictive purposes. Consequently, current techniques are either technically constrained, limiting their general application to longitudinal imaging, or rely on disease-specific priors.
Yet our experiments demonstrate that spatio-temporal methods from natural imaging cannot outperform a simple baseline: Last Context Image (LCI), which used the most recent image as a prediction. Table 1 summarizes technical comparison of baselines from medical and natural imaging: Static Bias: Pixel level scores are dominated by unchanged anatomy. Figure 1 shows the differences between consecutive frames. We note that changes are small, and in some cases quite localized. Full dataset statistics can be found in Figure 7. For example, in the ACDC dataset [3] see e.g. Figure 1, the temporal differences account for only of .
Motivated by these observations, we introduce Temporal Flow Matching (TFM), a unified generative trajectory model that captures temporal evolution across multiple scans, modeling only the changes. We term this mechanism as Difference Modeling. Crucially, this modeling objective imposes no architectural or regularization constraints, since it is mathematically just a transformation of the output space (see Appendix A for further discussion). Therefore, TFM remains fully flexible and offers the following capabilities:
- •
Efficient Training: Offers end-to-end optimization within during training
- •
4D Time Series Handles 3D volumetric time series of variable length and amount of context
- •
Robust to Sparse and Irregular Sampling Robust to irregular or missing follow-up scans
- •
Disease and Modality Agnostic Generalizes across heterogeneous applications, including cardiac function (Cine-MRI), stroke progression (perfusion CT) and glioblastoma growth (MRI).
Through extensive benchmarks on three public longitudinal and spatio-temporal datasets, TFM consistently outperforms our prior spatio-temporal baseline, including LCI. To the best of our knowledge, this results in the first comprehensive benchmark of spatio-temporal prediction methods in medical imaging. With its strong performance and broad technical flexibility, TFM establishes a robust foundation and new baseline for future advances in medical image analysis.
2 Methods
Longitudinal medical imaging requires handling of irregularly sampled time series, while capturing spatial and temporal dynamics. In Section 2.1, we formalize the problem of irregular medical imaging. We summarize Flow Matching (FM) in Section 2.2, and discuss challenges with integrating FM into image time series. We then introduce a novel extension of FM, namely TFM, in Section 2.3, designed to explicitly address these challenges. Finally, in Section 2.4, we address missing images by introducing a sparsity filling strategy, which is essential for maximizing the performance of TFM.
2.1 Problem Setup
Let us assume a dataset of spatio-temporal image sequences (i.e. one per patient). For each patient, we assume context images with acquired at ordered, and possibly irregular,time points , with a target image at a time . Due to irregular and sparse acquisitions, missing context images are set to [math]. For this task, we propose Temporal Flow Matching (TFM), a generative model that extends Flow Matching (FM) to predict future medical images from sparse and irregular historical observations.
2.2 Flow Matching (FM)
We adopt the notation of Flow Matching (FM) as introduced in [9]. FM learns a continuous transformation between a source sample and a target sample by modeling an optimal transport field parametrized by an Ordinary Differential Equation (ODE):
[TABLE]
where defines the trajectory at interpolation step . Since the equation (1) is an ODE, we can fix the initial conditions as . The vector field denotes the velocity of the transport field at position . To avoid ambiguity with real-valued medical timepoints, we refer to as the FM step, rather than the time . A neural network is trained to predict the true velocity field:
[TABLE]
where is the intermediate state at step , and are the network parameters. As only and are observed, we define using a known transport map . Typically, a linear interpolation is used:
[TABLE]
while other choices are possible. The FM training objective then minimizes the discrepancy between the predicted velocity and true velocity:
[TABLE]
Unlike diffusion models, which rely on iterative denoising guided by learned score functions, FM learns a direct mapping via velocity fields. Under certain conditions, FM can be shown to be equivalent to diffusion models [10].
2.3 Temporal Flow Matching (TFM)
Medical image follow-ups are often irregular, both in terms of temporal spacing and the number of available context images. This poses a challenge for standard generative models, such as Flow Matching (FM) or Diffusion, which models a transformation between two distributions and . Therefore, FM cannot be directly applied when the input and target sequences differ in dimensionality. There are two canonical strategies to address this; i): Temporal Pooling: Compress via a spatio-temporal encoder, or predict the flow only from the last available image. ii): Dimension padding Extend the target and the context dimensionality of the to a set context sequence length. We adopt Dimension Padding, since we find that only using flows from the last image is not stable. The second method is in part inspired by [6], which also lifts all predictions to the same temporal dimension as a fixed input dimension. With this, we propose Temporal Flow Matching(TFM), a generative model that directly learns transformations from each context image to the target image within a unified spatio-temporal flow formulation. Unlike approaches that compress temporal information early, or operate on latent representation, TFM retains full spatial resolution. This is feasible because TFM has a computational footprint comparable to other spatio-temporal methods, which makes it able to afford this modeling compute. By jointly processing the entire input sequence, the model can leverage spatio-temporal dependencies between input images. To enable this, we define the FM target as
[TABLE]
where , with being the number of context images. The FM initial conditions for equation (1) then reads:
[TABLE]
where is the series of input images. Then we have the vector field . Training and inference are described in Algorithm 1. We then calculate and using (3). The neural net then predicts a velocity (2), and is trained via (4). Inference is then done using eq. (1), i.e.
[TABLE]
In practice, equation (7) can only be solved numerically. This requires choosing an ODE solver (e.g. Euler or Runge-Kutta) and the number of integration steps (and optionally solver hyperparameters). Since , we need to reduce the temporal dimension. For the final temporal reduction, we use either the last predicted time channel or the mean across time.
Difference Modeling
Rather than modeling the whole spatio-temporal image distribution, our method predicts the velocity field, meaning the differences between context and target. Standard Flow Matching transforms between two distributions and , but here both stem from the same patient at different timepoints. Consequently, the velocity is the difference between and . Hence, we call this mechanism Difference Modeling, since models this difference. See Appendix C for a toy example illustrating how this modeling can influence evaluation metrics.
2.4 Handling Missing Data: Sparsity Filling
Irregular sampling in longitudinal data creates ’holes’ in the time axis (i.e., missing images for certain time points), which can distort the estimated flow between and . This reflects the same issue discussed in the motivation, but now arising from missing context frames. To address missing context images, we apply sparsity filling, replacing them with the most recent available scan (see Fig. 2 for visualization). This ensures smoother inputs and more stable flow estimation across masked inputs. If missing frames occur before the first available scan, we fill them using the earliest available image. We denote the filled context sequence via . We hypothesize this helps because each filled image in is closer to than an empty/ zero-filled image, resulting in more homogeneous flow fields. In out ablation studies, sparsity filling was essential; omitting it leads to unstable training and degraded convergence.
3 Data and Experimental Design
We compare TFM to methods that jointly model spatial and temporal information across multiple time points. SimVP [6]: This method originates from the natural image domain and simply uses all context images as input of their network. The original architecture consists of a 2D UNet, which we extend here to 3D. The temporal information is handled via flattening the time dimensions into the channel dimension. ConvLSTM [18]: It extracts spatial features using convolutions while capturing temporal dependencies through an LSTM’s recurrent states [7]. At each time step, the model processes an input image using convolutional layers and updates its internal memory, which maintains information across the sequence. ViViT The Video Vision Transformer (ViViT) first processes all input context images into image patches, where the patch size is . We use the ViViT as in [22], for fair comparison of the pure spatio-temporal backbone.
Baseline Training
The baseline methods directly predict the target given the context sequence . So the loss for those methods reads:
[TABLE]
Further definitions of architecture, model and method is found in section A.
Last Context Image
Furthermore, we use the Last Context Image (LCI), a heuristic that serves as an estimated lower bound. LCI is denoted as the last image in the sequence which is non-zero. This baseline is medically motivated, as it serves as a part of medical decision making when looking at longitudinal series (see [20]). 222While LCI is optimal for monotonic sequences, it might not be the best performing image from the context sequence. However, selecting the best image from the sequence would require further insight or an oracle model. So LCI is the best we can naively do for most tasks. Yet for the datasets we consider LCI is in fact the best.
3.1 Datasets
ACDC [3] is a cardiac MRI dataset for different states of the heart. Images are reshaped to , where the target is a single image having the same spatial dimensions. For the ACDC dataset, we randomly mask out time points, in order to make it irregular. We split the dataset into training, validation and test images. This dataset was used for method development and ablations were done on the validation set.
ISLES [17] consists of perfusion CT images from stroke patients. For our experiments, we utilize this 4D modality. Since there are dozens of time steps with minor changes in the image, we further process the image. For that, we only take every other time step of the perfusion sequence. From the resulting series, we randomly pick 4 consecutive points, where the last point is the target, and we randomly mask context images. The context then has shape . This dataset is split into training, validation and test images.
Lumiere [19] is a longitudinal dataset of tumor growth in gliomas, consisting of 3D MRI scans. The images are reshaped to . Since not all patients have many acquisitions, we prepended zeros to ensure pre-processing is consistent. For Lumiere we have training, validation and test images. Example images from two timepoints for each dataset are shown in Figure 1.
3.2 Experimental settings
All methods (see A for notation) were trained with AdamW and a cosine-annealed learning-rate schedule, using a batch size of 4. The learning rate was fixes at for all experiments. For TFM, we used 10 integration steps during inference (see Table 4). Our TFM builds on the standard UNet from the TorchCFM library [21], using cross-attention between time embeddings and spatial feature maps (see Figure 2 for an overview). To ensure fair comparison, we ran each experiment three times with different validation splits and the same random seed within each split.
Random Masking
For ACDC and ISLES, we randomly omit context images during both training and validation, to simulate irregular sampling. Since we believe we are the first to benchmark methods in this very specific irregular setting, we highlight a potentially grave pitfall: If validation masks are resampled at each validation epoch, even with a fixed seed, the masking evaluation metrics change every time, which is exacerbated by our small validation set . This variability affects even the trivial LCI baseline and makes "best" epoch selection arbitrary. Since the validation set is small, context sequences can be extremely sparse or dense, causing the LCI baseline’s performance to fluctuate drastically333In natural imaging, validation sets are much larger, so random fluctuations are less severe. In medical imaging, however, smaller validation sets make these fluctuations significant.. To avoid this issue, we generate one fixed set of masks per split (using a single seed) and reuse those exact masks for every model at every validation epoch. This ensures consistent validation conditions, meaningful epoch selection, and fully reproducible and interpretable validation results. In all cases, models were selected by the lowest validation and then evaluated on the held-out test set.
4 Results and Discussion
TFM outperforms LCI across all datasets and metrics as shown in Table 2. It achieves top performance on every metric and dataset tested. Competing methods struggle to generate realistic images, often scoring below the LCI baseline. We stress that pixel-wise metrics such as uniformly penalize any change, so unchanged anatomy dominates the score and can mask fine spatio-temporal predictions (see e.g. Figure 7). By modeling the difference directly, we argue that TFM sidesteps this bias and more faithfully captures true temporal evolution. Future work should consider modeling differences directly, capturing metrics only on regions of interests with substantial change, or adopting task-specific metrics that align more with clinical motivations [13]. Lumiere is a longitudinal tumor growth dataset, characterized by sparse sequences and a small number of training cases. The strong differences in patient-specific trajectories and its data scarcity make Lumiere particularly difficult; most methods fail to come close to LCI, except for TFM. We attribute the poorer performance of other methods to the small training set size and high inter-subject variability. This leads to negative for the SimVP method, and qualitatively noisy results. Despite these challenges, TFM outperforms LCI on Lumiere in both and . This supports our hypothesis that TFM benefits from Difference Modeling, making it more robust in real-world scenarios. These findings suggest that Temporal Flow Matching is especially well-suited for real-world scenarios. Figure 4 illustrates that TFM generates realistic images. Further qualitative results can be found in the appendix. Thanks to our use of Runge-Kutta integration, memory savings are non-trivial; memory usage can be further reduced by switching to Euler integration and detaching tensors after every step, or to aim for single step predictions. This could significantly reduce memory usage, if needed.
Insights on design decisions Table 3 summarizes the impact of design choices on ACDC, including an alternative lightweight architecture by replacing the attention mechanism with concatenating time embeddings in the bottleneck, no sparsity filling, and two aggregation methods. We see that switching to a lighter-weight architecture had only a minor effect on performance. Future work may explore alternative architectures for the flow network (2), either to improve efficiency or further boost performance. Yet this choice shows the flexibility of TFM. We observe no significant difference between aggregating by the mean or using only the final predicted frame. Since the model is trained on full flows in both settings, this suggests it learns to predict the target from any context frame. Crucially, our sparsity filling strategy significantly improves TFM’s performance. We attribute this due to the fact that filled frames are closer to the target image than zero tensors, resulting in more learnable and stable flow velocities. This reinforces our core design principle: TFM focuses on temporal changes, not the entire image distribution.
An important additional finding is that the LCI + FM baseline (using TFM’s UNet but a single time channel) performs poorly. This occurs even though TFM can handle single context inputs (see Figure 3). Instinctively, both methods should yield similar results at inference, since they receive the same input. But we believe this discrepancy stems from the training dynamics; In the LCI + FM setting, the model learns the flow over uneven time intervals. These inconsistent intervals introduce high variance and instabilities in the flows. In contrast, TFM end-to-end training on the randomly masked sequence yields more information on the temporal spacing, making the predictions more robust against single-input performance. We suspect that this consistency yields a more stable training regime and enables reliable performance even when reduced to a single input at test time.
4.1 Future Directions and Limitations
Building on TFM’s core strengths-full resolution flow, end-to-end training, and robust handling of irregular sampling-numerous promising avenues emerge. None of these expose a fundamental limitation of our method; instead capitalize on its flexibility:
- •
Advanced Sparsity Filling We demonstrated that even a simple nearest-image fill substantially stabilized training (see Table 3). Future work can explore more sophisticated schemes. This includes learned imputation networks, temporal interpolation priors, which could seamlessly plug into TFM.
- •
Explicit Continuous Time Modeling While our current setting treats as abstract interpolation scalars, this can be naturally extended to model continuous, real-valued time steps. This would allow for more flexible predictions, ideal for clinical workflows.
- •
Stochastic Generation It is technically simple to include stochastic sampling into TFM. This would allow TFM to sample multiple possible futures. This could be again valuable in the clinical context for risk assessment and planning. On a technical level, this only requires extending the ODE to SDE integration and adding noise during training (see step (6)1).
Limitations
Several challenges remain, which stem from the realities of longitudinal imaging. First, truly large-scale, high quality follow-up cohorts remain rare for many diseases. Acquiring more multi-timepoint studies can be costly, yet such data are essential for validating disease trajectory models. Second, our current approach of globally sampling to a set resolution might not be optimal. A more localized, patch based strategy could capture finer details better but remains challenging for generative modeling. Finally, conventional baseline methods struggle when data are scarce, a prominent issue in our setting. To overcome this limitation, future work might need to leverage large-scale pretrained models, and fine tune them on specific 4D prediction tasks, although such pretrained resources are not yet readily available. Despite these constraints, TFM maintains remarkably stable performance. We hope that this contribution will encourage the acquisition of larger longitudinal datasets and inspire further clinical studies.
5 Conclusion
In this paper, we address the challenge of modeling longitudinal medical imaging with sparse and irregular time series by introducing Temporal Flow Matching (TFM), a state-of-the-art generative approach for 4D medical image prediction. In our datasets, and often in clinical practice, temporal changes constitute only a small portion of the total image content, relative to inter-patient differences. TFM leverages this insight by explicitly modeling differences between context and target, an approach which we call Difference Modeling. Through extensive experiments on publicly available datasets, we demonstrate that TFM consistently outperforms prior methods, establishing a new baseline for disease progression modeling. While we gratefully acknowledge the public datasets which support this work, advancing spatio-temporal prediction will demand larger cohorts with detailed records of confounding factors, such as surgeries and treatment changes.
Acknowledgements
The present contribution is supported by the Helmholtz Association under the joint research school HIDSS4Health – Helmholtz Information and Data Science School for Health.
Appendix A Method vs Model
In previous sections we mentioned TFM as a baseline method. To further disambiguate semantic definitions, we define the following; Architecture is defined here as the actual neural network. That is the functional output between input of the network and the output. This seems redundant, but important in comparison. We define Model as the input and especially output space. The best example here is diffusion vs. flow matching; Both can be done via the same network , but in the case of diffusion, the input is a noisy sample, and the output is the noise. Whereas for flow matching, the network receives , and predicts . We say the diffusion network models the noise, and the flow matching network models the velocity.
Appendix B Datasets
In table 4 we ablate the amount of number of function evaluations. We note that for a single NFE the performance significantly drops. For a trade-off we chose 10 as the integration steps for all datasets.
Appendix C Toy Example - Difference Modeling
To clarify why modeling differences can be advantageous in low-change environments (even when full image reconstruction is limited), we present a simplified toy example illustrating a resolution-performance paradox. To show how limited resolution and bounded changes can impact error metrics, consider the following setting: Assume an checkerboard pattern where each pixel alternates between [math] and . In the center, a patch undergoes a change, specifically within a region. For simplicity, let contain two black squares in the center, and contain three. Suppose a perfect longitudinal model captures the central change but operates at a coarse resolution of . Since it cannot represent the high-frequency checkerboard pattern, its best prediction is a uniform value of across the image. This yields a total MSE of , but an LCI MSE of only . Now consider the difference image , which contains a single black square and zeros elsewhere. The same low-resolution model can now perfectly represent this difference, achieving an MSE of [math], despite lacking the resolution to represent the full images individually. This illustrates how Difference Modelingcan resolve the apparent paradox where a model with limited spatial capacity still performs well under certain metrics. While this does not fully explain the behavior of TFM, it provides intuition for how modeling the difference yields strong starting conditions, and how methods can benefit from this formulation.
Appendix D Further Experimental Settings
Models were trained for epochs. All methods are implemented using the AdamW optimizer, with cosine annealing learning rate, and batch size 4. We utilized cosine annealing, as well as a warm-up scheduler for of the total epochs, as well as a gradient clipping of magnitude .
TFM network details For the experiments we used a feature size of , and a channel multiplication per layer of , with one res block per layer. The attention resolution was set to . For anything else, the default parameters of the UNet from [21] was used.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vi Vi T: A Video Vision Transformer, Nov. 2021.
- 2[2] Hao Bai and Yi Hong. NODER: Image Sequence Regression Based on Neural Ordinary Differential Equations, July 2024.
- 3[3] Olivier Bernard, Alain Lalande, Clement Zotti, Frederick Cervenansky, Xin Yang, Pheng-Ann Heng, Irem Cetin, Karim Lekadir, Oscar Camara, Miguel Angel Gonzalez Ballester, Gerard Sanroma, Sandy Napel, Steffen Petersen, Georgios Tziritas, Elias Grinias, Mahendra Khened, Varghese Alex Kollerathu, Ganapathy Krishnamurthi, Marc-Michel Rohé, Xavier Pennec, Maxime Sermesant, Fabian Isensee, Paul Jäger, Klaus H. Maier-Hein, Peter M. Full, Ivo Wolf, Sandy Engelhardt, Christian F. Baumgartner, Lisa
- 4[4] Evan Calabrese, Javier E. Villanueva-Meyer, Jeffrey D. Rudie, Andreas M. Rauschecker, Ujjwal Baid, Spyridon Bakas, Soonmee Cha, John T. Mongan, and Christopher P. Hess. The University of California San Francisco Preoperative Diffuse Glioma MRI Dataset. Radiology. Artificial Intelligence , 4(6):e 220058, Nov. 2022.
- 5[5] Cong Fang, Song Bai, Qianlan Chen, Yu Zhou, Liming Xia, Lixin Qin, Shi Gong, Xudong Xie, Chunhua Zhou, Dandan Tu, Changzheng Zhang, Xiaowu Liu, Weiwei Chen, Xiang Bai, and Philip H. S. Torr. Deep learning for predicting COVID-19 malignant progression. Medical Image Analysis , 72:102096, Aug. 2021.
- 6[6] Zhangyang Gao, Cheng Tan, Lirong Wu, and Stan Z. Li. Sim VP: Simpler Yet Better Video Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 3170–3180, 2022.
- 7[7] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Comput. , 9(8):1735–1780, Nov. 1997.
- 8[8] Dmitrii Lachinov, Arunava Chakravarty, Christoph Grechenig, Ursula Schmidt-Erfurth, and Hrvoje Bogunovic. Learning Spatio-Temporal Model of Disease Progression with Neural OD Es from Longitudinal Volumetric Data, Nov. 2022.
