A Multimodal Vision Transformer-based Modeling Framework for Prediction of Fluid Flows in Energy Systems

Kiran Yalamanchi; Shivam Barwey; Ibrahim Jarrah; Pinaki Pal

arXiv:2604.02483·physics.flu-dyn·April 6, 2026

A Multimodal Vision Transformer-based Modeling Framework for Prediction of Fluid Flows in Energy Systems

Kiran Yalamanchi, Shivam Barwey, Ibrahim Jarrah, Pinaki Pal

PDF

TL;DR

This paper introduces a hierarchical Vision Transformer framework that predicts and reconstructs complex fluid flows in energy systems using multimodal CFD data, enabling efficient and accurate flow forecasting.

Contribution

It develops a novel multimodal Vision Transformer architecture conditioned on data modality and time, capable of generalizing across resolutions and inferring unobserved flow features.

Findings

01

Accurately predicts future flow states in high-pressure gas injection scenarios.

02

Successfully reconstructs missing flow information from limited observational data.

03

Demonstrates generalization across different CFD simulation configurations.

Abstract

Computational fluid dynamics (CFD) simulations of complex fluid flows in energy systems are prohibitively expensive due to strong nonlinearities and multiscale-multiphysics interactions. In this work, we present a transformer-based modeling framework for prediction of fluid flows, and demonstrate it for high-pressure gas injection phenomena relevant to reciprocating engines. The approach employs a hierarchical Vision Transformer (SwinV2-UNet) architecture that processes multimodal flow datasets from multi-fidelity simulations. The model architecture is conditioned on auxiliary tokens explicitly encoding the data modality and time increment. Model performance is assessed on two different tasks: (1) spatiotemporal rollouts, where the model autoregressively predicts the flow state at future times; and (2) feature transformation, where the model infers unobserved fields/views from observed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.