TRINE: A Token-Aware, Runtime-Adaptive FPGA Inference Engine for Multimodal AI

Hyunwoo Oh; Hanning Chen; Sanggeon Yun; Yang Ni; Suyeon Jang; Behnam Khaleghi; Fei Wen; Mohsen Imani

arXiv:2603.22867·cs.AR·March 25, 2026

TRINE: A Token-Aware, Runtime-Adaptive FPGA Inference Engine for Multimodal AI

Hyunwoo Oh, Hanning Chen, Sanggeon Yun, Yang Ni, Suyeon Jang, Behnam Khaleghi, Fei Wen, Mohsen Imani

PDF

Open Access

TL;DR

TRINE is a versatile FPGA inference engine that adaptively executes multimodal AI models with high efficiency, low latency, and minimal reconfiguration, enabling real-time processing across vision, language, and graph tasks.

Contribution

It introduces a unified, runtime-adaptive FPGA accelerator capable of executing diverse multimodal models without reconfiguration, using a mode-switchable engine and novel layer offloading techniques.

Findings

01

Reduces latency by up to 22.57x compared to RTX 4090.

02

Achieves up to 6.86x latency reduction over Jetson Orin Nano at similar power.

03

Token pruning and dependency-aware layer offloading significantly improve throughput.

Abstract

Multimodal stacks that mix ViTs, CNNs, GNNs, and transformer NLP strain embedded platforms because their compute/memory patterns diverge and hard real-time targets leave little slack. TRINE is a single-bitstream FPGA accelerator and compiler that executes end-to-end multimodal inference without reconfiguration. Layers are unified as DDMM/SDDMM/SpMM and mapped to a mode-switchable engine that toggles at runtime among weight/output-stationary systolic, 1xCS SIMD, and a routable adder tree (RADT) on a shared PE array. A width-matched, two-stage top-k unit enables in-stream token pruning, while dependency-aware layer offloading (DALO) overlaps independent kernels across reconfigurable processing units to sustain utilization. Evaluated on Alveo U50 and ZCU104, TRINE reduces latency by up to 22.57x vs. RTX 4090 and 6.86x vs. Jetson Orin Nano at 20-21 W; token pruning alone yields up to 7.8x…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Embedded Systems Design Techniques · Big Data and Digital Economy