Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models
Chengyu Fang, Heng Guo, Zheng Jiang, Chunming He, Xiu Li, Minfeng Xu

TL;DR
Photon is a novel framework that efficiently processes 3D medical images with large language models by adaptively reducing tokens, leading to faster computation and state-of-the-art accuracy in clinical visual question answering.
Contribution
Photon introduces instruction-conditioned token scheduling and surrogate gradient propagation for adaptive token reduction in 3D medical imaging, improving efficiency without sacrificing accuracy.
Findings
Achieves state-of-the-art accuracy on medical VQA tasks.
Reduces computational resource usage significantly.
Accelerates training and inference processes.
Abstract
Multimodal large language models are promising for clinical visual question answering tasks, but scaling to 3D imaging is hindered by high computational costs. Prior methods often rely on 2D slices or fixed-length token compression, disrupting volumetric continuity and obscuring subtle findings. We present Photon, a framework that represents 3D medical volumes with token sequences of variable length. Photon introduces instruction-conditioned token scheduling and surrogate gradient propagation to adaptively reduce tokens during both training and inference, which lowers computational cost while mitigating the attention dilution caused by redundant tokens. It incorporates a custom backpropagation rule with gradient restoration to enable differentiable optimization despite discrete token drop. To stabilize token compression and ensure reliable use of visual evidence, Photon further applies…
Peer Reviews
Decision·ICLR 2026 Poster
1. The proposed method shows a significant efficiency improvement without sacrificing its performance, which is quite impressive. According to the experiments, the proposed method can successfully reduce the token length by more than 50%, speed up the training process by over 5 times, and reduce the memory usage at the same time. It also shows a non-trivial improvement against SoTA baselines, including larger ones, like the 7B Lingshu model. 2. The idea of instruction-conditioned token schedul
The reviewer has 2 major concerns about this paper. 1. The paper itself is well-written, although it is not that easy to fully understand all the equations; it is at least clear and detailed. However, the figures in the paper, especially Figure 1 and Figure 2, are very difficult to follow. The reviewer understands that using a uniform color scheme can make it look nicer, but it should not harm the readability. For example, (a) The performance bar plot and radar plot at the bottom of Figure 1 ar
1. This work addresses an important bottleneck in medical AI (computational challenge for medical QA with MLLM due to 3D nature of the data). The motivation is clear, as processing full 3D volumes preserves volumetric information while dynamic token selection is relatively computationally affordable. 2. The proposed method IST is novel and seems to be effective. Instruction-conditioned, instance-adaptive token pruning is a step beyond common, instruction-agnostic pruning or fixed-ratio compress
1. The proposed method is overly complex, making it hard to follow let alone reproduce. For example, the derivation of the final surrogate gradient involves a long chain of heuristic-based calculations like standardization z_j, monotonic mapping r_j, directional term d_j, magnitude term m_j, and several clipping and clamping operations. Then there are also several regularization terms. These combined make the proposed method inherently brittle. This high sensitivity to implementation details not
- The combination of *instruction-conditioned pruning* and *surrogate gradient backpropagation* for efficient 3D token handling is innovative and mathematically well-founded. - Photon outperforms all major baselines (RadFM, M3D, OmniV, Lingshu, etc.) by **3–14%** across multiple Med-VQA tasks. - Achieves ~2/3 GPU memory reduction and ~5× training/inference speedup, verified through detailed benchmarks and ablations. - Includes both 3D-RAD and DeepTumorVQA, along with visualizations, abl
- If the instruction-conditioned token scheduling can transfers to domains with different spatial and noise characteristics? - How does Photon perform in zero-shot settings on out-of-distribution datasets? - While comparisons are thorough within medical MLLMs, the study omits re-implementations of VisionZip, LLaVA-PruMerge, or ATP-LLaVA under medical conditions, limiting cross-domain efficiency comparisons. - How would Photon scale when integrated with larger base models in terms of train
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
