TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens
Jianpeng Cheng, Xian Wu, Jiangfan Zhang, Wentao Bao, Chaitanya Ahuja, Shlok Kumar Mishra, Hanchao Yu, Yang Gao, Fan Xia, Qi Guo, Shaodan Zhai, Xiangjun Fan, and Jun Xiao

TL;DR
TTE-Flash introduces latent think tokens to replace explicit Chain-of-Thought reasoning in multimodal models, achieving high performance with reduced inference costs and interpretable reasoning traces.
Contribution
The paper proposes a novel latent think token approach that improves reasoning efficiency and interpretability in multimodal representations, outperforming explicit CoT methods.
Findings
Outperforms explicit-CoT models on MMEB-v2 benchmark
Produces interpretable think tokens both textually and visually
Shows scaling benefits with increased think tokens across datasets
Abstract
Recent research has demonstrated that Universal Multimodal Embedding (UME) benefits significantly from Chain-of-Thought (CoT) reasoning. In this paradigm, a generative model produces explicit reasoning traces for a multimodal query, with the final representation extracted from an <eos> embedding token attending to both the query and the reasoning. Despite its effectiveness, the computational overhead of generating explicit CoT traces is often prohibitive. In this work, we propose replacing explicit CoT with latent think tokens, which are interpreted as latent variables that can produce explicit CoT traces as observed variables. By optimizing think tokens using CoT generation loss and subsequent embedding tokens using contrastive loss, we produce high-performance, reasoning-aware representations at a constant inference cost. Our study investigates two key architectural designs: 1) how…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
