TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages

Minsu Kim; Jee-weon Jung; Hyeongseop Rha; Soumi Maiti; Siddhant Arora; Xuankai Chang; Shinji Watanabe; Yong Man Ro

arXiv:2402.16021·cs.CL·June 9, 2025·1 cites

TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages

Minsu Kim, Jee-weon Jung, Hyeongseop Rha, Soumi Maiti, Siddhant Arora, Xuankai Chang, Shinji Watanabe, Yong Man Ro

PDF

Open Access 1 Repo

TL;DR

This paper introduces TMT, a tri-modal translation model that treats speech, image, and text as different languages, enabling efficient and unified translation across these modalities with improved performance.

Contribution

The paper presents a novel tri-modal translation framework that interprets different modalities as languages, reducing computational costs and enhancing multi-modal translation performance.

Findings

01

TMT outperforms single-model counterparts on all six modality translation tasks.

02

Tokenization into discrete tokens enables a unified interface across modalities.

03

Treating modalities as languages benefits both practicality and accuracy.

Abstract

The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, where we interpret different modalities as different languages, and treat multi-modal translation as a well-established machine translation problem. To this end, we tokenize speech and image data into discrete tokens, which provide a unified interface across modalities and significantly decrease the computational cost. In the proposed TMT, a multi-modal encoder-decoder conducts the core translation, whereas modality-specific processing is conducted only within the tokenization and detokenization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ms-dot-k/tmt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques