Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel   Approach using Gloss-based Annotation

Sen Fang; Sizhou Chen; Yalin Feng; Xiaofeng Zhang; Teik Toe Teoh

arXiv:2410.03146·cs.CV·October 15, 2024

Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation

Sen Fang, Sizhou Chen, Yalin Feng, Xiaofeng Zhang, Teik Toe Teoh

PDF

Open Access

TL;DR

This paper introduces BGTAI, a novel multimodal framework that uses gloss-based annotations to improve alignment and understanding across text, audio, and images, enhancing multimodal representation quality.

Contribution

It proposes the first Langue2Gloss model and integrates it into UniBriVL, along with new modules and loss functions to improve multimodal alignment and training stability.

Findings

01

Outperforms previous multimodal models in experiments

02

Enhances compatibility among text, audio, and visual modalities

03

Demonstrates improved multimodal representations

Abstract

This paper presents an innovative approach called BGTAI to simplify multimodal understanding by utilizing gloss-based annotation as an intermediate step in aligning Text and Audio with Images. While the dynamic temporal factors in textual and audio inputs contain various predicate adjectives that influence the meaning of the entire sentence, images, on the other hand, present static scenes. By representing text and audio as gloss notations that omit complex semantic nuances, a better alignment with images can potentially be achieved. This study explores the feasibility of this idea, specifically, we first propose the first Langue2Gloss model and then integrate it into the multimodal model UniBriVL for joint training. To strengthen the adaptability of gloss with text/audio and overcome the efficiency and instability issues in multimodal training, we propose a DS-Net (Data-Pair Selection…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Handwritten Text Recognition Techniques