Grammar Induction from Visual, Speech and Text

Yu Zhao; Hao Fei; Shengqiong Wu; Meishan Zhang; Min Zhang; Tat-seng; Chua

arXiv:2410.03739·cs.CL·February 21, 2025

Grammar Induction from Visual, Speech and Text

Yu Zhao, Hao Fei, Shengqiong Wu, Meishan Zhang, Min Zhang, Tat-seng, Chua

PDF

Open Access

TL;DR

This paper introduces a novel unsupervised multimodal grammar induction method called VAT-GI, leveraging visual, speech, and text signals, with a new textless setting and a recursive autoencoder framework, achieving state-of-the-art results.

Contribution

It proposes the first unsupervised multimodal grammar induction framework that integrates visual, speech, and text modalities, including a textless setting, with a new recursive autoencoder architecture.

Findings

01

VAT-GI outperforms previous methods on benchmark datasets.

02

The proposed VaTiora framework effectively combines multimodal signals.

03

State-of-the-art performance achieved in VAT-GI tasks.

Abstract

Grammar Induction could benefit from rich heterogeneous signals, such as text, vision, and acoustics. In the process, features from distinct modalities essentially serve complementary roles to each other. With such intuition, this work introduces a novel \emph{unsupervised visual-audio-text grammar induction} task (named \textbf{VAT-GI}), to induce the constituent grammar trees from parallel images, text, and speech inputs. Inspired by the fact that language grammar natively exists beyond the texts, we argue that the text has not to be the predominant modality in grammar induction. Thus we further introduce a \emph{textless} setting of VAT-GI, wherein the task solely relies on visual and auditory inputs. To approach the task, we propose a visual-audio-text inside-outside recursive autoencoder (\textbf{VaTiora}) framework, which leverages rich modal-specific and complementary features…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Speech and dialogue systems