Grammar Induction from Visual, Speech and Text
Yu Zhao, Hao Fei, Shengqiong Wu, Meishan Zhang, Min Zhang, Tat-seng, Chua

TL;DR
This paper introduces a novel unsupervised multimodal grammar induction method called VAT-GI, leveraging visual, speech, and text signals, with a new textless setting and a recursive autoencoder framework, achieving state-of-the-art results.
Contribution
It proposes the first unsupervised multimodal grammar induction framework that integrates visual, speech, and text modalities, including a textless setting, with a new recursive autoencoder architecture.
Findings
VAT-GI outperforms previous methods on benchmark datasets.
The proposed VaTiora framework effectively combines multimodal signals.
State-of-the-art performance achieved in VAT-GI tasks.
Abstract
Grammar Induction could benefit from rich heterogeneous signals, such as text, vision, and acoustics. In the process, features from distinct modalities essentially serve complementary roles to each other. With such intuition, this work introduces a novel \emph{unsupervised visual-audio-text grammar induction} task (named \textbf{VAT-GI}), to induce the constituent grammar trees from parallel images, text, and speech inputs. Inspired by the fact that language grammar natively exists beyond the texts, we argue that the text has not to be the predominant modality in grammar induction. Thus we further introduce a \emph{textless} setting of VAT-GI, wherein the task solely relies on visual and auditory inputs. To approach the task, we propose a visual-audio-text inside-outside recursive autoencoder (\textbf{VaTiora}) framework, which leverages rich modal-specific and complementary features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Speech and dialogue systems
