Video-aided Unsupervised Grammar Induction

Songyang Zhang; Linfeng Song; Lifeng Jin; Kun Xu; Dong Yu; Jiebo Luo

arXiv:2104.04369·cs.CV·May 5, 2021

Video-aided Unsupervised Grammar Induction

Songyang Zhang, Linfeng Song, Lifeng Jin, Kun Xu, Dong Yu, Jiebo Luo

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multi-modal grammar induction model that leverages rich video features to improve unsupervised constituency parsing, outperforming previous models on multiple benchmarks.

Contribution

It proposes the MMC-PCFG model that effectively integrates diverse video features for grammar induction, advancing beyond prior text-image based methods.

Findings

01

MMC-PCFG outperforms previous state-of-the-art systems

02

Leveraging video information improves grammar induction accuracy

03

Model trained end-to-end on multiple benchmarks

Abstract

We investigate video-aided grammar induction, which learns a constituency parser from both unlabeled text and its corresponding video. Existing methods of multi-modal grammar induction focus on learning syntactic grammars from text-image pairs, with promising results showing that the information from static images is useful in induction. However, videos provide even richer information, including not only static objects but also actions and state changes useful for inducing verb phrases. In this paper, we explore rich features (e.g. action, object, scene, audio, face, OCR and speech) from videos, taking the recent Compound PCFG model as the baseline. We further propose a Multi-Modal Compound PCFG model (MMC-PCFG) to effectively aggregate these rich features from different modalities. Our proposed MMC-PCFG is trained end-to-end and outperforms each individual modality and previous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Sy-Zhang/MMC-PCFG
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Human Pose and Action Recognition