Factor-Conditioned Speaking-Style Captioning

Atsushi Ando; Takafumi Moriya; Shota Horiguchi; Ryo Masumura

arXiv:2406.18910·cs.CL·June 28, 2024

Factor-Conditioned Speaking-Style Captioning

Atsushi Ando, Takafumi Moriya, Shota Horiguchi, Ryo Masumura

PDF

Open Access

TL;DR

This paper introduces a factor-conditioned captioning approach that explicitly models speaking-style factors to generate diverse and accurate speech captions, improving over traditional methods that conflate style and syntax.

Contribution

The paper proposes factor-conditioned captioning (FCC) and greedy-then-sampling (GtS) decoding to explicitly learn speaking-style factors and enhance caption diversity and accuracy.

Findings

01

FCC outperforms traditional caption-based training.

02

GtS decoding improves caption diversity while maintaining style accuracy.

03

The method effectively predicts speaking-style factors in generated captions.

Abstract

This paper presents a novel speaking-style captioning method that generates diverse descriptions while accurately predicting speaking-style information. Conventional learning criteria directly use original captions that contain not only speaking-style factor terms but also syntax words, which disturbs learning speaking-style information. To solve this problem, we introduce factor-conditioned captioning (FCC), which first outputs a phrase representing speaking-style factors (e.g., gender, pitch, etc.), and then generates a caption to ensure the model explicitly learns speaking-style factors. We also propose greedy-then-sampling (GtS) decoding, which first predicts speaking-style factors deterministically to guarantee semantic accuracy, and then generates a caption based on factor-conditioned sampling to ensure diversity. Experiments show that FCC outperforms the original caption-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Natural Language Processing Techniques · Speech and dialogue systems