Factor-Conditioned Speaking-Style Captioning
Atsushi Ando, Takafumi Moriya, Shota Horiguchi, Ryo Masumura

TL;DR
This paper introduces a factor-conditioned captioning approach that explicitly models speaking-style factors to generate diverse and accurate speech captions, improving over traditional methods that conflate style and syntax.
Contribution
The paper proposes factor-conditioned captioning (FCC) and greedy-then-sampling (GtS) decoding to explicitly learn speaking-style factors and enhance caption diversity and accuracy.
Findings
FCC outperforms traditional caption-based training.
GtS decoding improves caption diversity while maintaining style accuracy.
The method effectively predicts speaking-style factors in generated captions.
Abstract
This paper presents a novel speaking-style captioning method that generates diverse descriptions while accurately predicting speaking-style information. Conventional learning criteria directly use original captions that contain not only speaking-style factor terms but also syntax words, which disturbs learning speaking-style information. To solve this problem, we introduce factor-conditioned captioning (FCC), which first outputs a phrase representing speaking-style factors (e.g., gender, pitch, etc.), and then generates a caption to ensure the model explicitly learns speaking-style factors. We also propose greedy-then-sampling (GtS) decoding, which first predicts speaking-style factors deterministically to guarantee semantic accuracy, and then generates a caption based on factor-conditioned sampling to ensure diversity. Experiments show that FCC outperforms the original caption-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Natural Language Processing Techniques · Speech and dialogue systems
