Hidden bawls, whispers, and yelps: can text be made to sound more than just its words?
Calu\~a de Lacerda Pataca, Paula Dornhofer Paro Costa

TL;DR
This paper introduces a novel method to visually encode vocal prosody into text typography, enhancing the conveyance of speech nuances like emotion and emphasis in captions.
Contribution
It presents a model that maps vocal prosody features into visual typographical elements, enabling richer, more expressive captions that reflect speech nuances.
Findings
Participants identified speech-modulated typography with 65% accuracy.
No significant difference between animated and static text in recognition.
Participants' mental models of speech modulation varied widely.
Abstract
Whether a word was bawled, whispered, or yelped, captions will typically represent it in the same way. If they are your only way to access what is being said, subjective nuances expressed in the voice will be lost. Since so much of communication is carried by these nuances, we posit that if captions are to be used as an accurate representation of speech, embedding visual representations of paralinguistic qualities into captions could help readers use them to better understand speech beyond its mere textual content. This paper presents a model for processing vocal prosody (its loudness, pitch, and duration) and mapping it into visual dimensions of typography (respectively, font-weight, baseline shift, and letter-spacing), creating a visual representation of these lost vocal subtleties that can be embedded directly into the typographical form of text. An evaluation was carried out where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Translation Studies and Practices
