When Drawing Is Not Enough: Exploring Spontaneous Speech with Sketch for Intent Alignment in Multimodal LLMs
Weiyan Shi, Dorien Herremans, Kenny Tsu Wei Choo

TL;DR
This paper presents TalkSketchD, a new dataset combining sketches and spontaneous speech to improve multimodal large language models' ability to interpret designer intent during early-stage ideation.
Contribution
The introduction of TalkSketchD dataset and a study showing speech augmentation improves intent alignment in sketch-to-image generation.
Findings
Speech-augmented sketches lead to better intent understanding in generated images.
Quantitative improvements in intent alignment across form, function, and experience.
Demonstrates the value of multimodal data in early-stage design ideation.
Abstract
Early-stage design ideation often relies on rough sketches created under time pressure, leaving much of the designer's intent implicit. In practice, designers frequently speak while sketching, verbally articulating functional goals and ideas that are difficult to express visually. We introduce TalkSketchD, a sketch-while-speaking dataset that captures spontaneous speech temporally aligned with freehand sketches during early-stage toaster ideation. To examine the dataset's value, we conduct a sketch-to-image generation study comparing sketch-only inputs with sketches augmented by concurrent speech transcripts using multimodal large language models (MLLMs). Generated images are evaluated against designers' self-reported intent using a reasoning MLLM as a judge. Quantitative results show that incorporating spontaneous speech significantly improves judged intent alignment of generated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
