Understanding Pedestrian Gesture Misrecognition: Insights from Vision-Language Model Reasoning
Tram Thi Minh Tran, Xinyan Yu, Callum Parker, Julie Stephany Berrio Perez, Stewart Worrall, Martin Tomitsch

TL;DR
This paper uses GPT-4V as a diagnostic tool to analyze pedestrian gesture misrecognition in autonomous vehicle interactions, revealing key factors affecting recognition accuracy and suggesting improvements for system design.
Contribution
It introduces a novel diagnostic approach using vision-language models to understand gesture misrecognition, providing insights applicable to various human-machine interaction domains.
Findings
Gesture visibility and context significantly impact recognition accuracy.
Environmental conditions and pedestrian behavior influence misrecognition patterns.
Recommendations for gesture design include enhancing salience and contextual redundancy.
Abstract
Pedestrian gestures play an important role in traffic communication, particularly in interactions with autonomous vehicles (AVs), yet their subtle, ambiguous, and context-dependent nature poses persistent challenges for machine interpretation. This study investigates these challenges by using GPT-4V, a vision-language model, not as a performance benchmark but as a diagnostic tool to reveal patterns and causes of gesture misrecognition. We analysed a public dataset of pedestrian-vehicle interactions, combining manual video review with thematic analysis of the model's qualitative reasoning. This dual approach surfaced recurring factors influencing misrecognition, including gesture visibility, pedestrian behaviour, interaction context, and environmental conditions. The findings suggest practical considerations for gesture design, including the value of salience and contextual redundancy,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
