Understanding Pedestrian Gesture Misrecognition: Insights from Vision-Language Model Reasoning

Tram Thi Minh Tran; Xinyan Yu; Callum Parker; Julie Stephany Berrio Perez; Stewart Worrall; Martin Tomitsch

arXiv:2508.06801·cs.HC·August 19, 2025

Understanding Pedestrian Gesture Misrecognition: Insights from Vision-Language Model Reasoning

Tram Thi Minh Tran, Xinyan Yu, Callum Parker, Julie Stephany Berrio Perez, Stewart Worrall, Martin Tomitsch

PDF

TL;DR

This paper uses GPT-4V as a diagnostic tool to analyze pedestrian gesture misrecognition in autonomous vehicle interactions, revealing key factors affecting recognition accuracy and suggesting improvements for system design.

Contribution

It introduces a novel diagnostic approach using vision-language models to understand gesture misrecognition, providing insights applicable to various human-machine interaction domains.

Findings

01

Gesture visibility and context significantly impact recognition accuracy.

02

Environmental conditions and pedestrian behavior influence misrecognition patterns.

03

Recommendations for gesture design include enhancing salience and contextual redundancy.

Abstract

Pedestrian gestures play an important role in traffic communication, particularly in interactions with autonomous vehicles (AVs), yet their subtle, ambiguous, and context-dependent nature poses persistent challenges for machine interpretation. This study investigates these challenges by using GPT-4V, a vision-language model, not as a performance benchmark but as a diagnostic tool to reveal patterns and causes of gesture misrecognition. We analysed a public dataset of pedestrian-vehicle interactions, combining manual video review with thematic analysis of the model's qualitative reasoning. This dual approach surfaced recurring factors influencing misrecognition, including gesture visibility, pedestrian behaviour, interaction context, and environmental conditions. The findings suggest practical considerations for gesture design, including the value of salience and contextual redundancy,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.