TL;DR
EmoSign introduces a comprehensive multimodal dataset of ASL videos with emotion labels, enabling improved research and benchmarking in emotion recognition for sign language understanding.
Contribution
This paper presents EmoSign, the first dataset with sentiment and emotion annotations for ASL videos, facilitating advances in multimodal emotion recognition research.
Findings
Baseline models demonstrate the dataset's utility for sentiment and emotion classification.
The dataset provides a new benchmark for multimodal emotion recognition in sign language.
Annotations by Deaf ASL signers ensure cultural and linguistic relevance.
Abstract
Unlike spoken languages where the use of prosodic features to convey emotion is well studied, indicators of emotion in sign language remain poorly understood, creating communication barriers in critical settings. Sign languages present unique challenges as facial expressions and hand movements simultaneously serve both grammatical and emotional functions. To address this gap, we introduce EmoSign, the first sign video dataset containing sentiment and emotion labels for 200 American Sign Language (ASL) videos. We also collect open-ended descriptions of emotion cues. Annotations were done by 3 Deaf ASL signers with professional interpretation experience. Alongside the annotations, we include baseline models for sentiment and emotion classification. This dataset not only addresses a critical gap in existing sign language research but also establishes a new benchmark for understanding model…
Peer Reviews
Decision·Submitted to ICLR 2026
This is a relatively novel task that's especially interesting because facial expressions in sign languages are indeed often misunderstood (though I think typically in the opposite direction of the LLMs; I've seen people think signers look angry). The paper is presented well and written in a way that meets best practices in the sign language field, including paying attention to data subject/annotator qualifications (Deaf native signers).
The two biggest problems with the paper to me are small dataset size and lack of motivation/insufficient baselines. 1. 16 minutes / 200 utterances is very small for an eval, when that artifact is essentially the contribution of the paper. (To be fair, if you're not hill-climbing against it, having some moderate amount of noise isn't terrible.) But the paper isn't framed as an "eval", it's a "dataset". And you don't see that it's 16 minutes long until Section 3 of the paper. 2. The benchmark/ta
Emotion expression in sign language translation is a long-standing topic. This paper establishes an emotion detection benchmark for ASL and evaluates the performance of four popular LMs on their benchmark. The benchmark also comes with three different tasks and corresponding well-designed evaluation metrics with different input settings, including video, caption, and video+caption. The paper is easy to understand.
1. The dataset is relatively small. And as all the data comes from only one public dataset, it may have domain issues. As such, the effectiveness of the benchmark is suspicious. 2. The paper discusses multimodal models in the abstract and introduction sections, but only evaluates several VLMs in the experiments. In SLT, many works have validated the effectiveness of facial input as well as body movement input. Should these works be included in the scope? 3. Continue from 1&2, only four tested mo
1. Novelty in Data Collection: The EmoSign dataset is the first to focus specifically on emotion and sentiment analysis within ASL, providing an essential resource for the field. 2. Benchmarking Potential: The dataset presents a clear baseline for multimodal emotion recognition models, which is a useful resource for researchers looking to build on this area. 3. Multimodal Approach: The dataset takes into account both visual cues (e.g., facial expressions and hand gestures) and sentiment labels
1. Lack of Clear Novelty: While EmoSign is presented as a novel contribution, similar datasets (e.g., FePh) already exist in the domain of ASL emotion recognition. The paper does not clearly explain how EmoSign offers a significant advancement over these existing resources. 2. Inconsistency in Data Collection: The authors argue against using artificially recorded videos due to concerns over the "realism" of emotion expression, yet the dataset is based on pre-recorded videos from existing datase
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
