EmoSign: A Multimodal Dataset for Understanding Emotions in American Sign Language

MIT Media Lab1, National University of Singapore2. The University of Tokyo3, Gallaudet University4

*Indicates Equal Contribution

Abstract

Unlike spoken languages where the use of prosodic features to convey emotion is well studied, indicators of emotion in sign language remain poorly understood, creating communication barriers in critical settings. Sign languages present unique challenges as facial expressions and hand movements simultaneously serve both grammatical and emotional functions. To address this gap, we introduce EmoSign, the first sign video dataset containing sentiment and emotion labels for 200 American Sign Language (ASL) videos. We also collect open-ended descriptions of emotion cues. Annotations were done by 3 Deaf ASL signers with professional interpretation experience. Alongside the annotations, we include baseline models for sentiment and emotion classification. This dataset not only addresses a critical gap in existing sign language research but also establishes a new benchmark for understanding model capabilities in multimodal emotion recognition for sign languages. The dataset is made available at Hugging Face.

Dataset Overview

EmoSign is the first comprehensive dataset specifically designed for studying emotional expression in American Sign Language (ASL). The dataset addresses a critical gap in sign language research by providing detailed emotion annotations from native ASL signers.

200

ASL Video Clips

Average duration: 4.8 seconds
Total duration: ~16 minutes

3

Deaf ASL Annotators

Professional interpreters with native ASL fluency

10+1

Emotion Categories

10 specific emotions plus overall sentiment ratings

Key Features:

  • Comprehensive Annotations: Each video includes sentiment ratings on a 7-point scale, presence and intensity ratings for 10 emotion categories, and detailed descriptions of emotion cues
  • Expert Annotation: All labels provided by Deaf ASL signers with professional interpretation experience
  • Multimodal Focus: Captures both manual (hand movements) and non-manual (facial expressions, body language) components of emotional expression
  • Baseline Models: Includes evaluation of state-of-the-art multimodal language models for sentiment analysis and emotion classification

Dataset Visualizations

Methodology

Data Collection & Selection

We built EmoSign using videos from the ASLLRP (American Sign Language Linguistic Research Project) continuous signing corpus. Videos were pre-selected based on emotional expressiveness using VADER sentiment analysis of text captions, then manually curated to ensure a balanced representation of emotions.

Annotation Process

Three Deaf ASL signers with professional interpretation experience annotated each video through a structured process:

  • Sentiment Analysis: Overall sentiment rating on a 7-point scale (-3 to +3)
  • Emotion Classification: Presence and intensity ratings (0-3) for 10 emotion categories: joy, excited, surprise (positive), surprise (negative), worry, sadness, fear, disgust, frustration, and anger
  • Cue Description: Open-ended descriptions of specific visual cues that led to their emotion assessments

Emotion Categories

Our emotion framework builds on Ekman's basic emotions and the circumplex model of affect, expanded to capture the rich emotional expressions found in sign language. The categories were informed by prior emotion recognition datasets and pilot testing with ASL signers.

Quality Assurance

Inter-annotator agreement was measured using Krippendorff's alpha, with an average score of 0.593 across all labels. Positive emotions showed higher agreement than negative emotions, with sentiment analysis achieving α=0.738.

Key Contributions

First ASL Emotion Dataset

EmoSign is the first dedicated dataset for studying emotional expression in American Sign Language, filling a critical gap in multimodal emotion recognition research.

Expert Annotations

All emotion labels provided by Deaf ASL signers with professional interpretation experience, ensuring cultural and linguistic authenticity.

Emotion Cue Documentation

Detailed descriptions of how emotions manifest in ASL through manual and non-manual components, providing insights for future research.

Model Benchmarks

Comprehensive evaluation of state-of-the-art multimodal models, revealing significant limitations in sign language emotion understanding.

Impact

This work establishes a new benchmark for multimodal AI systems and provides crucial insights for developing more emotionally-aware sign language technologies, with applications in education, healthcare, and accessibility.

Benchmark Results

We evaluated several state-of-the-art multimodal large language models (MLLMs) on EmoSign to establish baseline performance for sentiment analysis and emotion classification tasks.

Models Evaluated

Key Findings

Video-Only Performance

Models struggle with emotion recognition from visual cues alone, with most showing strong biases toward neutral or positive predictions.

  • GPT-4o: 40.7% accuracy (3-class sentiment)
  • Best performer without captions
  • Limited emotion recognition capabilities
Video + Caption Performance

Adding text captions significantly improves model performance across all tasks.

  • AffectGPT: 56.2% accuracy (3-class sentiment)
  • GPT-4o: 35.97% wAcc (emotion classification)
  • Models can better distinguish emotions

Detailed Results

Sentiment Analysis Results
Model Modality 3-class wAcc 3-class wF1 7-class wAcc 7-class wF1
GPT-4o Video only 40.72% 24.43% 19.81% 5.97%
GPT-4o Video + Caption 52.13% 76.72% 22.89% 26.35%
AffectGPT Video only 33.33% 0.04% 14.29% 0.04%
AffectGPT Video + Caption 56.18% 64.37% 21.02% 16.13%
Qwen2.5-VL Video only 27.34% 16.47% 10.26% 2.44%
Qwen2.5-VL Video + Caption 41.10% 54.29% 15.84% 14.51%
MiniGPT4 Video only 34.68% 40.00% 14.46% 13.03%
MiniGPT4 Video + Caption 21.65% 36.89% 9.76% 12.18%
Hearing Person Video only 55.64% 57.64% 25.48% 21.39%
Single-Label Emotion Classification Results
Model Modality Weighted Accuracy Weighted F1
GPT-4o Video only 11.50% 20.76%
GPT-4o Video + Caption 35.97% 55.09%
AffectGPT Video only 12.62% 11.03%
AffectGPT Video + Caption 30.17% 47.77%
Qwen2.5-VL Video only 14.39% 18.53%
Qwen2.5-VL Video + Caption 34.96% 44.67%
MiniGPT4 Video only 13.01% 22.02%
MiniGPT4 Video + Caption 23.56% 35.89%

Key Observation: All models show significant performance improvement when provided with text captions alongside video input, highlighting the current limitations of pure visual emotion recognition in sign language.

Emotion Cue Analysis

Our analysis of annotator descriptions revealed three key types of emotion cues in ASL:

  1. Non-manual markers: Facial expressions, head movements, mouth shapes, and body positioning
  2. Manual modifications: Changes in sign size, speed, repetition, and emphasis
  3. Contextual cues: Role shifts, eye gaze changes, and spatial positioning for narrative perspective

Human Baseline: A hearing, non-ASL fluent annotator achieved 55.6% accuracy on 3-class sentiment, highlighting the challenge of cross-modal emotion understanding.

BibTeX

@article{chua2025emosign,
  title={EmoSign: A Multimodal Dataset for Understanding Emotions in American Sign Language},
  author={Chua, Phoebe and Fang, Cathy Mengying and Ohkawa, Takehiko and Kushalnagar, Raja and Nanayakkara, Suranga and Maes, Pattie},
  journal={39th Conference on Neural Information Processing Systems (NeurIPS 2025) Track on Datasets and Benchmarks},
  year={2025},
  note={Available at \url{https://huggingface.co/datasets/catfang/emosign}}
}