EmoAra: Emotion-Preserving English Speech Transcription and Cross-Lingual Translation with Arabic Text-to-Speech
Besher Hassan, Ibrahim Alsarraj, Musaab Hasan, Yousef Melhim, Shahem Fadi, Shahem Sultan

TL;DR
EmoAra is an integrated pipeline that preserves emotional nuance in cross-lingual speech translation from English to Arabic, combining multiple AI components to maintain emotion and achieve high translation quality.
Contribution
The paper introduces EmoAra, a novel end-to-end system that preserves emotion in cross-lingual speech translation, integrating emotion recognition, transcription, translation, and speech synthesis.
Findings
Emotion classification F1-score of 94%
Translation BLEU score of 56, BERTScore F1 of 88.7%
Human evaluation score of 81% on banking translations
Abstract
This work presents EmoAra, an end-to-end emotion-preserving pipeline for cross-lingual spoken communication, motivated by banking customer service where emotional context affects service quality. EmoAra integrates Speech Emotion Recognition, Automatic Speech Recognition, Machine Translation, and Text-to-Speech to process English speech and deliver an Arabic spoken output while retaining emotional nuance. The system uses a CNN-based emotion classifier, Whisper for English transcription, a fine-tuned MarianMT model for English-to-Arabic translation, and MMS-TTS-Ara for Arabic speech synthesis. Experiments report an F1-score of 94% for emotion classification, translation performance of BLEU 56 and BERTScore F1 88.7%, and an average human evaluation score of 81% on banking-domain translations. The implementation and resources are available at the accompanying GitHub repository.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Sentiment Analysis and Opinion Mining · Speech Recognition and Synthesis
