Evaluating Open-Source Vision Language Models for Facial Emotion Recognition against Traditional Deep Learning Models

Vamsi Krishna Mulukutla; Sai Supriya Pavarala; Srinivasa Raju Rudraraju; Sridevi Bonthu

arXiv:2508.13524·cs.CV·August 20, 2025

Evaluating Open-Source Vision Language Models for Facial Emotion Recognition against Traditional Deep Learning Models

Vamsi Krishna Mulukutla, Sai Supriya Pavarala, Srinivasa Raju Rudraraju, Sridevi Bonthu

PDF

TL;DR

This paper empirically compares open-source vision-language models with traditional deep learning models for facial emotion recognition, revealing that traditional models outperform VLMs on low-quality FER data and highlighting areas for improvement.

Contribution

It introduces a novel pipeline combining image restoration with FER evaluation and provides a comprehensive benchmark comparing VLMs and traditional models on FER-2013.

Findings

01

Traditional models like EfficientNet-B0 achieve over 86% accuracy.

02

VLMs like CLIP achieve around 64% accuracy, underperforming traditional models.

03

The study offers detailed computational cost analysis for practical deployment.

Abstract

Facial Emotion Recognition (FER) is crucial for applications such as human-computer interaction and mental health diagnostics. This study presents the first empirical comparison of open-source Vision-Language Models (VLMs), including Phi-3.5 Vision and CLIP, against traditional deep learning models VGG19, ResNet-50, and EfficientNet-B0 on the challenging FER-2013 dataset, which contains 35,887 low-resolution grayscale images across seven emotion classes. To address the mismatch between VLM training assumptions and the noisy nature of FER data, we introduce a novel pipeline that integrates GFPGAN-based image restoration with FER evaluation. Results show that traditional models, particularly EfficientNet-B0 (86.44%) and ResNet-50 (85.72%), significantly outperform VLMs like CLIP (64.07%) and Phi-3.5 Vision (51.66%), highlighting the limitations of VLMs in low-quality visual tasks. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.