# Multimodal Behavioral Sensors for Lie Detection: Integrating Visual, Auditory, and Generative Reasoning Cues

**Authors:** Daniel Grabowski, Kamila Łuczaj, Khalid Saeed

PMC · DOI: 10.3390/s25196086 · 2025-10-02

## TL;DR

This paper introduces a lie detection system that combines visual, audio, and language cues to improve accuracy and explainability in deception analysis.

## Contribution

A novel multimodal framework using ViViT, HuBERT, and GPT-5 for interpretable lie detection with chain-of-thought reasoning.

## Key findings

- The ViViT-based visual model achieved 74.4% accuracy in detecting deception.
- Multimodal fusion and CoT-based reasoning improved classification accuracy and interpretability.
- GPT-5-based prompt-level fusion enabled zero-shot inference and explainable AI outputs.

## Abstract

What are the main findings?
A multimodal deception detection framework combining visual, audio, and language-based reasoning achieved high accuracy on a DOLOS dataset.The ViViT-based visual model reached 74.4% accuracy, while HuBERT audio classification showed strong performance on prosodic cues.

A multimodal deception detection framework combining visual, audio, and language-based reasoning achieved high accuracy on a DOLOS dataset.

The ViViT-based visual model reached 74.4% accuracy, while HuBERT audio classification showed strong performance on prosodic cues.

What is the implication of the main finding?
Multimodal fusion enhances robustness and interpretability in behavioral biometrics for deception analysis.Language-guided models like GPT-5 prompt-level fusion provide explainable AI outputs, facilitating trust and real-world applicability.

Multimodal fusion enhances robustness and interpretability in behavioral biometrics for deception analysis.

Language-guided models like GPT-5 prompt-level fusion provide explainable AI outputs, facilitating trust and real-world applicability.

Advances in multimodal artificial intelligence enable new sensor-inspired approaches to lie detection by combining behavioral perception with generative reasoning. This study presents a deception detection framework that integrates deep video and audio processing with large language models guided by chain-of-thought (CoT) prompting. We interpret neural architectures such as ViViT (for video) and HuBERT (for speech) as digital behavioral sensors that extract implicit emotional and cognitive cues, including micro-expressions, vocal stress, and timing irregularities. We further incorporate a GPT-5-based prompt-level fusion approach for video–language–emotion alignment and zero-shot inference. This method jointly processes visual frames, textual transcripts, and emotion recognition outputs, enabling the system to generate interpretable deception hypotheses without any task-specific fine-tuning. Facial expressions are treated as high-resolution affective signals captured via visual sensors, while audio encodes prosodic markers of stress. Our experimental setup is based on the DOLOS dataset, which provides high-quality multimodal recordings of deceptive and truthful behavior. We also evaluate a continual learning setup that transfers emotional understanding to deception classification. Results indicate that multimodal fusion and CoT-based reasoning increase classification accuracy and interpretability. The proposed system bridges the gap between raw behavioral data and semantic inference, laying a foundation for AI-driven lie detection with interpretable sensor analogues.

## Full-text entities

- **Diseases:** eye contact aversion (MESH:D020018), ACC (MESH:D004476), anxiety (MESH:D001007), injury to (MESH:D014947), AI (MESH:C538142)
- **Chemicals:** GPT-5 (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12526670/full.md

---
Source: https://tomesphere.com/paper/PMC12526670