An Architecture-Led Hybrid Report on Body Language Detection Project

Thomson Tong; Diba Darooneh

arXiv:2512.23028·cs.CV·December 30, 2025

An Architecture-Led Hybrid Report on Body Language Detection Project

Thomson Tong, Diba Darooneh

PDF

Open Access

TL;DR

This paper analyzes two vision-language models within a video-to-artifact pipeline for body language detection, emphasizing architectural properties, system constraints, and practical implementation considerations.

Contribution

It provides an architecture-led analysis linking model design to system implementation and evaluation challenges in body language detection.

Findings

01

Shared multimodal foundation identified (visual tokenization, Transformer attention, instruction following)

02

Structured outputs can be syntactically valid but semantically incorrect

03

Schema validation is structural, not geometric

Abstract

This report provides an architecture-led analysis of two modern vision-language models (VLMs), Qwen2.5-VL-7B-Instruct and Llama-4-Scout-17B-16E-Instruct, and explains how their architectural properties map to a practical video-to-artifact pipeline implemented in the BodyLanguageDetection repository [1]. The system samples video frames, prompts a VLM to detect visible people and generate pixel-space bounding boxes with prompt-conditioned attributes (emotion by default), validates output structure using a predefined schema, and optionally renders an annotated video. We first summarize the shared multimodal foundation (visual tokenization, Transformer attention, and instruction following), then describe each architecture at a level sufficient to justify engineering choices without speculative internals. Finally, we connect model behavior to system constraints: structured outputs can be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Hand Gesture Recognition Systems · Human Pose and Action Recognition