An Architecture-Led Hybrid Report on Body Language Detection Project
Thomson Tong, Diba Darooneh

TL;DR
This paper analyzes two vision-language models within a video-to-artifact pipeline for body language detection, emphasizing architectural properties, system constraints, and practical implementation considerations.
Contribution
It provides an architecture-led analysis linking model design to system implementation and evaluation challenges in body language detection.
Findings
Shared multimodal foundation identified (visual tokenization, Transformer attention, instruction following)
Structured outputs can be syntactically valid but semantically incorrect
Schema validation is structural, not geometric
Abstract
This report provides an architecture-led analysis of two modern vision-language models (VLMs), Qwen2.5-VL-7B-Instruct and Llama-4-Scout-17B-16E-Instruct, and explains how their architectural properties map to a practical video-to-artifact pipeline implemented in the BodyLanguageDetection repository [1]. The system samples video frames, prompts a VLM to detect visible people and generate pixel-space bounding boxes with prompt-conditioned attributes (emotion by default), validates output structure using a predefined schema, and optionally renders an annotated video. We first summarize the shared multimodal foundation (visual tokenization, Transformer attention, and instruction following), then describe each architecture at a level sufficient to justify engineering choices without speculative internals. Finally, we connect model behavior to system constraints: structured outputs can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Hand Gesture Recognition Systems · Human Pose and Action Recognition
