Exploring and Testing Skill-Based Behavioral Profile Annotation: Human Operability and LLM Feasibility under Schema-Guided Execution
Yufeng Wu

TL;DR
This study evaluates the feasibility of using large language models like GPT-5.4 for skill-based behavioral profile annotation, highlighting the importance of assessing individual skill execution rather than overall task automation.
Contribution
It introduces a skill-file-driven pipeline for BP annotation and demonstrates GPT-5.4's selective reliability in executing specific annotation skills.
Findings
GPT-5.4 achieves 0.678 accuracy in skill execution.
Human and GPT difficulty profiles are highly correlated at the skill level.
Failures in open-source models are mainly due to schema-to-skill execution issues.
Abstract
Behavioral Profile (BP) annotation is difficult to automate because it requires simultaneous coding across multiple linguistic dimensions. We treat BP annotation as a bundle of annotation skills rather than a single task and evaluate LLM-assisted BP annotation from this perspective. Using 3,134 concordance lines of 30 Chinese metaphorical color-term derivatives and a 14-feature BP schema, we implement a skill-file-driven pipeline in which each feature is externally defined through schema files, decision rules, and examples. Two human annotators completed a two-round schema-only protocol on a 300-instance validation subset, enabling BP skills to be classified as directly operable, recoverable under focused re-annotation, or structurally underspecified. GPT-5.4 and three locally deployable open-source models were then evaluated under the same setup. Results show that BP annotation is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
