TL;DR
Humans-Junior, a 3.8B language model, achieves GPT-4o-level factual accuracy through directed reasoning and fine-tuning, offering a cost-effective alternative with comparable performance and potential for edge deployment.
Contribution
This paper introduces Humans-Junior, a small language model that matches GPT-4o's factual grounding accuracy using a novel combination of directed reasoning scaffolds and behavioral fine-tuning.
Findings
Humans-Junior matches GPT-4o's accuracy within a ±5 percentage point margin.
It is approximately 19 times cheaper than GPT-4o when purchased as an API.
Directed reasoning improves performance on frontier models in prompt-only settings.
Abstract
We introduce Humans-Junior, a 3.8B model that matches GPT-4o on the FACTS Grounding public subset within a pp equivalence margin. Results. On Q1--Q500 under identical judges, GPT-4o scores 73.5% (95% CI 69.5--77.2) and Humans-Junior 72.7% (95% CI 68.7--76.5); the paired difference is 0.8 pp (bootstrap 95% CI to ; permutation ; Cohen's ). TOST establishes equivalence at pp (not at pp). When purchased as managed APIs, Humans-Junior's base model (Phi-3.5-mini-instruct) is less expensive than GPT-4o on Microsoft AI Foundry pricing; self-hosted or edge deployments can drive incremental inference cost toward zero. Measured vs estimated pricing sources are tabulated in Appendix E. Method. Our approach combines minimal directed "Exoskeleton Reasoning" scaffolds with behavioral fine-tuning that teaches protocol…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
