Bridging the Reasoning Gap in Vietnamese with Small Language Models via Test-Time Scaling
Bui The Trung, Do Minh Duc, Nguyen Van Vinh, Bui Nguyen Quoc Trinh

TL;DR
This paper explores test-time scaling and fine-tuning to improve reasoning in Vietnamese Small Language Models, demonstrating significant performance gains and analyzing trade-offs in prompting strategies.
Contribution
It introduces Vi-S1K and Vi-Elementary-Bench datasets, and shows that supervised fine-tuning combined with test-time scaling enhances reasoning in resource-constrained models.
Findings
Supervised fine-tuning improves explanation quality by 77%.
Base model has high latent knowledge but formatting issues.
Structured prompting can degrade performance due to cognitive load.
Abstract
The democratization of ubiquitous AI hinges on deploying sophisticated reasoning capabilities on resource-constrained devices. However, Small Language Models (SLMs) often face a "reasoning gap", particularly in non-English languages like Vietnamese, where they struggle to maintain coherent chains of thought. This paper investigates Test-Time Scaling strategies for the Qwen3-1.7B architecture within the context of Vietnamese Elementary Mathematics. We introduce Vi-S1K, a high-fidelity reasoning dataset localized via a Gemini 2.5 Flash-Lite powered pipeline, and Vi-Elementary-Bench, a dual-resource benchmark for rigorous evaluation. Using an LLM-as-a-Judge protocol, we reveal that the base model possesses robust latent knowledge (Accuracy: 4.05/5.00) but suffers from a severe "formatting gap" in communication. Supervised Fine-Tuning (SFT) acts as a critical "reasoning unlocker", yielding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
