From Amateur to Master: Infusing Knowledge into LLMs via Automated Curriculum Learning

Nishit Neema; Srinjoy Mukherjee; Sapan Shah; Gokul Ramakrishnan; Ganesh Venkatesh

arXiv:2510.26336·cs.CL·October 31, 2025

From Amateur to Master: Infusing Knowledge into LLMs via Automated Curriculum Learning

Nishit Neema, Srinjoy Mukherjee, Sapan Shah, Gokul Ramakrishnan, Ganesh Venkatesh

PDF

3 Reviews

TL;DR

This paper introduces ACER, a curriculum learning method that systematically enhances large language models' domain-specific knowledge without losing their general capabilities, leading to significant performance improvements.

Contribution

ACER is a novel automated curriculum learning approach that synthesizes domain curricula and guides continual pretraining to improve LLMs' specialized knowledge.

Findings

01

ACER improves accuracy by 5 percentage points in microeconomics.

02

ACER achieves a 3 percentage point macro-average gain across target domains.

03

ACER enhances performance on knowledge-intensive benchmarks by over 2 points.

Abstract

Large Language Models (LLMs) excel at general tasks but underperform in specialized domains like economics and psychology, which require deep, principled understanding. To address this, we introduce ACER (Automated Curriculum-Enhanced Regimen) that transforms generalist models into domain experts without sacrificing their broad capabilities. ACER first synthesizes a comprehensive, textbook-style curriculum by generating a table of contents for a subject and then creating question-answer (QA) pairs guided by Bloom's taxonomy. This ensures systematic topic coverage and progressively increasing difficulty. The resulting synthetic corpus is used for continual pretraining with an interleaved curriculum schedule, aligning learning across both content and cognitive dimensions. Experiments with Llama 3.2 (1B and 3B) show significant gains in specialized MMLU subsets. In challenging domains…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

1. The paper presents a systematic synthetic corpus generation pipeline that is well-motivated and clearly described. 2. The curriculum scheduling is a reasonable design choice, enabling ablations that highlight the value of ordering in continual pretraining. 3. Empirical results demonstrate improvements in both target niche domains and stability on general capability benchmarks.

Weaknesses

1. The impact of the synthesis book corpus generation pipeline is not sufficiently discussed. The experiments centered around using the same pipeline under different curriculum schedules. It's unclear how big a role the generation pipleline plays in the overall performance improvement. 2. Limited insight was revealed and discussed among the different curriculum schedules, e.g. what makes some schedule works better than the others. 3. The performance improvement in some domains are rather limited

Reviewer 02Rating 2Confidence 3

Strengths

The paper identifies that state-of-the-art LLMs struggle in specialized domains requiring deep, principled understanding.

Weaknesses

- The paper only compares the effectiveness of different curriculum schedules within their own method in Table 1, and lacks comparisons against other baseline approaches in the same domain (e.g., those referenced in lines 53–55). - The ACER synthesis seems to rely on Gemini 2.0 Flash as the LLM teacher, but this is not explicitly stated in Section 3. Moreover, the paper omits presenting Gemini 2.0 Flash’s performance on the evaluation benchmarks, which is important as it is the teacher model.

Reviewer 03Rating 2Confidence 4

Strengths

1. Well-motivated problem: The paper addresses a genuine limitation of current LLMs—their shallow understanding of specialized domains—highlighted by consistent performance gaps on benchmarks like MMLU (Hendrycks et al., 2021). 2. Systematic synthesis pipeline: ACER’s multi-stage generation process (domain detailing → outline → textbook → QA pairs) is pedagogically grounded and scalable, drawing on established educational frameworks like Bloom’s taxonomy (Bloom et al., 1956). 3. Comprehensive ev

Weaknesses

1. Lack of comparison to strong synthetic data baselines: The paper does not compare ACER against recent, high-impact synthetic data methods, like Phi-4 (Abdin et al., 2024), which uses multi-agent, multi-stage pipelines to generate diverse, high-quality textbook-like data. Without such comparisons, it is unclear whether gains stem from curriculum structure or simply from high-quality synthetic data. 2. Marginal and unstable gains from curriculum scheduling: While the “Flat” baseline (random mix

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.