A Reproducible, Scalable Pipeline for Synthesizing Autoregressive Model Literature

Faruk Alpay; Bugra Kilictas; Hamdi Alakkad

arXiv:2508.04612·cs.IR·August 7, 2025

A Reproducible, Scalable Pipeline for Synthesizing Autoregressive Model Literature

Faruk Alpay, Bugra Kilictas, Hamdi Alakkad

PDF

TL;DR

This paper introduces an open-source, scalable pipeline that automates literature retrieval, relevance filtering, metadata extraction, and experiment reproduction for autoregressive models, significantly aiding research synthesis and reproducibility.

Contribution

The authors present a novel, fully automated pipeline that streamlines literature analysis and reproduces experiments in autoregressive modeling, addressing scalability and reproducibility challenges.

Findings

01

F1 scores above 0.85 for relevance and metadata extraction

02

Near-linear scalability with up to 1000 papers using 8 CPU workers

03

Reproduced experiments achieve perplexities within 1-3% of original results

Abstract

The accelerating pace of research on autoregressive generative models has produced thousands of papers, making manual literature surveys and reproduction studies increasingly impractical. We present a fully open-source, reproducible pipeline that automatically retrieves candidate documents from public repositories, filters them for relevance, extracts metadata, hyper-parameters and reported results, clusters topics, produces retrieval-augmented summaries and generates containerised scripts for re-running selected experiments. Quantitative evaluation on 50 manually-annotated papers shows F1 scores above 0.85 for relevance classification, hyper-parameter extraction and citation identification. Experiments on corpora of up to 1000 papers demonstrate near-linear scalability with eight CPU workers. Three case studies -- AWD-LSTM on WikiText-2, Transformer-XL on WikiText-103 and an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.