A Reproducible, Scalable Pipeline for Synthesizing Autoregressive Model Literature
Faruk Alpay, Bugra Kilictas, Hamdi Alakkad

TL;DR
This paper introduces an open-source, scalable pipeline that automates literature retrieval, relevance filtering, metadata extraction, and experiment reproduction for autoregressive models, significantly aiding research synthesis and reproducibility.
Contribution
The authors present a novel, fully automated pipeline that streamlines literature analysis and reproduces experiments in autoregressive modeling, addressing scalability and reproducibility challenges.
Findings
F1 scores above 0.85 for relevance and metadata extraction
Near-linear scalability with up to 1000 papers using 8 CPU workers
Reproduced experiments achieve perplexities within 1-3% of original results
Abstract
The accelerating pace of research on autoregressive generative models has produced thousands of papers, making manual literature surveys and reproduction studies increasingly impractical. We present a fully open-source, reproducible pipeline that automatically retrieves candidate documents from public repositories, filters them for relevance, extracts metadata, hyper-parameters and reported results, clusters topics, produces retrieval-augmented summaries and generates containerised scripts for re-running selected experiments. Quantitative evaluation on 50 manually-annotated papers shows F1 scores above 0.85 for relevance classification, hyper-parameter extraction and citation identification. Experiments on corpora of up to 1000 papers demonstrate near-linear scalability with eight CPU workers. Three case studies -- AWD-LSTM on WikiText-2, Transformer-XL on WikiText-103 and an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
