PaPaformer: Language Model from Pre-trained Parallel Paths
Joonas Tapaninaho, Mourad Oussala

TL;DR
PaPaformer introduces a novel decoder-only transformer architecture with parallel paths, enabling faster training, customization for specific tasks, and reduced computational costs compared to traditional models.
Contribution
The paper presents PaPaformer, a transformer variant with parallel paths that can be trained separately and combined, reducing training time and allowing task-specific customization.
Findings
Training time reduced from days to hours.
Lower-dimensional paths can be trained independently.
Model performance improves with combined paths.
Abstract
The training of modern large-language models requires an increasingly amount of computation power and time. Even smaller variants, such as small-language models (SLMs), take several days to train in the best-case scenarios, often requiring multiple GPUs. This paper explores methods to train and evaluate decoder-only transformer-based language models in hours instead of days/weeks. We introduces \textit{PaPaformer}, a decoder-only transformer architecture variant, whose lower-dimensional parallel paths are combined into larger model. The paper shows that these lower-dimensional paths can be trained individually with different types of training data and then combined into one larger model. This method gives the option to reduce the total number of model parameters and the training time with increasing performance. Moreover, the use of parallel path structure opens interesting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
