L2MAC: Large Language Model Automatic Computer for Extensive Code Generation
Samuel Holt, Max Ruiz Luyten, Mihaela van der Schaar

TL;DR
L2MAC is a novel LLM-based framework that enables long, coherent output generation by mimicking a stored-program computer architecture, surpassing existing methods in large codebase and extensive text generation tasks.
Contribution
This paper introduces L2MAC, a practical multi-agent LLM system with a dual-component memory architecture, allowing for extensive output generation beyond fixed context limitations.
Findings
Achieves state-of-the-art performance in large codebase generation.
Successfully generates entire books and complex texts.
Outperforms existing coding and text generation methods.
Abstract
Transformer-based large language models (LLMs) are constrained by the fixed context window of the underlying transformer architecture, hindering their ability to produce long and coherent outputs. Memory-augmented LLMs are a promising solution, but current approaches cannot handle long output generation tasks since they (1) only focus on reading memory and reduce its evolution to the concatenation of new memories or (2) use very specialized memories that cannot adapt to other domains. This paper presents L2MAC, the first practical LLM-based general-purpose stored-program automatic computer (von Neumann architecture) framework, an LLM-based multi-agent system, for long and consistent output generation. Its memory has two components: the instruction registry, which is populated with a prompt program to solve the user-given task, and a file store, which will contain the final and…
Peer Reviews
Decision·ICLR 2024 poster
- Structured framework for LLM-based computation that can deal with limited context, file input/output and output evaluation and testing. - Context handling that preserves information needed for the tasks and limits context to the context size - Read and write implementation for files generated during subtasks. Demonstrated capabilities to write, then read and update files. - Strongly improved results on benchmark tasks compared to strong baseline models/tools.
- File read/write implementation details are not clear. Please explain how your system decides what files to write, read, and update and how this is different from previous systems that did not have this functionality. - Benchmark set is not described. It is not clear if the benchmarks are representative of large code base creation tasks. Evaluation is done on only 3 tasks. The number of tasks should be increased to show the versatility and that the results are not outliers. Minor comments: - S
1. **Relatively Novel Approach**: The paper presents a novel idea of employing a control unit, instruction registry, and a file store to enhance LLMs. Although the individual components have been introduced in prior work (planning, test case generation, using external tools, refining with code execution feedback), the application of these to a stored-program computer in this context seems to be a fresh approach. 2. **Detailed Descriptions**: The paper provides a thorough description of all the
1. **Questionable Evaluation Metrics**: The paper employs several evaluation metrics that are based on LLMs rather than ground truths like human-written test cases. This approach raises concerns about the representation of these metrics in terms of code quality. For instance, 'Features %' is determined by a GPT-4 call and not by running and testing the code. Similarly, 'Tests Passed' is based on the LLM-generated test cases, which may not accurately reflect test coverage or code quality. The pap
$\mathtt{+}$ I found the synergy between conventional von Neumann architecture and L2MAC interesting and how the authors created a 1to1 mapping between different components in conventional computing platforms and their proposed design. $\mathtt{+}$ The results for code generation tasks is promising.
$\mathtt{-}$ While I think the paper proposes an interesting idea, but I found the writing very challenging and difficult to understand and follow. $\mathtt{-}$ While the general-purpose computers can excel work in a variety of task, L2MAC focuses on one particular task and it is not clear how such model can generalized to different application and programs. $\mathtt{-}$ While the core idea is still new, most of the explored idea like self-refinement, using external memory, etc. have been expl
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsFocus
