TL;DR
This paper introduces P2P, a novel LLM-based multi-agent framework for automated academic paper-to-poster generation, along with a large-scale dataset and comprehensive benchmark to advance research in this area.
Contribution
We propose P2P, the first flexible multi-agent system for generating high-quality academic posters from research papers, and release P2PInstruct and P2PEval for standardized training and evaluation.
Findings
P2P achieves high-quality poster generation with iterative refinement.
P2PEval provides a dual evaluation methodology for comprehensive assessment.
P2PInstruct offers a large-scale dataset for training and benchmarking.
Abstract
Academic posters are vital for scholarly communication, yet their manual creation is time-consuming. However, automated academic poster generation faces significant challenges in preserving intricate scientific details and achieving effective visual-textual integration. Existing approaches often struggle with semantic richness and structural nuances, and lack standardized benchmarks for evaluating generated academic posters comprehensively. To address these limitations, we introduce P2P, the first flexible, LLM-based multi-agent framework that generates high-quality, HTML-rendered academic posters directly from research papers, demonstrating strong potential for practical applications. P2P employs three specialized agents-for visual element processing, content generation, and final poster assembly-each integrated with dedicated checker modules to enable iterative refinement and ensure…
Peer Reviews
Decision·ICLR 2026 Poster
The paper makes three core contributions: A multi-agent system that decomposes poster generation into specialized sub-tasks (figure extraction, content summarization, and layout assembly) with integrated reflection mechanisms for iterative improvement. A large-scale instruction dataset containing over 30,000 examples, designed to support training and fine-tuning of models for the poster generation task. A benchmark featuring fine-grained checklists and a universal evaluation metric, combining
1. The checker-reflection paradigm represents a significant architectural innovation, but its failure boundaries remain unclear. Could you provide a typology of errors that persist despite reflection cycles? Specifically, we're interested in cases where the system's compositional reasoning breaks down - for instance, when reconciling complex multi-panel figures with nuanced methodological descriptions. Understanding these limitations would help define the theoretical ceiling of this approach. 2.
1. **Innovative P2P Framework** The P2P multi-agent architecture is a novel contribution to complex document transformation tasks. The inclusion of a **checker-reflection mechanism** is particularly strong, as it mimics the human design process of drafting and revision. This iterative approach helps ensure both scientific accuracy and structural integrity in the final output. 2. **Valuable Dataset (P2PINSTRUCT)** The paper introduces P2PINSTRUCT, the first large-scale (30K+) instructi
### 1. Lack of Detail on P2P Checker Mechanisms The paper states, "Each agent operates in conjunction with a dedicated checker module that triggers a reflection loop if its output fails to meet quality standards". However, the manuscript provides no concrete details on how these critical checker modules are implemented. * What is the core component of each checker? Is it an LLM-as-a-judge, a set of programmatic rules, or a trained classifier? * **Figure Checker**: The paper mentions "an initial
1. The authors have built an end-to-end solution, from the generation framework (P2P) to the large-scale data for training (P2PINSTRUCT) and a novel benchmark for evaluation (P2PEVAL). This is a significant and impressive amount of work. 2. The authors conducted comprehensive experiments, including evaluation on 35 models and a human preference study. The P2P framework is shown to be effective via an ablation study.
1. There is a lack of analysis for the effectiveness of the P2PINSTRUCT dataset. 2. The methodology of this metric measures fidelity to a specific human instance, which itself may be suboptimal (as shown by human preference research). The authors should discuss this tension between "imitation" and "fidelity" more openly. 3. The two-step, MLLM-Featurizer-plus-XGBoost-Regressor methodology for the universal score is not justified over simpler, more direct MLLM-as-a-Judge approaches. This adds a "
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
