From Threads to Trajectories: A Multi-LLM Pipeline for Community Knowledge Extraction from GitHub Issue Discussions

Nazia Shehnaz Joynab; Soneya Binta Hossain

arXiv:2604.25880·cs.SE·April 29, 2026

From Threads to Trajectories: A Multi-LLM Pipeline for Community Knowledge Extraction from GitHub Issue Discussions

Nazia Shehnaz Joynab, Soneya Binta Hossain

PDF

TL;DR

This paper introduces SWE-MIMIC-Bench, an automated multi-LLM pipeline that extracts structured, coherent issue trajectories from GitHub discussions to aid developers and train expert-like LLM agents.

Contribution

The paper presents a novel multi-LLM pipeline that generates detailed, label-aware issue trajectories from raw GitHub discussions, enhancing understanding and training of AI agents.

Findings

01

Achieved 91.7% success rate in extracting high-fidelity trajectories.

02

Generated 734 detailed issue trajectories from 800 real-world GitHub issues.

03

Demonstrated the system's effectiveness on multiple SWE-Bench datasets.

Abstract

Resolution of complex post-production issues in large-scale open-source software (OSS) projects requires significant cognitive effort, as developers need to go through long, unstructured and fragmented issue discussion threads before that. In this paper, we present SWE-MIMIC-Bench, an issue trajectory dataset generated from raw GitHub discussions using an automated multi-LLM pipeline. Unlike simple summarization, this pipeline utilizes a group of closed-source LLMs to perform granular tasks: analyzing individual comments with awareness of externally-linked resources, classifying comment analyses into label-specific fields (e.g., root cause, solution plan, implementation progress), and synthesizing label-aware trajectories which capture a structured and coherent narrative of the entire discussion thread. Our pipeline uses five closed-source LLM configurations for distinct purposes: label…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.