Thread: A Logic-Based Data Organization Paradigm for How-To Question Answering with Retrieval Augmented Generation
Kaikai An, Fangkai Yang, Liqun Li, Junting Lu, Sitao Cheng, Shuzheng Si, Lu Wang, Pu Zhao, Lele Cao, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Baobao Chang

TL;DR
This paper introduces Thread, a new data organization paradigm that structures documents into logical units to improve how-to question answering, significantly enhancing success rates and adaptability over traditional chunk-based methods.
Contribution
The paper proposes Thread, a novel paradigm using logic units for better data organization, enabling more effective handling of how-to questions with improved accuracy and flexibility.
Findings
Thread outperforms existing paradigms by 21-33% in success rate.
Reduces retrieval information by up to 75%.
Demonstrates high adaptability across document formats.
Abstract
Recent advances in retrieval-augmented generation (RAG) have substantially improved question-answering systems, particularly for factoid '5Ws' questions. However, significant challenges remain when addressing '1H' questions, specifically how-to questions, which are integral for decision-making and require dynamic, step-by-step responses. The key limitation lies in the prevalent data organization paradigm, chunk, which commonly divides documents into fixed-size segments, and disrupts the logical coherence and connections within the context. To address this, we propose Thread, a novel data organization paradigm enabling systems to handle how-to questions more effectively. Specifically, we introduce a new knowledge granularity, 'logic unit' (LU), where large language models transform documents into more structured and loosely interconnected LUs. Extensive experiments across both…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The authors proposed a novel way to organize document corpus. When compared to the usual chunking methods, it facilitates building logical connections between units, and compared to knowledge-graph approaches, it facilitates building meta connections between units that are larger than the more atomic-level knowledge base entities. 2. This approach is shown to be effective at improving answering how-to questions
### Clarity In general the major problem for me is clarity. The paper is a bit hard to follow and requires multiple re-read to understand how the system is supposed to be implemented. 1. For the methodology section, reading the text itself is clearer than trying to parse figure 2 & 3. For figure 2, the arrow from (b) to the right side confused me. Per my understanding, the right panel is an individual view of an LU. Maybe split up the high-level flow with component details into different figure
1. THREAD introduces a novel logic-based organization paradigm that emphasizes the logical flow needed to handle complex how-to questions effectively. 2. The paper’s design of logic units with structured components is well-thought-out. Each component serves a specific purpose, enhancing logical continuity and allowing for more coherent, step-by-step responses. 3. Experimental results show that THREAD outperforms traditional data organization paradigms. 4. THREAD reduces the number of retrieval
1. The experiments mainly compare THREAD with chunk-based and document-based paradigms, with limited discussion on other advanced retrieval techniques, such as graph-based retrieval or hierarchical indexing. 2. In the industrial setting, the evaluation relies on human engineers. However, the paper does not provide enough details about the evaluation protocols, inter-annotator agreement, or potential biases. 3. THREAD’s effectiveness appears to depend on the assumption that documents are reason
The paper is written well and points out a good question. The idea of LU is reasonable.
Objectively speaking, most RAG objects are chunks. This paper does not change this basic background. So the beginning of this article is very misleading. It is recommended not to exaggerate the motivation in the respect of chunks. The last sentence of contribution # 3 does not seem to be supported by any experiments.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Advanced Database Systems and Queries · Data Management and Algorithms
MethodsBalanced Selection
