DocXChain: A Powerful Open-Source Toolchain for Document Parsing and Beyond
Cong Yao

TL;DR
DocXChain is an open-source, modular toolchain that automates the conversion of unstructured documents into structured data, supporting various document parsing tasks and easily integrating with other systems for complex applications.
Contribution
It introduces a flexible, comprehensive toolchain for document parsing that combines basic capabilities with pipelines, enabling easier integration and application in real-world scenarios.
Findings
Provides a complete set of document parsing capabilities
Supports integration with tools like LangChain and ChatGPT
Open-source and publicly available for community use
Abstract
In this report, we introduce DocXChain, a powerful open-source toolchain for document parsing, which is designed and developed to automatically convert the rich information embodied in unstructured documents, such as text, tables and charts, into structured representations that are readable and manipulable by machines. Specifically, basic capabilities, including text detection, text recognition, table structure recognition and layout analysis, are provided. Upon these basic capabilities, we also build a set of fully functional pipelines for document parsing, i.e., general text reading, table parsing, and document structurization, to drive various applications related to documents in real-world scenarios. Moreover, DocXChain is concise, modularized and flexible, such that it can be readily integrated with existing tools, libraries or models (such as LangChain and ChatGPT), to construct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Topic Modeling
MethodsSparse Evolutionary Training
