Scalable Construction of a Lung Cancer Knowledge Base: Profiling Semantic Reasoning in LLMs
Cesar Felipe Mart\'inez Cisneros, Jes\'us Ulises Quiroz Bautista, Claudia Anah\'i Guzm\'an Solano, Bogdan Kaleb Garc\'ia Rivera, Iv\'an Garc\'ia Pacheco, Yalbi Itzel Balderas Mart\'inez, Kolawole John Adebayoc, Ignacio Arroyo Fern\'andez

TL;DR
This paper introduces a scalable pipeline for constructing a lung cancer knowledge base using OpenIE and NER, significantly improving biomedical NLP performance for LLM fine-tuning.
Contribution
It presents a novel, scalable method for building a domain-specific biomedical knowledge base leveraging OpenIE and NER, enhancing LLM fine-tuning.
Findings
Improved semantic coherence in LLM outputs.
Enhanced performance on biomedical NLP tasks.
Demonstrated effectiveness of OpenIE-derived datasets.
Abstract
The integration of Large Language Models (LLMs) into biomedical research offers new opportunities for domainspecific reasoning and knowledge representation. However, their performance depends heavily on the semantic quality of training data. In oncology, where precision and interpretability are vital, scalable methods for constructing structured knowledge bases are essential for effective fine-tuning. This study presents a pipeline for developing a lung cancer knowledge base using Open Information Extraction (OpenIE). The process includes: (1) identifying medical concepts with the MeSH thesaurus; (2) filtering open-access PubMed literature with permissive licenses (CC0); (3) extracting (subject, relation, object) triplets using OpenIE method; and (4) enriching triplet sets with Named Entity Recognition (NER) to ensure biomedical relevance. The resulting triplet sets provide a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Text Readability and Simplification
