Chi-square-based scoring function for categorization of MEDLINE citations
Andrej Kastrin, Borut Peterlin, Dimitar Hristovski

TL;DR
This paper introduces a simple chi-square-based scoring method for classifying MEDLINE citations related to genetics, demonstrating comparable accuracy to machine learning techniques in biomedical text categorization.
Contribution
The study presents a novel, straightforward chi-square scoring approach for biomedical document categorization, validated against machine learning methods with promising results.
Findings
Achieved 87% predictive accuracy on MEDLINE citations
Performed comparably to machine learning algorithms like SVM, decision trees, and Naive Bayes
Implemented in a literature-based discovery system for gene disambiguation
Abstract
Objectives: Text categorization has been used in biomedical informatics for identifying documents containing relevant topics of interest. We developed a simple method that uses a chi-square-based scoring function to determine the likelihood of MEDLINE citations containing genetic relevant topic. Methods: Our procedure requires construction of a genetic and a nongenetic domain document corpus. We used MeSH descriptors assigned to MEDLINE citations for this categorization task. We compared frequencies of MeSH descriptors between two corpora applying chi-square test. A MeSH descriptor was considered to be a positive indicator if its relative observed frequency in the genetic domain corpus was greater than its relative observed frequency in the nongenetic domain corpus. The output of the proposed method is a list of scores for all the citations, with the highest score given to those…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
