Finding People's Professions and Nationalities Using Distant Supervision - The FMI@SU "goosefoot" team at the WSDM Cup 2017 Triple Scoring Task
Valentin Zmiycharov (1), Dimitar Alexandrov (1), Preslav Nakov (2),, Ivan Koychev (1), Yasen Kiprov (1) ((1) Sofia University "St. Kliment, Ohridski", (2) Qatar Computing Research Institute)

TL;DR
This paper presents a system that uses distant supervision, Wikipedia data, and machine learning to accurately score the relevance of profession and nationality triples, achieving top rankings in the WSDM Cup 2017 competition.
Contribution
The paper introduces a novel distant supervision method combined with word embeddings and regression modeling for triple scoring in profession and nationality classification.
Findings
Ranked 1st on Kendall's Tau in the competition
Achieved high accuracy and low score difference
Effective use of Wikipedia and DBpedia data
Abstract
We describe the system that our FMI@SU student's team built for participating in the Triple Scoring task at the WSDM Cup 2017. Given a triple from a "type-like" relation, profession or nationality, the goal is to produce a score, on a scale from 0 to 7, that measures the relevance of the statement expressed by the triple: e.g., how well does the profession of an Actor fit for Quentin Tarantino? We propose a distant supervision approach using information crawled from Wikipedia, DeletionPedia, and DBpedia, together with task-specific word embeddings, TF-IDF weights, and role occurrence order, which we combine in a linear regression model. The official evaluation ranked our submission 1st on Kendall's Tau, 7th on Average score difference, and 9th on Accuracy, out of 21 participating teams.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Biomedical Text Mining and Ontologies
