SemEval-2019 Task 8: Fact Checking in Community Question Answering   Forums

Tsvetomila Mihaylova (1); Georgi Karadjov (2); Pepa Atanasova (3),; Ramy Baly (4); Mitra Mohtarami (4); Preslav Nakov (5) ((1) Instituto de; Telecomunica\c{c}\~oes; Lisbon; Portugal; (2) SiteGround Hosting EOOD,; Bulgaria; (3) University of Copenhagen; Denmark; (4) MIT Computer Science and; Artificial Intelligence Laboratory; Cambridge; MA; (5) Qatar Computing; Research Institute; HBKU)

arXiv:1906.01727·cs.CL·June 6, 2019

SemEval-2019 Task 8: Fact Checking in Community Question Answering Forums

Tsvetomila Mihaylova (1), Georgi Karadjov (2), Pepa Atanasova (3),, Ramy Baly (4), Mitra Mohtarami (4), Preslav Nakov (5) ((1) Instituto de, Telecomunica\c{c}\~oes, Lisbon, Portugal, (2) SiteGround Hosting EOOD,, Bulgaria, (3) University of Copenhagen, Denmark

PDF

1 Repo

TL;DR

This paper introduces SemEval-2019 Task 8, focusing on fact checking in community question answering forums through two subtasks: classifying question types and verifying answer correctness, with system evaluations and baseline comparisons.

Contribution

It presents a new benchmark dataset and evaluation framework for fact checking in community QA forums, along with analysis of system performances.

Findings

01

All systems improved over the baseline in Subtask A.

02

All systems underperformed in Subtask B but some approached the baseline.

03

The dataset and leaderboard are publicly available for future research.

Abstract

We present SemEval-2019 Task 8 on Fact Checking in Community Question Answering Forums, which features two subtasks. Subtask A is about deciding whether a question asks for factual information vs. an opinion/advice vs. just socializing. Subtask B asks to predict whether an answer to a factual question is true, false or not a proper answer. We received 17 official submissions for subtask A and 11 official submissions for Subtask B. For subtask A, all systems improved over the majority class baseline. For Subtask B, all systems were below a majority class baseline, but several systems were very close to it. The leaderboard and the data from the competition can be found at http://competitions.codalab.org/competitions/20022

Tables4

Table 1. Table 1: Subtask A: Distribution of the factuality labels for the questions.

Label	Train	Dev	Test
Factual	311	62	299
Opinion	563	126	167
Socializing	244	51	487
Total	1118	239	953

Table 2. Table 2: Subtask B: Distribution of the factuality labels for the answers.

Label	Train	Dev	Test
True	166	29	34
False	135	31	45
NonFactual	194	52	231
Total	495	112	310

Table 3. Table 3: Subtask A: Results for question classification based on the official submissions, evaluated on the test set. (Some teams did not submit system description papers, and thus we have no citations for their systems.)

Team ID	Affiliation	Accuracy	F1	AvgRec
Fermi Syed et al. (2019)	IIIT Hyderabad, Microsoft, Teradata	0.840	0.718₂	0.735₃
TMLab Niewiński et al. (2019)	Samsung R&D Institute, Warsaw, Poland	0.834	0.725₁	0.764₁
SolomonLab Gupta et al. (2019)	Samsung R&D Institute India, Bangalore	0.831	0.709₄	0.728₄
ColumbiaNLP Chakrabarty and Muresan (2019)	Columbia University, Department Of Computer Science and Data Science Institute	0.828	0.645₇	0.662₉
DOMLIN Stammbach et al. (2019)	Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI), Saarbrucken, Germany	0.823	0.710₃	0.755₂
BLCU_NLP Xie et al. (2019)	Beijing Language and Culture University, Beijing, China	0.820	0.696₅	0.723₅
pjetro	Warsaw University of Technology	0.790	0.661₆	0.698₆
LP0606		0.768	0.637₈	0.679₈
PP08		0.766	0.637₉	0.684₇
AUTOHOME-ORCA Lv et al. (2019)	Autohome Inc., Beijing, China and Beijing University of Posts and Telecommunications, Beijing, China	0.745	0.583₁₀	0.596₁₁
DUTH Bairaktaris et al. (2019)	Democritus University of Thrace, Xanthi, Greece	0.711	0.563₁₁	0.604₁₀
cococold		0.702	0.543₁₂	0.594₁₂
nothing		0.702	0.543₁₂	0.594₁₂
chchao		0.630	0.454₁₃	0.523₁₃
CodeForTheChange Avvaru and Pandey (2019)	International Institute of Information Technology, Hyderabad, Teradata and Qubole	0.630	0.442₁₄	0.513₁₄
Tuefact Juhasz et al. (2019)	University of Tübingen, Tübingen, Germany	0.599	0.360₁₅	0.348₁₅
Reem06		0.549	0.263₁₆	0.343₁₆
Majority Class Baseline		0.450	0.009	0.333

Table 4. Table 4: Subtask B: Results for answer classification based on the official submissions, evaluated on the test set.

Team ID	Affiliation	Accuracy	F1	AvgRec	MAP
AUTOHOME-ORCA	Autohome Inc., Beijing, China and Beijing University of Posts and Telecommunications, Beijing, China	0.815	0.511₂	0.512₂	0.155₇
ColumbiaNLP	Columbia University, Department Of Computer Science and Data Science Institute	0.791	0.524₁	0.635₁	0.134₈
DOMLIN	Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI), Saarbrucken, Germany	0.718	0.402₃	0.445₃	0.267₃
SolomonLab	Samsung R&D Institute India, Bangalore	0.686	0.375₄	0.403₄	0.333₂
CodeForTheChange	International Institute of Information Technology, Hyderabad, Teradata and Qubole	0.654	0.325₅	0.326₅	0.156₆
BLCU_NLP	Beijing Language and Culture University, Beijing, China	0.611	0.296₆	0.317₆	0.222₄
LP0606		0.548	0.271₇	0.341₇	0.121₉
PP08		0.548	0.271₇	0.341₇	0.121₉
Tuefact	University of Tübingen, Tübingen, Germany	0.527	0.260₈	0.347₈	0.571₁
cococold		0.439	0.133₉	0.241₉	0.208₅
nothing		0.439	0.133₉	0.241₉	0.208₅
Majority Class Baseline		0.830	0.285	0.333	0.156

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tsvm/factcheck-cqa
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

SemEval-2019 Task 8:

Fact Checking in Community Question Answering Forums

Tsvetomila Mihaylova,1 Georgi Karadjov,2 Pepa Atanasova,3

Ramy Baly,4 Mitra Mohtarami,4 Preslav Nakov5

1 Instituto de Telecomunicações, Lisbon, Portugal, 2 SiteGround Hosting EOOD, Bulgaria

3 University of Copenhagen, Denmark

4 MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA

5 Qatar Computing Research Institute, HBKU

{tsvetomila.mihaylova, georgi.m.karadjov}@gmail.com,

[email protected], {baly,mitram}@mit.edu, [email protected]

Abstract

We present SemEval-2019 Task 8 on Fact Checking in Community Question Answering Forums, which features two subtasks. Subtask A is about deciding whether a question asks for factual information vs. an opinion/advice vs. just socializing. Subtask B asks to predict whether an answer to a factual question is true, false or not a proper answer. We received 17 official submissions for subtask A and 11 official submissions for Subtask B. For subtask A, all systems improved over the majority class baseline. For Subtask B, all systems were below a majority class baseline, but several systems were very close to it. The leaderboard and the data from the competition can be found at http://competitions.codalab.org/competitions/20022.

1 Overview

The current coverage of the political landscape in both the press and in social media has led to an unprecedented situation. Like never before, a statement in an interview, a press release, a blog note, or a tweet can spread almost instantaneously. The speed of proliferation leaves little time for double-checking claims against the facts, which has proven critical in politics, e.g., during the 2016 presidential campaign in the USA, which was dominated by fake news in social media and by false claims.

Investigative journalists and volunteers have been working hard to get to the root of a claim and to present solid evidence in favor or against it. Manual fact-checking is very time-consuming, and thus automatic methods have been proposed to speed-up the process, e.g., there has been work on checking the factuality/credibility of a claim, of a news article, or of an information source Ba et al. (2016); Zubiaga et al. (2016); Ma et al. (2016); Castillo et al. (2011); Baly et al. (2018).

The process starts when a document is made public. First, an intrinsic analysis is carried out in which check-worthy text fragments are identified. Then, other documents that might support or rebut a claim in the document are retrieved from various sources. Finally, by comparing a claim against the retrieved evidence, a system can determine whether the claim is likely true or likely false (or unsure, if no strong enough evidence either way could be found). For instance, Ciampaglia et al. (2015) do this using a knowledge graph derived from Wikipedia. The outcome could then be presented to a human expert for final judgement.111As of present, fully automatic methods for fact checking still lag behind in terms of quality, and thus also of credibility in the eyes of the users, compared to what high-quality manual checking by reputable sources can achieve, which means that a final double-checking by a human expert is needed.

For our two subtasks, we explore factuality in the context of Community Question Answering (cQA) forums. Forums such as StackOverflow, Yahoo! Answers, and Quora are very popular these days, as they represent effective means for communities around particular topics to share information. However, the information shared by the users is not always correct or accurate. There are multiple factors explaining the presence of incorrect answers in cQA forums, e.g., misunderstanding of the question, ignorance or maliciousness of the responder. Also, as a result of our dynamic world, the truth is time-sensitive: something that was true yesterday may be false today. Moreover, forums are often barely moderated and thus lack systematic quality control.

Here we focus on checking the factuality of questions and answers in cQA forums. This aspect was ignored in recent cQA tasks Ishikawa et al. (2010); Nakov et al. (2015, 2016a, 2017a), where an answer is considered Good if it addresses the question, irrespective of its veracity, accuracy, etc.

Figure 1 presents an excerpt of an example from the Qatar Living Forum, with one question and three answers selected from a longer thread. According to SemEval-2016 Task 3 Nakov et al. (2016a), all three answers would be considered Good since they are formally answering the question. Nevertheless, $a_{1}$ contains false information, while $a_{2}$ and $a_{3}$ are correct, as can be established from an official government website.222http://portal.moi.gov.qa/wps/portal/MOIInternet/departmentcommittees/visasentrypermeits/

Checking the veracity of answers in a cQA forum is a hard problem, which requires putting together aspects of language understanding, modelling the context, integrating several information sources, uisng world knowledge and complex inference, among others. Moreover, high-quality automatic fact-checking would offer better experience to users of cQA systems, e.g., the user could be presented with veracity scores, where low scores would warn the user not to completely trust the answer or to double-check it.

2 Related Work

Fact-checking of answers was not studied before in the context of community Question Answering, apart from our own recent work Mihaylova et al. (2018). Yet, in the context of cQA and general QA, there has been work on credibility assessment, which has been modelled primarily at the feature level, with the goal of improving Good answer identification. A notable exception are Nakov et al. (2017b); Mihaylov et al. (2018), where credibility was a task on its own right. However, credibility is different from veracity (our focus here) as it is a subjective perception about whether a statement is credible, rather than actually truthful.

Jurczyk and Agichtein (2007) modelled author authority using link analysis. Agichtein et al. (2008) looked for high-quality answers using PageRank and HITS, in addition to intrinsic content quality, e.g., punctuation and typos, syntactic and semantic complexity, and grammaticality.

Lita et al. (2005) studied three qualitative dimensions for answers: source credibility (e.g., does the document come from a government website), sentiment analysis, and contradiction compared to other answers. Su et al. (2010) looked for verbs and adjectives that cast doubt. Banerjee and Han (2009) used language modelling to validate the reliability of an answer’s source. Jeon et al. (2006) focused on non-textual features such as click counts, answer activity level, and copy counts. Pelleg et al. (2016) curated social media content using syntactic, semantic, and social signals. Unlike this research, we (i) target factuality rather than credibility, (ii) address it as a task in its own right, and on a specialised dataset.

Information credibility was also studied in social computing. Castillo et al. (2011) modeledd user reputation. Canini et al. (2011) analyzed the interaction of content and social network structure. Morris et al. (2012) studied how Twitter users judge truthfulness. Lukasik et al. (2015) used temporal patterns to detect rumors, and Zubiaga et al. (2016) focused on conversations.

Other authors have been querying the Web to gather support for accepting or refuting a claim Popat et al. (2016); Karadzhov et al. (2017b). In social media, there has been research targeting the user, e.g., finding malicious users Mihaylov and Nakov (2016); Mihaylova et al. (2018); Mihaylov et al. (2018), sockpuppets Maity et al. (2017), Internet water army Chen et al. (2013), and seminar users Darwish et al. (2017).

Finally, there has been work on credibility, trust, and expertise in news communities Mukherjee and Weikum (2015). Dong et al. (2015) proposed that a trustworthy source is one that contains very few false claims. Recent work has also focused on evaluating the factuality of reporting of entire news outlets Baly et al. (2018, 2019).333Knowing the reliability of a medium is important when fact-checking a claim Popat et al. (2017); Nguyen et al. (2018) and when solving article-level tasks such as “fake news” and click-bait detection Hardalov et al. (2016); Karadzhov et al. (2017a); Pan et al. (2018); Pérez-Rosas et al. (2018). However, none of this work was about QA or cQA.

3 Subtasks and Data Description

SemEval-2019 Task 8 has two subtasks:

•

Subtask A: Given a question from a cQA forum, predict whether this question asks for factual information vs. opinion/advice vs. just socializing.

•

Subtask B: Given a factual question from a cQA forum, together with its answer thread, predict whether each answer provides true vs. false vs. non-factual information as a response to the question.

3.1 Data and Resources

We retrieved the data from the Qatar Living web forum444http://www.qatarliving.com. We then cleaned it and we annotated it with the labels described in Sections 3.2 and 3.3.

For subtask A, we annotated the questions using Amazon Mechanical Turk555http://www.mturk.com/. To ensure high quality of the annotation, we went through all annotations and manually double-checked them.

For subtask B, we did not use an external annotation service, but instead we annotated all the data ourselves. Each answer was processed by three independent annotators, and we made sure we had proof for the label from reliable sources on the Web. Then, the annotations were consolidated after a discussion until agreement was achieved for each example.

All data is freely available under a Creative Commons Attribution 3.0 Unported (CC BY 3.0) license, and is accessible on the competition’s website666http://competitions.codalab.org/competitions/20022.

In addition to the provided annotated data, we also allowed the participants to use unlabelled data from the Qatar Living forum footnotehttp://alt.qcri.org/semeval2016/task3/data/uploads/QL-unannotated-data-subtaskA.xml.zip, as well as additional external resources, which they had to mention explicitly in their submissions.

Note that the class distribution in the training, development and test sets differs, especially for Subtask B. The reason for this is the way the data was prepared. The different datasets (training, development and test) were prepared on stages, because of the very time-consuming data annotation process.

For each dataset annotation stage, we had to choose between releasing all the available annotated data or aim at releasing sets with similar label distribution. At the end, we decided to release the available data, although we were aware that this would result in releasing sets with different distribution and, in some cases, unbalanced categories.

3.2 Training Data for Subtask A

To create the dataset for the task, we chose to augment a pre-existing dataset for cQA with factuality annotations; this allowed us to stress the difference between (a) distinguishing a good vs. a bad answer, and (b) distinguishing a factually-true vs. a factually-false one. In particular, we added annotations for factuality to the CQA-QL-2016 dataset from SemEval-2016 Task 3 on Community Question Answering Nakov et al. (2016a).

In CQA-QL-2016, the data is organized in question–answer threads (from the Qatar Living forum). Each question has a subject, a body, and meta information: question ID, date and time of posting, user name and ID, and category (e.g., Computers and Internet and Moving to Qatar).

We analyzed the forum questions and we defined three categories, related to their factuality. We then annotated the questions using Amazon Mechanical Turk. The three factuality categories are as follows:

$\ast$

Factual: The question asks for factual information, which can be answered by checking various information sources, and it is not ambiguous (e.g., “What is Ooredoo customer service number?”).

$\ast$

Opinion: The question asks for an opinion or an advice, not for a fact. (e.g., “Can anyone recommend a good Vet in Doha?”)

$\ast$

Socializing: Not a real question, but rather socializing/chatting. This can also mean expressing an opinion or sharing some information, without really asking anything of general interest. (e.g., “What was your first car?”)

Table 1 shows the distribution of the labels for the question labels in the training, in the development and in the testing datasets. Overall, there are 1118, 239 and 953 questions annotated with the above-described labels.

3.3 Training Data for Subtask B

For subtask B, we annotated for veracity the answers to the questions with a Factual label for subtask A. Note that in CQA-QL-2016, each answer has a subject, a body, meta information (answer ID, user name, and ID), the question that it answers, and a judgement about how well it answers the question of its thread (Good , Bad or Potentially Useful ).

We annotated the Good answers for factuality based on the assumption that a Good answer means it provides factual information, whether it is true or false. All Bad and Potentially Useful answers are automatically considered as Non-Factual. The factuality labels are described as follows:

$\ast$

Factual – True: The answer is True and can be proven with an external resource. (Q: “I wanted to know if there were any specific shots and vaccinations I should get before coming over [to Doha].”; A: “Yes there are; though it varies depending on which country you come from. In the UK; the doctor has a list of all countries and the vaccinations needed for each.”).777The answer is factually true and this can be seen at http://wwwnc.cdc.gov/travel/destinations/traveler/none/qatar

$\ast$

Factual – False: The answer gives a factual response, but it is False and this can be proven using an external resource. (Q: “Can I bring my pitbulls to Qatar?”; A: “Yes you can bring it but be careful this kind of dog is very dangerous.”).888The answer is incorrect since pitbulls are included in the list of breeds banned in Qatar. See http://canvethospital.com/pet-relocation/banned-dog-breed-list-qatar-2015/

$\ast$

Factual – Partially True: The answer contains more than one claim, and only some of these claims could be manually verified. (Q: “I will be relocating from the UK to Qatar […] is there a league or TT clubs / nights in Doha?”; A: “Visit Qatar Bowling Center during thursday and friday and you’ll find people playing TT there.”).999The place mentioned in the answer has table tennis, but we do not know on which days. See http://www.qatarbowlingfederation.com/bowling-center/

$\ast$

Factual – Conditionally True: The answer is True in some cases, and False in others, depending on some conditions that the answer does not mention. (Q: “My wife does not have NOC from Qatar Airways; but we are married now so can i bring her legally on my family visa as her husband?”; A: “Yes you can.”).101010This answer can be true, but this depends upon some conditions. See http://www.onlineqatar.com/info/dependent-family-visa.aspx

$\ast$

Factual - Responder Unsure: The person giving the answer is not sure about the veracity of his/her statement. (e.g., “Possible only if government employed. That’s what I heard.”)

$\ast$

Non-Factual: When the answer does not provide factual information to the question; it can be an opinion or an advice that cannot be verified. (e.g., “Its better to buy a new one.”).

We further discarded answers whose factuality was very time-sensitive and it makes no sense to check whether the statements are true or false (e.g., “It is Friday tomorrow.”, “It was raining last week.”).

Moreover, many answers are arguably somewhat time-sensitive, e.g., “There is an IKEA in Doha.” is true only after IKEA opened, but not before that. In such cases, we just used the present situation as a point of reference. We further discarded the answers for which the annotators could not find any information.

Ultimately, we consolidated the above fine-grained labels into the following coarse-grained labels, which we used for subtask B:

$\ast$

Factual – True: Contains answers with proven true, non-contradictory statements. This includes the answers with the label Factual – True from above. This label is used for answers one can trust as a true statement.

$\ast$

Factual – False: Contains answers with statements that are proven to be false or not completely true. This includes answers with the following fine-grained factuality labels: Factual – False, Factual – Partially False, Factual – Conditionally True, Factual – Responder Unsure. We also use this label for answers that contain a statement for which the person giving the answer expresses uncertainty in the claim.

$\ast$

Non-Factual: These are either non-factual statements or statements that could be factual, but no information about them could be found, i.e., we could find no way to check whether the statement was true or false. This category also includes some statements that have been incorrectly annotated as a Good answer. It also includes the very time-sensitive statements described before, such as ”It is Friday tomorrow?”. The Bad and the Potentially Useful answers from CQA-QL-2016 also fall in this category.

As we have mentioned above, we have annotated the answers to the Factual questions selected from the Qatar Living forum. We targeted very high quality annotation, and thus we did not use crowd-sourcing, as a pilot experiment has shown that the task was very difficult and that it was not possible to guarantee that Turkers would do all the necessary verification and gather evidence from trusted sources. Instead, all examples were first annotated independently by three of us, and then, we carefully discussed each example to come up with a final label. The distribution of the labels on the training, on the development, and on the testind dataset are shown in Table 2111111Although not very big, our dataset is larger than datasets used for similar problems, e.g., Ma et al. (2015) experimented with 226 rumors for rumor detection, and Popat et al. (2016) used 100 Wiki hoaxes for credibility assessment of textual claims..

3.4 Evaluation

Both subtasks are three-way classification problems. In subtask A, the questions were to be classified as Factual, Opinion, or Socializing. Similarly, in subtask B there were also three target categories for the answers: Factual - True, Factual - False, and Non-Factual.

We further scored the submissions based on Accuracy, macro-F1, and average recall (AvgRec).121212Average recall has some attractive properties and has been used in previous SemEval tasks, e.g., Nakov et al. (2016b); Rosenthal et al. (2017). For subtask B, we also report mean average precision (MAP), where the Factual - True instances were considered to be positive, and the remaining ones were negative. The official evaluation measure for both subtasks was Accuracy.

4 Participants and Results

We received 17 official submissions for Subtask A and 11 official submissions for Subtask B. Below we report the evaluation results.

Table 3 presents the results for subtask A on question classification. The results are based the official submissions in the evaluation phase. In this subtask, all of the submitted systems managed to improve over the majority class baseline, and several teams achieved similarly good results. Whenever a number of teams achieve the same result with respect to the main evaluation measure, i.e., Accuracy, we rank them according to the F1 score, and then by AvgRec if a tie still appears.

Table 4 presents the results based on the evaluation phase on the test set for predicting answer factuality labels. This subtask was more difficult as the majority class baseline was very high due to label unbalance. No team managed to improve over that baseline, but several teams had results that were very close to it.

5 Discussion

In the evaluation phase of the competition, the participants had to specify one official submission and were allowed up to two contrastive submissions. In the post-evaluation phase, they could upload an unlimited number of contrastive submissions. Below, we will only discuss the official submissions. The contrastive submissions, the ablation studies, and the experiments with different techniques are described by the participants in their respective system description papers.

The best system for Subtask A was by team Fermi (IIIT Hyderabad). They used Google’s Universal Sentence representation Cer et al. (2018), and XGBoost Chen and Guestrin (2016).

The best system for Subtask B was by team AUTOHOME-ORCA (Autohome Inc. and Beijing University of Posts and Telecommunications), who used BERT Devlin et al. (2019).

They achieved their best results by using an ensemble, and by also using question meta-information (category and subject) in addition to the question and the answer text. They concatenated the category, the subject and the body of the questions into the first part separated by [SEP]. The replier’s username and statement were concatenated as the second part. The two parts separated by [SEP] were pushed into the BERT model for answer classification. Then, based on the sequential outputs of the BERT model, some variant methods such as average-pooling, and bi-LSTM were adopted to produce the final results. To tackle the problem with insufficient training data, they further used data augmentation based on translation with Google Translate: in particular, they performed consecutive English-Chinese and Chinese-English translation to generate more synthetic training data.

Overall, the submitted systems for the two subtasks used a number of pre-processing steps to clean the text of the question and of the answer. As shown by the DOMLIN team, the pre-processing of the data turns out to be crucial. They reported up to 5% improvement in terms of accuracy when cleaning the unannotated forum data before fine-tuning a BERT model. Common preprocessing steps included removing or replacing the URLs, the numbers, the punctuation, the symbols, spell-checking, expansion of contractions, HTML tags, etc. DUTH also used lemmatization and stopword removal.

The submitted systems used a wide range of strategies for training their models. A sizable part of the systems used manually crafted features such as linguistic, syntactic, stylistic, and semantic features. Moreover, the systems used task-specific information such as answer ranking and rating. ColumbiaNLP also computed an average cosine similarity of one answer with respect to the other answers in the thread for subtask B, assuming that bad answers would differ substantially from the remaining answers.

While some of the approaches used character and word $n$ -gram information, the teams also used word- and sentence-level embeddings. CodeForTheChange evaluated different classification algorithms fed with Skip-Thought vectors, and ultimately found that neural networks performed best for both subtasks with either concatenation or averaging over the vectors of the available texts.

Fermi performed evaluation of different embedding models - InferSent, Concatenated Power Mean Word Embedding, Lexical Vectors, ELMo and The Universal Sentence Encoder, used in subtask A to feed an XGBoost classifier. ColumbiaNLP used ULMFiT, but performed additional unsupervised tuning of the language model on questions, answers and question-answer pairs from the Qatar Living Forum. TMLab’s system used the Universal Sentence Encoder.

A common neural network architecture was LSTM, where YNU-HPCC combined LSTM with an attention mechanism. TueFact used comment chain embeddings. Other machine learning algorithms that participants tried include Random Forest, Adaboost, Perceptron, and SVM, inter alia.

While for question classification (subtask A), all the necessary information was contained in the question text and in the metadata, subtask B required additional resources. Most teams used the provided additional unannotated forum data in order to pre-train their language models or to extract more data with weak supervision (DOMLIN). Furthermore, several teams used other means for data augmentation such as SQuAD (BLCU NLP) or external Web information (SolomonLab).

6 Conclusion

We have described SemEval 2019 Task 8 on Fact Checking in Community Question Answering Forums. We received 17 and 11 submissions for Subtask A and B, respectively. Overall, subtask A (question classification) was easier and all submitted systems managed to improve over the majority class baseline. However, Subtask B (answer classification) proved to be much more challenging, and no team managed to improve over the majority class baseline, even though several teams came very close. For this latter subtask, using external resources and preprocessing proved to be crucial.

Acknowledgments

This research is part of the Tanbih project,131313http://tanbih.qcri.org/ which aims to limit the effect of “fake news”, propaganda and media bias by making users aware of what they are reading. The project is developed in collaboration between the Qatar Computing Research Institute (QCRI), HBKU and the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL).

Bibliography54

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Agichtein et al. (2008) Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis, and Gilad Mishne. 2008. Finding high-quality content in social media. In Proceedings of the 2008 International Conference on Web Search and Data Mining , WSDM ’08, pages 183–194, Palo Alto, CA, USA.
2Avvaru and Pandey (2019) Adithya Avvaru and Anupam Pandey. 2019. Code For The Change at Sem Eval-2019 task 8: Skip-thoughts for fact checking in community question answering. In Proceedings of the International Workshop on Semantic Evaluation , Sem Eval ’19, Minneapolis, MN, USA.
3Ba et al. (2016) Mouhamadou Lamine Ba, Laure Berti-Equille, Kushal Shah, and Hossam M Hammady. 2016. VERA: A platform for veracity estimation over web data. In Proceedings of the 25th International Conference Companion on World Wide Web , pages 159–162, Montreal, Canada.
4Bairaktaris et al. (2019) Anastasios Bairaktaris, Symeon Symeonidis, and Avi Arampatzis. 2019. DUTH at Sem Eval-2019 task 8: Part-of-speech features for question classification. In Proceedings of the International Workshop on Semantic Evaluation , Sem Eval ’19, Minneapolis, MN, USA.
5Baly et al. (2018) Ramy Baly, Georgi Karadzhov, Dimitar Alexandrov, James Glass, and Preslav Nakov. 2018. Predicting factuality of reporting and bias of news media sources. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , EMNLP ’18, pages 3528–3539, Brussels, Belgium.
6Baly et al. (2019) Ramy Baly, Georgi Karadzhov, Abdelrhman Saleh, James Glass, and Preslav Nakov. 2019. Multi-task ordinal regression for jointly predicting the trustworthiness and the leading political ideology of news media. In Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , NAACL-HLT ’19, Minneapolis, MN, USA.
7Banerjee and Han (2009) Protima Banerjee and Hyoil Han. 2009. Answer credibility: A language modeling approach to answer validation. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics , NAACL-HLT ’09, pages 157–160, Boulder, CO, USA.
8Canini et al. (2011) Kevin R. Canini, Bongwon Suh, and Peter L. Pirolli. 2011. Finding credible information sources in social networks based on content and social structure. In Proceedings of the IEEE International Conference on Privacy, Security, Risk, and Trust, and the IEEE International Conference on Social Computing , Social Com/PASSAT ’11, pages 1–8, Boston, MA, USA.