TL;DR
This study evaluates methods for matching vaccination-related webpages to their original research articles, demonstrating that simple ranking tools can effectively identify credible sources and potentially reduce misinformation online.
Contribution
The paper introduces and tests a tool that matches vaccination webpages to their source research articles, comparing different ranking approaches including CCA, and finds simple methods are quite effective.
Findings
Baseline ranking correctly identified sources for over 25% of webpages.
Augmenting methods with CCA improved performance but did not surpass the baseline.
More than half of webpages' sources were ranked within the top 50 articles.
Abstract
Online health communications often provide biased interpretations of evidence and have unreliable links to the source research. We tested the feasibility of a tool for matching webpages to their source evidence. From 207,538 eligible vaccination-related PubMed articles, we evaluated several approaches using 3,573 unique links to webpages from Altmetric. We evaluated methods for ranking the source articles for vaccine-related research described on webpages, comparing simple baseline feature representation and dimensionality reduction approaches to those augmented with canonical correlation analysis (CCA). Performance measures included the median rank of the correct source article; the percentage of webpages for which the source article was correctly ranked first (recall@1); and the percentage ranked within the top 50 candidate articles (recall@50). While augmenting baseline methods using…
| Feature representation & reduction methods | Median rank (IQR) | Recall@1 | Recall@50 |
|---|---|---|---|
| Threshold parameters | |||
| Binary | 238.5 (1-9154) | 0.251 | 0.417 |
| TF | 427.5 (5-10075.25) | 0.188 | 0.368 |
| TF-IDF | 41 (1-799.25) | 0.262 | 0.515 |
| T-SVD (100 components) | |||
| Binary | 8858 (1198-34252.25) | 0.049 | 0.097 |
| TF | 38491.5 (4968.75-104229.25) | 0.046 | 0.077 |
| TF-IDF* | 2768 (203.5-24884.5) | 0.07 | 0.168 |
| T-SVD (200 components) | |||
| Binary | 5522.5 (495-27377.5) | 0.073 | 0.144 |
| TF | 36429 (3924.75-99717) | 0.054 | 0.089 |
| TF-IDF* | 1513 (84.75-15572.25) | 0.097 | 0.225 |
| T-SVD (400 components) | |||
| Binary | 3211.5 (188-21040.25) | 0.098 | 0.184 |
| TF | 31220 (2967.25-96203.5) | 0.066 | 0.1 |
| TF-IDF* | 720 (36-9674.25) | 0.126 | 0.276 |
| T-SVD (800 components) | |||
| Binary | 1606 (41.75-15311.75) | 0.133 | 0.263 |
| TF | 29421 (2245.25-92871.5) | 0.069 | 0.117 |
| TF-IDF* | 385.5 (13-6211.25) | 0.15 | 0.335 |
| T-SVD (1600 components) | |||
| Binary | 824.5 (9-12704.5) | 0.173 | 0.331 |
| TF | 29519.5 (1597.5-93890) | 0.077 | 0.13 |
| TF-IDF* | 219 (6-4145.75) | 0.174 | 0.371 |
| Method (CCA dimensions) | Median rank (IQR) | Recall@1 | Recall@50 |
|---|---|---|---|
| 100 T-SVD components | |||
| No CCA* | 2768 (203.5-24884.5) | 0.07 | 0.168 |
| 50 | 318 (23-3381) | 0.099 | 0.319 |
| 100 | 475 (20-4635.5) | 0.101 | 0.317 |
| 200 T-SVD components | |||
| No CCA* | 1513 (84.75-15572.25) | 0.097 | 0.225 |
| 50 | 322.5 (20-2940) | 0.093 | 0.314 |
| 100 | 200 (10-1982.75) | 0.133 | 0.358 |
| 200 | 253.5 (11-4198) | 0.14 | 0.355 |
| 400 T-SVD components | |||
| No CCA* | 720 (36-9674.25) | 0.126 | 0.276 |
| 50 | 575 (60-5055.5) | 0.051 | 0.234 |
| 100 | 268.5 (15-2696.5) | 0.103 | 0.325 |
| 200 | 185.5 (7-2506.75) | 0.14 | 0.38 |
| 400 | 270 (11-5581) | 0.136 | 0.349 |
| 800 T-SVD components | |||
| No CCA* | 385.5 (13-6211.25) | 0.150 | 0.335 |
| 50 | 3806.5 (279.75-28002.75) | 0.017 | 0.122 |
| 100 | 1100 (29-15787) | 0.031 | 0.21 |
| 200 | 409 (27-10816) | 0.084 | 0.296 |
| 400 | 291.5 (15-9859) | 0.113 | 0.345 |
| 800 | 1437 (34-34434.75) | 0.075 | 0.272 |
| 1600 T-SVD components | |||
| No CCA* | 219 (6-4145.75) | 0.174 | 0.371 |
| 50 | 58164.5 (19678-117859.25) | 0.0 | 0.009 |
| 100 | 47806 (14104.5-110966.5) | 0.001 | 0.023 |
| 200 | 37414.5 (7236.75-92341.25) | 0.004 | 0.037 |
| 400 | 30554.5 (3454-91052.25) | 0.005 | 0.06 |
| 800† | NA | NA | NA |
| 1600† | NA | NA | NA |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Recommending research articles to consumers of online vaccination information
Eliza Harrison
Centre for Health Infomatics
Australian Institute of Health Innovation
Macquarie University
\AndPaige Martin
Centre for Health Infomatics
Australian Institute of Health Innovation
Macquarie University
\AndDidi Surian
Centre for Health Infomatics
Australian Institute of Health Innovation
Macquarie University
\AndAdam G. Dunn
Discipline of Biomedical Informatics and Digital Health
Faculty of Medicine and Health
The University of Sydney
Abstract
Online health communications often provide biased interpretations of evidence and have unreliable links to the source research. We tested the feasibility of a tool for matching webpages to their source evidence. From 207,538 eligible vaccination-related PubMed articles, we evaluated several approaches using 3,573 unique links to webpages from Altmetric. We evaluated methods for ranking the source articles for vaccine-related research described on webpages, comparing simple baseline feature representation and dimensionality reduction approaches to those augmented with canonical correlation analysis (CCA). Performance measures included the median rank of the correct source article; the percentage of webpages for which the source article was correctly ranked first (recall@1); and the percentage ranked within the top 50 candidate articles (recall@50). While augmenting baseline methods using CCA generally improved results, no CCA-based approach outperformed a baseline method, which ranked the correct source article first for over one quarter of webpages and in the top 50 for more than half. Tools to help people identify evidence-based sources for the content they access on vaccination-related webpages are potentially feasible and may support the prevention of bias and misrepresentation of research in news and social media.
K****eywords research communications news media information retrieval vaccination
1 Background
The communication of health and medical research online provides a critical resource for the public. More than three-quarters of the UK public report an interest in biomedical research, with 42% having actively sought out content relating to medical or health research in 2015 [18]. Nearly all searches for health information take place online via search engines [3, 18, 9, 11].Internet searches are a common way for people to engage with health research and the communication of health research on news websites and other forums and have the potential to alter health beliefs and decisions [39].
The communication of health research in news and social media is associated with several challenges. Studies with fewer participants and of lower methodological rigour are more common in news media [14, 35], and research from authors with conflicts of interest tend to receive more attention in news and social media [12]. As many as half of all news reports manipulate or sensationalise study results to emphasise the benefits of experimental treatments [41].
Despite issues with the reliability of health information online, most people trust what they encounter [10, 11], and are inconsistent in their efforts to validate health information using appropriate sources [8, 11], likely because they find it difficult to do so. Where attempts to assess the credibility of health information are made, the visibility and accessibility of sources such as scientific research articles are an important criterion by which users assess the quality of online health communications [8, 11]. Individuals are also subject to order-effect biases that impact their perception of the evidence presented by online communications of health research [25], and tend to believe information that aligns with their current knowledge of a health topic [11].
The representation of medical research in the public domain is particularly important in relation to vaccination, where vocal critics actively seek to erode trust in the safety and effectiveness of vaccines and immunisation programs. In 2019, the World Health Organisation listed vaccine hesitancy—the reluctance or refusal to vaccinate—as one of the ten most significant threats to global health [40]. There is a clear risk that the misrepresentation of scientific evidence and amplification of misinformation by social media may be major contributing factors to further outbreaks of these diseases in future [23].
The rise of vaccine hesitancy as a global public health issue is in part driven by the increased pervasiveness of anti-vaccination sentiment in search engine results [21] and the mainstream news media [24], as well as the growth of social media as a platform for the provision of a diverse range of information sources to the public [38]. Discussion of the safety and efficacy of vaccines is a common theme in news reports and low-quality information is common [6]. On webpages specifically advocating against vaccination, the majority cite safety risks including illness, damage, or death [2, 20].
To be able to identify biases and misrepresentation in the communication of health research online, we need to be able to quickly identify the original source literature for that research. While existing services such as Altmetric (https://www.altmetric.com/) can be used to identify links to scientific source material using Digital Object Identifiers (DOIs), Uniform Resource Locators (URLs), or other identifiers such as PubMed IDs (PMIDs), in most cases these identifiers must be embedded in hyperlinks to enable their tracking. Other media services that offer more complete tracking of media mentions of research tend to be for-profit subscription services that support organisations wanting to keep track of their research outputs. These services are source-centric—they start with a research article and track the media that references it—and may not easily support use cases where a member of the public is interested in accessing the source research that underpins the information on webpages communicating health-related research to the public.
Our aim was to evaluate methods for automatically identifying source literature by recommending articles for webpages communicating vaccination research to the public. To do this, we made use of a large set of reported links between vaccination-related webpages and the scientific literature they reference tracked by Altmetric.
2 Methods
2.1 Study data
The study data comprised a set of research articles from PubMed linked to a set of webpages via Altmetric. To construct the corpus of research articles from PubMed, we retrieved all articles from PubMed by searching for “vaccine”, automatically expanded to include searches for the plural form and “vaccine” as a Medical Subject Heading (MeSH) term. Title and abstract text for each article were extracted using the National Center for Biotechnology Information (NCBI) E-Utilities Application Programming Interface (API) (https://www.ncbi.nlm.nih.gov/books/NBK25501/). Any PubMed articles that did not include at least 100 words after concatenating title and abstract were excluded from the analysis, and the remaining 207,538 articles formed the PubMed corpus (Figure 1). The search was conducted in July 2018.
We then used the Altmetric API to identify the set of research communications that linked to one or more of the articles in the PubMed corpus. We defined research communications to include news articles, blogs and non-social media posts that discuss the outcomes of vaccine-related research. Crawling each URL to access the web articles, contiguous blocks of text from the webpages were concatenated to form the basis of the data used in the following analyses.Text from the set of webpages was accessed in July 2018. Webpages were excluded if they did not include at least 100 words of text, as were any identified as non-English using the Google Code language-detection library (https://code.google.com/p/language-detection/). We also excluded web articles with significant amounts of exact duplicate text. This was common where articles were published on multiple online platforms owned by a single entity, often with only minor changes in title, content, or formatting. To remove these duplicates, we identified webpages for which the longest common substring between any two records linked to a PMID was greater than 50% of the total length of the longest webpage. We then randomly selected webpages such that no PMID was mapped to any number of similar webpages. Note that after selecting unique examples of linked webpages and research articles, no two webpages had a longest common substring overlap of more than 10% of the total length.
The resulting dataset included 207,538 research articles, of which 4,333 had known links to one or more of 8,458 distinct webpages (Figure 1). There were 1,934 articles that were referenced on two or more webpages, with one article referenced by 98 distinct webpages. Conversely, there were 1,418 webpages that referenced 2 or more articles, one of which had known links to 68 of the articles in the PubMed corpus. To generate a final set of reported links for which no webpage linked to more than one PubMed article in the final corpus and vice versa, we first selected any article and webpage pairs for which the corresponding PMID and URL were both present only once in the dataset (1:1 links). For each of the remaining articles, we instead selected the linked webpage with the greatest number of words and not yet present in final corpus. This resulted in a final set of 3,573 PMID-URL pairs of individually linked articles and webpages, which we refer to as the known links set.
2.2 Feature extraction and dimensionality reduction
To generate a term-based vector representation of each of the linked articles and webpages, we pre-processed each document by removing punctuation and words consisting entirely of numeric characters. We then used the remaining words to construct a vocabulary of terms common to both corpora (terms that existed in at least one research article and at least one webpage).
Each article or webpage was then represented as a vector of numeric values based on one of three standard vector representations: binary, term frequency (TF), and term frequency-inverse document frequency (TF-IDF). Binary vectors were generated by recording the presence (value = 1) or absence (value = 0) of vocabulary terms in each document. The TF vector representation was defined as a count of the number of times each word appeared in the document. The TF-IDF score is given by the log-transformed TF value multiplied by the inverse of the log-transformed proportion of documents in which the feature was present. In contrast to term frequency, TF-IDF weights vary depending on how common the term is across the entire corpus, based on the assumption that words appearing more often in fewer documents (like the name of a specific vaccine or the outcomes measured in a research study) are likely to be more informative, while those that appear often across many documents (like “and”, “the”, or “vaccination”) are less informative [37, 32, 34].
In information retrieval methods, sparse representations of documents may be less useful for measuring document similarity or finding documents relevant to a search. This is expected in particular for short documents. To address issues of sparsity, dimensionality reduction methods either remove features that are expected to be less useful or transform the vector space representation into fewer dimensions.
We evaluated the use of two approaches. The first was a simple feature reduction method that uses threshold parameters. Features were removed by applying the maximum document frequency limit of 0.85 to the combined corpora vocabulary. As a result, those terms common to more than 85% of articles and webpages in the corpus were excluded from the term-based vector representation.
For the second dimensionality reduction approach we used truncated singular value decomposition (T-SVD). T-SVD works in a similar way to singular value decomposition (SVD) by decomposing a matrix into a product of matrices that contain singular vectors and singular values. The singular values can be used to understand the amount of variance in the data captured by the singular vectors. T-SVD allows more efficient computation than SVD since T-SVD approximates the decomposition by only considering a select few components, specified as an argument to the algorithm [13].
2.3 Ranking methods
We used cosine similarity as a standard measure of similarity between webpages and PubMed articles. For each webpage, we calculated the cosine similarity to all 205,037 articles in the test portion of the final document corpus to produce a ranked list.
We expected that there would be consistent differences between the language style used in article titles and abstracts, compared to that used in online research communications. For example, we expected that communications would replace technical jargon with simpler synonyms. Canonical correlation analysis (CCA) [15] is an algorithm designed to identify linear combinations of maximally correlated variables between complex, multivariate datasets. CCA captures and maps the correlations between two sets of variables into a single space, and thus the comparison for ranking can be made using a standard similarity measure. CCA is used to analyse a joint dimensionality reduction across different spaces (e.g., text and images, text and text, etc.) [28, 33]. As a result, the CCA approach could be used to learn the alignment between the terms used in the articles and the terms used to describe the same concepts in research communications presented online. To test the CCA approach, we added it as an extra process in the pipeline, using training data to construct a transform (a matrix that may modify the number of features), and then apply that transform to the testing data before calculating the distance (Figure 2).
2.4 Experiments and outcome measures
While standard document similarity methods typically do not need to be constructed on one set of data and tested on another, the CCA approach learns an alignment between articles and webpages based on a set of training data, and its ability to generalise to unseen data is best tested on a separate dataset. To examine the effect of adding CCA to the pipeline, we constructed training and testing sets by randomly assigning each PMID-URL pair. The resulting training dataset comprised 70% or 2,501 of the known links, with the remaining 30% of PMID-URL pairs allocated to the testing set. To replicate the work of searching a large corpus or database for relevant scientific publications, we also added the 203,965 eligible articles not already captured in either the training or testing datasets, resulting in a testing set of 1,072 linked articles and webpages plus the set of 203,965 articles with no linked webpages.
The set of experiments were split into two phases. In the first phase, we examined how differences in the vector space representations might affect the performance of the ranking methods, comparing the binary, TF, and TF-IDF representations in combination with either threshold or T-SVD feature reduction. In the second, we tested the effect of transforming the best performing feature representation using CCA.
The success of each of these systems in correctly linking research articles to the webpages that reference them is indicated by the final rank of the correct PubMed article for each of the 1,072 webpages tested. Based on the similarity between each webpage and source article we calculated the number of PubMed articles a user would be required to read to locate the known links for at least half of all webpages, equivalent to the median rank of the correct source article. As a second metric we determined the number of webpages for which the correct PubMed article was ranked first out of all possible 205,037 articles in the testing set, or the proportion of known links correctly identified by each system (i.e. recall@1). We also calculated the proportion of links ranked within the top 50 PubMed articles in the testing set as an indicator of the capacity of each system to return the correct PubMed article within the first page of query results (i.e. recall@50). Finally, we plotted recall@k for all values between 1 and the total number of PubMed articles to visualise the proportion of known links which can be identified after having read the top k ranked source articles.
All methods and experiments were developed using Python 3.6, the code for which is available on GitHub (https://github.com/evidence-surveillance/web2pubmed).
3 Results
Among the 207,538 articles that were returned by the search and met the inclusion criteria for the analysis, 4,333 had one or more links to webpages recorded by Altmetric and were also eligible for inclusion in study analyses. The most popular article was used as source information on 98 webpages, while 22% (2,535 of 11,319 known links) were used as source information on one webpage (Figure 3). To construct a representative dataset in which no article or webpage was represented more than once, we selected a final set of 3,573 PMID-URL pairs.
Within this final set of 3,573 articles and webpages with known PMID-URL links and 203,965 additional articles with no known links, we identified 39,948 terms used at least once in both the set of webpages and the set of articles. Where we applied threshold parameters (limiting the vocabulary to exclude terms used in at least 85% of corpus documents), this vocabulary was reduced to 39,942 terms, representing the greatest number of features used in the following analyses. For experiments instead using the T-SVD method of feature reduction, the number of terms retained in the dataset varied between 100 and 1,600.
Of the methods of representing the text of articles and webpages, we observed that TF-IDF consistently produced the highest performance (Table 1). Regardless of the feature reduction approach used, experiments using the TF-IDF representation of document text outperformed the binary and TF representations.
Of the two feature reduction methods, the threshold approach outperformed the T-SVD approach for all outcome measures (Table 1). However, because the performance improved roughly linearly as the number of T-SVD components was increased, the results suggest that the number of features used may be a more important factor than the choice of feature reduction method. Overall, the highest performance was achieved using TF-IDF to represent the text as term features and the threshold to reduce the number of features. In the testing dataset, the method ranked the correct source article first for more than one in four webpages and placed the correct source article in the top 50 ranked candidate articles for more than half of the webpages.
The addition of CCA was expected to improve the performance of the method by finding an alignment between the terms used in the webpages and articles rather than exact matches between terms. We found that adding CCA to the process improved the performance for experiments where the number of T-SVD components was relatively low (Table 2). However, as we increased the number of T-SVD components above 400, the improvements gained from adding CCA started to diminish, indicating that the maximum gain in performance from adding CCA was achieved for the experiment that used 400 T-SVD components transformed into 200 feature dimensions by the trained CCA model, where for 38.0% of the webpages, the correct source article was placed within the top 50 ranked candidates (Figure 4). As the number of feature dimensions used was increased further, the approach then failed because the CCA failed to converge because of the sparsity of the feature space. Overall, the results show that we were able to identify a maximum performance within the parameter space for which the CCA approach could be used, but that none outperformed the simpler approach that used thresholds rather than T-SVD and did not use CCA (Figure 5).
4 Discussion
In this study we evaluated methods that could be used as part of tools to support the identification of missing links between online research communications and the source literature they use. We used vaccination research as an example application domain where there are common problems with bias and misrepresentation in subsequent news and media coverage. We started with the assumptions that many webpages are not reliably connected to the research on which they are based, and that readers may not have the time or expertise to construct a search query to identify relevant articles in bibliographic databases. We tested methods that seek to circumvent the need for expert construction of search queries and instead automatically recommend articles that are likely to be relevant. While the use of a CCA-based approach did not outperform our baseline methods, the results suggest that such tools are likely feasible.
4.1 Methods for automatic recommendations from text
We tested two standard information retrieval methods and found that the simpler approach using a TF-IDF representation and a maximum document frequency limit outperformed a more sophisticated approach of transforming the feature space using CCA. While we know of no previous studies that have developed tools for the same purpose, the structure of the problem is common. The combination of TF-IDF and cosine distance has previously been used to identify missing links between trial registrations on ClinicalTrials.gov and articles in PubMed reporting trial results [7]. Similarly, the use of TF-IDF has been shown facilitate the detection of similarities between patent documents and scientific publications [27]. These results were consistent with ours—increasing the number of SVD components improved the accuracy but the best performance was achieved without the use of SVD.
There are a range of other more complex approaches that could be applied to a problem of this structure: the identification of missing links between two distinct sets of documents that may be matched using similarity of content and a relatively sparse bipartite graph connecting the two sets of documents. These might include alternative feature representations like pre-trained language models, word embedding, or both [1, 16, 29, 31]; as well as other algorithms for recommendation or ranking related to collaborative filtering [17, 22], and learning-to-rank methods [19, 26].
An expert might take an alternative approach to manually identifying source articles for online research communications, making use of specific information including the names of authors, institutions, or journals. Rule-based approaches that make use of this information may yield improvements. Other similar approaches might make use of the date of publication extracted from webpages and articles in bibliographic databases, under the assumption that online communications of research tend to be reported soon after the research is published.
4.2 Implications and future applications
The results indicate that it is likely feasible to build a tool that could be used to help find missing links between health research communications and source literature for the purpose of checking the veracity of the communications and identifying biases. One way to operationalise this type of tool would be to develop browser plugins that automatically augment webpages with a list of recommended relevant peer-reviewed research. Hyperlinks might be added to the terms or phrases that most contribute to the recommendation based on the weights of the terms that contribute to the similarity.
A further application relates to the automatic detection of distortion or bias in research communications. Checklist tools such as QIMR [42] or DISCERN [4, 5] are designed to be used to manually evaluate the credibility of health information and health research communications, but little work has been done to use these checklists as the basis for automatically estimating the credibility of webpages [36]. We know of no studies that have attempted to automatically compare the text of research communications with the abstract or full text of research articles to detect specific differences that might be indicative of misrepresentation of distortion of research conclusions. For example, tools able to identify scenarios where studies of association are written as causation in communications would be of clear benefit, particularly when discussing vaccination [21, 30].
Tools extending the work we present here could also be used to help educate non-experts on when it is appropriate to search for source articles when reading research communications online, and to train them on how to construct useful search queries. First, the distances to the top-ranked articles might be suggestive of whether the text on a webpage is based on any form of peer-reviewed research. This could be used to indicate a common practice in anti-vaccine blogs where writers provide circular links within a network of other blogs that are all equally disconnected from clinical evidence. Second, the tool could be used to show users a search query that is automatically generated from the text of research communications for use with bibliographic databases like PubMed, educating users on how to search bibliographic databases for clinical evidence.
4.3 Limitations
This study had several limitations. First, while the use of Altmetric helped us to quickly construct a large dataset of reported links, the dataset might be a biased sample of research communications. Communications that include hyperlinks to journal webpages, PubMed, or link to articles using their DOIs may be of higher quality or may be targeted at specialised audiences. Other research communications not using hyperlinks were not included in the dataset, and these may be different to those tracked by Altmetric. Testing the approaches on a more general set of examples before deployment would be necessary. Second, there are a wide range of alternative approaches to feature representation and recommender systems. While we discuss the potential advantages of some of these approaches above, we are at present only able to speculate on which of them are likely to perform best as part of a tool or service aimed at improving the detection of distortion in research communications online. Finally, while vaccination is an important application domain, we did not test what might happen if we had selected a much broader sample of webpages and articles, or if we constructed models specifically designed to find missing links for individual fields or topics of research. It is possible that more general or more specific datasets may influence the performance of the methods we tested.
It is also worth noting that for this dataset, excluding terms not present in both the PubMed and webpage corpora resulted in very few remaining terms were also common to more than 85% of PubMed articles and webpages, and as such had a minimal impact on the dimensionality of the dataset used for subsequent analyses.
5 Conclusion
The results indicate the feasibility of tools designed to support the identification of missing links between health research communications and the scientific literature on which they are based. Such tools have the potential to help people better discern the veracity and quality of what they read online. While standard feature representation and document similarity methods were moderately successful in this task, further investigation is warranted.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Beam et al. [2018] A. L. Beam, B. Kompa, I. Fried, N. P. Palmer, X. Shi, T. Cai, and I. S. Kohane. Clinical Concept Embeddings Learned from Massive Sources of Medical Data. Co RR , abs/1804.0, 2018.
- 2Bean [2011] S. J. Bean. Emerging and continuing trends in vaccine opposition website content. Vaccine , 29(10):1874–1880, 2011. doi: 10.1016/j.vaccine.2011.01.003 .
- 3Castell et al. [2014] S. Castell, Charlton A, Clemence M, Pettigrew N, Pope S, Quigley A, Navin Shah J, and Silman T. Public Attitudes to Science 2014. Technical report, Ipsos MORI Social Research Institute, 2014.
- 4Charnock and Shepperd [2004] D. Charnock and S. Shepperd. Learning to DISCERN online: applying an appraisal tool to health websites in a workshop setting. Health Education Research , 19(4):440–446, 2004. doi: 10.1093/her/cyg 046 .
- 5Charnock et al. [1999] D. Charnock, S. Shepperd, G. Needham, and R. Gann. DISCERN: an instrument for judging the quality of written consumer health information on treatment choices. Journal of Epidemiology and Community Health , 53(2):105–11, 1999.
- 6Cooper Robbins et al. [2012] S. C. Cooper Robbins, C. Pang, and J. Leask. Australian Newspaper Coverage of Human Papillomavirus Vaccination, October 2006–December 2009. Journal of Health Communication , 17(2):149–159, 2012. doi: 10.1080/10810730.2011.585700 .
- 7Dunn et al. [2018] A. G. Dunn, E. Coiera, and F. T. Bourgeois. Unreported links between trial registrations and published articles were identified using document similarity measures in a cross-sectional analysis of Clinical Trials.gov. Journal of Clinical Epidemiology , 95(Mar):94–101, 2018. doi: 10.1016/j.jclinepi.2017.12.007 .
- 8Eysenbach [2002] G. Eysenbach. How do consumers search for and appraise health information on the world wide web? Qualitative study using focus groups, usability tests, and in-depth interviews. BMJ , 324(7337):573–577, 2002. doi: 10.1136/bmj.324.7337.573 .
