Utilizing Social Media Data for Pharmacovigilance: A Review

Abeed Sarker (1), Rachell Ginn (2), Azadeh Nikfarjam (2), Karen O’ Connor (1), Karen Smith (3) , Swetha Jayaraman (4), Tejaswi Upadhaya (4), Graciela Gonzalez (1)

(1) Department of Biostatistics, Epidemiology and Informatics
Perelman School of Medicine
University of Pennsylvania
Philadelphia, PA, USA

(2) Department of Biomedical Informatics
Arizona State University
Scottsdale, AZ, USA

(3) Department of Biomedical Informatics
Arizona State University
Tempe, AZ, USA

(4) Rueckert-Hartman College for Health Professions
Regis University
Denver, CO, USA



Automatic monitoring of Adverse Drug Reactions (ADRs), defined as adverse patient outcomes caused by medications, is a challenging research problem that is currently receiving significant attention from the medical informatics community. In recent years, user-posted data on social media, primarily due to its sheer volume, has become a useful resource for ADR monitoring. Research using social media data has progressed using various data sources and techniques, making it difficult to compare distinct systems and their performances. In this paper, we perform a methodical review to characterize the different approaches to ADR detection/extraction from social media, and their applicability to pharmacovigilance. In addition, we present a potential systematic pathway to ADR monitoring from social media.


We identified studies describing approaches for ADR detection from social media from the Medline, Embase, Scopus and Web of Science databases, and the Google Scholar search engine. Studies that met our inclusion criteria were those that attempted to extract ADR information posted by users on any publicly available social media platform. We categorized the studies according to different characteristics such as primary ADR detection approach, size of corpus, data source(s), availability, and evaluation criteria.


Twenty-two studies met our inclusion criteria, with fifteen (68%) published within the last two years. However, publicly available annotated data is still scarce, and we found only six studies that made the annotations used publicly available, making system performance comparisons difficult. In terms of algorithms, supervised classification techniques to detect posts containing ADR mentions, and lexicon-based approaches for extraction of ADR mentions from texts have been the most popular.


Our review suggests that interest in the utilization of the vast amounts of available social media data for ADR monitoring is increasing. In terms of sources, both health-related and general social media data have been used for ADR detection—while health-related sources tend to contain higher proportions of relevant data, the volume of data from general social media websites is significantly higher. There is still very limited amount of annotated data publicly available , and, as indicated by the promising results obtained by recent supervised learning approaches, there is a strong need to make such data available to the research community.

Quick Downloads

ADR binary classification data [1] (10,822 instances)
Georgetown ADR data set [2] (247 instances)
Spanish ADR corpus [3] (400 instances)
ADR full annotations [4] (1,784 instances)

Lexicons and knowledge bases:

HLP Lab ADR Lexicon


[1] Sarker A, Gonzalez G. Portable automatic text classification for adverse drug reaction detection via multi-corpus training, J Biomed Inform 2015;53: 196–207.
[2] Yates A, Goharian N. ADRTrace: detecting expected and unexpected adverse drug reactions from user reviews on social media sites. In: Proceedings of the 35th European conference on advances in information retrieval; 2013. p. 816–9.
[3] Segura-Bedmar I, Revert R, Martinez P. Detecting drugs and adverse events from Spanish health social media streams. In: Proceedings of the 5th international workshop on health text mining and information analysis (LOUHI); 2014. p. 106–15.
[4] Nikfarjam A, Sarker A, O’Connor K, Ginn R, Gonzalez G. Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. J Am Med Inform Assoc 2015;22(2).