Social Media Mining for Health Applications (SMM4H) Workshop & Shared Task 2019

(For the Shared Task, please follow this link, for SMM4H’18, please follow this link)

*** New: Proceedings ***


The total number of users of social media continues to grow worldwide, resulting in the generation of vast amounts of data. Popular social networking sites such as Facebook, Twitter and Instagram dominate this sphere. According to estimates, 500 million tweets and 4.3 billion Facebook messages are posted every day [1]. The latest Pew Research Report [2], nearly half of adults worldwide and two-thirds of all American adults (65%) use social networking. The report states that of the total users, 26% have discussed health information, and, of those, 30% changed behavior based on this information and 42% discussed current medical conditions. Advances in automated data processing, machine learning and NLP present the possibility of utilizing this massive data source for biomedical and public health applications, if researchers address the methodological challenges unique to this media.

This workshop aims to provide a forum for the ACL community members to present and discuss NLP advances specific to social media use in the particularly challenging area of health-related research and applications, following on the success of a session and accompanying Workshop on the topic that was hosted at the Pacific Symposium in Biocomputing (PSB) in 2016, at the AMIA Annual Conference in 2017, and at EMNLP in 2018. The workshop attracts researchers interested in automatic methods for the collection, extraction, representation, analysis, and validation of social media data for health informatics. It serves as a unique forum to discuss novel approaches to text and data mining methods that are applicable to social media data and may prove invaluable for health monitoring and surveillance.

Although social media text mining research for health applications is still very much its infancy, the domain has seen a surge in interest in recent years. Numerous studies have been published of late in this realm, including studies on pharmacovigilance [3], identifying user behavioral patterns [4], identifying user social circles with common experiences (like drug abuse) [5], monitoring malpractice, [6], and tracking infectious/viral disease spread [7,8]. Population/public health topics are most commonly addressed, although different social networks maybe suitable for specific targeted tasks. For example, while Twitter data has been utilized for surveillance and content analysis, a significant portion of research using Facebook has focused on communication rather than lexical content processing [9,10]. For health monitoring and surveillance research from social media, the most common topic has been influenza surveillance [11,12]. From the perspective of informatics and NLP, proposed techniques have typically been in the areas of data collection (e.g., keywords and queries) [13,14], text classification [15,16] and information extraction [17].

While innovative approaches have been proposed, there is still a lot of progress to be made in this domain. Effective utilization of the health-related knowledge contained in social media will require the joint effort by the research community, and the bringing together of researchers from distinct fields including NLP, machine learning, data science, biomedical informatics, medicine, pharmacology and public health. The knowledge gaps among researchers in these communities need to be reduced by community sharing of data and the development of novel applied systems. Our workshop attempts to achieve precisely these goals and appeal to researchers applying NLP methods in those domain areas and that might not typically consider ACL/EMNLP/NAACL. Topics of interest include, but are not limited to:

  • Methods for the automatic detection and extraction of health-related concept mentions in social media
  • Mapping of health-related mentions in social media to standardized vocabularies
  • Deriving health-related trends from social media
  • Information retrieval methods for obtaining relevant social media data
  • Geographic or demographic data inference from social media discourse
  • Virus spread monitoring using social media
  • Mining health-related discussions in social media
  • Drug abuse and alcoholism incidence monitoring through social media
  • Disease incidence studies using social media
  • Sentinel event detection using social media
  • Semantic methods in social media analysis
  • Classifying health-related messages in social media
  • Automatic analysis of social media messages for disease surveillance and patient education
  • Methods for validation of social-media derived hypothesis and datasets.

Important Dates (tentative)

    • Paper submission deadline: April 26, April 29, 2019
    • Notification of acceptance: May 24, May 27, 2019
  • Camera-ready version due: June 3, 2019
  • Workshop date: August 2, 2019

Submission Guidelines

All workshop papers must be original and not simultaneously submitted to another journal or conference. All papers focusing on natural language processing of social media texts for health-related tasks are welcome.

  • Full Papers must have a maximum length of 4-8 pages, unlimited references.
  • Extended Abstracts may have a maximum length of 2-4 pages, unlimited references.
  • System Descriptions (shared task participants only) may have a maximum of 2 pages, unlimited references. Accepted system descriptions will also be included in the workshop proceedings.

Please follow the standard submission guidelines of ACL 2019, available here: ACL call
All submissions must be through Softconf. Submission link: Softconf SMM4H

Organizing Committee

Graciela Gonzalez-Hernandez, Ph.D., is an Associate Professor at the Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, and serves as PI of an NLM-funded R01 project for assessing the value of social media postings for pharmacovigilance. Her research interests focus in the area of knowledge discovery, specifically through NLP and knowledge integration. Her lab has produced top-performing systems such as BANNER, for gene and disease entity recognition and ADRMine, for automated extraction of adverse drug reactions from social media data. She served as a member of the National Library of Medicine Standing Review Committee (BLIRC) from 2009 through 2013.

Davy Weissenbacher, PhD., is a Research Associate at the Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania. His research interests include knowledge representation, logic, machine learning and computational linguistics with a focus on biomedical and clinical text processing. While participating in research projects lead by renowned universities and research institutes, including the University of Manchester, the French Institute for Research in Computer Science and Automation or the Arizona State University, he applied his expertise in Information Extraction to help solving problems in domains as varied as Education Sciences, Public Health or Genomics. He co-organized and coordinated SMM4H at EMNLP 2018.

Michael Paul, Ph.D., is an Assistant Professor in the Department of Information Science at the University of Colorado-Boulder, with affiliate positions in the Department of Computer Science and the Institute of Cognitive Science. His research interests include machine learning and natural language processing with applications to social media analysis and public health monitoring. He has contributed to a state-of-the-art influenza surveillance and forecasting system that uses social media messages, with data available at the website, Recently, he was a co-organizer of the Social Media Mining for Public Health Monitoring and Surveillance Session at PSB 2016 and the 1st Joint Workshop on Health Intelligence (W3PHIAI) at the AAAI 2017, and was co-organizer of an NSF-funded workshop on ontology induction for behavioral science.

Abeed Sarker, Ph.D., is a Research Associate at the Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania. His research experiences and interests are in the areas of NLP, machine learning, data science, social media mining, and particularly their intersection with the field of healthcare and public health monitoring and surveillance. He has been involved with a variety of renowned organizations including the CSIRO-Australia, and Canon Information Systems Research Australia (CISRA), and the Centre for Language Technology, Macquarie University. He co-organized the ALTA 2011, PSB 2016, AMIA 2017, and EMNLP 2018 SMM4H shared tasks.

Ari Z. Klein, Ph.D., is a Postdoctoral Researcher at the Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania. He received a PhD in Lingustics from Carnegie Mellon University. Broadly, Ari is interested in NLP for biomedical and clinical applications, and his recent work has involved curating a variety of annotated datasets for supervised classification.

Arjun Magge, MS, is a PhD student and a graduate research assistant in the Biodesign Center for Enivironmental Health Engineering and the School of Health Solutions at Arizona State University. He is also a member of the Health Language Processing Lab at the University of Pennsylvania. His primary interests are in the design and implementation of NLP pipelines for information extraction. He has developed and implemented NLP methods on a variety of biomedical texts such as scientific literature, clinical notes, drug labels and health related social media texts for applications in public health and clinical informatics.

Ashlynn R. Daughton, MPH, is PhD student in the Department of Information Science at the University of Colorado Boulder and a research assistant in the Division of Analytics, Intelligence and Technology at Los Alamos National Laboratory. Ashlynn received a Masters in Public Health from Boston University in 2015 and a Bachelor of Arts in Molecular and Cell Biology from the University of California, Berkeley in 2012. Her research involves modeling and forecasting infectious diseases using computational models and digital data sources including social media, search queries, and Wikipedia.

Karen O’Connor, MS, is a Staff Scientist in the Health Language Processing Lab at the Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania. Karen O’Connor’s research experience and interest are in the area social media mining for public health. She has worked as senior annotator on various projects in this area including the detection and extraction of adverse drug reactions in social media and health forum posts, and the detection of pregnancy outcomes from user’s timelines in social media. Her expertise is in the development of annotation schema for corpora development.

Program Committee

Nigel Collier, University of Cambridge, UK
Larry Hunter, University of Colorado, USA
Hongfang Liu, Mayo Clinic Rochester, USA
Pierre Zweigenbaum, French National Center for Scientific Research, France
Cecile Paris, CSIRO, Australia
Kirk Roberts, University of Texas Houston, USA
Robert Leaman, US National Library of Medicine, USA
Azadeh Nikfarjam, Nuance Communications, USA
Ehsan Emazadeh, Google Inc., USA
Yutaka Sasaki, Toyota Technological Institute, Japan
Anne-Lyse Minard, Université d’Orléans, France
Thierry Poibeau, French National Center for Scientific Research, France
Kevin Cohen, University of Colorado, USA
Nicolas Turenne, French National Institute for Agricultural Research, France


  1. Team Gwava. How Much Data is Created on the Internet each Day? 2016. [Online]. Available: [Accessed: 03-Jan-2017].
  2. Pew Research Center. Social Media Fact Sheet. 2017. [Online]. Available: [Accessed: 03-Mar-2017].
  3. Leaman R, Wojtulewicz L, Sullivan R, Skariah A, Yang J, Gonzalez G. Towards Internet-Age Pharmacovigilance: Extracting Adverse Drug Reactions from User Posts to Health-Related Social Networks. In Proc. BioNLP. 2010:117–125.
  4. Struik LL, Baskerville NB. The Role of Facebook in Crush the Crave, a Mobile- and Social Media-Based Smoking Cessation Intervention: Qualitative Framework Analysis of Posts. J Med Internet Res. 2014;16(7):e170.
  5. Hanson CL, Cannon B, Burton S, Giraud-Carrier C. An exploration of social circles and prescription drug abuse through Twitter. J Med Internet Res. 2013;15(9):e189.
  6. Nakhasi A, Passarella RJ, Bell SG, Paul MJ, Dredze M, Provost PJ. Malpractice and Malcontent: Analysing Medical Complaints in Twitter. AAAI Technical Report. Information Retrieval and Knowledge Discovery in Biomedical Text. Johns Hopkins University. 2012.
  7. Broniatowski DA, Paul MJ, Dredze M. National and local influenza surveillance through Twitter: Analysis of the 2012-2013 influenza epidemic. PLoS One. 2013;8(12):e83672.
  8. Paul MJ, Dredze M. You are what you Tweet: Analyzing Twitter for public health. In Proc. AAAI-ICWSM. 2011:265-272.
  9. Kite J, Foley BC, Grunseit AC, Freeman B. Please Like Me: Facebook and Public Health Communication. PLoS One. 2016;11(9): e0162765.
  10. Platt T, Platt J, Thiel DB, Kardia SLR. Facebook Advertising Across an Engagement Spectrum: A Case Example for Public Health Communication. JMIR Public Heal Surveil. 2016;2(1):e27.
  11. Broniatowski DA, Dredze M, Paul MJ, Dugas A. Using Social Media to Perform Local Influenza Surveillance in an Inner-City Hospital: A Retrospective Observational Study. JMIR Public Heal Surveill.2015;1(1):e5.
  12. Sharpe JD, Hopkins RS, Cook RL, Striley CW. Evaluating Google, Twitter, and Wikipedia as Tools for Influenza Surveillance Using Bayesian Change Point Analysis: A Comparative Analysis. JMIR Public Heal. Surveill.2016;2(2):e161.
  13. Pimpalkhute P, Patki A, Nikfarjam A, Gonzalez G. Phonetic spelling filter for keyword selection in drug mention mining from social media. AMIA Summits Transl Sci Proc. 2014;2014:90-5.
  14. Bernardo TM, Rajic A, Young I, Robiadek K, Pham MT, Funk JA. Scoping review on search queries and social media for disease surveillance: a chronology of innovation. J Med Internet Res. 2013;15(7):e147.
  15. Aphinyanaphongs Y, Lulejian A, Brown DP, Bonneau P, Krebs P. Text classification for automatic detection of e-cigarette use and use for smoking cessation from Twitter: A feasibility pilot. Pac. Symp. Biocomput.2016;21:480–91.
  16. Aramaki E, Maskawa S, Morita M. Twitter catches the flu: detecting influenza epidemics using Twitter. In Proc. EMNLP. 2011:1568–1576.
  17. Nikfarjam A, Sarker A, O’Connor K, Ginn R, Gonzalez G. Pharmacovigilance from social media: Mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. J. Am. Med. Informatics Assoc.2015;22(3):671–681.