Social Media Mining for Health Applications (#SMM4H) Shared Task 2020

Call For Participation – Shared Task (Click here for the #SMM4H ’20 Call For Papers – Workshop, or click here for the #SMM4H ’19 Shared Task.)

The Social Media Mining for Health Applications (#SMM4H) Shared Task involves natural language processing (NLP) challenges of using social media data for health research, including informal, colloquial expressions and misspellings of clinical concepts, noise, data sparsity, ambiguity, and multilingual posts. For each of the five tasks below, participating teams will be provided with a set of annotated tweets for developing systems, followed by a three-day window during which they will run their systems on unlabeled test data and upload the predictions of their systems to CodaLab. Information about registration, data access, paper submissions, and presentations can be found below.

Task 1: Automatic classification of tweets that mention medications

This binary classification task involves distinguishing tweets that mention a medication or dietary supplement (annotated as “1”) from those that do not (annotated as “0”). For this task we follow the definition of a drug product and of a dietary supplement as stated by the FDA. These definitions and concrete examples can be found in the guidelines we followed during the annotation of the data for the Task 1. In 2018, this task was organized using a data set that contained an artificially balanced distribution of the two classes. This year, the data set1 represents the natural, highly imbalanced distribution of the two classes among tweets posted by 112 women during pregnancy2, with only approximately 0.2% of the tweets mentioning a medication. Training and evaluating classifiers on this year’s data set will more closely model the detection of tweets that mention medications in practice.

  • Training data: 69,272 (181 “positive” tweets; 69,091 “negative” tweets)
  • Test data: 29,687 tweets
  • Evaluation metric: F1-score for the “positive” class (i.e., tweets that mention medications)
  • Contact information: Davy Weissenbacher (

Task 2: Automatic classification of multilingual tweets that report adverse effects

This binary classification task involves distinguishing tweets that report an adverse effect (AE) of a medication (annotated as “1”) from those that do not (annotated as “0”), taking into account subtle linguistic variations between AEs and indications (i.e., the reason for using the medication). This classification task has been organized for every past #SMM4H Shared Task, but only for tweets posted in English. This year, this task also includes distinct sets of tweets posted in French and Russian.

  • English
    • Training data: 25,672 tweets (2,374 “positive” tweets; 23,298 “negative” tweets)
    • Test data: ~5,000 tweets
    • Contact information: Arjun Magge (
  • French
    • Training data: 2,426 tweets (39 “positive” tweets; 2,387 “negative” tweets)
    • Test data: 607 tweets
    • Contact information: Anne-Lyse Minard (
  • Russian
    • Training data: 7,612 tweets (666 “positive” tweets; 6,946 “negative” tweets)
    • Test data: 1,903 tweets
    • Contact information: Elena Tutubalina (
    • We thank Yandex.Toloka for supporting the shared task and providing credits for data annotation in Russian.
  • Evaluation metric: F1-score for the “positive” class (i.e., tweets that report AEs)

Task 3: Automatic extraction and normalization of adverse effects in English tweets

This task, organized for the first time in 2019, is an end-to-end task that involves extracting the span of text containing an adverse effect (AE) of a medication from tweets that report an AE, and then mapping the extracted AE to a standard concept ID in the MedDRA vocabulary (preferred terms). The training data includes tweets that report an AE (annotated as “1”) and those that do not (annotated as “0”). For each tweet that reports an AE, the training data contains the span of text containing the AE, the character offsets of that span of text, and the MedDRA ID of the AE. For some of the tweets that do not report an AE, the training data contains the span of text containing an indication (i.e., the reason for using the medication) and the character offsets of that span of text, allowing participants to develop techniques for disambiguating AEs and indications.

  • Training data: 2,376 tweets (1,212 “positive” tweets; 1,155 “negative” tweets)
  • Test data: ~1,000 tweets
  • Evaluation metric: F1-score for the “positive” class (i.e., the correct AE spans and MedDRA IDs for tweets that report AEs)
  • Contact information: Arjun Magge (

Task 4: Automatic characterization of chatter related to prescription medication abuse in tweets

This new, multi-class classification task involves distinguishing, among tweets that mention at least one prescription opioid, benzodiazepine, atypical anti-psychotic, central nervous system stimulant or GABA analogue, tweets that report potential abuse/misuse (annotated as “A”) from those that report non-abuse/-misuse consumption (annotated as “C”), merely mention the medication (annotated as “M”), or are unrelated (annotated as “U”)3.

  • Training data: 13,172 tweets
  • Test data: 3,271 tweets
  • Evaluation metric: F1-score for the “potential abuse/misuse” class
  • Contact information: Abeed Sarker (

Task 5: Automatic classification of tweets reporting a birth defect pregnancy outcome

This new, multi-class classification task involves distinguishing three classes of tweets that mention birth defects: “defect” tweets refer to the user’s child and indicate that he/she has the birth defect mentioned in the tweet (annotated as “1”); “possible defect” tweets are ambiguous about whether someone is the user’s child and/or has the birth defect mentioned in the tweet (annotated as “2”); “non-defect” tweets merely mention birth defects (annotated as “3”)4,5.

  • Training data: 18,397 tweets (953 “defect” tweets; 956 “possible defect” tweets; 16,488 “non-defect” tweets)
  • Test data: 4,602 tweets
  • Evaluation metric: micro-averaged F1-score for the “defect” and “possible defect” classes
  • Contact information: Ari Klein (

Important Dates

Training data available: January 15, 2020 (may be sooner for some tasks)
Test data available: June 1, 2020
System predictions for test data due: June 4, 2020 (23:59 CodaLab server time)
System description paper submission deadline: July 1, 2020
Notification of acceptance of system description papers: September 1, 2020
Camera-ready papers due:  October 1, 2020
Workshop: December 12 or 13, 2020 (TBA)

* All deadlines, except for system predictions (see above), are 23:59 UTC (“anywhere on Earth”).

Registration and Data Access Information

To participate in a task, register (for free) on CodaLab (click “Participate”) and then contact Ivan Flores, providing (1) your name, (2) your institutional affiliation, (3) your email address, (4) the names and email addresses of your team members, (5) your CodaLab team name, and (6) the task(s) in which you would like to participate. Student registrants are required to provide the name and email address of a faculty team member who has agreed to serve as their advisor/mentor for developing their system and writing their system description (see below). By registering for a task, participants agree to run their system on the test data and upload at least one set of predictions to CodaLab. Teams may upload up to three sets of predictions per task. By receiving access to the annotated tweets, participants agree to Twitter’s Terms of Service and may not redistribute any portion of the data.

Paper Submission and Presentation Information

Participating teams are required to submit a paper describing the system(s) they ran on the test data. The system description may consist of up to two pages, plus unlimited references. Teams participating in more than one task are required to describe their system(s) for each task, and are permitted one additional page per additional task. Sample description systems can be found in pages 89-136 of the #SMM4H 2019 proceedings. Accepted system descriptions will be included in the #SMM4H 2020 proceedings. We encourage, but do not require, at least one author of each accepted system description to register for the #SMM4H 2020 Workshop, co-located at COLING 2020, and present their system as a poster. Select participants, as determined by the program committee, will be invited to extend their system description to up to four pages, plus unlimited references, and present their system orally. All paper submissions must follow the COLING guidelines and be submitted as a PDF using the Softconf START Conference Manager.


Graciela Gonzalez-Hernandez, University of Pennsylvania, USA
Davy Weissenbacher, University of Pennsylvania, USA
Ari Z. Klein, University of Pennsylvania, USA
Karen O’Connor, University of Pennsylvania, USA
Abeed Sarker, Emory University, USA
Arjun Magge, Arizona State University, USA
Elena Tutubalina, Kazan Federal University, Russia
Zulfat Miftahutdinov, Kazan Federal University, Russia
Ilsear Alimova, Kazan Federal University, Russia
Martin Krallinger, Barcelona Supercomputing Center, Spain
Anne-Lyse Minard, Université d’Orléans, France

Program Committee

Olivier Bodenreider, US National Library of Medicine, USA
Kevin Cohen, University of Colorado School of Medicine, USA
Robert Leaman, US National Library of Medicine, USA
Diego Molla, Macquarie University, Australia
Zhiyong Lu, US National Library of Medicine, USA
Azadeh Nikfarjam, Apple, USA
Thierry Poibeau, French National Center for Scientific Research, France
Kirk Roberts, University of Texas Health Science Center at Houston, USA
Yutaka Sasaki, Toyota Technological Institute, Japan
H. Andrew Schwartz, Stony Brook University, USA
Nicolas Turenne, French National Institute for Agricultural Research, France
Karin Verspoor, University of Melbourne, Australia
Pierre Zweigenbaum, French National Center for Scientific Research, France


  1. Weissenbacher D, Sarker A, Klein A, O’Connor K, Magge A, Gonzalez-Hernandez G. Deep neural networks for ensemble for detecting medication mentions in tweets. Journal of the American Medical Informatics Association 2019;ocz156.
  2. Sarker A, Chandrashekar P, Magge A, Cai H, Klein A, Gonzalez G. Discovering cohorts of pregnant women from social media for safety surveillance and analysis. Journal of Medical Internet Research 2017;19(10):e361.
  3. O’Connor K, Sarker A, Perrone J, Gonzalez-Hernandez G. Monitoring prescription medication abuse from Twitter: an annotated corpus and annotation guidelines for reproducible machine learning research. Journal of Medical Internet Research Preprints 13/08/2019:15861.
  4. Klein AZ, Sarker A, Cai H, Weissenbacher D, Gonzalez-Hernandez G. Social media mining for birth defects research: a rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter. Journal of Biomedical Informatics 2018;87:68-78.
  5. Klein AZ, Sarker A, Weissenbacher D, Gonzalez-Hernandez G. Towards scaling Twitter for digital epidemiology of birth defects. npj Digital Medicine 2019;2:96.