Social Media Mining for Health Applications (SMM4H) Workshop & Shared Task 2019

(For the Workshop, please follow this link, for SMM4H’18, please follow this link)

Shared Task

*** New, system description submission deadline extended: April 26, April 29, 2019***

The proposed SMM4H shared tasks involve NLP challenges on social media mining for health monitoring and surveillance. This requires processing imbalanced, noisy, real-world, and substantially creative language expressions from social media. The proposed systems should be able to deal with many linguistic variations and semantic complexities in various ways people express medication-related concepts and outcomes. It has been shown in past research that automated systems frequently underperform when exposed to social media text because of the presence of novel/creative phrases and misspellings, and frequent use of idiomatic, ambiguous and sarcastic expressions. The tasks will thus act as a discovery and verification process of what approaches work best for social media data.

Similar to the first three runs of the shared tasks, the data will include annotated collections of posts on Twitter. The training data is already prepared and will be available to the teams registering to participate. This year, we will standardize the competition platform, utilizing Codalab competitions.

Task 1: Automatic classifications of adverse effects mentions in tweets

The designed system for this sub-task should be able to distinguish tweets reporting an adverse effect (AE) from those that do not, taking into account subtle linguistic variations between adverse effects and indications (the reason to use the medication). This is a rerun of the popular classification task organized in 2016, 2017, and 2018.

Data

  • Training data: 25,672 (2,374 positive and  23,298 negative)
  • Evaluation data: approximately 5,000 tweets.
  • Evaluation metric: F-score for the ADR/positive class.
  • Test data: April 16, 0:00 UTC (Free registration is required, please contact Davy Weissenbacher)
  • Final run: April 20, 0:00 UTC
  • Codalab link
  • Baseline system: TBA

For each tweet, the publicly available data set contains: (i) the user ID, (ii) the tweet ID, and (iii) the binary annotation indicating the presence or absence of ADRs, as shown below. The evaluation data will contain the same information, but without the classes. Participating teams should submit their results in the same format as the training set (shown below).

TweetID            UserID       Class
354256195432882177  54516759     0
352456944537178112  1267743056   1
332479707004170241  273421529    0
340660708364677120  135964180    1

Task 2: Extraction of Adverse Effect mentions

As a follow-up step of Task 1, this task includes identifying the text span of the reported ADRs and distinguishing ADRs from similar non-ADR expressions. ADRs are multi-token, descriptive, expressions, so this subtask requires advanced named entity recognition (NER) approaches. The data for this sub-task includes 2000+ tweets which are fully annotated for mentions of ADRs and Indications. This set contains a subset of the tweets from Task 1 tagged as hasADR plus an equal number of noADR tweets. Some tweets in the noADR subset were annotated for mentions of Indications to allow participants to develop techniques to deal with this confusion class.

Datacodalab link

  • Training data: 2,367 (1,212 positive and 1,155 negative)
  • Evaluation data: 1,000 (~500 positive, ~500 negative)
  • Evaluation metric: Strict and Relaxed F1-score, Precision and Recall
  • Test data: April 16, 0:00 UTC (Free registration is required, please contact Davy Weissenbacher)
  • Final run: April 20, 0:00 UTC
  • Codalab link
  • Baseline system: TBA

For each tweet, the publicly available data set contains: (i) the tweet ID, (ii) the start and (iii) end of the span, (iv) the annotation indicating an ADR or not and (v) the text covered by the span in the tweet. The evaluation data will just contain the tweet IDs.

Tweet ID                Begin   End     Class   text
346575368041410561	106	120	ADR	gaining weight
345922687161491456	27	34	ADR	sweatin
343961812334686208	-	-	noADR	-
345167138966876161	-	-	noADR	-
342380499802681344	118	139	ADR	difficult to come off

Task 3: Normalization of adverse drug reaction mentions (ADR)

The Task-3 is an end-to-end task, where the objective is to detect tweets mentioning an ADR and to map the extracted colloquial mentions of ADRs in the tweets to standard concept IDs in the MedDRA vocabulary (lower level terms). This task requires to understand the semantic interpretation of ADRs in order to map them to standard concept IDs. This task is likely to require a semi-supervised approach to successfully disambiguate ADRs.

Data

  • Training data: 2,367 (1,212 positive and 1,155 negative)
  • Evaluation data: 1,000 (~500 positive, ~500 negative)
  • Evaluation metric: Strict and Relaxed F1-score, Precision and Recall
  • Test data: April 16, 0:00 UTC (Free registration is required, please contact Davy Weissenbacher)
  • Final run: April 20, 0:00 UTC
  • Codalab link

For each ADR mention, the publicly available data set contains: (i) the tweet ID, (ii) the start and (iii) end of the span, (iv) the annotation indicating an ADR or not, (v) the text covered by the span in the tweet and (iii) the corresponding ID of the preferred term in the MedDRA vocabulary. The evaluation data will just contain the tweet IDs.

Tweet ID                Begin   End     Class   Text            MEDDRA ID
346575368041410561	106	120	ADR	gaining weight	10047899
345922687161491456	27	34	ADR	sweatin	        10020642
343961812334686208	-	-	noADR	-	-
345167138966876161	-	-	noADR	-	-
342380499802681344	118	139	ADR	difficult to come off	10048010

Task 4: Generalizable identification of personal health experience mentions

Motivation

In health research using social media, it is often necessary for researchers to build individual classifiers to identify health mentions of a particular disease in a particular context. Classification models that can generalize to different health contexts would be greatly beneficial to researchers in these fields (e.g., [1]), as this would allow researchers to more easily apply existing tools and resources to new problems. Motivated by these ideas, this task will test tweet classification methods across diverse health contexts, such that the test data will include a very different health context than the training data. This will measure the ability of tweet classifiers to generalize across health contexts.

Task and Data

The binary classification task is to classify whether a tweet contains a personal mention of one’s health (for example, sharing one’s own health status or opinion), as opposed to a more general discussion of the health issue, or an unrelated mention of the word. The shared task will involve multiple tweet datasets annotated for personal health mentions across different health issues. The training data will include data from one disease domain (influenza) across two contexts (being sick and getting vaccinated), both annotated for personal mentions (the user is personally sick or the user has been personally vaccinated). The test data will include an additional disease domain beyond influenza in a different context than being sick or being vaccinated.

The training data provided are Twitter data in two influenza contexts: the first is labeled with instances of flu infection [2], and the second with individuals that received a flu vaccine [3]. The first dataset originally had 4516 tweets, but was collected several years ago. As such, only ~1000 are still available to download. Approximately 54% of this dataset are labeled with the positive class. The second dataset originally had ~10000 tweets, of which ~9800 are still available. Approximately 30% of this dataset are labeled with the positive class. Note that in addition to class imbalances, there is substantially more data available in the second dataset compared to the first, which creates an imbalance of domain settings.

The testing data will include both a similar domain to the training data (influenza) and a separate domain entirely, in order to help test the generalizability of the classifiers. Testing data from the influenza domain will also contain data collected years after the original training data, to test the generalizability of the classifiers to future data [4]. The test data will have a similar class balance as the training sets (~30-60% positive).

Each dataset includes (i) the tweet ID, and (ii) the binary annotation. The evaluation data will contain the same information, but without the class labels. Participating teams should submit their results in the same format as the training set (shown below).

Tweet ID    Label
6004314210  0
6003319713  0
5991525204  0
5989718714  0
5986621813  0

Classification models

We realize that participants might have access to additional datasets that could improve model performance, for example other annotated health datasets beyond the training data provided. Participants are welcome to use additional data, but we ask that they be clear about additional data used in their paper.

Evaluation

Each model submitted will be evaluated using 3 F1-scores. We will calculate F1 for the held out influenza data, the second undisclosed context, and the F1-score overall.

Data

  • Training data: 10,876 tweets
  • Evaluation data: TBA
  • Evaluation metric: Accuracy, F1-score
  • Test data: April 15, 0:00 UTC (Free registration is required, please contact Davy Weissenbacher)
  • Final run: April 19, 0:00 UTC
  • Codalab link

References

  1. Payam Karisani, Eugene Agichtein. Did You Really Just Have a Heart Attack? Towards Robust Detection of Personal Health Mentions in Social Media. 2018. https://arxiv.org/abs/1802.09130
  2. Alex Lamb, Michael J. Paul, Mark Dredze. Separating fact from fear: Tracking flu infections on Twitter. 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2013), Atlanta. June 2013.
  3. Xiaolei Huang, Michael C. Smith, Michael J. Paul, Dmytro Ryzhkov, Sandra C. Quinn, David A. Broniatowski, Mark Dredze. Examining patterns of influenza vaccination in social media. AAAI Joint Workshop on Health Intelligence (W3PHIAI), San Francisco. February 2017.
  4. Xiaolei Huang, Michael J. Paul. Examining temporality in document classification. Association for Computational Linguistics (ACL), Melbourne, Australia. July 2018.