SMM4H 2023 – Task 1 : Binary classification of English tweets self-reporting a COVID-19 diagnosis

To facilitate the use of Twitter data for monitoring personal experiences of COVID-19 in real time and on a large scale, this binary classification task involves automatically distinguishing tweets that self-report a COVID-19 diagnosis (annotated as “1”)—for example, a postitive test, clinical diagnosis, or hospitalization—from those that do not (annotated as “0”). By this definition, a tweet that merely states that the user has experienced COVID-19 would not be considered a diagnosis. The training data include the Tweet ID, the text of the Tweet Object, and the annotated binary label. System predictions for the validation and test data should be submitted through CodaLab. Submissions should be formatted as a ZIP file containing a TSV file with only two columns: the tweet_id column first and the label column second, separted by a tab. The TSV file should not be in a folder in the ZIP file, and the ZIP file should not contain any files or folders other than the TSV file. The TSV file should be named prediction_task1.tsv.

  • Training data: 7,600 tweets
  • Validation data: 400 tweets
  • Test data: 10,000 tweets
  • Evaluation metric: F1-score for the “positive” class (i.e., tweets that self-report a COVID-19 diagnosis)

Contact: Ari Klein, University of Pennsylvania, USA (

CodaLab: TBA