This new binary classification task involves automatically distinguishing tweets that self-report potential cases of COVID-19 (annotated as “1”) from those that do not (annotated as “0”). “Potential case” tweets include those indicating that the user or a member of the user’s household was denied testing for, symptomatic of, directly exposed to presumptive or confirmed cases of COVID-19, or has had experiences that pose a higher risk of exposure to COVID-19. “Other” tweets are related to COVID-19 and may discuss topics such as testing, symptoms, traveling, or social distancing, but do not indicate that the user or a member of the user’s household may be infected.
- Training data: 7,181 tweets
- Test data: 10,000 tweets
Register your team here : https://forms.gle/1qs3rdNLDxAph88n6
After registration approval, you will be invited to join the Google group for the task. Link to the dataset is available in the Google groups banner. If you do not receive the invite please request to join the Google group with team name using the link below.
Google groups : https://groups.google.com/g/smm4h21-task-5
Link to Codalab : https://competitions.codalab.org/competitions/28766
Evaluation Period for Task 5 :
Test Dataset Release | 28th Feb 2021 12:00am UTC |
Predictions Due | 2nd Mar 2021 11:59pm UTC (3:59pm PST) |

Submission format: Please use the format below for submission. Submissions should contain two columns tweet_id and label separated by tabspaces. All other columns will be ignored. Predictions for each task should be contained in a single .tsv (tab separated values) file. This file (and only this file) should be compressed into a .zip file. Please upload this zip file as submission.
tweet_id | label |
35656 | 0 |
34637 | 1 |
56844 | 0 |
12735 | 1 |
05745 | 0 |
24677 | 0 |
Evaluation Metric : F1-score for the “potential case” class
Contact information: Ari Klein (ariklein@pennmedicine.upenn.edu)