Expanding on an #SMM4H 2022 task involving the classification of Spanish tweets that self-report COVID-19 symptoms, this task focuses on the detection and extraction of COVID-19 symptoms in tweets written specifically in Latin American Spanish. The task includes both personal self-reports and third-party mentions of symptoms, in an effort to generalize the identification of various disease symptoms in Latin American Spanish to both colloquial and formal language domains. The dataset consists of tweets annotated by medical doctors who are native Latin American Spanish speakers, including labels for whether or not the tweet mentions a symptom and the characters offsets of symptoms. In addition, participants will be provided with the dataset for the aforementioned #SMM4H 2022 task, along with BERT-like language models pretrained on Latin American Spanish tweets. The evaluation metric for this task is the strict F1-score for identifying the character offsets of COVID-19 symptoms. The task involves NER offset detection and classification. Participants must find the beginning and end of symptoms. Dataset annotation guidelines: Adapted annotation guideline derived from 2022’s SocialDisNER SMM4H shared task (available https://zenodo.org/record/6983041).
- Training data: 6,021 tweets
- Validation data: 1,979 tweets
- Test data: 2,150 tweets
- Evaluation metric: Strict F1-score
Submission format: tab-separated file with headers, same format used in the validation set.
tweet_id | begin | end | label | span |
25 | 131 | 139 | síntoma | dolor de cabeza |
25 | 198 | 201 | síntoma | nauseas |
Contact: Juan Banda, Georgia State University, USA (jbanda@gsu.edu)
CodaLab: TBA