
The 9th Social Media Mining for Health Research and Applications (#SMM4H) Workshop, collocated at ACL 2024, serves as a unique venue for bringing together researchers interested in developing and sharing NLP methods that enable the systematic use of SM data for health research. #SMM4H-2024 workshop and shared tasks have a special focus on Large Language Models (LLMs) and Generalizability for Social Media NLP. A variety of LLMs and their emerging capabilities promise the creation of a generalist artificial agent (Moor et al., 2023; Qin et al., 2023) capable of transferring knowledge acquired during training on massive corpora to solve unseen tasks on user-generated data. We seek to motivate such progress, benefiting from the ‘wisdom of the masses’ as reflected in SM, particularly in the realm of personal health.
NLP topics of interest to our workshop include:
• Information retrieval methods for obtaining relevant SM data
• Annotation schemes and evaluation techniques for health-related texts in SM
• Classifying health-related texts in SM
• Methods for the automatic detection, extraction, and normalization of health-related concept mentions in SM data
• Semantic methods in SM analysis
• Domain adaptation and transfer learning techniques for health-related texts in SM
Submit workshop papers here: OpenReview (https://openreview.net/group?id=aclweb.org/ACL/2024/Workshop/SMM4H)
Important Dates (tentative)
Organizers
Graciela Gonzalez-Hernandez, Cedars-Sinai Medical Center, USA
Dongfang Xu, Cedars-Sinai Medical Center, USA
Ivan Flores, Cedars-Sinai Medical Center, USA
Davy Weissenbacher, Cedars-Sinai Medical Center, USA
Ari Z. Klein, University of Pennsylvania, USA
Karen O'Connor, University of Pennsylvania , USA
Abeed Sarker, Emory University, USA
Yao Ge, Emory University, USA
Juan M. Banda, Stanford Health Care, USA
Raul Rodriguez-Esteban, Roche Pharmaceuticals, Switzerland
Lucia Schmidt, Roche Pharmaceuticals, Switzerland
Vishakha Sharma, Roche Diagnostics, USA
Lisa Raithel, Technical University of Berlin, Germany
Pierre Zweigenbaum, Université Paris-Saclay, France
Roland Roller, German Research Center for Artificial Intelligence, Germany
Philippe Thomas, German Research Center for Artificial Intelligence, Germany
Eiji Aramaki, NAIST, Japan
Shuntaro Yada, NAIST, Japan
In 2024, #SMM4H is also organizing 7 shared tasks: participants will be provided with annotated training and validation data to develop their systems, followed by 7 days during which they will run their systems on unlabeled test data and upload their predictions to CodaLab. The individual CodaLab site for each task can be found below. Teams may upload up to 2 sets of predictions per task.
Please use this form (https://forms.gle/7w4si27uJrCMiTyL8) to register. When your registration is approved, you will be invited to a Google group, where the data sets will be made available. Registered teams are required to submit a paper describing their systems. System descriptions may consist of up to 4 pages and must follow the ACL formatting. Teams participating in multiple tasks are permitted an additional page. Sample system descriptions can be found in past proceedings (under the Past events tab). In order for accepted system descriptions to be included in the proceedings, at least one author must register for and present at the #SMM4H 2024 Workshop.
Submit system description papers here: OpenReview (https://openreview.net/group?id=aclweb.org/ACL/2024/Workshop/SMM4H)
Adverse Drug Events (ADEs), also often known as Adverse Drug Reactions (ADRs), are negative side effects related to the drug. Mining ADEs from social media is one of the most studied topics in the area of Social Media Pharmacovigilance, as it could help early detection of ADEs directly from user-generated content. In this task, we focus on identifying the standardized ADEs from tweets. And our reference of ADE concepts is from a rich and highly specific medical terminology, MedDRA.
To enable novel approaches on ADE mining, in contrast to previous iterations of this task -- SMM4H-2022 , SMM4H-2023 -- our evaluation this year will considers the 1) text spans and 2) MedDRA ID (preferred term ID) of ADEs from tweets. We will provide participants with about 18,000 labeled tweets for training and about 14,000 tweets for testing. More details about submission format and evaluation metrics will be available at SMM4H-2024 codalab.
Here are some data examples:
Tweet:
SMM4H2022uCZV2SRsCe4vzjFm @USER__________ have to go to a doc now to see why i'm still gaining. stupid paxil made me gain like 50 pounds ?? and now i have to lose it
Annotation:
SMM4H2022uCZV2SRsCe4vzjFm ADE 61 68 gaining 10047896
SMM4H2022uCZV2SRsCe4vzjFm ADE 91 110 gain like 50 pounds 10047896
While for testing, participants will only be required to provide text spans and ADE preferred term IDs (ptIDs), e.g.,
Submission:
SMM4H2022uCZV2SRsCe4vzjFm gaining 10047896
SMM4H2022uCZV2SRsCe4vzjFm gain like 50 pounds 10047896
Contact: Dongfang Xu, Cedars-Sinai Medical Center, USA (Dongfang.Xu@cshs.org)
Google Group: smm4h-2024-task-1@googlegroups.com
CodaLab: TBA
The task targets both the extraction of drug and disorder/body function mentions (Subtask 2a) and the extraction of relations between those entities (joint Named Entity Detection and Relation Extraction, Subtask 2b). The task is set up in a cross-lingual few/zero-shot scenario: Training data consist mostly of Japanese and German data plus four French documents. The submitted systems will be evaluated on Japanese, German, and, finally, on French data.
The data provided originates from different social media sources (online patient fora and X/Twitter) and is available in the aforementioned three languages: German, French, and Japanese. The German (training and test) data is from an online patient forum, whereas the Japanese documents are from X (training) and a patient forum (test). The French data, finally, is a translation of German documents from the same patient forum as the German data. The documents do not overlap.
All data are annotated with the same annotation guidelines, with a focus on the detection and extraction of adverse drug reactions (negative medical side effects), modeled by associating medication mentions with disorder (medical signs and symptoms) and body function mentions. The relation distribution is imbalanced (i.e., the number of "treatment_for" relations is much lower than the number of "caused" relations), adding to the difficulty of the task.
The participants are expected to submit multi-lingual systems (FR + DE + JA) for one or both of the following tasks:
1. Named entity recognition of the entities "drug", "disorder" and "function"
2. Joint Named Entity and Relation Extraction of the entities "drug", "disorder" and "function" and the relations
- Drug → treatment_for → disorder/function
- Drug → caused → disorder/function
We will provide the participants with Japanese and German data containing entity and relation annotations in brat format. Further, we will provide a small set of French documents to allow few-shot approaches for both sub tasks.
Training Data:
- German: 70 documents, collected from a German patient forum
- Japanese: 392 documents, collected from X (Twitter)
- French: 4 documents, collected from a German forum and translated to French (distinct from the German data)
Example:
Text: She took infliximab but she became red all over.
Annotation and submission format:
Evaluation:
Submissions will be ranked by non-weighted macro F1-score, Precision, and Recall. Our evaluation script is a modified version of "brateval" and can be found here: https://github.com/Erechtheus/brateval.
For Subtask 2a, we will use exact match of entities for calculating the above mentioned scores. (In the evaluation script, this corresponds to the parameters "-span-match exact")
For Subtask 2b, joint entity and relation extraction, note that both entity boundaries and types, as well as relation types and arguments have to match exactly. (In the evaluation script, this corresponds to the parameters "-type-match exact -span-match exact")
We explicitly encourage new and creative approaches to both subtasks.
More details and example annotations can be found in the respective Codalab challenge description.
Contact: Lisa Raithel, Technical University of Berlin, Germany (raithel@tu-berlin.de)
Google Group: smm4h-2024-task-2@googlegroups.com
Codalab: TBA
Social anxiety disorder (SAD), an anxiety disorder whose onset appears mostly in early adolescence and may affect up to 12% of the population at some point of their lives. About one-third of people with SAD report experiencing symptoms for 10 years before seeking treatment, however, people do turn to social media outlets, such as Reddit, to discuss their symptoms and share or ask other users about what may help alleviate these symptoms. While, as has been found with other anxiety disorders, being outdoors in green or blue spaces may be beneficial for relieving symptoms, scant research exists into the effect of these on SAD. In order to qualitatively assess the effects of outdoor spaces, posts that mention these locations and the user’s sentiment towards them must be identified for further study.
This task presents a multi-class classification task to categorize posts that mention one or more pre-determined keywords related to outdoor spaces into one of four categories: 1) positive effect, 2) neutral or no effect, 3) negative effect, or 4) unrelated, where the keyword mention is not referencing an actual outdoor space. Details for each class can be found in the associated annotation guidelines (TBA). There is only one category per post. This task has 3,000 annotated posts which were downloaded from the r/socialanxiety subreddit and filtered first to only include users between the ages of 12 and 25, and then for the mention of one of 80 keywords related to green or blue spaces. 80% of the data will be made available for training and validation, and 20% of the data will be held out for evaluation. The evaluation metric for this task is the micro-averaged F1-score over all 4 classes. The data include annotated collections of Reddit posts which will be shared in csv files. There are 4 fields in the csv files: post_id, keyword, text, label. The training data is already prepared and will be available to the teams registering to participate. The testing data will be released when the evaluation phase starts.
· Training data: 1800 posts
· Validation data: 600 posts
· Testing data: 600
· Evaluation metric: macro-averaged F1-score
Contact: Karen O'Connor, University of Pennsylvania (karoc@pennmedicine.upenn.edu)
Google Group: TBA
Codalab: TBA
Substance use, both prescription and illicit, has become a significant public health concern, leading to addiction, overdose, and associated health issues. Understanding the clinical impacts and social impacts of nonmedical substance use is essential for improving the treatment of substance use disorder. It helps healthcare professionals develop more effective interventions and medications to address addiction. By studying these impacts, researchers can develop more effective prevention and education programs to reduce the occurrence of nonmedical substance use and its associated clinical and social consequences. In this named entity recognition task, we focus on two entity types: clinical impacts and social impacts. Instances in the clinical impacts category describe the clinical effects, consequences, or impacts of substance use on individuals' health, physical condition, or mental well-being. Instances the social impacts describe the societal, interpersonal, or community-level effects, consequences, or impacts of nonmedical substance use. These impacts may include social relationships, community dynamics, or broader social issues. In this task, 27.8% of posts contain words or phrases marked as clinical or social impacts. Systems designed for this task need to detect these impacts and automatically distinguish between clinical impacts and social impacts in text data derived from Reddit, with specific spans. Specifically, we anticipate that the strategies will involve leveraging Large Language Models (LLMs).
Training data: 843 posts
Validation data: 259 posts
Test data: 278 posts
Evaluation metric: F1-score
Participants of this task must sign a data use agreement (DUA) confirming that the data will not be redistributed.
Data Examples
Text= “In PA at a 28 day detox/rehab they used methadone to get me off of bupe.”
Submission Format
Please submit using the same format as Data Example.
Contact: Yao Ge, Emory University, USA (yao.ge@emory.edu)
Google Group: smm4h24-task-4-extraction-of-the-clinical-and-social-impacts@googlegroups.com
CodaLab: https://codalab.lisn.upsaclay.fr/competitions/16648
Many children are diagnosed with disorders that can impact their daily life and can last throughout their lifetime. For example, in the United States, 17% of children are diagnosed with a developmental disability, and 8% of children are diagnosed with asthma. Meanwhile, sources of data for assessing the association of these outcomes with pregnancy exposures remain limited. This binary classification task involves automatically distinguishing tweets, posted by users who had reported their pregnancy on Twitter, that report having a child with attention-deficit/hyperactivity disorder (ADHD), autism spectrum disorders (ASD), delayed speech, or asthma (annotated as "1"), from tweets that merely mention a disorder (annotated as "0"). Sample tweets are shown in the table below. This task enables the use of Twitter on a large scale not only for epidemiologic studies, but, more generally, to explore parents' experiences and directly target support interventions. The training, validation, and test sets contain 7398 tweets, 389 tweets, and 1947 tweets, respectively. The evaluation metric is the F1-score for the class of tweets that report having a child with a disorder.
Contact: Ari Klein, University of Pennsylvania, USA (ariklein@pennmedicine.upenn.edu)
Google Group: smm4h-2024-task-5@googlegroups.com
CodaLab: TBA
Because social media is used by patients in every aspect of their daily lives, its analysis presents a promising way to understand the patient’s perspective on their disease journey, their unmet medical needs and their disease burden. Social media listening (SML) can, therefore, foster progress in our understanding of diseases and influence the development of new therapies.
Advancing the utility of social media data for research applications requires methods for automatically detecting demographic information about social media study populations, including users' age. Automatically identifying the self-reported exact age of social media users, rather than their age groups, which is the standard approach, would enable the large-scale use of social media data for applications that do not align with predefined age groupings of extant models, including health applications such as linking specific age-related risk factors in observational studies.
In this task, we focus on the automatic extraction of self-reported ages in posts of two social media platforms: Twitter (now X) and Reddit.
Training Data:
8,800 tweets (SMM4H22)
100k unlabeled Reddit posts from r/AskDocs with 2 digit numbers (seems the # of age reports is high)
Validation Data:
2,200 tweets (SMM4H22)
1,000 Reddit dry eye disease posts (SMM4H22)
Testing Data:
2,200 tweets (SMM4H22)
2,000 Reddit dry eye disease posts (SMM4H22)
12,482 Reddit social anxiety posts (age only on the 13 to 25 range)
Evaluation metric: F1-Score on the positive class (‘1’). Micro average will also be calculated afterwards.
Table 1 provides sample training data, which include the ID, the source type (Twitter or Reddit), the text and the annotated binary class.
Entries were annotated as "1" if the user's exact age could be determined from the text at the time the entry was posted, “0” otherwise.
- In the first entry, the user's exact age is explicitly stated.
- Although the second entry does not explicitly state the user's age, it can be inferred from the fact that the user reports turning 20 the day after posting.
- The third entry does not specify when the user will be 21, but it was annotated as "1" under the assumption that the entry is referring to the user's next birthday.
- The fourth entry was annotated as "0" because it is ambiguous about whether the user was 21 when the entry was posted, or whether the user is referring to a future age.
- The fifth entry was also annotated as "0" because it is ambiguous whether the user was 18 when the tweet was posted, or whether the user is referring to age further in the past.
- The sixth entry was annotated as "0" because it does not refer to the age of the user, but rather the user's brother.
- The seventh and eighth entries were annotated as “1” because the users expresses their age according to the Reddit convention and followed by a letter indicating their gender (25m, 27f)
Contact:
Ana Lucia Shcmidt, Roche (lucia.schmidt@roche.com)
Ari Klein, University of Pennsylvania, USA (ariklein@pennmedicine.upenn.edu)
Google Group: smm4h-2024-task-5@googlegroups.com
Codalab: TBD
The current widespread adoption of LLMs, like ChatGPT, for data annotation tasks has the NLP field at odds. While some researchers are embracing it, due to their decent performance in many types of annotation tasks and certain domains. Others are more skeptical due to potential underlying biases and hallucinations of said models. It will become of paramount importance to be able to identify what data was annotated by LLMs and what data was annotated by humans. In this task, we provide two datasets, one annotated by human domain experts and the other annotated by an LLM (GPT-4). The task at hand involves the detection and extraction of COVID-19 symptoms in tweets written specifically in Latin American Spanish. The task includes both personal self-reports and third-party mentions of symptoms, in an effort to generalize the identification of various disease symptoms in Latin American Spanish to both colloquial and formal language domains. The domain-expert annotated dataset was used in task 3 of Social Media Mining for Health 2023 and it consists of a total of 10,150 tweets which will be released in full. An equally sized dataset, which consists of non-overlapping tweets, was annotated using GPT-4 with some prompt engineering performed. This dataset was not curated by any domain experts and was only verified to make sure the spans are marked correctly, no LLM generated annotations have been corrected, improved, or modified in any way. The test set for this task will be the unreleased test set from our Social Media Mining for Health 2023 and its LLM annotated counterpart.
Evaluation metric: Classification accuracy (human or machine)
Baseline: Our current baseline system achieves an 82% classification accuracy for detection of human and LLM annotated tweets.
Rules of engagement: The systems designed for this task are encouraged to leverage LLMs, however, any instruction tuning needs to be documented and presented as part of your solution. The same goes for any prompt engineering performed. Traditional systems are also invited to participate.
Sub-task b) In this subtask, participants are asked to perform symptom detection on an additional testing set of 2,113 tweets, annotated by human domain experts. With the idea of identifying if machine annotated data would boost performance of NLP systems when using it to augment human annotated data, or be sufficient on its own. Other approaches/systems can be used, like using instruction tuned or prompt engineered LLM approaches.
Submission Format for main task:
Tab-separated file with headers corresponding to tweet ID and class: human or machine
Submission format for subtask b:
Tab-separated file with headers
Important Note: Please name your submission with the type of data approach you are using.
For example: LLM, or Human + Machine, etc.
Contact: Juan M. Banda, (jmbanda@stanford.edu)
Google Group: TBA
CodaLab: TBD
