Social Media Mining for Health/Health Real-World Data (#SMM4H-HeaRD) 2026 Workshop and Shared Tasks

The Social Media Mining for Health Applications and Health Real-World Data (#SMM4H-HeaRD) Workshop provides an interdisciplinary forum for presenting and discussing advances in natural language processing (NLP), machine learning, and artificial intelligence at the intersection of health and web-based data. Now in its 11th edition, the #SMM4H-HeaRD 2026 Workshop will be held online and co-located with the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026): https://2026.aclweb.org/

This year’s edition will emphasize privacy-preserving data sharing and real-world evaluation, particularly in the context of health-related text from sources such as social media, electronic health records, and biomedical literature. In alignment with the ACL 2026 Theme Track on the “Explainability of NLP Models”, we also invite work that improves the transparency, interpretability, and trustworthiness of models applied in high-stakes clinical and public health contexts.

Topics of interest to our Workshop include:

· Creation and sharing of open, privacy-preserving health NLP datasets

· Synthetic data generation, weak supervision, and resource-constrained learning

· Scalable and interpretable evaluation methods, including LLM-as-a-judge

· Explainable and interpretable methods for clinical NLP

· Evaluation and quality assessment of free-text model justifications and rationales

· Information extraction and classification from biomedical literature

· Multilingual and efficient approaches for largescale analysis of social media data

· Text-to-text generation for clinical and biomedical applications

Submit workshop papers here: TBA

Important Dates

Submission deadline: April 24, 2026

Notification of acceptance: May 15, 2026

Camera-ready papers due: May 25, 2026

Workshop: July 2-3, 2026

Workshop Chair

Graciela Gonzalez-Hernandez, Cedars-Sinai Medical Center, USA

Organizers

Graciela Gonzalez-Hernandez, Cedars-Sinai Medical Center, USA

Guillermo Lopez-Garcia, Cedars-Sinai Medical Center, USA

Dongfang Xu, Cedars Sinai-Medical Center, USA

Ivan Flores, Cedars-Sinai Medical Center, USA

Jose Miguel Acitores, Cedars-Sinai Medical Center, USA

Jacob Berkowitz, Cedars-Sinai Medical Center, USA

Nicholas Tatonetti, Cedars-Sinai Medical Center, USA

Ari Z. Klein, University of Pennsylvania, USA

Abeed Sarker, Emory University, USA

Sumon Kanti Dey, Emory University, USA

Ahmad Rezaie Mianroodi, Dalhousie University, Vector Institute, Canada

Frank Rudzicz, Dalhousie University, Vector Institute, Canada

Lisa Raithel, Technische Universität Berlin, Germany

Roland Roller, DFKI, Germany

Philippe Thomas, DFKI, Germany

Farnaz Zaidi, Paul Ehrlich Institute, Germany

Yu Zhai, Hong Kong Polytechnic University, Hong Kong

Tomohiro Nishiyama, Nara Institute of Science and Technology, Japan

Pierre Zweigenbaum, LISN, CNRS, Université Paris-Saclay, France

Elena Tutbalina, AIRI, Kazan Federal University, Russia

Salvador Lima-López, Barcelona Supercomputing Center, Spain

Judith Rosell, Barcelona Supercomputing Center, Spain

Martin Krallinger, Barcelona Supercomputing Center, Spain

The Social Media Mining for Health Applications and Health Real-World Data (#SMM4H-HeaRD) shared tasks address natural language processing (NLP), machine learning, and artificial intelligence challenges involved in analyzing and modeling health-related content from social media and other real-world data sources. Now in its 11th edition, the #SMM4H-HeaRD 2026 Workshop, co-located with the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), expands its mission by organizing eight shared tasks across diverse domains and data modalities.

For each of the 8 tasks listed below, participating teams will be provided with annotated training and validation datasets. A 5-day evaluation phase will follow, during which teams will run their systems on unlabeled test data and submit their predictions via CodaLab.

To participate, please complete the registration form: https://forms.gle/oE9gfaNxFw2f6gyX6. Once your registration is approved, you will be added to a task-specific Google Group that will be used for all communications, data releases, and important updates. Registered teams are required to submit a paper describing their systems. In order for accepted system descriptions to be included in the proceedings, at least one author must register for and present at the #SMM4H-HeaRD 2026 Workshop (online).

Submit system description papers here: TBA

Important Dates

Training and validation data available: January 11, 2026

System predictions for validation data due: April 4, 2026

Test data available: April 10, 2026

System predictions for test data due: April 14, 2026

Submission deadline for system description papers: April 24, 2026

Notification of acceptance: May 15, 2026

Camera-ready papers due: May 25, 2026

Workshop: July 2-3, 2026

Adverse Drug Events (ADEs) are negative medical side effects associated with a drug. Extracting ADE mentions from user-generated text has gained significant attention in research, as it can help detect crowd signals from online discussions. Leveraging multilingual methods to analyze ADE reports across languages and borders further enhances this effort.

For this shared task, we provide user-generated messages from patient forums, X, drug reviews, etc., each labeled according to the presence of an ADE. A message with a positive label (1) contains at least one mention of an ADE, while a message with a negative label (0) does not.

Task

This is a binary classification task. Given a user-generated post, participants are supposed to develop a system that predicts whether the post contains a mention of an Adverse Drug Event (ADE). The system should output either 1 (positive, ADE mentioned) or 0 (negative, no ADE mentioned).

Data

The dataset consists of user-generated social media messages, in which mentions of medications and medical symptoms can be highly variable and sometimes ambiguous. Moreover, the dataset is imbalanced, with the positive class (posts mentioning an ADE) representing only a small fraction of the data.

Participants will receive the following data:

Training / Validation:
German: 1482 + 634 (train + dev) documents, collected from a German patient forum. The data are based on the KEEPHA corpus.
French: 977 + 419 documents, collected from a German patient forum and translated to French. The French training data is distinct from the German training data (but also from the KEEPHA corpus).
Russian: 10,754 + 2,670 documents, collected from a site with user reviews about drugs in Russian. These sets are based on the RuDReC dataset.
English: 17,974 + 902 documents, collected from X.
Mandarin: 2,248 + 379 documents, collected from a Mandarin health forum.
Japanese: 14,208 + 3,045 documents, collected from X. This corpus is based on the data studied in this publication.
Test:
German: 1,105 documents from a German patient forum (KEEPHA corpus)
French: 1,104 documents from a German patient forum and translated to French. The French test data is the translated version of the German test data (KEEPHA corpus).
Russian: 9,293 documents from collected from a site with user reviews about drugs in Russian.
English: 11,712 documents collected from X.
Mandarin: 1,144 documents, collected from a Mandarin health forum ("120ask.com").
Japanese: 3,045 documents, collected from X.

Data Format

The data are provided in CSV format. Each document includes an ID, the label, the original file name, the origin of the data, the language, the split identifier and the type of the document, as shown below:

Free registration is required to access the training and evaluation data.

Evaluation

Submissions will be ranked based on the unweighted macro F1-score, Precision, and Recall across all languages. Our evaluation script and a baseline model will be published online.

Task Organizers

Lisa Raithel, BIFOLD, TU Berlin (XplaiNLP group), Germany
Yu Zhai, Hong Kong Polytechnic University (Department of Language Science and Technology), Hong Kong
Tomohiro Nishiyama, Nara Institute of Science and Technology, Social Computing Laboratory, Japan
Pierre Zweigenbaum, Université Paris-Saclay, LISN, CNRS, France
Roland Roller, DFKI GmbH, Germany
Philippe Thomas, DFKI GmbH, Germany
Elena Tutubalina, AIRI, Russia
Dongfang Xu, Cedars-Sinai, USA

Contact: Lisa Raithel (raithel@tu-berlin.de)

Google group: https://groups.google.com/g/smm4h-heard-2026-task-1

CodaLab: TBA

Insomnia is a prevalent sleep disorder that severely impacts sleep quality and overall health. It is associated with a range of serious outcomes, including psychiatric conditions, increased work absenteeism, and a heightened risk of accidents. Despite its prevalence and clinical relevance, insomnia remains largely underdiagnosed. There is thus a pressing need to develop effective methods for identifying patients at risk using real-world clinical text.

This shared task aims to support the development of NLP systems that can automatically detect insomnia using clinical notes from the MIMIC-III dataset. The task is framed as a two-subtask classification challenge, designed to evaluate both predictive accuracy and clinical reasoning capabilities in models.

We have developed a comprehensive set of rules (Insomnia Rules) to guide the identification of patients potentially suffering from insomnia. These rules incorporate both direct and indirect symptoms of insomnia and include information about commonly prescribed hypnotic medications. For this task, we have curated an annotated corpus of clinical notes from the MIMIC-III database, adhering to the Insomnia Rules during the annotation process. Each note is annotated with a binary label indicating the patient’s overall insomnia status ("yes" or "no") and with individual rule-level labels based on the content of the note. Additionally, to enhance the explainability of participating NLP systems, we provide textual evidence from the clinical notes that supports the classification of each individual rule component. This ensures that system predictions can be interpreted and effectively justified.

Participants are encouraged to use large language models (LLMs) to tackle the Insomnia detection task. This shared task serves as an exceptional benchmark to assess the reasoning capabilities of LLMs in medicine, applying a realistic set of diagnostic guidelines to real-world clinical data.

Task

This text classification shared task is divided into two distinct subtasks:

Subtask 1: Binary text classification. Predict whether the clinical note indicates that the patient is likely to have insomnia ("yes" or "no"). Evaluation is based on F1-score, treating "yes" as the positive class.
Subtask 2: Multi-label text classification + Explanation. For each clinical note, participants must predict the presence or absence of five insomnia-related criteria: Definition 1, Definition 2, Rule A, Rule B, and Rule C. In addition, for Definition 1, Definition 2, Rule B, and Rule C, participants must generate a free-text explanation grounded in the content of the note that justifies each assigned label. These explanations are intended to reflect the clinical reasoning supporting each classification decision. Label predictions will be evaluated using micro-averaged F1. The quality of explanations will be assessed using ROUGE and BERTScore to measure alignment with expert-annotated reference rationales, as well as via an LLM-as-a-judge framework guided by scoring rubrics.

Corpus

This shared task utilizes a corpus of clinical notes derived from the MIMIC-III Database. Participants must complete necessary training and sign a data usage agreement to access the MIMIC-III Clinical Database (v1.4). Once access is granted, they must run the text_mimic_notes.py script to retrieve clinical notes and associated patient information using the provided note IDs, as detailed in the instructions provided at README file.

Annotations and Submission Format

For each subtask, ground truth annotations are provided in JSON format. Participants are required to submit their system outputs following the same format as the ground truth annotations provided by the organizers. To illustrate the submission format, a sample of the training set annotations is available at data/training. Additionally, the complete set of Insomnia rules utilized for annotating the corpus can be found at resources/Insomnia_Rules.md file.

Below, we provide two samples of notes annotated with Subtask 1 information:

Task Organizer: Guillermo Lopez-Garcia, Cedars-Sinai Medical Center, USA (Guillermo.LopezGarcia@cshs.org)

Baseline system: https://doi.org/10.1101/2025.06.02.25328701

Github repository: https://github.com/guilopgar/SMM4H-HeaRD-2026-Task-2-Insomnia

Google group: https://groups.google.com/g/smm4h-heard-2026-task-2

CodaLab: TBA

Influenza vaccine effectiveness (VE) estimation plays a critical role in public health decision-making by quantifying the real-world impact of vaccination campaigns and guiding policy adjustments. Current approaches to VE estimation are constrained by limited population representation, selection bias, and delayed reporting. To mimic the "test-negative design" used by the CDC, we first identify users that have publicly posted reports on X (formerly known as Twitter) of having been tested for flu, as well as their test results. Further, from all their publicly available posts, we extract whether they have been vaccinated or not. We place each user into one of four possible flu-test-and-vaccine categories, essential for calculating the odds ratio (OR) as defined by equation below:

OR = (Vac-Pos / Vac-Neg) ÷ (Unvac-Pos / Unvac-Neg)

Vac-Pos: Vaccinated and flu-positive
Vac-Neg: Vaccinated and flu-negative
Unvac-Pos: Unvaccinated and flu-positive
Unvac-Neg: Unvaccinated and flu-negative

Task

This shared task is divided into two subtasks:

Subtask 1: Flu Vaccination Status Classification
Categorizes tweets based on the user's self-reported vaccination status:
Currently-Vaccinated: The user explicitly states receiving a flu shot in the current flu season.
Currently-Unvaccinated: The user explicitly states that they have not received or plan not to receive a flu shot. Statements reflecting past behaviors are included if they imply a continuous pattern and allow inference about the user’s current vaccination status.
Previously-Vaccinated: The user mentions receiving a flu shot during a previous flu season without indicating any recent vaccination updates.
Possibly-Vaccinated: The user indicates intentions or considerations to receive a flu shot but provides no clear evidence of actual vaccination.
Other: The tweet references flu vaccination without specifying the user’s personal vaccination status, including mentions of other individuals’ vaccination statuses (such as their child), advocacy for flu shots, or general vaccine-related information.
Subtask 2: Flu Test Result Classification
Categorizes tweets based on the user's self-reported flu test outcome:
Currently-Positive: The user explicitly reports a recent positive flu test result or diagnosis.
Currently-Negative: The user explicitly reports a recent negative flu test result or a diagnosis excluding the flu.
Previously-Positive: The user mentions testing positive for flu or receiving a flu diagnosis in a previous season, with no indication of their current flu test status.
Previously-Negative: The user mentions testing negative or not being diagnosed with flu in a previous season, without referencing their current flu test status.
Other: The tweet references flu testing without specifying the user's personal test status, including references to another individual's test results, encouragement for others to get tested, general discussions about flu tests, or ambiguous statements where the user's test status remains unclear.

Corpus

This shared task utilizes a corpus of tweets collected from X (formerly Twitter). Participants must register for the shared task and sign a data usage agreement to access the corpus. Once access is granted, participants will be added to the shared task Google Group and provided with a download link to the dataset.

Annotations and Submission Format

For each subtask, ground truth annotations are provided in CSV format. Participants are required to submit their system outputs using the same format as the ground truth files supplied by the organizers. To illustrate the expected submission structure, a sample submission file will be released along with the training and validation sets.

Additionally, the annotation guidelines for finding flu vaccination status and flu test result categories from tweets will be included in a separate reference file.Below, we provide two samples of tweets annotated with Subtask 1 and Subtask 2 information:

Subtask 1: Flu Vaccination Status Classification

Subtask 2: Flu Test Result Classification

Task Organizer: Dongfang Xu, Cedars-Sinai Medical Center, USA (Dongfang.Xu@cshs.org)

Baseline system: https://www.medrxiv.org/content/10.1101/2025.03.26.25324701v1

Google group: https://groups.google.com/g/smm4h-heard-2026-task-3

CodaLab: TBA

Physicians spend substantial time documenting clinical encounters, contributing to burnout and limiting direct patient care. Automated systems that can reliably map doctor–patient dialogues to structured notes are critical for scalable clinical documentation support. Progress in this area has been hindered by the lack of large, diverse, and privacy-preserving datasets.

MedSynth directly addresses this gap with a fully synthetic corpus of doctor–patient dialogues paired with SOAP-format clinical notes, spanning a broad range of diagnoses and clinical presentations. This shared task invites participants to build and evaluate models on Dialogue-to-Note (Dial2Note), using MedSynth as a benchmark resource.

Task Overview: Dialogue-to-Note Generation (Dial2Note)

Given a synthetic doctor–patient dialogue, systems must generate the corresponding clinical note in SOAP format (Subjective, Objective, Assessment, Plan).

Corpus

The shared task uses the MedSynth dataset:

Over 10,000 synthetic dialogue–note pairs.
Coverage of more than 2,000 unique ICD-10 codes.
Notes provided in standardized SOAP structure.
Dialogues reflecting realistic primary-care style encounters.
Generated via a controlled pipeline to ensure privacy (no real patient data).

The MedSynth dataset, code, and documentation are released by the authors and can be found in the paper. Participants must adhere to the dataset license and usage guidelines.

Data Splits

Training set: 85 percent of MedSynth dialogue–note pairs.
Development set: The remaining 15 percent held-out pairs for validation, hyperparameter tuning, and ablation.
Test set: Newly generated dialogue–note pairs held out for final evaluation; only dialogues released; reference notes kept hidden.

Participants must submit predictions for the test inputs; gold-standard outputs remain hidden and are used by organizers for official scoring.

Evaluation

Systems are first ranked using the automatic metrics listed below. The top five teams are then re-evaluated using an LLM-as-a-judge setup to refine rankings. Finally, the top three teams from this stage are assessed by medical experts to determine the official final rankings.

Dial2Note

Primary metrics: An average of these metrics will be used for ranking: BLEU, ROUGE-1, ROUGE-2, ROUGE-L, METEOR, Medcon
Aspect taken into account for LLM-as-a-judge and human evaluation:
Hallucination
Critical Omissions
Professional Tone
Logical Structure
Adherence to the SOAP format
Section Relevance

The details of the LLM-as-a-judge and human evaluation might change. We will provide updates on the task page.

Input and Submission Format

All data will be distributed in csv format. Each instance includes:

id: Unique identifier.
dialogue: list or text of doctor–patient turns.
note: SOAP-structured clinical note.

Participants must submit JSON files with the following structure:

Dial2Note: For each id, provide note_pred containing the generated SOAP note.

Baseline Systems

Organizers will provide baseline implementations on MedSynth for both subtasks. Participants are encouraged to compare against these baselines.

Ethical and Practical Considerations

All data are synthetic and contain no real patient information.
Systems developed in this task are for research purposes only and must not be used directly in clinical care without rigorous validation.

Contact and Resources

Contact: Ahmad Rezaie Mianroodi, Dalhousie University and Vector Institute, Canada (ahmad.rm@dal.ca)

MedSynth paper: MedSynth: Realistic, Synthetic Medical Dialogue-Note Pairs

Dataset & code: Github and HuggingFace

Google group: https://groups.google.com/g/smm4h-heard-2026-task-4

CodaLab: TBA

Studies have highlighted that patient metadata, which are crucial for understanding the transmission patterns of viruses, are often missing from SARS-CoV-2 genome sequences in online databases such as GISAID and GenBank, limiting their utility for genomic epidemiology. The COVID-19 pandemic brought to attention this two-fold problem: (1) data exist in published articles but are disconnected from digital resources, and (2) manually extracting data from the unstructured text of thousands of articles would be inefficient for timely public health responses to virus outbreaks. This binary classification tasks involves automatically distinguishing, among PubMed articles that report generating SARS-CoV-2 sequences, sentences that report patient metadata associated with the sequences (annotated as "1") from sentences that do not (annotated as "0"). Patient metadata include age, sex, race/ethnicity, symptoms, disease severity, viral load, duration of infection, lab results, vital signs, treatments, hospitalization, outcomes, comorbidities, risk factors, vaccination status, place of residence, geographic location of sample collection, and travel history. Sample sentences are shown in the table below, distinguishing reports of patient metadata from sentences in which patient metadata were not clearly associated with sequences, were associated with sequences in previous studies, or were merely reported as being collected. The training, validation, and test sets contain 15,504 (70%) sentences, 2214 (10%) sentences, and 4429 (20%) sentences, respectively: 2944 (13.3%) sentences that report patient metadata and 19,203 (86.7%) sentences that do not. The evaluation metric is the F₁-score for the class that reports patient metadata. Benchmark classifiers achieved F₁-scores of 0.776 based on fine-tuning the BiomedBERT-Large-Abstract pre-trained model, and 0.558 based on prompting the Llama-3-70B large language model (LLM). Participants are encouraged to further experiment with LLM prompting to address this gap in performance.

Submission Details:

TBA

Contact: Ari Klein, University of Pennsylvania, USA (ariklein@pennmedicine.upenn.edu)

Baseline system: hhttps://doi.org/10.1101/2025.04.25.25326298

Google group: https://groups.google.com/g/smm4h-heard-2026-task-5

CodaLab: TBA

The TNM staging system is a globally recognized standard for describing the extent of cancer spread. It is defined by three components: T (size and extent of the primary tumor), N (spread to nearby lymph nodes), and M (presence of distant metastasis). Accurate TNM staging plays a crucial role in determining prognosis and guiding treatment decisions.

Pathology reports from The Cancer Genome Atlas (TCGA) contain rich, unstructured textual descriptions of tumor characteristics. Automatically extracting and predicting TNM stage components from these reports can support large-scale cancer research and help standardize clinical data.

Task

This is a multi-label text classification shared task. Given a pathology report from TCGA, participants must develop a system that predicts the TNM stage of the patient’s cancer. The system should output three independent labels:

T: Primary tumor stage (T1, T2, T3, T4)
N: Lymph node involvement (N0, N1, N2, N3)
M: Metastasis status (M0, M1)

Each label should be predicted independently, and the final output should be a combination of the three (e.g., T2 N1 M0). Evaluation will be conducted using label-level F1 scores, with additional metrics for overall TNM prediction accuracy.

Secondary task: Explainability

As an independent secondary task, we will evaluate the models' ability to generate an explanation for the selected answer. Important aspects of this scope will include evidence in the text, as well as correct reasoning, when choosing the correct option. While this is an optional task, we highly encourage the participants to include it in their work, either as part of their main model or as a separate one.

Corpus

The dataset consists of de-identified free-text pathology reports from The Cancer Genome Atlas (TCGA). These reports vary in length, structure, and terminology, and may include domain-specific abbreviations, measurement values, and narrative descriptions. The available data was extracted from PDFs by Kefeli et al.(2023)

The dataset is moderately imbalanced, with certain TNM categories (e.g., M1) appearing much less frequently than others (e.g., M0). Reports may contain irrelevant sections or ambiguous wording, requiring robust natural language processing techniques to ensure accurate predictions.

The corpus is split into training, validation, and test sets, with TNM labels derived from the structured clinical metadata available in TCGA.

Annotations and Submission Format

Ground truth annotations are provided in CSV format, with each entry containing the text of the pathology report and the corresponding TNM labels. Participants must submit their system outputs following the same CSV structure as the provided annotations.

The training data is available at github.com/tatonetti-lab/tcga-path-reports, and labels can be found at github.com/tatonetti-lab/tnm-stage-classifier/tree/main/TCGA_Metadata. Detailed TNM staging definitions and guidelines for annotation can be found at the National Cancer Institute's cancer.gov.

Evaluation

Systems will be evaluated based on:

F1 score for each label (T, N, and M), treating each category as the positive class in a one-vs-rest setting.
AUROC (Area Under the Receiver Operating Characteristic Curve) calculated in a one-vs-rest manner for each label.
Exact-match accuracy for complete TNM stage predictions.

We will employ LLM-as-a-judge to evaluate the secondary task. We will provide more detailed information in the coming months.

This shared task provides a realistic benchmark for assessing the ability of NLP systems to extract structured cancer staging information from complex, real-world pathology reports.

Data Samples

Contact: Jacob Berkowitz, Cedars-Sinai Medical Center, USA (Jacob.Berkowitz2@cshs.org) & Jose Miguel Acitores, Cedars-Sinai Medical Center, USA (Jose.AcitoresCortina@cshs.org)

Baseline system: Kefeli, J., Berkowitz, J., Acitores Cortina, J.M. et al. Generalizable and automated classification of TNM stage from pathology reports with external validation. Nat Commun 15, 8916 (2024). https://doi.org/10.1038/s41467-024-53190-9

Google group: https://groups.google.com/g/smm4h-heard-2026-task-6

CodaLab: TBA

Task

Nonmedical opioid use has underreported clinical and social consequences. This task focuses on detecting span boundaries for two categories of self-reported impacts in first-person social media narratives: ClinicalImpacts (e.g., withdrawal, depression) and SocialImpacts (e.g., job loss, legal issues). The objective is to accurately identify and extract these spans to understand how individuals describe the personal effects of opioid use in their own words. Participants are expected to evaluate their models using both strict and relaxed F1 scores for span detection. Benchmark pretrained models, DeBERTa (large) achieved the highest relaxed F1 score of 0.61. Participants are encouraged to explore better modeling strategies to enhance span detection performance.

Example

I am a recovering addict and I overdosed at 19 (I’m 25 now) and I was charged with disorderly conduct (only thing on my record).

Dataset

Submission Format

Please submit using the same format as Data Example.

Contact: Abeed Sarker, Emory University, USA (abeed.sarker@emory.edu) & Sumon Kanti Dey, Emory University, USA (sumon.kanti.dey@emory.edu)

Baseline system: https://arxiv.org/pdf/2508.19467

Google group: TBA

CodaLab: TBA

The MultiClinAI (Multilingual Clinical Entity Annotation Projection and Extraction) shared task challenges participants to create systems that can automatically create multilingual versions of Gold Standard corpora from a seed language (in our case, Spanish) to six different target languages (Czech, Dutch, English, Italian, Romanian and Swedish). Three different clinical concept types will be used: diseases, symptoms and procedures. Then, participants are challenged to create systems for clinical concept extraction in the seven languages of the task. Participants are free to create their systems in any way they want (i.e. monolingual or multilingual models, word alignment or generative models, unilabel or multilabel, …), and the use of creative solutions is encouraged. Each language will be evaluated independently using standard precision, recall and F-1 scores.

Task

MultiClinAI is divided into two different sub-tasks:

Sub-task 1: MultiClinNER (Multilingual Comparable Clinical Entity Recognition). This subtask focuses on the implementation and comparative evaluation of multilingual clinical entity recognition systems for seven different languages — English, Spanish, Dutch, Italian, Romanian, Swedish, and Czech — by comparing automatically extracted entity mentions against those manually validated by domain experts. Given a collection of clinical case reports in these seven languages, participating teams will be required to return, for each report, the corresponding character offsets of entity mentions for three different entity types: diseases, symptoms & signs and clinical procedures. Both native and automatically translated clinical case reports will be included in the test set collections. As training data, manually validated entity mentions for each language and entity type will be provided.
Sub-task 2: MultiClinCorpus (Multilingual Comparable Clinical Corpus Generation). This subtask will cover the automatic generation of comparable multilingual corpora. Given a collection of plain text documents and manually annotated entity mentions for a seed language (Spanish), together with the corresponding translated versions of the texts in different target languages (English, Italian, Dutch, Swedish, Romanian and Czech), participating teams have to return the exact character offsets of all corresponding equivalent entity mentions in the target languages. As training data, manually mapped corresponding entity mentions in target languages, revised and validated by experts, will be released.

In both sub-tasks, teams can submit results for any target language. Submitting for all languages is not mandatory.

Data

Three different corpora will be used for the task: SpaCCC, CardioCCC and OnaCCC. Each corpus was annotated in a different context, but with similar methodologies, with the same four labels: diseases, symptoms, procedures and medications. Medication annotations will be released together with the other entities, but they will not be evaluated as part of MultiClinAI.

For more details about the datasets, their creation process and screenshots, please refer to the Task Website.

Task Website: https://temu.bsc.es/MultiClinAI

Contact: Salvador Lima-López, Barcelona Supercomputing Center, Spain (salvador.limalopez@gmail.com)

Google group: TBA

CodaLab: TBA

SMM4H-HeaRD 2025 (proceedings)

SMM4H 2024 (proceedings)

SMM4H 2023 (overview paper)

SMM4H 2022 (proceedings)

SMM4H 2021 (proceedings)

SMM4H 2020 (proceedings)

SMM4H 2019 (proceedings)

SMM4H 2018 (proceedings)

SMM4H 2017 (proceedings)

SMM4H 2016 (proceedings)

HLP @ Cedars-Sinai Computational Biomedicine

Progressing healthcare through automated natural language processing research

Social Media Mining for Health/Health Real-World Data (#SMM4H-HeaRD) 2026 Workshop and Shared Tasks

Important Dates

Workshop Chair

Organizers

Task Overview: Dialogue-to-Note Generation (Dial2Note)

Corpus

Data Splits

Evaluation

Dial2Note

Input and Submission Format

Baseline Systems

Ethical and Practical Considerations

Contact and Resources

Task

Secondary task: Explainability

Corpus

Annotations and Submission Format

Evaluation

Data Samples

Task

Example

Dataset

Submission Format

Task

Data

Like this:

Important Dates

Workshop Chair

Organizers

Task Overview: Dialogue-to-Note Generation (Dial2Note)

Corpus

Data Splits

Evaluation

Dial2Note

Input and Submission Format

Baseline Systems

Ethical and Practical Considerations

Contact and Resources

Task

Secondary task: Explainability

Corpus

Annotations and Submission Format

Evaluation

Data Samples

Task

Example

Dataset

Submission Format

Task

Data

Share this:

Like this: