SenSegMed

The 9th BioCreative workshop seeks to attract researchers interested in developing and evaluating automatic methods of extracting medically relevant information from clinical data and aims to bring together the medical NLP community and the healthcare researchers and practitioners. This year, the BioCreative IX will be colocated with IJCAI 2025 in Montreal, CA, on August 16-22, 2025. Our HLP lab organizes track 2: sentence segmentation of real clinical notes using MIMIC-II clinical notes.

To join this shared task, please register for the Track 2 SenSegMed through the BioCreative IX Shared Task Registration Form. Upon registration, participants will gain access to the full training, and validation.

Registration Form: Please fill in this form

Submit shared task papers here: chairingtool

Important Dates

Team registration: March-April, 2025

Test set available: May 27, 2025

Testing predictions & Evaluation results: June 1, 2025

Paper submission deadline: June 3, 2025

Notification of acceptance: June 6, 2025

Camera-ready papers due: June 12, 2025

Workshop@IJCAI: TBD; 1.5 days on Aug 16-22, 2025

Task Description

Sentence segmentation is a fundamental linguistic task widely used as a pre-processing step in many NLP applications. While modern LLMs and sparse attention mechanisms in transformer networks have reduced the necessity of sentence-level inputs in some NLP tasks, many models are still designed and tested for shorter sequences. The need for sentence segmentation is particularly pronounced in clinical notes, as most clinical NLP tasks depend on this information for annotation and model training.

Participants will detect sentence boundaries (spans) for MIMIC-III clinical notes, which contain fragmented and incomplete sentences, complex graphemic devices (e.g., abbreviations, acronyms), and markups.

Task

Given a clinical note, participants are supposed to develop a system to split the note into sentences. The system should output the text spans of each sentence.

Data

we collected a subset of clinical notes from the MIMIC-III corpus, and manually annotated sentence boundaries without changing the original structure of clinical notes. MIMIC-III contains de-identified clinical notes

from 38,597 distinct patients admitted to a Beth Israel Deaconess Medical Center between 2001 and 2012. It covers 15 note types including discharge summary, physician note, radiology report, social work, among others. We randomly sampled 6 notes for each note type for annotation, yielding 90 notes in total. We stratified the notes into training, development, and test sets (57/15/18), respectively.

The sample annotations are available here. Each annotation file is named using the ROW_ID of clinical notes from the MIMIC-III and includes information on sentence boundaries and types. There are two types of text chunks: Sentence and Unstructured. The distinction between these types is detailed in our annotation guidelines, which will be provided upon registration. Registered participants will have access to the full dataset and annotation guidelines.

Organizers

Graciela Gonzalez-Hernandez, Cedars-Sinai Medical Center, USA

Dongfang Xu, Cedars Sinai-Medical Center, USA

Important Dates

Task Description

Organizers

Share this:

Like this: