
The 9th BioCreative workshop seeks to attract researchers interested in developing and evaluating automatic methods of extracting medically relevant information from clinical data and aims to bring together the medical NLP community and the healthcare researchers and practitioners. This year, the BioCreative IX will be colocated with IJCAI 2025 in Montreal, CA, on August 16-22, 2025. Our HLP lab organizes track 2: sentence segmentation of real clinical notes using MIMIC-II clinical notes.
To join this shared task, please register for the Track 2 SenSegMed through the BioCreative IX Shared Task Registration Form. Upon registration, participants will gain access to the full training, and validation.
Registration Form: Please fill in this form
Submit shared task papers here: chairingtool
Team registration: March-April, 2025
Test set available: May 27, 2025
Testing predictions & Evaluation results: June 1, 2025
Paper submission deadline: June 3, 2025
Notification of acceptance: June 6, 2025
Camera-ready papers due: June 12, 2025
Workshop@IJCAI: TBD; 1.5 days on Aug 16-22, 2025
Sentence segmentation is a fundamental linguistic task widely used as a pre-processing step in many NLP applications. While modern LLMs and sparse attention mechanisms in transformer networks have reduced the necessity of sentence-level inputs in some NLP tasks, many models are still designed and tested for shorter sequences. The need for sentence segmentation is particularly pronounced in clinical notes, as most clinical NLP tasks depend on this information for annotation and model training.
Participants will detect sentence boundaries (spans) for MIMIC-III clinical notes, which contain fragmented and incomplete sentences, complex graphemic devices (e.g., abbreviations, acronyms), and markups.
Task
Given a clinical note, participants are supposed to develop a system to split the note into sentences. The system should output the text spans of each sentence.
Data
we collected a subset of clinical notes from the MIMIC-III corpus, and manually annotated sentence boundaries without changing the original structure of clinical notes. MIMIC-III contains de-identified clinical notes
from 38,597 distinct patients admitted to a Beth Israel Deaconess Medical Center between 2001 and 2012. It covers 15 note types including discharge summary, physician note, radiology report, social work, among others. We randomly sampled 6 notes for each note type for annotation, yielding 90 notes in total. We stratified the notes into training, development, and test sets (57/15/18), respectively.
The sample annotations are available here. Each annotation file is named using the ROW_ID of clinical notes from the MIMIC-III and includes information on sentence boundaries and types. There are two types of text chunks: Sentence and Unstructured. The distinction between these types is detailed in our annotation guidelines, which will be provided upon registration. Registered participants will have access to the full dataset and annotation guidelines.
Graciela Gonzalez-Hernandez, Cedars-Sinai Medical Center, USA
Dongfang Xu, Cedars Sinai-Medical Center, USA