NIH Sponsored – Research Advancement for Diversity Scholars (RADS) at the UPenn Health Language Processing Center

Parent Grant Overview

In the parent grant, the overall goal is to develop novel NLP methods to leverage SM data for specific PV efforts that are hindered by known drawbacks of traditional pharmacovigilance approaches. We focus on methods to facilitate the use of SM data for exploring (a) factors affecting medication adherence and persistence among the general population (Aim 1), and (b) possible associations between medications taken during pregnancy and pregnancy outcomes (Aim 2). We also address a major roadblock for case-control studies based on this data: finding controls. Below, we provide a list of the aims in the parent grant.

Specific Aim 1. Develop and evaluate NLP methods to identify non-adherence and non-persistence and related information from SM data. This includes methods to dynamically collect a cohort of SM users thatstopped taking or switched medications, did not fill a prescription, or altered their treatment, and methods toextract information from the user’s timeline (publicly available postings over time) and conversation threads (postings by the user and others in reply to a posting of interest) relevant to (a) an expressed reason for theseactions, (b) dosage/duration of treatment, (c) concomitant treatments, and (d) diagnosed health conditions.

Specific Aim 2. Develop and evaluate NLP methods to identify medication use during pregnancy and pregnancy outcomes from SM data. Work under this aim includes the development and evaluation of NLPmethods to dynamically collect a cohort of SM users who report a pregnancy, and methods to extract information from the user’s timeline to (a) distinguish when mention of a medication indicates possible intake

of it, (b) infer the estimated pregnancy timeframe (beginning and end of pregnancy), and (c) extract or infer pregnancy outcomes from those postings (including at least live birth, fetal death, hemorrhage, miscarriage, low-birth weight, pre-term birth, and reported congenital malformations). For Aims 1 and 2, manual annotations will enable comparison of SM findings to systematic reviews and other established sources (FAERS and CDC data) for the medications or conditions highlighted in each aim.

Evaluation hypothesis: the collected data from SM will match data provided directly by consented users and manually annotated postings with at least a moderate level of agreement (Kappa 0.41 – 0.6).

Specific Aim 3. Develop and evaluate methods for automatic selection of control groups. Work under this aim addresses the challenge faced when information from SM is to be used for epidemiological studies.

Access to longitudinal SM data (timelines) as proposed in Aims 1 and 2, enables case-control studies 11 as a suitable epidemiological model, since the NLP methods can find persons with a condition (or another outcome variable) of interest within a larger cohort. However, finding a suitable control group among the vast number of other SM users not exhibiting the condition 12 is a challenge. We propose a novel adaptation of biased topic modeling to find control subjects using information in the timelines. Evaluation hypothesis: identified users will be labeled by experts to be suitable control subjects to a moderate level of agreement (Kappa 0.41 – 0.6).


This program supports diversity students for a 2-year Research Experience at the Health Language Processing (HLP) Center at the University of Pennsylvania. Given the current pandemic, we will be running the experience fully remotely the first year, with flexible scheduling as outlined in the sections ahead, and with a mix of in-person and remote components for the second year. Targeted participants are talented diversity high school, undergraduate and PhD students from institutions across the country. In general, students participate in vertical research teams led by a post-doctoral research associate already involved in the parent grant (with no additional funding from the supplement). Teams include 1 PhD student, 1 undergraduate student, and 1-2 high school students. We have budgeted for 3 teams in this supplement. The diversity supplement participants will work alongside other students funded internally by the University of Pennsylvania or other sources whenever possible to enhance the experience. Each team is assigned a project or task relevant to the aims of the parent grant. High School students and Undergraduates contribute primarily as annotators, unless their skillset is such that they can contribute to code development or data analysis. We include a task that is more oriented to data analysis to accommodate those without a programming background. All students participate in enrichment activities (lectures, seminars, discussion groups, and scientific writing session) and present their work to peers, mentors, and the PI.

Mentorship towards a career in biomedical research

Dr Gonzalez-Hernandez conceived and ran the pilot of the proposed experience in the Summer of 2019, directly funded from parent grant with partial support from the Institute of Biomedical Informatics at Penn. The idea behind the experience was to expose students from public universities and local high schools to science at an Ivy League school, giving them meaningful participation in research at one of the top schools of medicine. Through it, we aimed to boost their interest in health science and biomedical informatics, and show them that such an experience was within their reach.

A total of 12 participants (3 High School students, 6 Undergraduates, 2 PhDs and 1 Research Associate) attended. We have included some of their evaluations in the Appendix. After the 8 weeks, 3 of the undergraduate students remained working for the parent grant –at a distance- as annotators, 1 paper was published (presented at the AMIA Summit in March of 2020 (Klein 2020)), and another two papers are in preparation. All of the undergraduate students decided to apply to MS programs in the health domain, a few expressly stating that they had decided on this route given the summer experience at the HLP Center. For example, Aiden McRobbie-Johnson stated, on Dec of 2019 (email included in Appendix A):

Overall, the experience is designed to allow students to learn the fundamentals of biomedical informatics, both in theory and practice. The experience includes mentoring and activities geared towards manuscript preparation throughout the Fall 2020, Spring 2021, Fall 2021 and Spring 2022 semesters, with a more intense summer experience in the Summer of 2021. The summer research experience includes enrichment experiences: one weekly lecture of 1 to 1.5 hours with discussion, covering topics relevant to BMI and applicable to the tasks, one-on-one meetings with their mentors, writing exercises, and presentations of their work during seminars.

Lectures include an overview of Biomedical Informatics, fundamentals of Research Methods and of Natural Language Processing, as well as guidance on writing and scientific presentations. All of the reviews received from the 2019 participants were positive about this aspect (see attached emails). Here are some highlights, in their own words:

Logistics of the experience

The driving objective for each year is to complete work leading to a manuscript submission as a team under the guidance of the PI and the post-doctoral mentor, sent to AMIA or some other suitable journal. There are specific writing objectives each week for all participants, according to their skills.

A total of 3.5 months FTE (600 hours) per year will be covered, for each of the 2 academic years.  

First Year Timeline

Starting the first or second week of September 2020, the experience will run remotely at 10-15 hours/week through the end of the spring semester, with a winter break in December-January to match the academic calendar of the student’s home schools.

All participants log into a progress meeting every week with their team leader. Every 4 weeks, a general meeting with the PI serves as an opportunity for direct mentorship and polishing presentation skills, as each team presents their monthly progress. Additional meetings are scheduled as needed with individuals or teams. They are also invited to attend the monthly HLP Center Seminar.

Second Year Timeline:

The second year will kick off with an on-campus experience in June and July of 2021 (8 weeks), where the students get to experience first-hand the ongoing work at Penn, not only at the HLP Center, but at other centers and the Medical School. They work directly with their teams and mentors, have discussion and analysis sessions, and can contribute to the approach and definition of their second year tasks. They get 8 additional lectures on more advanced topics, and participate in the research tasks more actively. After going back to their home institutions, they continue working remotely as it was done for Year 1, with 10-15 hour weeks. No additional lectures are planned for the at-home portion of the second year, just the progress weekly meetings and the monthly general meetings.  All participants make a presentation remotely at the end of the experience, which wraps up in May of 2022, with submission of the second manuscript.

First year tasks:

Task S20.a. Develop and evaluate a pipeline to detect expressions of diagnosed health conditions.  We currently have an initial corpus annotated for expressions of diagnosed health conditions. Additional work is needed developing regular expressions to reduce data imbalance, followed by a classifier. The summer scholars are expected to contribute with additional annotations, error analysis, writing regular expressions, and deployment of classifiers to critically compare advances led by current personnel in the project. Relevant to Aims 1, 2, and 3 in the parent grant.

Task S20.b. Develop and evaluate a classifier for reported speech expressions. In analyzing timelines, we have discovered that a significant portion of the tweets of interest (say, when detecting outcomes of pregnancy, or reports of medication non adherence) is reported speech –quoting a newspaper article, or something said by another person, or repeating song lyrics-. We have an annotated corpus and a preliminary rule-based approach ready. Participants will pick up the project to make it more efficient, training a machine learning based classifier and comparing the results to our existing approach. Relevant to Aim 1 and Aim 2 of the parent grant, 

Task s20.c. Re-deploy a case-control study focusing on medication intake reported by women who a child with a congenital malformation. We have advanced our approach to find controls (in general, women whose pregnancy outcome is ‘normal’: a baby born after 37 weeks of gestation and weighting more than 5 pounds 8 oz). We have prior work (Golder 2019) where we used a looser definition of a control (namely, a woman who does not mention having a baby with a birth defect). This task focuses on a new analysis of the birth defects data with the new set of controls (four times the size of the prior one, and with the stricter conditions in place). It requires annotations and data analysis. Relevant to Aim 2 and Aim 3 of the parent grant.

Second Year Tasks:

Planning of the second year tasks begins internally around April. All tasks for year 1 will be finalized by May of 2021, and will be relevant to the advances in the parent grant. Participants are expected at this point to be more versant in the scientific process, and will be mentored in formulating their own questions and follow up tasks. Tentatively, we contemplate, for year 2:

Task S21.a. Develop and evaluate a classifier for expressions of non-adherence for specific medication classes (for example, statins, diabetes, HIV PrEP). Relevant to Aim 1.

Task S21.b. Develop and evaluate a classifier for reports of pregnancy loss (miscarriage and still birth). Relevant of Aim 2.

Task S21.c. Analysis of data and tuning of topic modeling approaches to automatically classify discussions during different pregnancy trimesters. Relevant to Aim 2.

Diversity Scholar Profiles & Expectations

We aim to select and fund 3 high school, 2 undergraduate and 3 PhD students from diverse institutions and backgrounds who are eligible for funding under a diversity supplement. A GPA of 3.5 and above is required, as well as a statement of interest and a letter of recommendation.  All students will contribute to advancing ongoing work under the parent grant, and participate in manuscript preparation specific to their tasks.

For High Schoolers, the idea is to expose them early to the area of health research, and have them participate and witness the scientific process first hand. High school students are trained as annotators, and that is their main tangible deliverable through the experience, while also participating in team discussions and “brainstorming” sessions with the undergraduates and PhDs as their project progresses. They are asked to write about the datasets that they annotate, and to be on the lookout for anything out of the ordinary in the data. They are guided directly through the process by the undergraduate student in their team, as well as the team mentor, and receive instruction in scientific writing.

The target undergraduate profile are students pursing a degree in Computer Science, Biomedical Informatics, Health Sciences or Liberal Arts and that are considering a career in health-related domains or a medical career. Coursework in statistics and biology is preferred.  Undergraduate students are also trained as annotators, and that is their primary deliverable, although they are also asked to write a significant part of the “methods” section and conduct error analysis with guidance. If their skillset includes programming, they will be expected to complete software modules commensurate with their experience. Each undergraduate student is assigned 1 or 2 high schoolers to mentor and train as annotation supervisor. Undergraduate students work under the mentorship of a PhD Student, and participate more actively in the specific research project assigned to their team.

The target PhD students are those early in their program, pursuing a degree in Computer Science, Biomedical Informatics, or related fields and that want to explore the application of their skills and the focus of their research to the health domain.  PhD students will expand or assist in the development of machine learning and language processing methods developed for the specific task to which they are assigned. They are expected to develop leadership skills by managing the day-to-day work of the undergraduate students. They work under the guidance of the team leaders (in-house staff scientists, one each per task) and Dr Gonzalez-Hernandez, and have an active role in manuscript preparation. Depending on their demonstrated writing and leadership skills, they could also be encouraged to be main authors of an additional manuscript.


We have already processed applications for year 1. As stated before, and due to the University campus closure for the pandemic, the first year experience will be conducted remotely. The HLP Center team is prepared to do this, and has some experience already both through our own work through the coronavirus pandemic, and having worked remotely with 3 of the participants of the Summer 2019 Experience throughout the year.

Overall Timeline

Below we show an anticipated timeline for this supplement. The total duration is 24 months.

   Year 1 – Month
S20.a1, 2, 3Develop and evaluate a pipeline to detect expressions of diagnosed health conditionsxxx         
S20.a1, 2, 3Ongoing tasks for Manuscript   xxxxxxx  
S20.a1, 2, 3Data & software release         xxx
S20.b1, 2Develop and evaluate a classifier for reported speech expressionsxxx         
S20.b1, 2Ongoing tasks for Manuscript   xxxxxxx  
S20.b1,2Data & software release         xxx
S20.c2, 3Re-deploy a case-control study focusing on medication intake reported by women who a child with a congenital         
S20.c2, 3Ongoing tasks for Manuscript   xxxxxxx  
S20.c2, 3Data & software release         xxx
   Year 2 = Month
S21.a1Develop and evaluate a classifier for expressions of non-adherence for specific medication classes (ie, statins, diabetes, HIV PrEP).xxx         
S21.a1Post-summer & manuscript   xxxxxxx  
S21.a1Data & software release         xxx
S21.b2Develop and evaluate a classifier for reports of pregnancy loss (miscarriage and still birth).xxx         
S21.b2Post-summer & manuscript   xxxxxxx  
S21.b2Data & software release         xxx
S21.c2Analysis of data and tuning of topic modeling approaches to automatically classify discussions during different pregnancy         
S21.c2Post-summer & manuscript   xxxxxxx  
S21.c2Data & software release         xxx


Golder S, Chiuve S, Weissenbacher D, Klein A, O’Connor, Bland M, Malin M, Bhattacharya M, Scarazzini L, Gonzalez-Hernandez G. Pharmacoepidemiologic evaluation of birth defects from health-related postings in social media during pregnancy. Drug Saf. 2019;42(3):389-400.  doi:10.1007/s40264-018-0731-6.

Klein AZ, Gebreyesus A, Gonzalez-Hernandez G. Automatically Identifying Comparator Groups on Twitter for Digital Epidemiology of Pregnancy Outcomes. AMIA Jt Summits Transl Sci Proc. 2020 May 30;2020:317-325. PMID: 32477651; PMCID: PMC7233041.

Appendix A.

Participant Reviews & Other Communications, Summer 2019.