Davy Weissenbacher, Abeed Sarker, Ari Klein, Karen O’Connor, Arjun Magge Ranganatha, Graciela Gonzalez-Hernandez
Department of Biostatistics, Epidemiology and Informatics
Perelman School of Medicine
University of Pennsylvania
Philadelphia, PA, USA
Objective: Twitter posts are now recognized as an important source of patient-generated data, providing unique insights into population health. A fundamental step to incorporating Twitter data in pharmacoepidemiological research is to automatically recognize medication mentions in tweets. Given that lexical searches for medication names suffer from low recall due to misspellings or ambiguity with common words, we propose a more advanced method to recognize them.
Methods: We present Kusuri, an ensemble learning classifier, able to identify tweets mentioning drug products and dietary supplements. Kusuri (“medication” in Japanese) is composed of two modules. First, four different classifiers (lexicon-based, spelling-variant-based, pattern-based and a weakly-trained neural network) are applied in parallel to discover tweets potentially containing medication names. Second, an ensemble of deep neural networks encoding morphological, semantic and long-range dependencies of important words in the tweets makes the final decision.
Results: On a balanced (50-50) corpus of 15,005 tweets, Kusuri demonstrated performances close to human annotators with 93.7% F1-score, the best achieved thus far on this corpus. On a corpus made of all tweets posted by 112 Twitter users (98,959 tweets, with only 0.26% mentioning medications), Kusuri obtained 78.8% F1-score. To the best of our knowledge, Kusuri is the first system to achieve this score on such an extremely imbalanced dataset.
Keywords: Social Media, Pharmacovigilance, Drug Name Detection, Ensemble Learning, Text Classification
Demonstration (coming soon)
Paper on arXiv