Anahita Davoudi (1), Ari Z. Klein (1), Abeed Sarker (2), Graciela Gonzalez-Hernandez (1)
(1) Department of Biostatistics, Epidemiology and Informatics
Perelman School of Medicine
University of Pennsylvania
Philadelphia, PA, USA
(2) Department of Biomedical Informatics
Emory University School of Medicine
Atlanta, GA, USA
With the increasing use of social media data for health-related research, the credibility of the information from this source has been questioned as the posts may originate from automated accounts or “bots”. While automatic bot detection approaches have been proposed, there are none that have been evaluated on users posting health-related information. In this paper, we extend an existing bot detection system and customize it for health-related research. Using a dataset of Twitter users, we first show that the system, which was designed for political bot detection, underperforms when applied to health-related Twitter users. We then incorporate additional features and a statistical machine learning classifier to significantly improve bot detection performance. Our approach obtains F1-scores of 0.7 for the “bot” class, representing improvements of 0.339. Our approach is customizable and generalizable for bot detection in other health-related social media cohorts.
Natural language processing; Machine Learning; Social media mining; Bot Detection, Bots.