Social Media Mining for Health Applications (SMM4H) Workshop & Shared Task

The total number of users of social media continues to grow worldwide, resulting in the generation of vast amounts of data. Popular social networking sites such as Facebook, Twitter and Instagram dominate this sphere. According to estimates, 500 million tweets and 4.3 billion Facebook messages are posted every day.1 The latest Pew Research Report 1, nearly half of adults worldwide and two-thirds of all American adults (65%) use social networking. The report states that of the total users, 26% have discussed health information, and, of those, 30% changed behavior based on this information and 42% discussed current medical conditions. Advances in automated data processing, machine learning and NLP present the possibility of utilizing this massive data source for biomedical and public health applications, if researchers address the methodological challenges unique to this media.

This workshop aims to provide a forum for the ACL community members to present and discuss NLP advances specific to social media use in the particularly challenging area of health applications, following on the success of a session and accompanying Workshop on the topic that was hosted at the Pacific Symposium in Biocomputing (PSB) in 2016 and the AMIA Annual Conference in 2017 (to be held Nov 4, 2017). The workshop seeks to attract researchers interested in automatic methods for the collection, extraction, representation, analysis, and validation of social media data for health informatics. It serves as a unique forum to discuss novel approaches to text and data mining methods that are applicable to social media data and may prove invaluable for health monitoring and surveillance.

Although social media text mining research for health applications is still very much its infancy, the domain has seen a surge in interest in recent years. Numerous studies have been published of late in this realm, including studies on pharmacovigilance,3 identifying user behavioral patterns patterns,4 identifying user social circles with common experiences (like drug abuse),5 monitoring malpractice,6 and tracking infectious/viral disease spread.7,8 Population/public health topics are most commonly addressed, although different social networks maybe suitable for specific targeted tasks. For example, while Twitter data has been utilized for surveillance and content analysis, a significant portion of research using Facebook has focused on communication rather than lexical content processing.9,10 For health monitoring and surveillance research from social media, the most common topic has been influenza surveillance.11,12 From the perspective of informatics and NLP, proposed techniques have typically been in the areas of data collection (e.g., keywords and queries),13,14 text classification,15,16 and information extraction.17 While innovative approaches have been proposed, there is still a lot of progress to be made in this domain.

Effective utilization of the health-related knowledge contained in social media will require the joint effort by the research community, and the bringing together of researchers from distinct fields including NLP, machine learning, data science, biomedical informatics, medicine, pharmacology and public health. The knowledge gaps among researchers in these communities need to be reduced by community sharing of data and the development of novel applied systems. Our workshop attempts to achieve precisely these goals and appeal to researchers applying NLP methods in those domain areas and that might not typically consider ACL/EMNLP/NAACL. Topics of interest include, but are not limited to:

  • Methods for the automatic detection and extraction of health-related concept mentions in social media
  • Mapping of health-related mentions in social media to standardized vocabularies
  • Deriving health-related trends from social media
  • Information retrieval methods for obtaining relevant social media data
  • Geographic or demographic data inference from social media discourse
  • Virus spread monitoring using social media
  • Mining health-related discussions in social media
  • Drug abuse and alcoholism incidence monitoring through social media
  • Disease incidence studies using social media
  • Sentinel event detection using social media
  • Semantic methods in social media analysis
  • Classifying health-related messages in social media
  • Automatic analysis of social media messages for disease surveillance and patient education
  • Methods for validation of social-media derived hypothesis and datasets.

 

Shared Task

Workshop Organizers

Graciela Gonzalez-Hernandez, Ph.D., is an Associate Professor at the Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, and serves as PI of an NLM-funded R01 project for assessing the value of social media postings for pharmacovigilance. Her research interests focus in the area of knowledge discovery, specifically through NLP and knowledge integration. Her lab has produced top-performing systems such as BANNER, for gene and disease entity recognition and ADRMine, for automated extraction of adverse drug reactions from social media data. She served as a member of the National Library of Medicine Standing Review Committee (BLIRC) from 2009 through 2013. Email: gragon@pennmedicine.upenn.edu

Abeed Sarker, Ph.D. is a Research Associate at the Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania. His research experiences and interests are in the areas of NLP, machine learning, data science, social media mining, and particularly their intersection with the field of healthcare and public health monitoring and surveillance. He has worked for a number of renowned organizations in multiple countries including the CSIRO-Australia, Canon Information Systems Research Australia (CISRA), the Centre for Language Technology, Macquarie University, Arizona State University, and the University of Pennsylvania. He co-organized the ALTA 2011, PSB 2016, and AMIA 2017 SMM4H shared tasks. Email: abeed@pennmedicine.upenn.edu

Pierre Zweigenbaum, Ph.D. is the head of the ILES group (Information, Langue Écrite et Signée) at LIMSI-CNRS (Computer Sciences Laboratory for Mechanics and Engineering Sciences, French National Center for Scientific Research). He graduated from École Polytechnique, Paris, France, in 1980, from École Nationale Supérieure des Télécommunications (ENST) in 1982, received an ENST Ph.D. in Computer science in 1985, and the “Habilitation à diriger des Recherches” in 1998 from Paris-North University. His research interests focus on Natural Language Processing, from morphology to question answering, with a focus on specific domains (medicine and medical terminology) and multilingual issues (alignment in parallel and comparable corpora). He has been involved in organizing multiple shared tasks in the past, notably the CLEF eHealth shared tasks, and is Editor of the IMIA Yearbook on Medical Informatics. Email: pz@limsi.fr

Nigel Collier, Ph.D., is PI and Director of Research in Computational Linguistics at the Department of Theoretical and Applied Linguistics in the University of Cambridge. He received his PhD from the University of Manchester in 1996 for his work on neural network modeling for machine translation. He was awarded a Toshiba Fellowship in 1996 and then joined the University of Tokyo as a research associate. After becoming faculty at the National Institute of Informatics (NII) in 2000, Nigel led the BioCaster research programme (2006 to 2012) for multilingual news surveillance and served as technical advisor to the Global Health Security Action Group’s working group on Risk Management and Communication. He was awarded a Marie Curie fellowship at the European Bioinformatics Institute from 2012 to 2014 where he continued his investigation into biomedical text mining for scientific texts. His research interests are in computational linguistics and biomedical knowledge discovery. He is the author of over 100 peer-reviewed articles and conference papers. Nigel currently leads the EPSRC-funded Semantic Interpretation of Personal Health messages (SIPHS) project which investigates biomedical concept encoding of laymen’s terms in the social media for real world applications such as digital disease surveillance. Email: nhc30@cam.ac.uk

Cecile Paris, Ph.D. is a Science Leader of the Knowledge Discovery and Management group at CSIRO Australia. Received her PhD in Artificial Intelligence (more specifically in Natural Language Processing and User Modelling) in 1987 from Columbia University (New York) and her Bachelor degree from the University of California in Berkeley. Her research has focused on Natural Language and User Modelling throughout her career. She joined the Information Sciences Institute (ISI), a research laboratory in Los Angeles, CA, where she stayed until 1996, working on knowledge based systems. She then moved to the UK (ITRI, at the University of Brighton, UK), where she researched multilingual generation systems. She joined CSIRO late 1996, creating the Natural Language Processing team (now called “Language and Social Computing”, currently led by Dr Stephen Wan). Cecile was elected a Fellow of the Australian Academy of Technology & Engineering (ATSE) in 2016.The application domains for her work currently include electronic business applications, knowledge management and surveillance. In recent years, Cecile has done research in social media analytics and social computing. More generally, she is interested both in facilitating communication with computers and in unlocking information. The application domains for her work currently include government services, health, electronic business applications, knowledge management, e-Science, and emergency situation assessment.  Email: Cecile.Paris@data61.csiro.au.

Michael Paul, Ph.D. is an Assistant Professor in the Department of Information Science at the University of Colorado-Boulder, with affiliate positions in the Department of Computer Science and the Institute of Cognitive Science. His research interests include machine learning and natural language processing with applications to social media analysis and public health monitoring. He has contributed to a state-of-the-art influenza surveillance and forecasting system that uses social media messages, with data available at the website, HealthTweets.org. Recently, he was a co-organizer of the Social Media Mining for Public Health Monitoring and Surveillance Session at PSB 2016 and the 1st Joint Workshop on Health Intelligence (W3PHIAI) at the AAAI 2017, and was co-organizer of an NSF-funded workshop on ontology induction for behavioral science. Email: mpaul@colorado.edu

Davy Weissenbacher, PhD. is a Research Associate at the Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania. His research interests include knowledge representation, logic, machine learning and computational linguistics with a focus on biomedical and clinical text processing. While participating in research projects lead by renowned universities and research institutes, including the University of Manchester, the French Institute for Research in Computer Science and Automation or the Arizona State University, he applied his expertise in Information Extraction to help solving problems in domains as varied as Education Sciences, Public Health or Genomics. Email: dweissen@pennmedicine.upenn.edu

Azadeh Nikfarjam, Ph.D. is a Postdoctoral researcher at the Department of Biomedical Informatics, Stanford University. Her research specialization is on health-related information extraction from social media and other sources. She completed her Ph.D. from Arizona State University, under the supervision of Graciela Gonzalez. She was a co-organizer for the PSB shared task and session on social media mining in 2016 and the SMM4H AMIA Workshop in 2017. Email: azadehn@stanford.edu 

Masoud Rouhizadeh, Ph.D., is a Postdoctoral Researcher at the Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania. He has experience in various interdisciplinary areas of NLP, collaborating with top clinical scientists in the analysis of social media for exploring pregnancy-related clinical outcomes including medication side-effects. He is also working on detecting signals relevant to mental health, particularly autism, depression, dementia, and empathy. In addition to clinical NLP, he works on problems such as text normalization, verb-predicate clustering, and dependency analysis of pronouns. He was the co-chair of the 4th Pacific Northwest Regional NLP Workshop (NW-NLP 2016) in Seattle, WA, and an Organizing Committees member of Interspeech 2012 and ACL 2011 in Portland, OR.  Email: mrou@pennmedicine.upenn.edu

Ari Z. Klein, Ph.D. is a Postdoctoral Researcher at the Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania. Broadly, Ari is interested in NLP for biomedical and clinical applications, and his recent work has involved curating a variety of annotated datasets for supervised classification. Ari presented and published a paper in the proceedings of the BioNLP 2017 workshop, on an annotated corpus that is being used for training, development, and testing in an AMIA 2017 shared task. Ari has also reviewed papers for this shared task, as well as papers for an AMIA 2017 workshop that members of his lab have organized (SMM4H), the general AMIA conference, and EMNLP. Email: ariklein@pennmedicine.upenn.edu

Karen O’Connor, MS is a Staff Scientist in the Health Language Processing Lab at the Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania. Karen O’Connor’s research experience and interest are in the area social media mining for public health. She has worked as senior annotator on various projects in this area including the detection and extraction of adverse drug reactions in social media and health forum posts, and the detection of pregnancy outcomes from user’s timelines in social media. Her expertise is in the development of annotation schema for corpora development. Email: karoc@pennmedicine.upenn.edu