HLP Web Scraping Bots

In order to fetch text data from websites which do not provide APIs for data mining, we deploy web scraping bots. Before building bots for each website, we carefully read the robots.txt and respect the constraints described in the file.

Scraping Rate

If a website to scrape does not explicitly specify a limit of scraping rate, our bots will collectively send less than 10 requests per second.

Bot Header

To show the websites our politeness, each scraping bot will contain the following content in its header.

For instance, a bot dedicated to the project of studying WhatToExpect.com will carry the header “UPennHLPBot (https://healthlanguageprocessing.org/hlp-web-scraping-bots/; project: WhatToExpect.com: A case study of online birth club forums)”

Projects Involving Scraping Bots