In order to fetch text data from websites which do not provide APIs for data mining, we deploy web scraping bots. Before building bots for each website, we carefully read the robots.txt and respect the constraints described in the file.
If a website to scrape does not explicitly specify a limit of scraping rate, our bots will collectively send less than 10 requests per second.
To show the websites our politeness, each scraping bot will contain the following content in its header.
- A common identifier “UPennHLPBot”.
- The URL to this page ” https://healthlanguageprocessing.org/hlp-web-scraping-bots/ “.
- Project identifier.
For instance, a bot dedicated to the project of studying WhatToExpect.com will carry the header “UPennHLPBot (https://healthlanguageprocessing.org/hlp-web-scraping-bots/; project: WhatToExpect.com: A case study of online birth club forums)”