Find Part 1 of the blog series here: https://deepsense.ca/2022/02/09/blog-building-your-own-dataset-for-machine-learning/
Watch Amit’s Discovery Session on how to build an image based dataset: https://www.youtube.com/watch?v=3TT4hySJVvA
Here are a few of the most relevant libraries for web scrapping:
- Requests: a python library for making http request such as GET, POST etc. It is much simpler and easy to use than the built-in urllib library. However, it can only retrieve static web content, meaning it cannot handle websites which are written purely in JavaScript.
- BeautifulSoup: perhaps the most widely used Python library for web scraping, it is used to parse the retrieved html and xml documents. The lxml library is also a commonly used parser used by BeautifulSoup.
- Selenium: originally designed for automated web application testing it was later adapted to be used to for web scrapping. The selenium web driver can be used handle JavaScript websites and perform actions such as click on the page, fill forms, scroll etc. Although it is beginner friendly and able to handle dynamic web content, this comes at the price of being slower and unsuitable for large scale projects.
- Scrapy: it is a web scraping framework intended as a full-fledged web scraping solution that does all the heavy lifting for you. It provides spider bots that can crawl and extract data. It also allows the creation of pipelines. Another benefit is that scrapy is asynchronous, meaning that it can make multiple http requests simultaneously which saves time and makes the process more efficient.
Legal aspects of web-scraping
By its nature web-scraping isn’t illegal (Hunter, 2021; Rocha, 2015). However, not all websites allow users to use a programming script to extract their data. Most big social media sites usually only allow access to browsers such as Mozilla, or Chrome and block any programming script which tries to access the site. In Canada web-scrapping is both used by some government institutions such as Statistics Canada (2021), but is also the subject of publicized legal battles such as that between Mongohouse and the Toronto Real Estate Board (TREB) (Lifshitz, 2019). Depending on how and why web-scrapping is performed it can be on either side of the law. In the case of Mongohouse vs TREB, Mongohouse’s entire business was centred around the scraping and unauthorized distribution of TREB MLS information for commercial purposes (Lifshitz, 2019). Hunter, (2021) states that, “The Federal Court made it clear in its ruling that the unauthorized web scraping of third-party content without explicit consent is illegal”.
Scrapping the web legally
Generally, the first step to determine if a particular site allows scraping is to view their terms and conditions or other relevant documentation. Many websites also have a robots.txt file which lists which programs are allowed, as well as which parts of the website may or may not be scrapped. Besides allowing web-scrapping, some websites also allow other legal alternatives such as an API. If the website already has an API, then it most likely discourages scraping. Whenever it is unclear if scrapping is allowed, the best course of action is to contact the administrators and ask.
References:
Rocha, R. (2015, October 15). On the ethics of web scraping. Retrieved from: https://robertorocha.info/on-the-ethics-of-web-scraping/.
Lifshitz, R. L. (2019, May 13). Federal Court makes clear: Website scraping is illegal. Retrieved from: https://www.canadianlawyermag.com/news/opinion/federal-court-makes-clear-website-scraping-is-illegal/276128.
Hunter, M. (2021, April 6). How to Legally Scrape the Web for Your Next Data Science Project. Retrieved from: https://medium.com/modern-programmer/how-to-legally-scrape-the-web-for-your-next-data-science-project-8a19250e5f9b.
Statistics Canada. (2021, November 15). Web scraping. Retrieved from: https://www.statcan.gc.ca/en/our-data/where/web-scraping.