Daily Reading 17

Web Scraping

###

What are the key differences between scraping static and dynamic websites?

A static website has all of it’s text and relevant assets loaded on initial load of the page, so simply waiting for the page to complete loading is all that is required to access any information you may want to obtain via scraping. A dynamic website however, is any website that displays more information conditionally without a full page reload or page change, and thus requires user action to reveal further information, without it being present from the start in the HTML.

Scraping a dynamic website requires you to through some method, emulate the actions of a user making further requests to the already loaded page. On page javascript execution can be achieved in python using one of these 4 libraries: Pyppeteer, Selenium, Playwright, or Web Scraping API.

Explain at least three techniques or best practices that can be employed to avoid getting blocked while scraping websites.

The core reason scrapers get blocked is because of scrape actions putting undue load and stress on the servers, so the ethical techniques for avoiding getting yourself blocked are the ones respectful to the webhost and their resources.

Respect Robots.txt.

Robots.txt is a text document that many websites have, that lists rules and regulations on what pages you are allowed to scrape with the server owner’s tolerance, and which are restricted, along with what rate you can be loading or pinging without drawing their ire. Flagrant disregard of robots.txt means you have no respect for the website owner, and the feeling will soon be mutual.

SLOW DOWN

Your scraper could potentially run extremely, extremely fast. Doing so however means your IP is sending obscene numbers of requests for all pages of the site at once, and can in seconds make the number of requests that a site might get from humans in a day.

DON’T SCRAPE THINGS FROM BEHIND A LOGIN

This should be obvious. If something requires a login, then that means only specific people are supposed to see it. Often, exactly one human user is supposed to see the things behind a login barrier. To make a scraper that would get past this barrier would necessitate providing cookies and information that would identify yourself anyway, so it becomes incredibly obvious who is doing the scraping in the first place. If massive amounts of data is being scraped rapidly, from the login session run by Alice, then it’s rather clear who is running the scraper.

What is Playwright, and how does it assist in web scraping tasks? Provide an example of a use case where Playwright would be particularly beneficial.

Playwright is a python library that allows you to automate web browser tasks, without need for human interaction. Unlike other potential scraping methods that are headless and directly make requests, Playwright convincingly emulates being a real user, because, well, playwright uses a real browser, and emulates browser actions in the same way a real user would. Playwright being able to interact with a webpage in this way also means that if you can programmatically write out the steps needed to reveal more information from a page, you can collect info from dynamic webpages, which is something simple requests performed from a headless scraper cannot.

Describe the purpose of using Xpath in web scraping, and provide an example of an Xpath expression to select a specific HTML element from a webpage.

Xpath allows you to specify specific elements on a web page for Playwright to interact with. You provide it with HTML tags that contain the element you want to interact with, and playwright will look on the page for that element, upon finding it, you can ask it to perform specific tasks, and this is the primary method for automatically performing actions via playwright.

reading-notes