PageFetcher

class scraper_toolkit.components.PageFetcher.PageFetcher(domain: str)[source]

Fetch URLs and return web pages’ HTML as strings.

Parameters:domain – Prefix to be added to scraped URLs missing the domain.
static get_full_url(domain: str, suffix: str) → str[source]

Return a complete URL given a domain and suffix, even if the provided suffix is the complete URL.

Parameters:
  • domain – The domain of the target page URL.
  • suffix – The URL of the target page, with or without the domain prefix.
Returns:

The complete URL.

get_html(url: str = None) → str[source]

Fetch the page HTML from the given URL.

Parameters:url – URL of target page.
Returns:HTML as a string.

Return a list of every href URL found from target_url.

Parameters:target_url – URL of page to search for href links.
Returns:List of every discovered href link on the page.
static select_elements_from_html(html: str, selector: str)[source]

Return a list of HTML elements from the given html that match the provided CSS selector.

Parameters:
  • html – HTML of the page to parse.
  • selector – CSS selector for target elements.
Returns:

List of HTML elements matching the CSS selector.

Yield the HTML of all pages linked on the target_url located by the given CSS selector.

Parameters:
  • selector – CSS selector for elements containing href attribute
  • target_url – URL to search for links. If none is provided, the domain URL will be used.
Returns:

Generator for HTML as strings, fetched from the selected links.