PageFetcher¶

class scraper_toolkit.components.PageFetcher.PageFetcher(domain: str)[source]¶

Fetch URLs and return web pages’ HTML as strings.

Parameters:	domain – Prefix to be added to scraped URLs missing the domain.

static get_full_url(domain: str, suffix: str) → str[source]¶

Return a complete URL given a domain and suffix, even if the provided suffix is the complete URL.

Parameters:	domain – The domain of the target page URL. suffix – The URL of the target page, with or without the domain prefix.
Returns:	The complete URL.

get_html(url: str = None) → str[source]¶

Fetch the page HTML from the given URL.

Parameters:	url – URL of target page.
Returns:	HTML as a string.

get_links_from_page(target_url: str = None) → Iterable[str][source]¶

Return a list of every href URL found from target_url.

Parameters:	target_url – URL of page to search for href links.
Returns:	List of every discovered href link on the page.

static select_elements_from_html(html: str, selector: str)[source]¶

Return a list of HTML elements from the given html that match the provided CSS selector.

Parameters:	html – HTML of the page to parse. selector – CSS selector for target elements.
Returns:	List of HTML elements matching the CSS selector.

select_links_from_page(selector: str, target_url: str = None) → Iterable[str][source]¶

Yield the HTML of all pages linked on the target_url located by the given CSS selector.

Parameters:	selector – CSS selector for elements containing href attribute target_url – URL to search for links. If none is provided, the domain URL will be used.
Returns:	Generator for HTML as strings, fetched from the selected links.