PageFetcher¶
-
class
scraper_toolkit.components.PageFetcher.
PageFetcher
(domain: str)[source]¶ Fetch URLs and return web pages’ HTML as strings.
Parameters: domain – Prefix to be added to scraped URLs missing the domain. -
static
get_full_url
(domain: str, suffix: str) → str[source]¶ Return a complete URL given a domain and suffix, even if the provided suffix is the complete URL.
Parameters: - domain – The domain of the target page URL.
- suffix – The URL of the target page, with or without the domain prefix.
Returns: The complete URL.
-
get_html
(url: str = None) → str[source]¶ Fetch the page HTML from the given URL.
Parameters: url – URL of target page. Returns: HTML as a string.
-
get_links_from_page
(target_url: str = None) → Iterable[str][source]¶ Return a list of every href URL found from target_url.
Parameters: target_url – URL of page to search for href links. Returns: List of every discovered href link on the page.
-
static
select_elements_from_html
(html: str, selector: str)[source]¶ Return a list of HTML elements from the given html that match the provided CSS selector.
Parameters: - html – HTML of the page to parse.
- selector – CSS selector for target elements.
Returns: List of HTML elements matching the CSS selector.
-
select_links_from_page
(selector: str, target_url: str = None) → Iterable[str][source]¶ Yield the HTML of all pages linked on the target_url located by the given CSS selector.
Parameters: - selector – CSS selector for elements containing href attribute
- target_url – URL to search for links. If none is provided, the domain URL will be used.
Returns: Generator for HTML as strings, fetched from the selected links.
-
static