ScraperProject¶

class scraper_toolkit.ScraperProject.ScraperProject(domain: str)[source]¶

Handle the page fetching, HTML parsing, and exporting of a web scraping project.

Parameters:	domain – Prefix to be added to scraped URLs missing the domain.

add_selector(selector: Union[str, Selector], attribute: str = None, name: str = None, post_processing: Callable = None)[source]¶

Add the given selector to loaded CSS selectors.

Parameters:	selector – CSS selector as a string or a Selector type object. attribute – HTML attribute of the element to store name – Optional name for the parsed attribute, useful for creating the header row when exporting as a CSV file. post_processing – Optional function called on the parsed attribute before it is stored. Useful for cleaning up and splitting data.

add_selectors(selectors: List[Selector])[source]¶

Add multiple CSS selectors to loaded selectors.

Parameters:	selectors – List of Selector objects.

export_to_csv(csv_path: pathlib.Path, encoding: str = 'UTF-8', write_header: bool = True)[source]¶

Export parsed data to a CSV file.

Parameters:	csv_path – Path of the location to save the CSV file. encoding – CSV file encoding. Default is UTF-8. write_header – If true, write a header row to the CSV file using the “name” keys in the provided data.

fetch(url: str = None) → str[source]¶

Fetch HTML from the page at the given URL.

Parameters:	url – URL of the target page.
Returns:	HTML page as a string.

parse()[source]¶: Parse HTML for elements using loaded CSS selectors and append matching elements to self.parsed as dictionary objects.