ScraperProject

class scraper_toolkit.ScraperProject.ScraperProject(domain: str)[source]

Handle the page fetching, HTML parsing, and exporting of a web scraping project.

Parameters:domain – Prefix to be added to scraped URLs missing the domain.
add_selector(selector: Union[str, Selector], attribute: str = None, name: str = None, post_processing: Callable = None)[source]

Add the given selector to loaded CSS selectors.

Parameters:
  • selector – CSS selector as a string or a Selector type object.
  • attribute – HTML attribute of the element to store
  • name – Optional name for the parsed attribute, useful for creating the header row when exporting as a CSV file.
  • post_processing – Optional function called on the parsed attribute before it is stored. Useful for cleaning up and splitting data.
add_selectors(selectors: List[Selector])[source]

Add multiple CSS selectors to loaded selectors.

Parameters:selectors – List of Selector objects.
export_to_csv(csv_path: pathlib.Path, encoding: str = 'UTF-8', write_header: bool = True)[source]

Export parsed data to a CSV file.

Parameters:
  • csv_path – Path of the location to save the CSV file.
  • encoding – CSV file encoding. Default is UTF-8.
  • write_header – If true, write a header row to the CSV file using the “name” keys in the provided data.
fetch(url: str = None) → str[source]

Fetch HTML from the page at the given URL.

Parameters:url – URL of the target page.
Returns:HTML page as a string.
parse()[source]

Parse HTML for elements using loaded CSS selectors and append matching elements to self.parsed as dictionary objects.