Clients

HTTP clients designed for easy tool building

class webtoolbox.clients.Retriever(log_name='Retriever', **kwargs)

Fast, asynchronous URL retriever

Usage is simple:
  1. Create a Retriever, optionally providing a log_name for logging and any of the kwargs accepted by tornado.httpclient.AsyncHTTPClient
  2. Add request processor callbacks to retriever.response_processors. Each will be called with args=(request, response)
  3. Load one or more URLs using retriever.queue_urls (your response processors may add more as desired)
  4. Call retriever.run(), which will block until all URLs have been processed
queue(url, **kwargs)
Queue up a list of URLs to retrieve
class webtoolbox.clients.Spider(log_name='Spider', **kwargs)

Retriever-based Spider

Starts with an initial list of URLs and crawls them asynchronously, providing results to header_processors, html_processors and tree_processors which implement additional functionality.

check_site demonstrates the HTML processor feature to report HTML validation errors from pytidylib.

allowed_hosts
This will be automatically populated from the inital batch of URLs
guess_charset(response)
Does the ugly business of attempting to figure out how to decode the response to a unicode string
header_processors
Header processors will be called with (URL, HTTP Headers)
html_processors
HTML processors will be called with unprocessed HTML as a UTF-8 string
log
Logger used to report progress & errors
process_page(request, response)

Callback used to process a URL after it’s been retrieved

Rough sequence:
  1. Process errors and redirects
  2. Process non-HTML content
  3. Convert retrieved HTML to UTF-8
  4. Process HTML through the defined html_processors
  5. Create an lxml tree
  6. Convert all links to absolute URLs
  7. Queue any unseen URLs for retrieval
  8. Pass lxml tree to tree_processors
queue(url)
Add a URL to the queue to be retrieved
run(urls)

Start the spider with the provided list of URLs

Block until the spider has crawled the entire site

site_structure
All urls processed by this spider as a URL-keyed list of :class:URLStatus elements
URLs whose path matches this regular expression won’t be followed:
skip_media
If true, don’t retrieve media files (i.e. <img>, <object>, <embed>, etc.)
skip_resources
If true, don’t process non-media components (i.e. stylesheets or CSS)
tree_processors
Tree processors will be called with the full lxml tree, which can be

Previous topic

log_replay

This Page