Spider Python Reference Documentation

Spider

Current Version: 11.5.0

Chilkat.Spider

Crawl a website domain, manage URL queues, filter links, cache pages, and inspect fetched HTML.

Chilkat.Spider crawls pages within a single initialized domain. It maintains internal lists of unspidered, spidered, failed, and outbound URLs; fetches one page at a time; extracts same-domain links for continued crawling; collects outbound links separately; respects robots.txt; supports wildcard include and avoid patterns; can use a dedicated cache directory; and exposes the last fetched HTML, title, meta description, keywords, redirect information, cache status, and last-modified date.

Domain-limited crawling

Initialize the spider with a domain and crawl only URLs that belong to that same domain.

URL queue management

Add starting URLs, crawl the next queued URL, skip selected URLs, and inspect unspidered, spidered, failed, and outbound URL lists.

Filtering rules

Add wildcard avoid patterns, must-match patterns, and outbound-link avoid patterns to control which URLs are crawled or collected.

Page metadata

Read the last fetched HTML, page title, meta description, meta keywords, final redirect URL, and last-modified timestamp.

Caching and redirects

Fetch pages from a Chilkat-managed cache, update the cache, detect when a page was redirected, and inspect whether the last page came from cache.

Network controls

Configure user-agent, proxy settings, connect/read timeouts, IPv6 preference, maximum URL length, maximum response size, and abort support.

Common pattern: Call Initialize with the target domain, add one or more starting URLs with AddUnspidered, optionally add filtering patterns, then repeatedly call CrawlNext. After each successful crawl, inspect LastUrl, LastHtml, page metadata, redirect state, and the URL-list counts to decide whether to continue, skip, recrawl, or process the page.

Scope note: Spider is intended for controlled, domain-oriented crawling and link discovery. Use Chilkat.Http or Chilkat.Rest when the task is a single HTTP request, API call, form post, or download rather than a crawl of discovered links.

Object Creation

obj = chilkat2.Spider()

Properties

AbortCurrent

bool AbortCurrent

Introduced in version 9.5.0.58

When set to True, causes the currently running method to abort. Methods that always finish quickly (i.e.have no length file operations or network communications) are not affected. If no method is running, then this property is automatically reset to False when the next method is called. When the abort occurs, this property is reset to False. Both synchronous and asynchronous method calls can be aborted. (A synchronous method call could be aborted by setting this property from a separate thread.)

Spider Python Reference Documentation

Spider

Current Version: 11.5.0

Crawl a website domain, manage URL queues, filter links, cache pages, and inspect fetched HTML.

Domain-limited crawling

URL queue management

Filtering rules

Page metadata

Caching and redirects

Network controls

Object Creation

Properties

AbortCurrent

AvoidHttps

CacheDir

ChopAtQuery

ConnectTimeout

DebugLogFilePath

Domain

FetchFromCache

FinalRedirectUrl

LastErrorHtml

LastErrorText

LastErrorXml

LastFromCache

LastHtml

LastHtmlDescription

LastHtmlKeywords

LastHtmlTitle

LastMethodSuccess

LastModDateStr

LastUrl

MaxResponseSize

MaxUrlLen

NumAvoidPatterns

NumFailed

NumOutboundLinks

NumSpidered

NumUnspidered

PreferIpv6

ProxyDomain

ProxyLogin

ProxyPassword

ProxyPort

ReadTimeout

UpdateCache

UserAgent

VerboseLogging

Version

WasRedirected

WindDownCount

Methods

AddAvoidOutboundLinkPattern

AddAvoidPattern

AddMustMatchPattern

AddUnspidered

CanonicalizeUrl

ClearFailedUrls

ClearOutboundLinks

ClearSpideredUrls

CrawlNext

CrawlNextAsync (1)

FetchRobotsText

FetchRobotsTextAsync (1)

GetAvoidPattern

GetBaseDomain

GetFailedUrl

GetOutboundLink

GetSpideredUrl

GetUnspideredUrl

GetUrlDomain

Initialize

LoadTaskCaller

RecrawlLast

RecrawlLastAsync (1)

SkipUnspidered

SleepMs

CrawlNextAsync ⁽¹⁾

FetchRobotsTextAsync ⁽¹⁾

RecrawlLastAsync ⁽¹⁾