Add Crawl-delay Directive Support from robots.txt #1707
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds support for respecting Crawl-delay directives from robots.txt] files. When enabled, the crawler will automatically wait the specified delay between requests to the same domain, improving compliance with website policies and reducing the risk of being rate-limited or blocked.
Motivation
Many websites specify Crawl-delay directives in their robots.txt to indicate how long crawlers should wait between requests. Respecting this directive helps:
Changes
New Feature:
respect_crawl_delayConfiguration ParameterFiles Modified:
async_configs.py - Added respect_crawl_delay parameter to CrawlerRunConfig
models.py - Added crawl_delay] field to DomainState dataclass
utils.py - Added get_crawl_delay() method to RobotsParser
async_dispatcher.py - Enhanced RateLimiter] to support crawl-delay
async_webcrawler.py - Wired up respect_crawl_delay in arun_many()
Files Added:
test_crawl_delay.py - Comprehensive test suite
Documentation Updates
Running Tests
python -m pytest tests/general/test_crawl_delay.py -vChecklist: