class GetTextCB(Callback):
async def after_visit(self, crawler, idx):
if crawler.pages[idx].url == 'https://fastcore.fast.ai/':
= await crawler.pages[idx].find_ele('//span[contains(text(), "Welcome to fastcore")]')
loc if loc:
assert await loc[0].get_text() == "Welcome to fastcore"
= Crawl(2, ['https://solveit.fast.ai/', 'https://fastcore.fast.ai/'], [GetTextCB()])
C await C.run(headless=False)
crawler
crawler with call backs
Web Crawler with Callback System
This crawler implements a flexible web scraping system with callback hooks for extensibility, inspired by fastai’s callback system.
## Architecture The crawler operates with two main callback hooks:
before_visit
and after_visit
, with an ord
parameter controlling execution order.
Key Features
- Parallel Processing:
- Configurable number of pages (
np
) for concurrent processing - Efficient browser resource management
- Configurable number of pages (
- URL Management:
- Input: List of URLs to visit (
to_visit
) - Tracks progress through callback-accessible sets:
visited
: Already processed URLsunvisited
: Pending URLsvisit_window
: Current batch of URLs (size =np
)
- Input: List of URLs to visit (
- Callback System:
- Extensible through custom callbacks
- Ordered execution (
ord
) - Full access to crawler state
Crawl
Crawl (np:int=1, to_visit:Optional[List[str]]=None, cbs=None)
Initialize self. See help(type(self)) for accurate signature.
Type | Default | Details | |
---|---|---|---|
np | int | 1 | |
to_visit | Optional | None | |
cbs | NoneType | None | type: ignore |
callback
extract text for a given xpath
To traverse all webpages within the same domain using
TraveseSameDomainCB
TraveseSameDomainCB (url)
Callback helping traveling all the links available in the same domain.
= 'https://solveit.fast.ai/'
url = Crawl(5, [url], [TraveseSameDomainCB(url)])
C await C.run(headless=True)
assert all([domain(i)==domain(url) for i in C.visited])
assert len(C.unvisited) == 0
Crawl a url and save in md
ToMDCB
ToMDCB (base_dir='PW')
Callback helping traveling all the links available in the same domain.
= 'https://solveit.fast.ai/'
url = Crawl(3, [url], [TraveseSameDomainCB(url), ToMDCB()])
C await C.run(headless=False)
writing to fn=Path('PW/solveit_fast_ai/index.md')
writing to fn=Path('PW/solveit_fast_ai_privacy/index.md')
writing to fn=Path('PW/solveit_fast_ai_course_info/index.md')
writing to fn=Path('PW/solveit_fast_ai_terms/index.md')
writing to fn=Path('PW/solveit_fast_ai_learn_more/index.md')