from pw.crawler import *
class GetTextCB(Callback):
async def after_visit(self, crawler, idx):
if crawler.pages[idx].url == 'https://fastcore.fast.ai/':
= await crawler.pages[idx].find_ele('//span[contains(text(), "Welcome to fastcore")]')
loc if loc:
assert await loc[0].get_text() == "Welcome to fastcore"
= Crawl(2, ['https://solveit.fast.ai/', 'https://fastcore.fast.ai/'], [GetTextCB()])
C await C.run(headless=False)
pw
playwright
Web Crawler with Callback System
This crawler implements a flexible web scraping system with callback hooks for extensibility, inspired by fastai’s callback system. The lib is motivated by AnswerDotAI playwrightnb. It achives by running etraction on multiple pages in single browser window.
Developer Guide
If you are new to using nbdev
here are some useful pointers to get you started.
Install pw in Development mode
# make sure pw package is installed in development mode
$ pip install -e .
# make changes under nbs/ directory
# ...
# compile to have changes apply to pw
$ nbdev_prepare
Usage
Installation
Install latest from the GitHub repository:
$ pip install git+https://github.com/tripathysagar/pw.git
or from conda
$ conda install -c tripathysagar pw
or from pypi
$ pip install pw
Documentation
Documentation can be found hosted on this GitHub repository’s pages. Additionally you can find package manager specific guidelines on conda and pypi respectively.
How to use
Few other examples are in crawler