pw

playwright

Web Crawler with Callback System

This crawler implements a flexible web scraping system with callback hooks for extensibility, inspired by fastai’s callback system. The lib is motivated by AnswerDotAI playwrightnb. It achives by running etraction on multiple pages in single browser window.
image

Developer Guide

If you are new to using nbdev here are some useful pointers to get you started.

Install pw in Development mode

# make sure pw package is installed in development mode
$ pip install -e .

# make changes under nbs/ directory
# ...

# compile to have changes apply to pw
$ nbdev_prepare

Usage

Installation

Install latest from the GitHub repository:

$ pip install git+https://github.com/tripathysagar/pw.git

or from conda

$ conda install -c tripathysagar pw

or from pypi

$ pip install pw

Documentation

Documentation can be found hosted on this GitHub repository’s pages. Additionally you can find package manager specific guidelines on conda and pypi respectively.

How to use

Few other examples are in crawler

from pw.crawler import *
class GetTextCB(Callback):
    async def after_visit(self, crawler, idx):
        if crawler.pages[idx].url == 'https://fastcore.fast.ai/':
            loc = await crawler.pages[idx].find_ele('//span[contains(text(), "Welcome to fastcore")]')
            if loc:
                assert await loc[0].get_text() == "Welcome to fastcore"

C = Crawl(2, ['https://solveit.fast.ai/', 'https://fastcore.fast.ai/'], [GetTextCB()])
await C.run(headless=False)