helper

auxilary helper functions

View MD in notebook


source

Extract table to dataframe


source

table2df

 table2df (table:playwright.async_api._generated.Locator)

Given a html table element it extracts the table obj and convert it to pandas dataframe

async with setup_browser(n=1) as obj:
    if obj.is_valid:
        page = obj.pages[0]
        await page.goto("https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue")
        await page.wait()
        ele = await page.find_ele('//table[@class="wikitable sortable plainrowheaders jquery-tablesorter"]') 
        assert len(ele) != 0

        df = await ele[0].table2df()
        assert len(df) != 0

df.head()
Rank Country Companies
0 1 United States of America 22
1 2 China 11
2 3 Germany 4
3 4 United Kingdom 2
4 4 Switzerland 2

Extract html object to md


source

h2md

 h2md (ele:Union[playwright.async_api._generated.Page,playwright.async_api
       ._generated.Locator])

Convert HTML h to markdown using `HTML2Text

async with setup_browser(n=1) as obj:
    if obj.is_valid:
        page = obj.pages[0]
        await page.goto("https://example.com/")
        await page.wait()        
        print_md(await page.h2md())

Example Domain

This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.

More information…

async with setup_browser(n=1) as obj:
    if obj.is_valid:
        page = obj.pages[0]
        await page.goto("https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue")
        await page.wait()
        ele = await page.find_ele('//table[@class="wikitable sortable plainrowheaders jquery-tablesorter"]') 
        print_md(await ele[0].h2md())

Breakdown by country Rank | Country | Companies
1 | United States of America | 22
2 | China | 11
3 | Germany | 4
4 | United Kingdom | 2
4 | Switzerland | 2
6 | Japan | 1
6 | France | 1
6 | Italy | 1
6 | India | 1
6 | Netherlands | 1
6 | South Korea | 1
6 | Saudi Arabia | 1
6 | Singapore | 1
6 | Taiwan | 1

Domain helpers


source

domain

 domain (url:str)

Extract domain i.e. netloc given a url

urls = ['https://fast.ai/getting_started.html', 'https://fast.ai/getting_started.html#copyright', 'https://fast.ai/getting_started.html#year=2008-09&quarter=quarter1?a=3']
assert domain("") == ""
assert domain(urls[0]) == 'fast.ai'

source

is_same_resource

 is_same_resource (url1:str, url2:str)

Takes in two urls and check if two url have any wuery param

assert is_same_resource(*urls[:-1])
assert not is_same_resource(*urls[1:])

source

url2fn

 url2fn (url:str)

*takes in a url and return a filename by substituting it with _.*

[url2fn(i) for i in urls]
['fast_ai_getting_started_html',
 'fast_ai_getting_started_html_copyright',
 'fast_ai_getting_started_html_year_2008_09_quarter_quarter1_a_3']