helper

auxilary helper functions

View MD in `notebook`

print_md

 print_md (s:str)

Given a string display markdown in Notebook

Extract table to dataframe

source

table2df

 table2df (table:playwright.async_api._generated.Locator)

Given a html table element it extracts the table obj and convert it to pandas dataframe

async with setup_browser(n=1) as obj:
    if obj.is_valid:
        page = obj.pages[0]
        await page.goto("https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue")
        await page.wait()
        ele = await page.find_ele('//table[@class="wikitable sortable plainrowheaders jquery-tablesorter"]') 
        assert len(ele) != 0

        df = await ele[0].table2df()
        assert len(df) != 0

df.head()

	Rank	Country	Companies
0	1	United States of America	22
1	2	China	11
2	3	Germany	4
3	4	United Kingdom	2
4	4	Switzerland	2

Extract html object to md

source

h2md

 h2md (ele:Union[playwright.async_api._generated.Page,playwright.async_api
       ._generated.Locator])

Convert HTML h to markdown using `HTML2Text

async with setup_browser(n=1) as obj:
    if obj.is_valid:
        page = obj.pages[0]
        await page.goto("https://example.com/")
        await page.wait()        
        print_md(await page.h2md())

Example Domain

This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.

More information…

async with setup_browser(n=1) as obj:
    if obj.is_valid:
        page = obj.pages[0]
        await page.goto("https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue")
        await page.wait()
        ele = await page.find_ele('//table[@class="wikitable sortable plainrowheaders jquery-tablesorter"]') 
        print_md(await ele[0].h2md())

Breakdown by country Rank | Country | Companies
1 | United States of America | 22
2 | China | 11
3 | Germany | 4
4 | United Kingdom | 2
4 | Switzerland | 2
6 | Japan | 1
6 | France | 1
6 | Italy | 1
6 | India | 1
6 | Netherlands | 1
6 | South Korea | 1
6 | Saudi Arabia | 1
6 | Singapore | 1
6 | Taiwan | 1

Domain helpers

source

domain

 domain (url:str)

Extract domain i.e. netloc given a url

urls = ['https://fast.ai/getting_started.html', 'https://fast.ai/getting_started.html#copyright', 'https://fast.ai/getting_started.html#year=2008-09&quarter=quarter1?a=3']
assert domain("") == ""
assert domain(urls[0]) == 'fast.ai'

source

is_same_resource

 is_same_resource (url1:str, url2:str)

Takes in two urls and check if two url have any wuery param

assert is_same_resource(*urls[:-1])
assert not is_same_resource(*urls[1:])

source

url2fn

 url2fn (url:str)

*takes in a url and return a filename by substituting it with _.*

[url2fn(i) for i in urls]

['fast_ai_getting_started_html',
 'fast_ai_getting_started_html_copyright',
 'fast_ai_getting_started_html_year_2008_09_quarter_quarter1_a_3']

View MD in notebook

print_md

Extract table to dataframe

table2df

Extract html object to md

h2md

Example Domain

Domain helpers

domain

is_same_resource

url2fn

View MD in `notebook`