Lecture 4
Extracting structured data from web pages using Python, BeautifulSoup, and XPath techniques
This work is licensed under CC BY-NC-SA 4.0
© Way-Up 2025
html (Hyper Text Markup Language) fileshtml is a markup language, close to xml, but with less constraintswhat you have seen in the previous slide is coded this:
<section class="slide">
<h2>How to handle Data present in the Web? (2)</h2>
<p>Html</p>
<ul>
<li>Content displayed from your web browser is contained in <code>html</code> (Hyper Text Markup Language) files</li>
<li><code>html</code> is a markup language, close to xml, but with less constraints</li>
<li>the browser is highly <b>fault tolerant</b>, not many websites would display at all if the browser was strictly enforcing standards</li>
</ul>
</section>
Exercise: working on a manual extract
from lxml import etree
import requests
from bs4 import BeautifulSoup as bs
def getContent(link: str) -> str:
webPage = requests.get(link)
return str(bs(webPage.content, "html.parser"))
content: str = getContent("https://en.wikipedia.org/wiki/List_of_international_airports_by_country")
html: etree.ElementBase = etree.HTML(content)
['Americas', 'Caribbean', 'Cuba', 'Holguín', 'Frank País Airport', 'HOG']
following this format [Region, SubRegion, Country, City, Airport name, IATA Code]
allNodes: list[etree.ElementBase] = html.xpath(RELATIVE_ROOT+"/*")
length: int = len(allNodes)
#...
node: etree.ElementBase = allNodes[i]
if node.tag == 'h2' and h2Count != 0:
currentRegion = node.xpath("./span[@class='mw-headline'][1]")[0].text
Longer answer: beware of the automation cost, verify that it is worthing it!
Besides web scraping, you can often get data from structured sources like REST APIs, which are designed for machine-to-machine communication.
Merging dataframes