AI & Data

Web Scraping & HTML Parsing

Lecture 4

Extracting structured data from web pages using Python, BeautifulSoup, and XPath techniques

How to handle Data present in the Web? (1)

Web Scraping = using data available through regular web pages

How to handle Data present in the Web? (2)

Html

How to handle Data present in the Web? (3)

Html - example

what you have seen in the previous slide is coded this:

<section class="slide">
    <h2>How to handle Data present in the Web? (2)</h2>
    <p>Html</p>
    <ul>
        <li>Content displayed from your web browser is contained in <code>html</code> (Hyper Text Markup Language) files</li>
        <li><code>html</code> is a markup language, close to xml, but with less constraints</li>
        <li>the browser is highly <b>fault tolerant</b>, not many websites would display at all if the browser was strictly enforcing standards</li>
    </ul>
</section>

How to handle Data present in the Web? (3)

Difference between DOM and html This is why there is quite always a difference between what you see in your browser, and what is fetched from the html page (and this is really important for web scraping)

How to handle Data present in the Web? (4)

DOM vs html: example

How to get Data present in the Web?

Exercise: working on a manual extract

How to get Data present in the Web?

Exercise: fetch data from a web page and convert it to an Orange Data Table
from lxml import etree
import requests
from bs4 import BeautifulSoup as bs


def getContent(link: str) -> str:
    webPage = requests.get(link)
    return str(bs(webPage.content, "html.parser"))


content: str = getContent("https://en.wikipedia.org/wiki/List_of_international_airports_by_country")
html: etree.ElementBase = etree.HTML(content)

How to get Data present in the Web? (2)

Limits of direct html fetching

When to use scraping?

Short answer: when you cannot do otherwise

Longer answer: beware of the automation cost, verify that it is worthing it!

Automation: Theory vs Reality

Other web data you can fetch

REST and XML

Besides web scraping, you can often get data from structured sources like REST APIs, which are designed for machine-to-machine communication.

aggregating different sources with merge and concatenation

Exercise: Build an Orange Data table from those two tables (users and posts)

Merging dataframes

Slide Overview