Data Formats
Session 3
Parsing and extracting data from XML documents using XPath and Python libraries
2026 WayUp
eXtensible Markup Language
XML is a hierarchical data format designed for structured information exchange between systems
Data organized as a tree of nodes with parent-child relationships
Elements (tags), text content, and attributes
Ideal for representing nested, detailed data with metadata
Visualizing hierarchical relationships
Every XML document forms a tree with a single root element and nested children
| Root | breakfast_menu |
| Child | food |
| Grandchild | name, price, calories |
| Text | "Belgian Waffles" |
Understanding the syntax
<breakfast_menu>
<food>
<name>Belgian Waffles</name>
<price>$5.95</price>
<description>Two of our famous Belgian Waffles with...</description>
<calories>650</calories>
</food>
<food>
<name>French Toast</name>
<price>$4.50</price>
<description>Thick slices made from our homemade bread...</description>
<calories>450</calories>
</food>
</breakfast_menu>
<breakfast_menu> - Root element
<food> - Child element (repeatable)
Belgian Waffles - Text node
$5.95 - Text node
Query language for XML navigation
XPath is a query language for selecting nodes from an XML document using path expressions
XPath traces a path from the root element to the target node
Each level in the hierarchy is separated by /
Use [...] to filter results with conditions
| XPath Expression | Meaning |
|---|---|
/breakfast_menu/food | All food elements from root |
./food/name | Relative path to name elements |
//name | All name elements anywhere |
/food[price>5] | Food items with price over $5 |
Using xml.etree.ElementTree
Python includes xml.etree.ElementTree for basic XML parsing without external dependencies
import xml.etree.ElementTree as xmlReader
# Parse XML from file
tree = xmlReader.parse('menu.xml')
root = tree.getroot()
# Iterate through elements
for food in root.findall('food'):
name = food.find('name').text
price = food.find('price').text
print(f"{name}: {price}")
lxml for advanced queries.
Enhanced XPath support
The lxml library provides extensive XPath 1.0 support, making complex queries much easier
from lxml import etree
# Parse XML document
tree = etree.parse('menu.xml')
root = tree.getroot()
# Use powerful XPath queries
expensive_items = root.xpath('//food[number(substring-after(price, "$")) > 5]')
for item in expensive_items:
name = item.find('name').text
price = item.find('price').text
print(f"{name}: {price}")
pip install lxml
Building a data list from XML
Extract all food items into a list of [name, price] pairs for analysis
Use ./ for relative paths from current element
from lxml import etree
tree = etree.parse('menu.xml')
root = tree.getroot()
# Get all food elements
elems = root.findall('./food')
# Extract name and price from each
data = [
[elem.find("./name").text, elem.find("./price").text]
for elem in elems
]
print(data)
# Output: [['Belgian Waffles', '$5.95'], ['French Toast', '$4.50'], ...]
Key syntax patterns
/breakfast_menu/food
Starts with / → query from root
./food/name
Starts with ./ → query from current node
//name
Starts with // → find anywhere in document
//food[starts-with(./name/text(), 'Be')]
Use [...] to filter results
Converting XML to Orange-compatible format
Orange can work with "raw" Python data through its Table.from_list API
from Orange.data import *
from lxml import etree
# Parse XML
tree = etree.parse('menu.xml')
root = tree.getroot()
# Extract data
data = []
for food in root.findall('./food'):
name = food.find('./name').text
price = float(food.find('./price').text.replace('$', ''))
calories = int(food.find('./calories').text)
data.append([name, price, calories])
# Define Orange domain
name_var = StringVariable('name')
price_var = ContinuousVariable('price')
calories_var = ContinuousVariable('calories')
domain = Domain([name_var, price_var, calories_var])
# Create Orange table
table = Table.from_list(domain, data)
Practice with the Python Script widget
Import the menu.xml file through Orange's Python Script widget and convert it to a data table
From the Data category in Orange
Use lxml or xml.etree to read the file
Store result in out_data variable
Connect to Data Table widget to view
out_data variable is automatically passed to connected widgets.
Download this archive with ML publications in XML format
Extract one XML, parse year/month/title, add "topic" = "ML"
Iterate with glob.glob('*.xml'), combine into single dataset
Load into Orange via Python Script, visualize publication trends
1
XML is hierarchical. Think in trees with parent-child relationships, not flat tables.
2
XPath is powerful. Learn XPath syntax to efficiently navigate complex XML structures.
3
lxml over native. Use lxml for production work—it has better XPath support and performance.
4
Orange integration is straightforward. Convert XML to lists, then use Table.from_list().