Lecture 5
Parsing and extracting data from XML documents using XPath and Python libraries for data analysis
This work is licensed under CC BY-NC-SA 4.0
© Way-Up 2025
XML = eXtensible Markup Language
<breakfast_menu>
<food>
<name>Belgian Waffles</name>
<price>$5.95</price>
<description>Two of our famous Belgian Waffles with...</description>
<calories>650</calories>
</food>
</breakfast_menu>
<breakfast_menu>: the root element<food>: an element/import xml.etree.ElementTree as xmlReader
# read from wml
tree = xmlReader.parse('menu.xml')
from lxml import etree
# read from xml
tree = etree.parse('menu.xml')
root = tree.getroot()
print(root)
lxml has a more extensive support of XPath, and it is really convenient
elems = root.findall('./food')
data = [[elem.find("./name").text,
elem.find("./price").text
] for elem in elems]
print(data)
/: the query looks data from the root./: path is relatively taken from the current path./food[starts-with(./name/text(), 'Be')]Exercise : load this xml file from your preferred python environment, then do the same in Orange (using Python Script widget)
from Orange.data import *
data = [
['green', 4, 1.2, 'apple'],
['orange', 5, 1.1, 'orange'],
['yellow', 4, 1.0, 'peach']
]
color = DiscreteVariable('color', values=set([row[0] for row in data]))
calories = ContinuousVariable('calories')
fiber = ContinuousVariable('fiber')
fruit = DiscreteVariable('fruit', values=set([row[3] for row in data]))
domain = Domain([color, calories, fiber], class_vars=fruit)
table = Table.from_list(domain, data)