Parse an XML file into a pandas DataFrame
| informatics | datascience | XML | pandas |
XML is a markup language often used to represent and distribute data structures which can be often difficult to create using more standard tabular formats.
Basically, the XML format is similar to HTML (which is another markup language, indeed), in that data are organized in elements, which define the type of information exposed, and each element contains the actual data in the form of content or attributes.
The XML Wikipedia page offers an extensive overview of all the details and technicalities of this format, but the key concepts are simple. Each piece of information is delimited by a specific tag, like this:
<data>
<student name="John">
<email>[email protected]</email>
<grade>A</grade>
<age>16</age>
</student>
<student name="Alice">
<email>[email protected]</email>
<grade>B</grade>
<age>17</age>
</student>
<student name="Bob">
<email>[email protected]</email>
<grade>C</grade>
<age>16</age>
</student>
<student name="Hannah">
<email>[email protected]</email>
<grade>A</grade>
<age>17</age>
</student>
</data>
In this example, each student is represented by a <student>
element, which has a name
attribute containing the name of a specific student. Each of these elements has then sub-elements defined by the <email>
, <grade>
and <age>
tags; between these tags the actual data content referring to the given student is present. Let’s say this data is saved in an XML file called “students.xml”.
We can think of this structure as a pandas DataFrame in which each student represents an observation, with its name
attribute being the main identifier and the sub-elements being other features of the observation. A tabular representation of these data would be like this:
name | grade | age | |
---|---|---|---|
John | [email protected] | A | 16 |
Alice | [email protected] | B | 17 |
Bob | [email protected] | C | 16 |
Hannah | [email protected] | A | 17 |
So we want to find a way to convert XML-structured data to a more functional table.
Given the structure of XML files, we can represent them as a tree, and this is the approach used by the xml.etree.ElementTree Python module. The parsing of our “students.xml” file starts at the root of the tree, namely the <data>
element, which contains the entire data structure.
import xml.etree.ElementTree as et
xtree = et.parse("students.xml")
xroot = xtree.getroot()
Now we can iterate through each node of the tree, which means we will get each student element and grab its name
attribute and all of its sub-elements to build our dataframe.
for node in xroot:
s_name = node.attrib.get("name")
s_mail = node.find("email").text
s_grade = node.find("grade").text
s_age = node.find("age").text
In order to get the name
attribute, we use the attrib.get()
function, while the text content of each element can be retrieved using the find()
function of nodes.
Each iteration will return a set of data that can be thought as an observation in a pandas DataFrame; we can build this procedure as follows:
import pandas as pd
import xml.etree.ElementTree as et
xtree = et.parse("students.xml")
xroot = xtree.getroot()
df_cols = ["name", "email", "grade", "age"]
out_df = pd.DataFrame(columns = df_cols)
for node in xroot:
s_name = node.attrib.get("name")
s_mail = node.find("email").text if node is not None else None
s_grade = node.find("grade").text if node is not None else None
s_age = node.find("age").text if node is not None else None
out_df = out_df.append(pd.Series([s_name, s_mail, s_grade, s_age],
index = df_cols),
ignore_index = True)
The downside to this approach is that you need to know the structure of the XML file in advance, and you have to hard-code column names accordingly.
We can try to convert this code to a more useful and versatile function, without having to hard-code any values.
import pandas as pd
import xml.etree.ElementTree as et
def parse_XML(xml_file, df_cols):
"""Parse the input XML file and store the result in a pandas DataFrame
with the given columns. The first element of df_cols is supposed to be
the identifier variable, which is an attribute of each node element in
the XML data; other features will be parsed from the text content of
each sub-element. """
xtree = et.parse(xml_file)
xroot = xtree.getroot()
out_df = pd.DataFrame(columns = df_cols)
for node in xroot:
res = []
res.append(node.attrib.get(df_cols[0]))
for el in df_cols[1:]:
if node is not None:
res.append(node.find(el).text)
else:
res.append(None)
out_df = out_df.append(pd.Series(res, index = df_cols), ignore_index = True)
return out_df
If we apply our function to the “students.xml” file using parse_XML("students.xml", ["name", "email", "grade", "age"])
, the result is precisely the table we saw above.
This is a more efficient implementation of the XML parsing function, although we still need to know the basic structure of the input XML document we want to parse.