Optimizing XML parsing in Python

At work we consume lots of SOAP webservices, its probably a GDS caprice. The most common type of request is flight availability search.
Due to the sheer size of the responses, it can be time consuming and taxing to our servers to parse them. Its about time we take a Paretofied approach and optimize this critical 20ish percent of our code. So with the goal of optimizing the parse function, here are examples of the steps we took (using python>=3.4 and lxml)

Metrics, Profiling

Our parse function interface is more or less like this

def iparse(xml_string):
    Parse the string xml_string. Return a list of result tuples

We need to profile our code to get a good idea where it spends most of its time. We also need a XML file for testing. Here is code that generates a (hypothetical) XML Response and dumps it to stdout

“` python gen_xml.py
from lxml import etree as ET

def generate_test_xml(n_results=250):
    root = ET.Element("{http://www.w3.org/1999/xhtml}Body")
    response = ET.SubElement(root, "{http://www.w3.org/1999/xhtml}Response")
    for i in range(n_results):
        parent = ET.SubElement(response, "{http://www.w3.org/1999/xhtml}ResponseItem", ResponseItemID=str(i))
        for j in range(100):
            child = ET.SubElement(parent, "{http://www.w3.org/1999/xhtml}ResponseItemDatum", SubItemID=str(j), ResponseItemID=str(i))
            child2 = ET.SubElement(child, "{http://www.w3.org/1999/xhtml}ResponseItemDatumChild2", SubItemID=str(j + 1), ResponseItemID=str(i))
            child2.text = str(j * i)
            child3 = ET.SubElement(child2, "{http://www.w3.org/1999/xhtml}ResponseItemDatumChild3", SubItemID=str(j + 2), ResponseItemID=str(i))
            child3.text = str(j * i)
            child4 = ET.SubElement(child3, "{http://www.w3.org/1999/xhtml}ResponseItemDatumChild4", SubItemID=str(j + 3), ResponseItemID=str(i))
            child4.text = str(j * i)
            child5 = ET.SubElement(child3, "{http://www.w3.org/1999/xhtml}ResponseItemDatumChild5", SubItemID=str(j + 3), ResponseItemID=str(i))
            child5.text = str(j * i)


if __name__ == '__main__':
Then we can generate a simple xml file like so

    python gen_xml.py > demo.xml

Assuming there is a module named `parsers` in our `PYTHONPATH`, we can test different parser implementation
functions within that module with this small script
``` python test.py
import argparse
import sys
import logging
from time import time
import parsers

logger = logging.getLogger("parser.test")

ch = logging.StreamHandler(sys.stdout)
formatter = logging.Formatter('[%(asctime)s] %(message)s')

def test_parser(parser):
    with open("demo.xml") as f:
        s = f.read()
    now = time()
    logger.info("evaluating parser `{}`".format(parser.__name__))
    results = parser(s)
    logger.info("parsing completed in {0:0.5f} sec.".format(time() - now))
    logger.info("correctly parsed {0} results.".format(len(results)))

def main():
    parser = argparse.ArgumentParser(description='Test A Parser Implementation')
    parser.add_argument('-p','--parser_function', default="parsers.parse",
                        help='Parser function to measure')

    opts = parser.parse_args()
    fn_name = opts.parser_function.split(".")[-1]
    parser = getattr(parsers, fn_name)
    return 0

if __name__ == '__main__':

By running it like so

PYTHONPATH=`pwd` python test.py -p parsers.parse

Diving in

Initial parse performance

The default implementation does the following: It removes the namespaces from the xml
string and then proceeds to parse and locate the elements of interest using ElementTree
standard operations (findall, find, etc). It will serve as a baseline to test the new implementations against.

def parse(xml):
root = strip_namespaces(xml)
results = []
for subelement in root.findall(“.//ResponseItemDatum”):
i = subelement.find(“.//ResponseItemDatumChild3”).get(“SubItemID”)
j = subelement.find(“.//ResponseItemDatumChild4”).get(“SubItemID”)
return results


[2014-04-29 14:30:07,288] evaluating parser `parse`
[2014-04-29 14:30:10,631] parsing completed in 3.34278 sec.
[2014-04-29 14:30:10,631] correctly parsed 25000 results.

The culprit

Some code profiling is in order. For this, guppy proved really useful.
After running the code through guppy‘s profiler, we discovered the culprit. More than 50% of parsing time was spent on the strip namespaces function. We refactored the code to use namespaces and lo and behold:

def parse4(xml):
    root = ET.fromstring(xml)
    results = []
    for subelement in root.iterfind(".//{http://www.w3.org/1999/xhtml}ResponseItemDatum"):
        chd = subelement.find(".//{http://www.w3.org/1999/xhtml}ResponseItemDatumChild3")
        i = chd.get("SubItemID")
        j = chd.find("{http://www.w3.org/1999/xhtml}ResponseItemDatumChild4").get("SubItemID")
    return results


[2014-04-29 15:17:09,212] evaluating parser `parse4`
[2014-04-29 15:17:10,357] parsing completed in 1.14476 sec.

The function strip_namespaces removed namespaces by applying an XSLT transformation

def strip_namespaces(xml_with_namespaces):
    xslt = '''



    xslt_doc = ET.parse(io.BytesIO(xslt.encode("latin-1")))
    transform = ET.XSLT(xslt_doc)
    dom = ET.parse(io.BytesIO(xml_with_namespaces.encode("latin-1")))
    return transform(dom)

Performing XSLT transformations is an intense operation. When processing typical SOAP responses
(less than 100KiB), the convenience of using
simple ElementPath queries sans the namespaces outweighs the negligible performance penalty on this size. However in our case when the responsees are big, it is acceptable to sacrifice code clarity for performance.

Using pure XPath

The next profiler run revealed another bottleneck. It was ElementPath selection operations themselves. After a little bit of studying lxml the solution came in the form of precompiled pure-XPath expressions.

XP_NS = {"o": "http://www.w3.org/1999/xhtml"}
XP_ALL_ELEMS = "o:Response/o:ResponseItem/o:ResponseItemDatum"
XP_CHD_ELEM = "o:ResponseItemDatumChild2/o:ResponseItemDatumChild3"
XP_CHD_ELEM2 = "o:ResponseItemDatumChild4"
find_all_elems = ET.XPath(XP_ALL_ELEMS, namespaces=XP_NS)
find_child_elem = ET.XPath(XP_CHD_ELEM, namespaces=XP_NS)
find_child_elem2 = ET.XPath(XP_CHD_ELEM2, namespaces=XP_NS)

def parse5(xml):
    root = ET.fromstring(xml)
    results = []
    for subelement in find_all_elems(root):
        chd = find_child_elem(subelement)[0]
        i = chd.get("SubItemID")
        j = find_child_elem2(chd)[0].get("SubItemID")
    return results


[2014-04-29 15:20:04,464] evaluating parser `parse5`
[2014-04-29 15:20:05,060] parsing completed in 0.59525 sec.

Using iterparse

def parse6(xml):

    XP_NS = {"o": "http://www.w3.org/1999/xhtml"}
    XP_CHD_ELEM2 = "o:ResponseItemDatumChild4"

    find_child_elem2 = ET.XPath(XP_CHD_ELEM2, namespaces=XP_NS)
    results = []
    XP_CHD_ELEM_TAG = "{http://www.w3.org/1999/xhtml}ResponseItemDatumChild3"
    l_elem = None

    for _, elem in ET.iterparse(io.BytesIO(xml.encode("latin-1")), events=("end",)):
        l_elem = elem
        tag = elem.tag
        if tag == XP_CHD_ELEM_TAG:
            i = elem.get("SubItemID")
            j = find_child_elem2(elem)[0].get("SubItemID")
    if l_elem is not None:
    return results


[2014-04-29 15:21:40,416] evaluating parser `parse6`
[2014-04-29 15:21:41,020] parsing completed in 0.60345 sec.

Final results and conclusion

From 3.34 down to 0.6 seconds! We made parsing roughly 560% faster.
The factors that contributed the most to achieving significant optimizations were
the removal of all XSLT transformations combined with exclusive use of precompiled
pure XPATH expressions for all element finding operations. Using iterparse + element clearing
gave a nicely reduced memory footprint for a negligible performance penalty.

Published by pgk


%d bloggers like this: