Optimizing XML parsing in Python

At work we consume lots of SOAP webservices, its probably a GDS thing. The most common type of request is flight availability search.

Due to the sheer response size, they can be taxing to parse. It’s about time we take a Paretofied approach and optimize this critical 20ish percent of our code. So with the goal of optimizing the parse function, here are the steps taken (using python>=3.4 and lxml)

Metrics, Profiling

Our parsing function interface looks like this

def iparse(xml_string):
    """
    Parse the string xml_string. Return a list of result tuples
    """

We need to profile our code to get a good idea where it spends most of its time. We also need an XML file for testing. Here is code that generates a (hypothetical) XML Response and dumps it to stdout

from lxml import etree as ETdef generate_test_xml(n_results=250):
    root = ET.Element("{http://www.w3.org/1999/xhtml}Body")
    response = ET.SubElement(root, "{http://www.w3.org/1999/xhtml}Response")
    for i in range(n_results):
        parent = ET.SubElement(response, "{http://www.w3.org/1999/xhtml}ResponseItem", ResponseItemID=str(i))
        for j in range(100):
            child = ET.SubElement(parent, "{http://www.w3.org/1999/xhtml}ResponseItemDatum", SubItemID=str(j), ResponseItemID=str(i))
            child2 = ET.SubElement(child, "{http://www.w3.org/1999/xhtml}ResponseItemDatumChild2", SubItemID=str(j + 1), ResponseItemID=str(i))
            child2.text = str(j * i)
            child3 = ET.SubElement(child2, "{http://www.w3.org/1999/xhtml}ResponseItemDatumChild3", SubItemID=str(j + 2), ResponseItemID=str(i))
            child3.text = str(j * i)
            child4 = ET.SubElement(child3, "{http://www.w3.org/1999/xhtml}ResponseItemDatumChild4", SubItemID=str(j + 3), ResponseItemID=str(i))
            child4.text = str(j * i)
            child5 = ET.SubElement(child3, "{http://www.w3.org/1999/xhtml}ResponseItemDatumChild5", SubItemID=str(j + 3), ResponseItemID=str(i))
            child5.text = str(j * i)

    ET.dump(root)

if __name__ == '__main__':
    generate_test_xml()

Then we can generate a simple xml file like so

 python gen_xml.py > demo.xml

Assuming there is a module named parsers in our PYTHONPATH, we can test different parser implementation functions within that module with this small script

import argparse
import sys
import logging
from time import time
import parsers

logger = logging.getLogger("parser.test")
logger.setLevel(logging.DEBUG)

ch = logging.StreamHandler(sys.stdout)
ch.setLevel(logging.DEBUG)
formatter = logging.Formatter('[%(asctime)s] %(message)s')
ch.setFormatter(formatter)
logger.addHandler(ch)


def test_parser(parser):
    with open("demo.xml") as f:
        s = f.read()
    now = time()
    logger.info("evaluating parser `{}`".format(parser.__name__))
    results = parser(s)
    logger.info("parsing completed in {0:0.5f} sec.".format(time() - now))
    logger.info("correctly parsed {0} results.".format(len(results)))


def main():
    parser = argparse.ArgumentParser(description='Test A Parser Implementation')
    parser.add_argument('-p','--parser_function', default="parsers.parse",
                        help='Parser function to measure')

    opts = parser.parse_args()
    fn_name = opts.parser_function.split(".")[-1]
    parser = getattr(parsers, fn_name)
    test_parser(parser)
    return 0

if __name__ == '__main__':
    main()

By running it like so

PYTHONPATH=`pwd` python test.py -p parsers.parse

Diving in

Initial parse performance

The default implementation does the following: It removes the namespaces from the xmlstring and then proceeds to parse and locate the elements of interest using ElementTreestandard operations (findall, find, etc). It will serve as a baseline to test the new implementations against.

def parse(xml):
    root = strip_namespaces(xml)
    results = []
    for subelement in root.findall(".//ResponseItemDatum"):
        i = subelement.find(".//ResponseItemDatumChild3").get("SubItemID")
        j = subelement.find(".//ResponseItemDatumChild4").get("SubItemID")
        results.append((i,j))
    return results

results:

[2014-04-29 14:30:07,288] evaluating parser `parse`
[2014-04-29 14:30:10,631] parsing completed in 3.34278 sec.
[2014-04-29 14:30:10,631] correctly parsed 25000 results.

The culprit

Some code profiling is in order. For this, guppy proved really useful.
After running the code through guppy‘s profiler, we discovered the culprit. More than 50% of parsing time was spent on the strip namespaces function. We refactored the code to use namespaces and lo and behold:

def parse4(xml):
    root = ET.fromstring(xml)
    results = []
    for subelement in root.iterfind(".//{http://www.w3.org/1999/xhtml}ResponseItemDatum"):
        chd = subelement.find(".//{http://www.w3.org/1999/xhtml}ResponseItemDatumChild3")
        i = chd.get("SubItemID")
        j = chd.find("{http://www.w3.org/1999/xhtml}ResponseItemDatumChild4").get("SubItemID")
        results.append((i,j))
    return results

results:

[2014-04-29 15:17:09,212] evaluating parser `parse4`
[2014-04-29 15:17:10,357] parsing completed in 1.14476 sec.

The function strip_namespaces removed namespaces by applying an XSLT transformation

def strip_namespaces(xml_with_namespaces):
    xslt = '''
    

    
        
            
        
    

    
        
            
        
    

    
        
            
        
    
    
    '''
    xslt_doc = ET.parse(io.BytesIO(xslt.encode("latin-1")))
    transform = ET.XSLT(xslt_doc)
    dom = ET.parse(io.BytesIO(xml_with_namespaces.encode("latin-1")))
    return transform(dom)

Performing XSLT transformations is an intense operation. When processing typical SOAP responses
(less than 100KiB), the convenience of using
simple ElementPath queries sans the namespaces outweighs the negligible performance penalty on this size. However in our case when the responsees are big, it is acceptable to sacrifice code clarity for performance.

Using pure XPath

The next profiler run revealed another bottleneck. It was ElementPath selection operations themselves. After a little bit of studying lxml the solution came in the form of precompiled pure-XPath expressions.

XP_NS = {"o": "http://www.w3.org/1999/xhtml"}
XP_ALL_ELEMS = "o:Response/o:ResponseItem/o:ResponseItemDatum"
XP_CHD_ELEM = "o:ResponseItemDatumChild2/o:ResponseItemDatumChild3"
XP_CHD_ELEM2 = "o:ResponseItemDatumChild4"
find_all_elems = ET.XPath(XP_ALL_ELEMS, namespaces=XP_NS)
find_child_elem = ET.XPath(XP_CHD_ELEM, namespaces=XP_NS)
find_child_elem2 = ET.XPath(XP_CHD_ELEM2, namespaces=XP_NS)

def parse5(xml):
    root = ET.fromstring(xml)
    results = []
    for subelement in find_all_elems(root):
        chd = find_child_elem(subelement)[0]
        i = chd.get("SubItemID")
        j = find_child_elem2(chd)[0].get("SubItemID")
        results.append((i,j))
    return results

results:

[2014-04-29 15:20:04,464] evaluating parser `parse5`
[2014-04-29 15:20:05,060] parsing completed in 0.59525 sec.

Using iterparse

def parse6(xml):

    XP_NS = {"o": "http://www.w3.org/1999/xhtml"}
    XP_CHD_ELEM2 = "o:ResponseItemDatumChild4"

    find_child_elem2 = ET.XPath(XP_CHD_ELEM2, namespaces=XP_NS)
    results = []
    XP_CHD_ELEM_TAG = "{http://www.w3.org/1999/xhtml}ResponseItemDatumChild3"
    l_elem = None

    for _, elem in ET.iterparse(io.BytesIO(xml.encode("latin-1")), events=("end",)):
        l_elem = elem
        tag = elem.tag
        if tag == XP_CHD_ELEM_TAG:
            i = elem.get("SubItemID")
            j = find_child_elem2(elem)[0].get("SubItemID")
            results.append((i,j))
            elem.clear()
    if l_elem is not None:
        l_elem.clear()
    return results

results:

[2014-04-29 15:21:40,416] evaluating parser `parse6`
[2014-04-29 15:21:41,020] parsing completed in 0.60345 sec.

Final results and conclusion

From 3.34 down to 0.6 seconds! We made parsing roughly 560% faster. The factors that contributed the most to achieving significant optimizations were the removal of all XSLT transformations combined with exclusive use of precompiled pure XPATH expressions for all element finding operations. Using iterparse + element clearing
reduced memory footprint for a negligible performance penalty.

Published by pgk

Person

%d bloggers like this: