Optimizing XML parsing in Python

We consume lots of SOAP. Its probably a GDS thing.

Due to increased response sizes, our parsing times went up. Fun times! Here are the steps taken for optimizing XML parsing (using python>=3.4 and lxml)

Metrics, Profiling

Our parse function interface is more or less like this

def iparse(xml_string):
    """
    Parse the string xml_string. 
    Return a list of result tuples
    """

We need to profile our code to find bottlenecks. We also need big XML files for testing. Here is code that generates a (hypothetical) XML Response and dumps it to stdout

from lxml import etree as ET

def generate_test_xml(n_results=250):
    root = ET.Element("{http://www.w3.org/1999/xhtml}Body")
    response = ET.SubElement(root, "{http://www.w3.org/1999/xhtml}Response")
    for i in range(n_results):
        parent = ET.SubElement(response, "{http://www.w3.org/1999/xhtml}ResponseItem", ResponseItemID=str(i))
        for j in range(100):
            child = ET.SubElement(parent, "{http://www.w3.org/1999/xhtml}ResponseItemDatum", SubItemID=str(j), ResponseItemID=str(i))
            child2 = ET.SubElement(child, "{http://www.w3.org/1999/xhtml}ResponseItemDatumChild2", SubItemID=str(j + 1), ResponseItemID=str(i))
            child2.text = str(j * i)
            child3 = ET.SubElement(child2, "{http://www.w3.org/1999/xhtml}ResponseItemDatumChild3", SubItemID=str(j + 2), ResponseItemID=str(i))
            child3.text = str(j * i)
            child4 = ET.SubElement(child3, "{http://www.w3.org/1999/xhtml}ResponseItemDatumChild4", SubItemID=str(j + 3), ResponseItemID=str(i))
            child4.text = str(j * i)
            child5 = ET.SubElement(child3, "{http://www.w3.org/1999/xhtml}ResponseItemDatumChild5", SubItemID=str(j + 3), ResponseItemID=str(i))
            child5.text = str(j * i)

    ET.dump(root)

if __name__ == '__main__':
    generate_test_xml()

Then we can generate a simple xml file like so

    python gen_xml.py > demo.xml

Assuming there is a module named `parsers` in our `PYTHONPATH`, we can test different parser implementation
functions within that module with this small script

import argparse
import sys
import logging
from time import time
import parsers

logger = logging.getLogger("parser.test")
logger.setLevel(logging.DEBUG)

ch = logging.StreamHandler(sys.stdout)
ch.setLevel(logging.DEBUG)
formatter = logging.Formatter('[%(asctime)s] %(message)s')
ch.setFormatter(formatter)
logger.addHandler(ch)


def test_parser(parser):
    with open("demo.xml") as f:
        s = f.read()
    now = time()
    logger.info("evaluating parser `{}`".format(parser.__name__))
    results = parser(s)
    logger.info("parsing completed in {0:0.5f} sec.".format(time() - now))
    logger.info("correctly parsed {0} results.".format(len(results)))


def main():
    parser = argparse.ArgumentParser(description='Test A Parser Implementation')
    parser.add_argument('-p','--parser_function', default="parsers.parse",
                        help='Parser function to measure')

    opts = parser.parse_args()
    fn_name = opts.parser_function.split(".")[-1]
    parser = getattr(parsers, fn_name)
    test_parser(parser)
    return 0

if __name__ == '__main__':
    main()

By running it like so

PYTHONPATH=`pwd` python test.py -p parsers.parse

Diving in

Initial parse performance

The existing implementation removes XML namespaces, then parses and locates elements of interest using ElementTree standard operations (findall, find, etc).

def parse(xml):
 root = strip_namespaces(xml)
 results = []
 for subelement in root.findall(".//ResponseItemDatum"):
 i = subelement.find(".//ResponseItemDatumChild3").get("SubItemID")
 j = subelement.find(".//ResponseItemDatumChild4").get("SubItemID")
 results.append((i,j))
 return results

results:

[2014-04-29 14:30:07,288] evaluating parser `parse`
[2014-04-29 14:30:10,631] parsing completed in 3.34278 sec.
[2014-04-29 14:30:10,631] correctly parsed 25000 results.

Profiling

After profiling the code with guppy, we discovered that more than 50% of parsing time was spent stripping namespaces function. We refactored the code to use namespaces:

def parse4(xml):
    root = ET.fromstring(xml)
    results = []
    for subelement in root.iterfind(".//{http://www.w3.org/1999/xhtml}ResponseItemDatum"):
        chd = subelement.find(".//{http://www.w3.org/1999/xhtml}ResponseItemDatumChild3")
        i = chd.get("SubItemID")
        j = chd.find("{http://www.w3.org/1999/xhtml}ResponseItemDatumChild4").get("SubItemID")
        results.append((i,j))
    return results

results:

[2014-04-29 15:17:09,212] evaluating parser `parse4`
[2014-04-29 15:17:10,357] parsing completed in 1.14476 sec.

The function strip_namespaces removed namespaces by applying an XSLT transformation

def strip_namespaces(xml_with_namespaces):
    xslt = '''
    

    
        
            
        
    

    
        
            
        
    

    
        
            
        
    
    
    '''
    xslt_doc = ET.parse(io.BytesIO(xslt.encode("latin-1")))
    transform = ET.XSLT(xslt_doc)
    dom = ET.parse(io.BytesIO(xml_with_namespaces.encode("latin-1")))
    return transform(dom)

Performing XSLT transformations turned out to be a heavy operation. When processing typical SOAP responses (< 100KB), the convenience of using
simple ElementPath queries sans the namespaces outweighs the performance penalty. However, with bigger responses, we want performs way better.

Using pure XPath

The next profiler run revealed another bottleneck -- ElementPath selection operations themselves. After more lxml digging, the solution came in the form of precompiled pure-XPath expressions.

XP_NS = {"o": "http://www.w3.org/1999/xhtml"}
XP_ALL_ELEMS = "o:Response/o:ResponseItem/o:ResponseItemDatum"
XP_CHD_ELEM = "o:ResponseItemDatumChild2/o:ResponseItemDatumChild3"
XP_CHD_ELEM2 = "o:ResponseItemDatumChild4"
find_all_elems = ET.XPath(XP_ALL_ELEMS, namespaces=XP_NS)
find_child_elem = ET.XPath(XP_CHD_ELEM, namespaces=XP_NS)
find_child_elem2 = ET.XPath(XP_CHD_ELEM2, namespaces=XP_NS)

def parse5(xml):
    root = ET.fromstring(xml)
    results = []
    for subelement in find_all_elems(root):
        chd = find_child_elem(subelement)[0]
        i = chd.get("SubItemID")
        j = find_child_elem2(chd)[0].get("SubItemID")
        results.append((i,j))
    return results

results:

[2014-04-29 15:20:04,464] evaluating parser `parse5`
[2014-04-29 15:20:05,060] parsing completed in 0.59525 sec.

Using iterparse

def parse6(xml):

    XP_NS = {"o": "http://www.w3.org/1999/xhtml"}
    XP_CHD_ELEM2 = "o:ResponseItemDatumChild4"

    find_child_elem2 = ET.XPath(XP_CHD_ELEM2, namespaces=XP_NS)
    results = []
    XP_CHD_ELEM_TAG = "{http://www.w3.org/1999/xhtml}ResponseItemDatumChild3"
    l_elem = None

    for _, elem in ET.iterparse(io.BytesIO(xml.encode("latin-1")), events=("end",)):
        l_elem = elem
        tag = elem.tag
        if tag == XP_CHD_ELEM_TAG:
            i = elem.get("SubItemID")
            j = find_child_elem2(elem)[0].get("SubItemID")
            results.append((i,j))
            elem.clear()
    if l_elem is not None:
        l_elem.clear()
    return results

results:

[2014-04-29 15:21:40,416] evaluating parser `parse6`
[2014-04-29 15:21:41,020] parsing completed in 0.60345 sec.

Final results and conclusion

From 3.34 down to 0.6 seconds! We made parsing much faster.
The factors that contributed the most to achieving significant optimizations were removing XSLT transformations, exclusive use of precompiled
XPATH expressions for element finding operations. Finally, using iterparse + element clearing reduced memory usage for a negligible performance penalty. Happy servers! :)

Optimizing XML parsing in Python

Metrics, Profiling

Diving in

Initial parse performance

Profiling

Using pure XPath

Using iterparse

Final results and conclusion

Search Content