We consume lots of SOAP. Its probably a GDS thing.
Due to increased response sizes, our parsing times went up. Fun times! Here are the steps taken for optimizing XML parsing (using python>=3.4
and lxml
)
Metrics, Profiling
Our parse function interface is more or less like this
def iparse(xml_string): """ Parse the string xml_string. Return a list of result tuples """
We need to profile our code to find bottlenecks. We also need big XML files for testing. Here is code that generates a (hypothetical) XML Response and dumps it to stdout
from lxml import etree as ET def generate_test_xml(n_results=250): root = ET.Element("{http://www.w3.org/1999/xhtml}Body") response = ET.SubElement(root, "{http://www.w3.org/1999/xhtml}Response") for i in range(n_results): parent = ET.SubElement(response, "{http://www.w3.org/1999/xhtml}ResponseItem", ResponseItemID=str(i)) for j in range(100): child = ET.SubElement(parent, "{http://www.w3.org/1999/xhtml}ResponseItemDatum", SubItemID=str(j), ResponseItemID=str(i)) child2 = ET.SubElement(child, "{http://www.w3.org/1999/xhtml}ResponseItemDatumChild2", SubItemID=str(j + 1), ResponseItemID=str(i)) child2.text = str(j * i) child3 = ET.SubElement(child2, "{http://www.w3.org/1999/xhtml}ResponseItemDatumChild3", SubItemID=str(j + 2), ResponseItemID=str(i)) child3.text = str(j * i) child4 = ET.SubElement(child3, "{http://www.w3.org/1999/xhtml}ResponseItemDatumChild4", SubItemID=str(j + 3), ResponseItemID=str(i)) child4.text = str(j * i) child5 = ET.SubElement(child3, "{http://www.w3.org/1999/xhtml}ResponseItemDatumChild5", SubItemID=str(j + 3), ResponseItemID=str(i)) child5.text = str(j * i) ET.dump(root) if __name__ == '__main__': generate_test_xml()
Then we can generate a simple xml file like so python gen_xml.py > demo.xml Assuming there is a module named `parsers` in our `PYTHONPATH`, we can test different parser implementation functions within that module with this small script import argparse import sys import logging from time import time import parsers logger = logging.getLogger("parser.test") logger.setLevel(logging.DEBUG) ch = logging.StreamHandler(sys.stdout) ch.setLevel(logging.DEBUG) formatter = logging.Formatter('[%(asctime)s] %(message)s') ch.setFormatter(formatter) logger.addHandler(ch) def test_parser(parser): with open("demo.xml") as f: s = f.read() now = time() logger.info("evaluating parser `{}`".format(parser.__name__)) results = parser(s) logger.info("parsing completed in {0:0.5f} sec.".format(time() - now)) logger.info("correctly parsed {0} results.".format(len(results))) def main(): parser = argparse.ArgumentParser(description='Test A Parser Implementation') parser.add_argument('-p','--parser_function', default="parsers.parse", help='Parser function to measure') opts = parser.parse_args() fn_name = opts.parser_function.split(".")[-1] parser = getattr(parsers, fn_name) test_parser(parser) return 0 if __name__ == '__main__': main()
By running it like so
PYTHONPATH=`pwd` python test.py -p parsers.parse
Diving in
Initial parse performance
The existing implementation removes XML namespaces, then parses and locates elements of interest using ElementTree
standard operations (findall
, find
, etc).
def parse(xml): root = strip_namespaces(xml) results = [] for subelement in root.findall(".//ResponseItemDatum"): i = subelement.find(".//ResponseItemDatumChild3").get("SubItemID") j = subelement.find(".//ResponseItemDatumChild4").get("SubItemID") results.append((i,j)) return results
results:
[2014-04-29 14:30:07,288] evaluating parser `parse` [2014-04-29 14:30:10,631] parsing completed in 3.34278 sec. [2014-04-29 14:30:10,631] correctly parsed 25000 results.
Profiling
After profiling the code with guppy
, we discovered that more than 50% of parsing time was spent stripping namespaces function. We refactored the code to use namespaces:
def parse4(xml): root = ET.fromstring(xml) results = [] for subelement in root.iterfind(".//{http://www.w3.org/1999/xhtml}ResponseItemDatum"): chd = subelement.find(".//{http://www.w3.org/1999/xhtml}ResponseItemDatumChild3") i = chd.get("SubItemID") j = chd.find("{http://www.w3.org/1999/xhtml}ResponseItemDatumChild4").get("SubItemID") results.append((i,j)) return results
results:
[2014-04-29 15:17:09,212] evaluating parser `parse4` [2014-04-29 15:17:10,357] parsing completed in 1.14476 sec.
The function strip_namespaces
removed namespaces by applying an XSLT transformation
def strip_namespaces(xml_with_namespaces): xslt = ''' ''' xslt_doc = ET.parse(io.BytesIO(xslt.encode("latin-1"))) transform = ET.XSLT(xslt_doc) dom = ET.parse(io.BytesIO(xml_with_namespaces.encode("latin-1"))) return transform(dom)
Performing XSLT transformations turned out to be a heavy operation. When processing typical SOAP responses (< 100KB
), the convenience of using
simple ElementPath queries sans the namespaces outweighs the performance penalty. However, with bigger responses, we want performs way better.
Using pure XPath
The next profiler run revealed another bottleneck -- ElementPath selection operations themselves. After more lxml digging,
the solution came in the form of precompiled pure-XPath expressions.
XP_NS = {"o": "http://www.w3.org/1999/xhtml"} XP_ALL_ELEMS = "o:Response/o:ResponseItem/o:ResponseItemDatum" XP_CHD_ELEM = "o:ResponseItemDatumChild2/o:ResponseItemDatumChild3" XP_CHD_ELEM2 = "o:ResponseItemDatumChild4" find_all_elems = ET.XPath(XP_ALL_ELEMS, namespaces=XP_NS) find_child_elem = ET.XPath(XP_CHD_ELEM, namespaces=XP_NS) find_child_elem2 = ET.XPath(XP_CHD_ELEM2, namespaces=XP_NS) def parse5(xml): root = ET.fromstring(xml) results = [] for subelement in find_all_elems(root): chd = find_child_elem(subelement)[0] i = chd.get("SubItemID") j = find_child_elem2(chd)[0].get("SubItemID") results.append((i,j)) return results
results:
[2014-04-29 15:20:04,464] evaluating parser `parse5` [2014-04-29 15:20:05,060] parsing completed in 0.59525 sec.
Using iterparse
def parse6(xml): XP_NS = {"o": "http://www.w3.org/1999/xhtml"} XP_CHD_ELEM2 = "o:ResponseItemDatumChild4" find_child_elem2 = ET.XPath(XP_CHD_ELEM2, namespaces=XP_NS) results = [] XP_CHD_ELEM_TAG = "{http://www.w3.org/1999/xhtml}ResponseItemDatumChild3" l_elem = None for _, elem in ET.iterparse(io.BytesIO(xml.encode("latin-1")), events=("end",)): l_elem = elem tag = elem.tag if tag == XP_CHD_ELEM_TAG: i = elem.get("SubItemID") j = find_child_elem2(elem)[0].get("SubItemID") results.append((i,j)) elem.clear() if l_elem is not None: l_elem.clear() return results
results:
[2014-04-29 15:21:40,416] evaluating parser `parse6` [2014-04-29 15:21:41,020] parsing completed in 0.60345 sec.
Final results and conclusion
From 3.34 down to 0.6 seconds! We made parsing much faster.
The factors that contributed the most to achieving significant optimizations were removing XSLT transformations, exclusive use of precompiled
XPATH expressions for element finding operations. Finally, using iterparse
+ element clearing reduced memory usage for a negligible performance penalty. Happy servers! :)