Validating XML using lxml in Python


Often when working with XML documents, it’s required that we validate our document with a predefined schema. These schemas usually come in XSD (XML Schema Definition) files and while there are commercial and open source applications that can do these validations, it’s more flexible and a good learning experience to do it using Python.

Prerequisites

You need Python installed obviously (I’ll be using Python 3, but the codes should work in Python 2 with minimal modifications). You’ll also need the lxml package to handle schema validations. You can install it using pip:

$ pip install lxml

Or if you’re a conda user:

$ conda install lxml

Importing and using lxml

For XML schema validation, we need the etree module from the lxml package. Let’s also import StringIO from the io package for passing strings as files to etree, as well as sys for handling input.

from lxml import etree
from io import StringIO
import sys

I prefer giving file names as command line arguments to the python file as it simplifies the handling:

filename_xml = sys.argv[1]
filename_xsd = sys.argv[2]

Let’s open and read both files:

# open and read schema file
with open(filename_xsd, 'r') as schema_file:
    schema_to_check = schema_file.read()

# open and read xml file
with open(filename_xml, 'r') as xml_file:
    xml_to_check = xml_file.read()

Parsing XML and XSD files

We can parse the XML files/schemas using the etree.parse() method, and we can load the schema to memory using etree.XMLSchema(). As schemas usually arrive well-formed and correctly formatted, I skipped error checking here for the schema parsing.

xmlschema_doc = etree.parse(StringIO(schema_to_check))
xmlschema = etree.XMLSchema(xmlschema_doc)

Next is the parsing of the actual XML document. I usually do error checking here to catch syntax errors and not well-formed XML documents. lxml throws and etree.XMLSyntaxError exception if it finds errors in the XML document and provides an error_log in the exception. We can write this to a file check the incorrect lines and tags:

# parse xml
try:
    doc = etree.parse(StringIO(xml_to_check))
    print('XML well formed, syntax ok.')

# check for file IO error
except IOError:
    print('Invalid File')

# check for XML syntax errors
except etree.XMLSyntaxError as err:
    print('XML Syntax Error, see error_syntax.log')
    with open('error_syntax.log', 'w') as error_log_file:
        error_log_file.write(str(err.error_log))
    quit()

except:
    print('Unknown error, exiting.')
    quit()

Validating with Schema

At the final step we can validate our XML document against the XSD schema using assertValid method from etree.XMLSchema. This method will get our parsed XML file (in variable doc above) and try to validate it using the schema definitions. It throws an etree.DocumentInvalid exception with an error_log object as above. We can also write this to a file to check any invalid tags or values.

# validate against schema
try:
    xmlschema.assertValid(doc)
    print('XML valid, schema validation ok.')

except etree.DocumentInvalid as err:
    print('Schema validation error, see error_schema.log')
    with open('error_schema.log', 'w') as error_log_file:
        error_log_file.write(str(err.error_log))
    quit()

except:
    print('Unknown error, exiting.')
    quit()

You can save this script (i.e. as ‘validation.py’) and use it with:

$ python validation.py <path_to_xml_file> <path_to_xsd_file>

Any errors will be written to ‘error_syntax.log’ and ‘error_schema.log’ files (in the same directory as your .py file) with timestamps, line number and detailed explanation of validation errors. You can check and correct your XML documents before validating using this script again.

lxml is quite an extensive and flexible package to handle and process XML and related files. Check the sources below for tutorials, references and more information.

Sources