How to parse XML documents using Streaming API for XML (StAX)

Streaming API for XML provides interface XMLStreamReader which gives a low-level but very efficient cursor-like API for reading XML documents. When using it we iterate over various events in XML document and extract information about these events. Once we are done with the current event, we move to the next one and continue. The events can be for example the start of element, the end of element or characters data.

Interface XMLStreamReader provides PULL API (compared to SAX which provides PUSH API). It means that the programmer explicitly decides when to extract next event from the XML document and can get prepared to it beforehand. In my opinion PULL interface is more straightforward to use and leads to smaller and more readable code responsible for handling of events.

Well-formedness and validation

XMLStreamReader takes care of most tasks related with parsing of XML documents like expanding entity references, unescaping special characters or handling XML namespaces. It also checks whether XML document is well-formed and raises exception when it is not the case. Unfortunately, it does not validate XML document against XML Schema so it has to be done in some other way.

General XMLStreamReader workflow

To start reading document we have to first create instance of XMLStreamReader using XMLInputFactory:

XMLInputFactory inputFactory = XMLInputFactory.newInstance();
XMLStreamReader reader = inputFactory.createXMLStreamReader(inputStream);

Once we have the reader we can start iterating over events:

while (reader.hasNext()) {
  int eventType = reader.next();
  switch (eventType) {
    case XMLStreamReader.START_ELEMENT:
      // handle start element
    case XMLStreamReader.ATTRIBUTE:
      // handle attribute
    ...
}

Method hasNext checks whether there is still any event to process. If yes, we move to it using next method and we check the type of the event. All possible event types are:

  • START_ELEMENT
  • ATTRIBUTE
  • NAMESPACE
  • END_ELEMENT
  • CHARACTERS
  • CDATA
  • COMMENT
  • SPACE
  • END_DOCUMENT
  • PROCESSING_INSTRUCTION
  • ENTITY_REFERENCE
  • DTD

Depending on the type of the event we can execute different methods of XMLStreamReader interface to retrieve information about the event. For example for START_ELEMENT we can execute getName, getLocalName, hasName, getPrefix, isAttributeSpecified, getElementText and various getAttributeXXX and getNamespaceXXX methods. For event END_ELEMENT the list is less impressive because we cannot access information about attributes and in case of CHARACTERS or CDATA we can only call getTextXXX methods.

When there are no more events to process, we should close the reader to release any resources used by it:

reader.close();

Closing the reader will not close the underlying file or stream so it has to be done manually.

Example

To fully understand the flexibility of this approach let’s look at the simple XML document (which was created in my previous post):

<?xml version="1.0" encoding="UTF-8"?>
<!--Describes list of books-->
<books xmlns="http://example.com/books">
  <book language="English">
    <authors>
      <author>Mark Twain</author>
    </authors>
    <title><![CDATA[The Adventures of Tom Sawyer]]></title>
    <category>FICTION</category>
    <year>1876</year>
  </book>
  <book language="English">
    <authors>
      <author>Niklaus Wirth</author>
    </authors>
    <title><![CDATA[The Programming Language Pascal]]></title>
    <category>PASCAL</category>
    <year>1971</year>
  </book>
  <book language="English">
    <authors>
      <author>O.-J. Dahl</author>
      <author>E. W. Dijkstra</author>
      <author>C. A. R. Hoare</author>
    </authors>
    <title><![CDATA[Structured Programming]]></title>
    <category>PROGRAMMING</category>
    <year>1972</year>
  </book>
</books>

and the code which will try to parse it and will return a list of books:

package com.example.staxread;

import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;

public class BooksReader {

    public List<Book> readFromXML(InputStream is) throws XMLStreamException {
        XMLInputFactory inputFactory = XMLInputFactory.newInstance();
        XMLStreamReader reader = null;
        try {
            reader = inputFactory.createXMLStreamReader(is);
            return readDocument(reader);
        } finally {
            if (reader != null) {
                reader.close();
            }
        }
    }

    private List<Book> readDocument(XMLStreamReader reader) throws XMLStreamException {
        while (reader.hasNext()) {
            int eventType = reader.next();
            switch (eventType) {
                case XMLStreamReader.START_ELEMENT:
                    String elementName = reader.getLocalName();
                    if (elementName.equals("books"))
                        return readBooks(reader);
                    break;
                case XMLStreamReader.END_ELEMENT:
                    break;
            }
        }
        throw new XMLStreamException("Premature end of file");
    }

    private List<Book> readBooks(XMLStreamReader reader) throws XMLStreamException {
        List<Book> books = new ArrayList<>();
        
        while (reader.hasNext()) {
            int eventType = reader.next();
            switch (eventType) {
                case XMLStreamReader.START_ELEMENT:
                    String elementName = reader.getLocalName();
                    if (elementName.equals("book"))
                        books.add(readBook(reader));
                    break;
                case XMLStreamReader.END_ELEMENT:
                    return books;
            }
        }
        throw new XMLStreamException("Premature end of file");
    }

    private Book readBook(XMLStreamReader reader) throws XMLStreamException {
        Book book = new Book("", "", null, "", 0);
        book.setLanguage(reader.getAttributeValue(null, "language"));
        
        while (reader.hasNext()) {
            int eventType = reader.next();
            switch (eventType) {
                case XMLStreamReader.START_ELEMENT:
                    String elementName = reader.getLocalName();
                    if (elementName.equals("authors"))
                        book.setAuthors(readAuthors(reader));
                    else if (elementName.equals("title"))
                        book.setTitle(readCharacters(reader));
                    else if (elementName.equals("category"))
                        book.setCategory(readCategory(reader));
                    else if (elementName.equals("year"))
                        book.setYear(readInt(reader));
                    break;
                case XMLStreamReader.END_ELEMENT:
                    return book;
            }
        }
        throw new XMLStreamException("Premature end of file");
    }

    private List<String> readAuthors(XMLStreamReader reader) throws XMLStreamException {
        List<String> authors = new ArrayList<>();
        while (reader.hasNext()) {
            int eventType = reader.next();
            switch (eventType) {
                case XMLStreamReader.START_ELEMENT:
                    String elementName = reader.getLocalName();
                    if (elementName.equals("author"))
                        authors.add(readCharacters(reader));
                    break;
                case XMLStreamReader.END_ELEMENT:
                    return authors;
            }
        }
        throw new XMLStreamException("Premature end of file");

    }
    
    private String readCharacters(XMLStreamReader reader) throws XMLStreamException {
        StringBuilder result = new StringBuilder();
        while (reader.hasNext()) {
            int eventType = reader.next();
            switch (eventType) {
                case XMLStreamReader.CHARACTERS:
                case XMLStreamReader.CDATA:
                    result.append(reader.getText());
                    break;
                case XMLStreamReader.END_ELEMENT:
                    return result.toString();
            }
        }
        throw new XMLStreamException("Premature end of file");
    }
    
    private Category readCategory(XMLStreamReader reader) throws XMLStreamException {
        String characters = readCharacters(reader);
        try {
            return Category.valueOf(characters);
        } catch(IllegalArgumentException e) {
            throw new XMLStreamException("Invalid category " + characters);
        }
    }
    
    private int readInt(XMLStreamReader reader) throws XMLStreamException {
        String characters = readCharacters(reader);
        try {
            return Integer.valueOf(characters);
        } catch (NumberFormatException e) {
            throw new XMLStreamException("Invalid integer " + characters);
        }
    }
}

The most important thing is that we create several methods for reading various XML elements and we call one method from another depending on the name of the element which has just started. It clearly resembles the hierarchical structure of XML documents.

When we are reading a book in readBook method we also extract attribute language before we start iterating over events. Of course, we could iterate over events and appropriately handle ATTRIBUTE event but this approach was much easier.

Additionally, the parser can emit several CHARACTERS or CDATA events in a row so we have to concatenate their text in readCharacters method. It is possible to force the parser to emit only one such event with concatenated text by setting javax.xml.stream.isCoalescing property on XMLInputFactory before instantiating the reader. The factory also supports several other properties which you may find useful.

Conclusion

Streaming API for XML provides very good but still low-level parser for reading XML documents. This is perfect when you want to have great control over parsing or want to process only few parts of the XML document. Additionally, it is much easier to use than SAX parser and does not require building DOM tree but instead it is possible to read data directly into your own data structures.

The source code for this example you can find at GitHub.

Advertisement

About Robert Piasecki

Husband and father, Java software developer, Linux and open-source fan.
This entry was posted in Java, XML and tagged , , . Bookmark the permalink.

1 Response to How to parse XML documents using Streaming API for XML (StAX)

  1. Kat says:

    Hi,

    Thank u for posting these. They were really helpful for my current task.
    I’d like to ask one question, why the readCharacter method, returns nothing if the line “XMLStreamReader.CHARACTERS:” is deleted for the line:

    case XMLStreamReader.CHARACTERS:
    case XMLStreamReader.CDATA:
    result.append(reader.getText());
    break;
    case XMLStreamReader.END_ELEMENT:
    return result.toString();

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.