JAVA: Resolving org.xml.sax.SAXParseException: Content is not allowed in prolog

| No TrackBacks
rss-logo.jpgParsing an RSS feed can be tricky.  Your code has to gracefully handle all sorts of strange corner cases; everything from malformed XML to an unexpected byte sequence in the feed prolog.  I recently worked on a problem that dealt with the latter: I was trying to parse an RSS feed in Java, and kept hitting an org.xml.sax.SAXParseException: Content is not allowed in prolog.  The prolog is anything before the opening <?xml tag at the start of the feed.  I dug into it a little further, and discovered that many UTF-8 encoded files include a three-byte UTF-8 Byte-order mark.  When dealing with a UTF-8 encoded RSS feed, this three-byte pattern (0xEF 0xBB 0xBF) in the prolog can cause all sorts of interesting XML parsing problems, including a SAXParseException: Content is not allowed in prolog.  The solution is to use a quick-and-dirty regular expression to cleanup the XML prolog before feeding it into a parser ... continue reading for the details.
First, I wanted to confirm my suspicion about the UTF-8 Byte-order mark.  I used wget to download the feed in question (http://www.hp.com/hpinfo/stories.xml) and opened it up using khexedit.  Sure enough, the first three bytes are EF BB BF:

rss-feed-extra-bytes.png

Because these extra three bytes are present in the prolog, you might see an exception that looks something like this when trying to parse the XML:

Caused by: org.jdom.input.JDOMParseException: Error on line 1:
Content is not allowed in prolog.
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:468)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:851)
at com.sun.syndication.io.WireFeedInput.build(WireFeedInput.java:178)
... 188 more
Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.
... 190 more

As I mentioned, a quick-and-dirty solution to this problem is to build a regular expression to strip off any junk in the prolog before feeding the XML into a parser.  Here's an example that strips off any non-word characters in the prolog:

String xml = "<?xml ...";
Matcher junkMatcher = (Pattern.compile("^([\\W]+)<"))
.matcher( xml.trim() );
xml = junkMatcher.replaceFirst("<");

As of Java 1.4, you could also try something a little cleaner:

String xml = "<?xml ...";
xml = xml.trim().replaceFirst("^([\\W]+)<","<");

Note that calling String.trim() on the XML isn't good enough, because trim() only handles leading and trailing white space.  Once I got rid of the UTF-8 Byte-order mark, my XML parser handled the feed with no issues.  Hope this helps someone else out there struggling with SAXParseException's and RSS feeds.

Did You Find this Helpful?

Did you find this post helpful, or at least, interesting?

  

Send Mark a Direct Message

If you'd like to send me a direct message, please do so below. However, I do not publicly post comments or messages submitted directly to me. So, if you're going to try to SPAM me, or my blog, you're pretty much wasting your time.

400 characters remaining

Error

About Mark

A Silicon Valley native, Mark Kolich is a full-time Software Engineer, a casual entrepreneur, and a consultant for hire. A web technologies expert, his current focus is on building powerful and robust cloud-driven web-applications using Java, PHP, Perl, AJAX, DHTML, CSS, and JavaScript. His favorite programming languages are PHP, Java and JavaScript. He uses Linux, enjoys biking to work, loves building great software, and always writes elegant, readable, and maintainable code.

No TrackBacks

No trackbacks attached to this entry.

Twitter (@markkolich)

Translate

About this Entry

This page contains a single entry by Mark Kolich published on February 2, 2009 7:00 AM.

HOWTO: Make Your Own AddThis Social Bookmarking Sharing Widget (Sharing URLs) was the previous entry in this blog.

Against my Better Judgement, publishing kolich.com Referrer Log Data is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.