Because these extra three bytes are present in the prolog, you might see an exception that looks something like this when trying to parse the XML:
Caused by: org.jdom.input.JDOMParseException: Error on line 1:
Content is not allowed in prolog.
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:468)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:851)
at com.sun.syndication.io.WireFeedInput.build(WireFeedInput.java:178)
... 188 more
Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.
... 190 more
As I mentioned, a quick-and-dirty solution to this problem is to build a regular expression to strip off any junk in the prolog before feeding the XML into a parser. Here's an example that strips off any non-word characters in the prolog:
String xml = "<?xml ...";
Matcher junkMatcher = (Pattern.compile("^([\\W]+)<"))
.matcher( xml.trim() );
xml = junkMatcher.replaceFirst("<");
As of Java 1.4, you could also try something a little cleaner:
String xml = "<?xml ...";
xml = xml.trim().replaceFirst("^([\\W]+)<","<");
Note that calling String.trim() on the XML isn't good enough, because trim() only handles leading and trailing white space. Once I got rid of the UTF-8 Byte-order mark, my XML parser handled the feed with no issues. Hope this helps someone else out there struggling with SAXParseException's and RSS feeds.


Did you find this post helpful, or at least, interesting?