SAX不能正确处理特殊字符的转义实体?
刚刚学习使用SAX解析XML,遇到了两个问题。
全部程序如下:
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class SimpeXmlHandler extends DefaultHandler {
private String str = null;
private String element = null;
public void startElement(String namespaceURI, String localName,
String fullName, Attributes attributes) throws SAXException {
element = fullName;
for (int i = 0; i < attributes.getLength(); i++) {
String qName = attributes.getQName(i);
if (qName.equals("id")) {
System.out.println("id=" + attributes.getValue(qName).trim());
break;
}
}
}
public void endElement(String uri, String localName, String qName)
throws SAXException {
if (str != null) {
if (element.equalsIgnoreCase("title")) {
System.out.println("title=" + str);
} else if (element.equalsIgnoreCase("href")) {
System.out.println("href=" + str);
} else if (element.equalsIgnoreCase("content")) {
System.out.println("content=" + str);
}
}
}
public void characters(char[] chars, int start, int length)
throws SAXException {
str = new String(chars, start, length).trim();
}
}
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.InputSource;
import org.xml.sax.XMLReader;
public class SimpleXmlTest {
public static void main(String[] args) throws Exception {
SimpeXmlHandler handler = new SimpeXmlHandler();
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating(false);
SAXParser parser = factory.newSAXParser();
XMLReader xmlReader = parser.getXMLReader();
xmlReader.setContentHandler(handler);
InputSource source = new InputSource("config/sample.xml");
xmlReader.parse(source);
}
}
执行SimpleXmlTest解析如下的XML文件,
<?xml version="1.0" encoding="UTF-8"?>
<root>
<articles>
<article id="00001">
<title>titleValue</title>
<href>hrefValue</href>
<publishtime>timeValue</publishtime>
<content>contentValue</content>
<tag>0</tag>
</article>
</articles>
</root>
结果如下,
id=00001
title=titleValue
href=hrefValue
content=contentValue
将XML文件的内容换成如下:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<articles>
<article id="00001">
<title>titleValue</title>
<href>hrefValue</href>
<publishtime>timeValue</publishtime>
<content>start>end</content>
<tag>0</tag>
</article>
</articles>
</root>
执行程序后会得到如下结果:
id=00001
title=titleValue
href=hrefValue
content=end
再将XML文件的内容换成如下:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<articles>
<article id="00001">
<title>titleValue</title>
<href>hrefValue</href>
<publishtime>timeValue</publishtime>
<content>start>end</content>
<tag>0</tag>
</article>
</articles>
</root>
再次执行程序后会得到如下结果:
id=00001
title=titleValue
href=hrefValue
content=start>end
似乎SAX会自动地把">"转换成">",这样就造成了错误。
一般地,在XML文件中直接使用">","<","&",...等特殊字符会造成错误,所以会使用">","<","&",...等转义实体。
但在我的程序中,似乎恰恰与此相反。
如何解释上述情况呢?
另,将XML文件的内容换成如下:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<articles>
<article id="00001">
<title>titleValue</title>
<href>hrefValue</href>
<publishtime>timeValue</publishtime>
<content>contentValue</content>
<!-- <tag>0</tag> -->
</article>
</articles>
</root>
执行测试程序后会得到如下结果:
id=00001
title=titleValue
href=hrefValue
content=contentValue
content=
content=
content=
对于最后三行的"content=",我不能理解。
希望大家能为我解惑,谢谢!