CS639 Class 12
Midterm Tues. Apr. 2
Pa2 due Sun, Mar 24, end of break
Tomcat working at home? Everyone present said yes.
Working
on reading XML, Chap. 6
Handling Attributes in
SAX
Attributes are delivered with the
startElement callback
public void startElement(String uri,
String localName,
String qName,
Attributes attributes)
throws SAXException
SAX's startElement() has a parameter of type Attributes (pg. 279)
package
org.xml.sax;
public interface
Attributes
{
public int getLength ();
public String getQName(int index);
public String getURI(int index);
public String getLocalName(int index);
public int getIndex(String uri, String localPart);
public int getIndex(String qualifiedName);
public String getType(String uri, String localName); ß
the promised type for an attribute
public String getType(String qualifiedName);
public String getType(int index);
public String getValue(String uri, String localName); ß the one we usually want
public String getValue(String qualifiedName);
public String getValue(int index);
}
This Attributes object contains a set of attributes of no particular order (an element has a set of attributes)
Still, individual attributes can be accessed by an index number in this representation, a somewhat surprising sytem.
We can use public String getValue(String uri, String localName) to search for the attributes we want, and get their values. This is foolproof, since the URI is a unique id of the NS.
One job of a parser is determining types of attribute values—see getType above. Clearly the parser needs access to a DTD or schema to do this right. It reports CDATA if it doesn’t know. See possible types on pg. 34. If an enumeration, the type is reported as NMTOKEN.
Recall from last week that an attribute usually belongs to no namespace and has no prefix, but is specified in the schema that goes with its element, unless it is a "global attribute" like those used by XInclude, that can be attached to any element, and does have a prefix, and is in its namespace. Clear?
When we use a global attribute, we always use a prefix. Recall that a default namespace only affects unprefixed element names.
Recall that XLink is a standard for linking an xml document element to other internet resources (like an html link). Here is an example from page 281. As in the case of the example from pg. 31, we can drop the xlink:type attribute since it turns out not to be in use today (at least in SVG)
<magazine xmlns:xlink=http://www.w3.org/TR/1999/xlink
xlink:type="simple"
xlink:href="http://www.thenation.com/"> <!—crossed-out part needs to be dropped
-->
The Nation
</magazine>
Element name magazine is defined in some application namespace, and the magazine element has an XLink to that magazine’s website.
xmlns:xlink = … shows the URI for the XLink namespace, and its prefix (xlink) in this doc
href is a name in the XLink namespace, (and a global attribute by the XLink schema)
xlink:href is a qualified name, or qname
the attribute value is http://www.thenation.com/, the URL we want to follow
Example 6.9, page 281, is a spider program than crawls an xml document's XLinks, and so must find and process the href attributes in the document, then any href attributes in the linked document, and so on, assuming the linked documents are in XML.
How should one go about finding the links, i.e. those particular attributes? The first thought is to search for attributes with qualified name "xlink:href", but the prefix is arbitrary (except in cases ruled by DTDs), and the local name "href" might belong to another namespace
For a certain element, there is only one instance of an attribute with a specific local name and a specific URI—that’s the best way to identify what we want. We want to determine the string value of this attribute, to get the URL to follow in the spider operation.
The spider of Ex. 6.9, pg 281, puts the URLs it collects in a queue to remember them for later processing
In the endDocument() method there is an attempt to dequeue a URL from the stack. If it is successful, this document is parsed
The code in endDocument is very ugly, because after processing a few docs you have a stack of calls to the same parser object. But we’re not interested in rewriting this program.
Receiving characters: you need to append pieces of text together
Note: SAX may call characters several times to deliver the contents of a single node. we should collect the text into a StringBuffer or other container that is specific to a given part of the document tree
The Example on page 233 is OK because it outputs all its characters which naturally appends them to their predecessors
Ex 6.10 on pages 285-287 is an example of this character buffering. It’s reading the message on pg. 142 with …<double>55</double>…
startElement: if element name is "double" set up a new StringBuffer
characters: if buffer != null
buffer.append(text, start, length)
endElement: if element is "Double" convert buffer to result and make buffer null. buffer serves double duty as a storage space and a flag
We can skip the processor instruction and namespace sections of Chapter 6
What the ContentHandler
Doesn't Tell You (page 303)
The ContentHandler interface gives you most of the information you really need from a document. Some of its omissions are handled by other callback interfaces
Some of the things the ContentHandler does not deal with but are available through other interfaces
- comments, unskipped entities and CDATA sections, all of which are available through the LexicalHandler interface
Note: the text of CDATA sections is delivered by characters(). This further lets you know which text came from CDATA’s.
- ELEMENT, ATTLIST and parsed ENTITY declarations from the DTD, all of which are reported through the DTDHandler interface
- validity errors and other nonfatal errors which are reported through the ErrorHandler interface
Some of the things the ContentHandler does not deal with, and are not dealt with by any part of SAX (I.e., SAX 2 = book version = Java 6 version)
- the version, encoding attribute from the XML declaration
- insignificant white space in tags and before and after the root element
- the order of attributes
- the type of quotes that surround attributes
- character references
- prenormalized attribute values (see p. 280 on normailization)
- whether an attribute was specified in the instance document or defaulted from a DTD or schema (see more below)
- whether empty elements are represented as <name></name> or <name />
- What XML Schema type goes with various attribute values and element contents (not mentioned in Harold)
The only common use case for most of the above information would be an XML editor.
Correction: The last point was true until Java 5, when a new schema-handling package was added to the JDK related to the newer DOM release known as DOM3. This new schema support was also made to fit in with SAX. See http://www.ibm.com/developerworks/library/x-javaxmlvalidapi/
See validate-ns/TypeLister.java for a program from that article, slightly modified to work with our examples. It prints out XML schema type names where available for both element content and attribute values (in my version). Anonymous types don’t have type names, so for example, book of book5.xml gets output as
b:book: typename #AnonType_book, typename ns: http://schemas.cs.umb.edu/book.
Example of an attribute default, using DTD, pg 36:
rate CDATA "0.0"
sets a default of “0.0” for rate, and type CDATA.
The corresponding XML Schema is on pg.39:
<xsd:attribute name=”rate” type=”xsd:decimal”/>
setting no default, but we can add a default easily, to be
<xsd:attribute name=”rate” default=”0.0”
type=”xsd:decimal”/>.
However, the type is still reported as CDATA via Atttributes, i.e., it’s mapped into the DTD type system.
Chap. 7: more on SAX: mainly additional exceptions, features (both
needed to capture validation problems)
Handling the existence of multiple parsers--we just use JDK's
Input for parsing: by string URL or "InputSource"…
An InputSource object is used in both constructors for a parser, one explicitly, the other, created under the covers when given a string.
Note: this class has a “messy” API: obscure rules about what you can call or should call. Not great software engineering.
Example 7.1, pg. 310. The SAX InputSource class (there are missing semi-colons in book interface def.)
package org.xml.sax;
public class InputSource
{
public InputSource();
public InputSource(String systemID);
public InputSource(InputStream
byteStream);
public InputSource(Reader
characterStream);
public void setPublicId(String publicID);
public String getPublicId();
public void setSystemId(String systemID);
public String getSystemId();
public void setByteStream(InputStream byteStream);
public InputStream getByteStream();
public void setEncoding(String encoding);
public String getEncoding();
public void setCharacterStream(Reader characterStream);
public Reader getCharacterStream();
}
A SystemID is a URL, and should be set to resolve relative URLs such as in DOCTYPEs or XLink hrefs.
An InputSource can be constructed from a string ("System ID" i.e. URL), an InputStream, or a Reader
InputStream is at the top of the byte input hierarchy of Java i/o.
Reader is at the top of the Unicode text input hierarchy of Java i/o.
Rules for InputSource use:
1. Don't set both the bytestream and Unicode text input objects on an InputSource object: it uses one or the other for actual input data.
2. Do set the System ID if what was passed to the constructor was a Unicode or byte stream object, at least if your XML has any relative URLs in it.
If an InputSource has both an InputStream and a System ID, the InputStream is used to for input XML data and the System ID is used to resolve relative URLs. Similarly if an InputSource has both a Reader and a SystemID.
Relative URLs in XML
For example, the beginning of book1.xml:
<?xml
version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE
book SYSTEM "book.dtd">
<book>
A parser would not be able to find the dtd, which is given as a relative URL, without the System ID information, though the parser could still parse the document.
Another example: schemaLocation= “book.xsd”.
Another way of dealing with the above problem is to use an absolute URL for the SYSTEM value, such as:
http:/www.umb.edu/cs639/validate/book.dtd
Errors
and Exceptions in SAX
There are three levels (pg 315)
- Fatal Errors - a well-formedness error. as soon as the parser detects this type of error it must throw a SAXParseException and stop parsing
- Error - an error but not a well-formedness error. the most common variety is a validation error. if a parser detects one of these errors, it may or may not throw a SAXException and it may or may not continue parsing. In the case of a validation error, it will be able to proceed.
- Warning - not an error in itself, but it may indicate a mistake of some kind in the document
NOTE: Validity errors don’t cause a throw themselves. We’ll soon see how to get info on them.
SAXParseException is a subclass of SAXException. Previously we saw catch (SAXException e)…after the call to parse, but the actual thrown exception object is of the subclass type SAXParseException in the case of a well-formedness error. Of course this catch catches all subclasses, so it works fine to gain control.
The only kind of XML the parser is guaranteed to tell you about through an exception is a well-formedness error
To be informed of other problems, you need to implement the ErrorHandler interface and register this class with the XMLReader
Nested Exceptions
Example 7.4. The SAXException class
package org.xml.sax;
public class SAXException extends
Exception
{
public SAXException();
public SAXException(String
message);
public SAXException(Exception
rootCause);
public SAXException(String
message, Exception e);
public String getMessage();
public Exception getException();
public String toString();
}
The only kind of exception you can throw in a ContentHandler interface method is a SAXException, but this may not be the kind you want to throw
In this case you wrap the exception you want to throw inside a SAXException and throw the SAXException instead. this can be done because one of the constructors for a SAXException takes an exception for an argument. you can then use the getException() method of this object to retrieve the original exception
For example (pg. 317), if we wanted to throw an InvalidKeyException in a handler, the code that catches this exception might look like this:
catch (SAXException e) {
Exception
rootCause = e.getException(); // or
more currently: = e.getCause();
if
(rootCause == null) {
//
handle it as an XML problem...
}
else {
if
(rootCause instanceof InvalidKeyException) {
InvalidKeyException
ike = (InvalidKeyException) rootCause;
throw
ike;
}
else if (rootCause instanceof SomeOtherException) {
SomeOtherException
soe = (SomeOtherException) rootCause;
throw
soe;
}
…
}
}
These days, we would not use a getException() to retrieve a nested exception, but a getCause(), but the SAXException was defined before this became standard--you can use either method to get it.
You can have many levels of nested exceptions if you keep catching and throwing the original exception
The SAXParseException has valuable line # and column # information for parse problems with XML files
Example 7.5. The
SAXParseException class
package org.xml.sax;
public class SAXParseException extends
SAXException
{
public
SAXParseException(String message, Locator locator);
public
SAXParseException(String message, Locator locator, Exception e);
public SAXParseException(String
message, String publicID,
String
systemID, int lineNumber, int columnNumber);
public
SAXParseException(String message, String publicID,
String
systemID, int lineNumber, int columnNumber,
Exception
e);
public
String getPublicId();
public
String getSystemId();
public
int getLineNumber(); ß
very useful info
public
int getColumnNumber();
}
The line number and column numbers the parser reports may not always be perfectly accurate, but they are very useful anyway. Recall Counter reports with line and column number info.
ErrorHandler Interface:
needed for non-fatal error detection, including validation errors
Throwing an exception aborts the parsing process, but not all problems in an XML document necessarily require such a radical step
In particular, validity errors are not signaled by a SAXException exception being thrown, because that would stop parsing. Instead, the ErrorHandler allows us to look at the exception object as just a method argument, not a thrown exception.
If you want your program to be informed of nonfatal errors, then you must register an ErrorHandler object with the XMLReader
Example 7.7. The ErrorHandler interface
package org.xml.sax;
public interface ErrorHandler
{
public
void warning(SAXParseException exception)
throws
SAXException;
public
void error(SAXParseException exception)
throws
SAXException;
public
void fatalError(SAXParseException exception)
throws
SAXException;
}
Use parser.setErrorHandler(handler)--see code on pg. 323
XMLReader ErrorHandler
The DefaultHandler is a convenience class that implements (see pg. 884) the four required SAX callback interfaces:
(they required for the parser implementation, not required to be used by a program)
- EntityResolver
- DTDHandler
- ContentHandler
- ErrorHandler
We can use this class as the base object of our own handlers, so we don't have to provide do-nothing methods to satisfy the API.
Ex 7.8 on page 322 sets up an ErrorHandler, but as it stands, doesn't turn on validity checking, so we won't get a report of validity problems.
Note that its fatalError method does not itself throw, but its caller does throw after fatalError returns, in order to fulfill the rule about fatal errors aborting parsing.
(But does report on NS prefix prob’s, as shown on pg. 324).
To get validation errors, need to set a “feature” on, as explained next. Could add this to Ex. 7.8. by adding
parser.setFeature("http://xml.org/sax/features/validation",
true);
before parse is called. This turns on DTD validation. To turn on XML Schema validation (for sure, maybe it’s on by default (pg. 348)):
parser.setFeature("http://xml.org/sax/features/validation",
true);
See Ex. 7.9, pg 333 showing the call for DTD validation.
SAX Features (boolean on or off) and Properties (various string values)
Counter.java uses SAX to parse a document and check well-formed-ness. If given flag –v, it will do DTD validation, then –v –s will do schema validation.
So we can look into it and see how this is done.
Counter.java snippets:
parser = XMLReaderFactory.createXMLReader();
…
Here is the explicit validation setting:
if ‘-v’, then parser.setFeature(“http://xml/org/sax/features/validation”, true). (name of validation feature) See pp. 325-326.
if ‘-s’, then parser.setFeature(“http://xml/org/sax/features/validation/schema”, true); See pg. 348.
parser.setContentHandler(counter) ; (counter is a new Counter() class). So Counter IS-A ContentHandler.
All these happen in main.
We look for startElement now. (we must see fElements++; etc, just stats).
As covered previously, validation is a feature that can be turned on. Example 7.9 on page 331 is basically the same as our Counter program. it sets up an ErrorHandler and uses the following line to enable DTD validation:
parser.setFeature("http://xml.org/sax/features/validation",
true);
Validation is the most important feature. For XML Schema validation, we need to set an additional Xerces feature, as discussed on pg. 348.
Features can be turned on or off through setFeature() on the parser object, i.e., features are Boolean.
On the other hand, Properties are objects, not boolean variables
Both features and properties are named by absolute URIs.
Features and properties can be read-only, write-only or (rarely) read-write
Standard features that are supported by multiple parsers have names that begin with
http://xml.org/sax/features/
Different parsers also support non-standard, custom features. The names of these features begins with URLs somewhere in the parser vendor's domain
An interesting property is the xml-string, formally http://xml.org/sax/properties/xml-string. It contains the string of text that triggered the current SAX event. This can be used to echo most of the content of an XML document. See pg. 335.
We are skipping Chap 8 on SAX Filters, so on to Chap. 9 on the DOM
But first backtrack to Chap 5 for intro part.
Back to Chapter 5 for intro program. (Chapter 5 intros both SAX and DOM)
Ex. 5.5, Pg 235-237 -> First DOM Example, see tree of objects in use, similarity to TestXPath code.
NOTE: This program uses class DOMImplementationImpl (not to be confused with DOMImplementation), also XMLSerializer, not in JDK.
Telltale sign of non-JDKness: look at the imports of “org.apache.xerces.parsers”, vs on page 239 “javax.xml.parsers”.
So skip to pg. 239 for the second version, which does work under JDK. The Transformer part is further explained in pg. 484 as trivial “transformation”, i.e., no real change.
With DOM we can build XML representation in terms of objects as well as read XML files. And use XPath with the object tree.
Example 5.6 on page 239 is program that uses JAXP, which is in the JDK. We’ll look at it next time.