3/2/2006

CS639 Class 12

Midterm Tues. Apr. 2

Pa2 due Sun, Mar 24, end of break

Tomcat working at home? Everyone present said yes.

Working on reading XML, Chap. 6

Handling Attributes in SAX

Attributes are delivered with the startElement callback

public void startElement(String uri,

String localName,

String qName,

Attributes attributes)

throws SAXException

SAX's startElement() has a parameter of type Attributes (pg. 279)

package org.xml.sax;

public interface Attributes

{

public int getLength ();

public String getQName(int index);

public String getURI(int index);

public String getLocalName(int index);

public int getIndex(String uri, String localPart);

public int getIndex(String qualifiedName);

public String getType(String uri, String localName); ß the promised type for an attribute

public String getType(String qualifiedName);

public String getType(int index);

public String getValue(String uri, String localName); ß the one we usually want

public String getValue(String qualifiedName);

public String getValue(int index);

}

This Attributes object contains a set of attributes of no particular order (an element has a set of attributes)

Still, individual attributes can be accessed by an index number in this representation, a somewhat surprising sytem.

We can use public String getValue(String uri, String localName) to search for the attributes we want, and get their values. This is foolproof, since the URI is a unique id of the NS.

One job of a parser is determining types of attribute values—see getType above. Clearly the parser needs access to a DTD or schema to do this right. It reports CDATA if it doesn’t know. See possible types on pg. 34. If an enumeration, the type is reported as NMTOKEN.

Recall from last week that an attribute usually belongs to no namespace and has no prefix, but is specified in the schema that goes with its element, unless it is a "global attribute" like those used by XInclude, that can be attached to any element, and does have a prefix, and is in its namespace. Clear?

When we use a global attribute, we always use a prefix. Recall that a default namespace only affects unprefixed element names.

Recall that XLink is a standard for linking an xml document element to other internet resources (like an html link). Here is an example from page 281. As in the case of the example from pg. 31, we can drop the xlink:type attribute since it turns out not to be in use today (at least in SVG)

<magazine xmlns:xlink=http://www.w3.org/TR/1999/xlink

~~xlink:type="simple"~~ xlink:href="http://www.thenation.com/"> <!—crossed-out part needs to be dropped -->

The Nation

</magazine>

Element name magazine is defined in some application namespace, and the magazine element has an XLink to that magazine’s website.

xmlns:xlink = … shows the URI for the XLink namespace, and its prefix (xlink) in this doc

href is a name in the XLink namespace, (and a global attribute by the XLink schema)

xlink:href is a qualified name, or qname

the attribute value is http://www.thenation.com/, the URL we want to follow

Example 6.9, page 281, is a spider program than crawls an xml document's XLinks, and so must find and process the href attributes in the document, then any href attributes in the linked document, and so on, assuming the linked documents are in XML.

How should one go about finding the links, i.e. those particular attributes? The first thought is to search for attributes with qualified name "xlink:href", but the prefix is arbitrary (except in cases ruled by DTDs), and the local name "href" might belong to another namespace

For a certain element, there is only one instance of an attribute with a specific local name and a specific URI—that’s the best way to identify what we want. We want to determine the string value of this attribute, to get the URL to follow in the spider operation.

The spider of Ex. 6.9, pg 281, puts the URLs it collects in a queue to remember them for later processing

In the endDocument() method there is an attempt to dequeue a URL from the stack. If it is successful, this document is parsed

The code in endDocument is very ugly, because after processing a few docs you have a stack of calls to the same parser object. But we’re not interested in rewriting this program.

Receiving characters: you need to append pieces of text together

Note: SAX may call characters several times to deliver the contents of a single node. we should collect the text into a StringBuffer or other container that is specific to a given part of the document tree

The Example on page 233 is OK because it outputs all its characters which naturally appends them to their predecessors

Ex 6.10 on pages 285-287 is an example of this character buffering. It’s reading the message on pg. 142 with …<double>55</double>…

startElement: if element name is "double" set up a new StringBuffer

characters: if buffer != null

buffer.append(text, start, length)

endElement: if element is "Double" convert buffer to result and make buffer null. buffer serves double duty as a storage space and a flag

We can skip the processor instruction and namespace sections of Chapter 6

What the ContentHandler Doesn't Tell You (page 303)

The ContentHandler interface gives you most of the information you really need from a document. Some of its omissions are handled by other callback interfaces

Some of the things the ContentHandler does not deal with but are available through other interfaces

- comments, unskipped entities and CDATA sections, all of which are available through the LexicalHandler interface

Note: the text of CDATA sections is delivered by characters(). This further lets you know which text came from CDATA’s.

- ELEMENT, ATTLIST and parsed ENTITY declarations from the DTD, all of which are reported through the DTDHandler interface

- validity errors and other nonfatal errors which are reported through the ErrorHandler interface

Some of the things the ContentHandler does not deal with, and are not dealt with by any part of SAX (I.e., SAX 2 = book version = Java 6 version)

- the version, encoding attribute from the XML declaration

- insignificant white space in tags and before and after the root element

- the order of attributes

- the type of quotes that surround attributes

- character references

- prenormalized attribute values (see p. 280 on normailization)

- whether an attribute was specified in the instance document or defaulted from a DTD or schema (see more below)

- whether empty elements are represented as <name></name> or <name />

- What XML Schema type goes with various attribute values and element contents (not mentioned in Harold)

The only common use case for most of the above information would be an XML editor.

Correction: The last point was true until Java 5, when a new schema-handling package was added to the JDK related to the newer DOM release known as DOM3. This new schema support was also made to fit in with SAX. See http://www.ibm.com/developerworks/library/x-javaxmlvalidapi/

See validate-ns/TypeLister.java for a program from that article, slightly modified to work with our examples. It prints out XML schema type names where available for both element content and attribute values (in my version). Anonymous types don’t have type names, so for example, book of book5.xml gets output as

b:book: typename #AnonType_book, typename ns: http://schemas.cs.umb.edu/book.

Example of an attribute default, using DTD, pg 36:

rate CDATA "0.0"

sets a default of “0.0” for rate, and type CDATA.

The corresponding XML Schema is on pg.39:

<xsd:attribute name=”rate” type=”xsd:decimal”/>

setting no default, but we can add a default easily, to be

<xsd:attribute name=”rate” default=”0.0” type=”xsd:decimal”/>.

However, the type is still reported as CDATA via Atttributes, i.e., it’s mapped into the DTD type system.

Chap. 7: more on SAX: mainly additional exceptions, features (both needed to capture validation problems)

Handling the existence of multiple parsers--we just use JDK's

Input for parsing: by string URL or "InputSource"…

An InputSource object is used in both constructors for a parser, one explicitly, the other, created under the covers when given a string.

Note: this class has a “messy” API: obscure rules about what you can call or should call. Not great software engineering.

Example 7.1, pg. 310. The SAX InputSource class (there are missing semi-colons in book interface def.)

package org.xml.sax;

public class InputSource

{

public InputSource();

public InputSource(String systemID);

public InputSource(InputStream byteStream);

public InputSource(Reader characterStream);

public void setPublicId(String publicID);

public String getPublicId();

public void setSystemId(String systemID);

public String getSystemId();

public void setByteStream(InputStream byteStream);

public InputStream getByteStream();

public void setEncoding(String encoding);

public String getEncoding();

public void setCharacterStream(Reader characterStream);

public Reader getCharacterStream();

}

A SystemID is a URL, and should be set to resolve relative URLs such as in DOCTYPEs or XLink hrefs.

An InputSource can be constructed from a string ("System ID" i.e. URL), an InputStream, or a Reader

InputStream is at the top of the byte input hierarchy of Java i/o.

Reader is at the top of the Unicode text input hierarchy of Java i/o.

Rules for InputSource use:

1. Don't set both the bytestream and Unicode text input objects on an InputSource object: it uses one or the other for actual input data.

2. Do set the System ID if what was passed to the constructor was a Unicode or byte stream object, at least if your XML has any relative URLs in it.

If an InputSource has both an InputStream and a System ID, the InputStream is used to for input XML data and the System ID is used to resolve relative URLs. Similarly if an InputSource has both a Reader and a SystemID.

Relative URLs in XML

For example, the beginning of book1.xml:

<?xml version="1.0" encoding="ISO-8859-1"?>

<!DOCTYPE book SYSTEM "book.dtd">

<book>

A parser would not be able to find the dtd, which is given as a relative URL, without the System ID information, though the parser could still parse the document.

Another example: schemaLocation= “book.xsd”.

Another way of dealing with the above problem is to use an absolute URL for the SYSTEM value, such as:

http:/www.umb.edu/cs639/validate/book.dtd

Errors and Exceptions in SAX

There are three levels (pg 315)

- Fatal Errors - a well-formedness error. as soon as the parser detects this type of error it must throw a SAXParseException and stop parsing

- Error - an error but not a well-formedness error. the most common variety is a validation error. if a parser detects one of these errors, it may or may not throw a SAXException and it may or may not continue parsing. In the case of a validation error, it will be able to proceed.

- Warning - not an error in itself, but it may indicate a mistake of some kind in the document

NOTE: Validity errors don’t cause a throw themselves. We’ll soon see how to get info on them.

SAXParseException is a subclass of SAXException. Previously we saw catch (SAXException e)…after the call to parse, but the actual thrown exception object is of the subclass type SAXParseException in the case of a well-formedness error. Of course this catch catches all subclasses, so it works fine to gain control.

The only kind of XML the parser is guaranteed to tell you about through an exception is a well-formedness error

To be informed of other problems, you need to implement the ErrorHandler interface and register this class with the XMLReader

Nested Exceptions

Example 7.4. The SAXException class

package org.xml.sax;

public class SAXException extends Exception

{

public SAXException();

public SAXException(String message);

public SAXException(Exception rootCause);

public SAXException(String message, Exception e);

public String getMessage();

public Exception getException();

public String toString();

}

The only kind of exception you can throw in a ContentHandler interface method is a SAXException, but this may not be the kind you want to throw

In this case you wrap the exception you want to throw inside a SAXException and throw the SAXException instead. this can be done because one of the constructors for a SAXException takes an exception for an argument. you can then use the getException() method of this object to retrieve the original exception

For example (pg. 317), if we wanted to throw an InvalidKeyException in a handler, the code that catches this exception might look like this:

catch (SAXException e) {

Exception rootCause = e.getException(); // or more currently: = e.getCause();

if (rootCause == null) {

// handle it as an XML problem...

} else {

if (rootCause instanceof InvalidKeyException) {

InvalidKeyException ike = (InvalidKeyException) rootCause;

throw ike;

} else if (rootCause instanceof SomeOtherException) {

SomeOtherException soe = (SomeOtherException) rootCause;

throw soe;

}

…

}

These days, we would not use a getException() to retrieve a nested exception, but a getCause(), but the SAXException was defined before this became standard--you can use either method to get it.

You can have many levels of nested exceptions if you keep catching and throwing the original exception

The SAXParseException has valuable line # and column # information for parse problems with XML files

Example 7.5. The SAXParseException class

package org.xml.sax;

public class SAXParseException extends SAXException

{

public SAXParseException(String message, Locator locator);

public SAXParseException(String message, Locator locator, Exception e);

public SAXParseException(String message, String publicID,

String systemID, int lineNumber, int columnNumber);

public SAXParseException(String message, String publicID,

String systemID, int lineNumber, int columnNumber,

Exception e);

public String getPublicId();

public String getSystemId();

public int getLineNumber(); ß very useful info

public int getColumnNumber();

}

The line number and column numbers the parser reports may not always be perfectly accurate, but they are very useful anyway. Recall Counter reports with line and column number info.

ErrorHandler Interface: needed for non-fatal error detection, including validation errors

Throwing an exception aborts the parsing process, but not all problems in an XML document necessarily require such a radical step

In particular, validity errors are not signaled by a SAXException exception being thrown, because that would stop parsing. Instead, the ErrorHandler allows us to look at the exception object as just a method argument, not a thrown exception.

If you want your program to be informed of nonfatal errors, then you must register an ErrorHandler object with the XMLReader

Example 7.7. The ErrorHandler interface

package org.xml.sax;

public interface ErrorHandler

{

public void warning(SAXParseException exception)

throws SAXException;

public void error(SAXParseException exception)

throws SAXException;

public void fatalError(SAXParseException exception)

throws SAXException;

}

Use parser.setErrorHandler(handler)--see code on pg. 323

XMLReader ErrorHandler

The DefaultHandler is a convenience class that implements (see pg. 884) the four required SAX callback interfaces:

(they required for the parser implementation, not required to be used by a program)

- EntityResolver

- DTDHandler

- ContentHandler

- ErrorHandler

We can use this class as the base object of our own handlers, so we don't have to provide do-nothing methods to satisfy the API.

Ex 7.8 on page 322 sets up an ErrorHandler, but as it stands, doesn't turn on validity checking, so we won't get a report of validity problems.

Note that its fatalError method does not itself throw, but its caller does throw after fatalError returns, in order to fulfill the rule about fatal errors aborting parsing.

(But does report on NS prefix prob’s, as shown on pg. 324).

To get validation errors, need to set a “feature” on, as explained next. Could add this to Ex. 7.8. by adding

parser.setFeature("http://xml.org/sax/features/validation", true);

before parse is called. This turns on DTD validation. To turn on XML Schema validation (for sure, maybe it’s on by default (pg. 348)):

parser.setFeature("http://xml.org/sax/features/validation", true);

See Ex. 7.9, pg 333 showing the call for DTD validation.

SAX Features (boolean on or off) and Properties (various string values)

Counter.java uses SAX to parse a document and check well-formed-ness. If given flag –v, it will do DTD validation, then –v –s will do schema validation.

So we can look into it and see how this is done.

Counter.java snippets:

parser = XMLReaderFactory.createXMLReader();

…

Here is the explicit validation setting:

if ‘-v’, then parser.setFeature(“http://xml/org/sax/features/validation”, true). (name of validation feature) See pp. 325-326.

if ‘-s’, then parser.setFeature(“http://xml/org/sax/features/validation/schema”, true); See pg. 348.

parser.setContentHandler(counter) ; (counter is a new Counter() class). So Counter IS-A ContentHandler.

All these happen in main.

We look for startElement now. (we must see fElements++; etc, just stats).

As covered previously, validation is a feature that can be turned on. Example 7.9 on page 331 is basically the same as our Counter program. it sets up an ErrorHandler and uses the following line to enable DTD validation:

parser.setFeature("http://xml.org/sax/features/validation", true);

Validation is the most important feature. For XML Schema validation, we need to set an additional Xerces feature, as discussed on pg. 348.

Features can be turned on or off through setFeature() on the parser object, i.e., features are Boolean.

On the other hand, Properties are objects, not boolean variables

Both features and properties are named by absolute URIs.

Features and properties can be read-only, write-only or (rarely) read-write

Standard features that are supported by multiple parsers have names that begin with

http://xml.org/sax/features/

Different parsers also support non-standard, custom features. The names of these features begins with URLs somewhere in the parser vendor's domain

An interesting property is the xml-string, formally http://xml.org/sax/properties/xml-string. It contains the string of text that triggered the current SAX event. This can be used to echo most of the content of an XML document. See pg. 335.

We are skipping Chap 8 on SAX Filters, so on to Chap. 9 on the DOM

But first backtrack to Chap 5 for intro part.

Intro to DOM programming

Back to Chapter 5 for intro program. (Chapter 5 intros both SAX and DOM)

Ex. 5.5, Pg 235-237 -> First DOM Example, see tree of objects in use, similarity to TestXPath code.

NOTE: This program uses class DOMImplementationImpl (not to be confused with DOMImplementation), also XMLSerializer, not in JDK.

Telltale sign of non-JDKness: look at the imports of “org.apache.xerces.parsers”, vs on page 239 “javax.xml.parsers”.

So skip to pg. 239 for the second version, which does work under JDK. The Transformer part is further explained in pg. 484 as trivial “transformation”, i.e., no real change.

With DOM we can build XML representation in terms of objects as well as read XML files. And use XPath with the object tree.

Example 5.6 on page 239 is program that uses JAXP, which is in the JDK. We’ll look at it next time.