3/2/2006

CS639 Class 12

Midterm Tues. Mar. 27

Pa2 due Mar. 21, Wed. after break

Tomcat working at home? Everyone present said yes.

Working on reading XML, Chap. 6

Receiving characters: you need to append pieces of text together

Note: SAX may call characters several times to deliver the contents of a single node. we should collect the text into a StringBuffer or other container that is specific to a given part of the document tree

The Example on page 233 is OK because it outputs all its characters which naturally appends them to their predecessors

Ex 6.10 on pages 285-287 is an example of this character buffering.

startElement: if element name is "double" set up a new StringBuffer

characters: if buffer != null

buffer.append(text, start, length)

endElement: if element is "Double" convert buffer to result and make buffer null. buffer serves double duty as a storage space and a flag

We can skip the processor instruction and namespace sections of Chapter 6

What the ContentHandler Doesn't Tell You (page 303)

The ContentHandler interface gives you most of the information you really need from a document. Some of its omissions are handled by other callback interfaces

Some of the things the ContentHandler does not deal with but are available through other interfaces

- comments, unskipped entities and CDATA sections, all of which are available through the LexicalHandler interface

Note: the text of CDATA sections is delivered by characters(). This further lets you know which text came from CDATA’s.

- ELEMENT, ATTLIST and parsed ENTITY declarations from the DTD, all of which are reported through the DTDHandler interface

- validity errors and other nonfatal errors which are reported through the ErrorHandler interface

Some of the things the ContentHandler does not deal with, and are not dealt with by any part of SAX (I.e., SAX 2 = book version = Java 6 version)

- the version, encoding attribute from the XML declaration

- insignificant white space in tags and before and after the root element

- the order of attributes

- the type of quotes that surround attributes

- character references

- prenormalized attribute values (see p. 280 on normailization)

- whether an attribute was specified in the instance document or defaulted from a DTD or schema (see more below)

- whether empty elements are represented as <name></name> or <name />

The only common use case for most of this information would be an XML editor. We will see an example that gets validation errors.

Example of an attribute default, using DTD, pg 36: rate CDATA "0.0" sets a default of “0.0” for rate.

The corresponding XML Schema is on pg.39: <xsd:attribute name=”rate” type=”xsd:decimal”/> setting no default, but we can add a default easily, to be <xsd:attribute name=”rate” default=”0.0” type=”xsd:decimal”/>.

Chap. 7: more on SAX: mainly additional exceptions, features (both needed to capture validation problems)

Handling the existence of multiple parsers--we just use JDK's

Input for parsing: by string URL or "InputSource"…

An InputSource object is used in both constructors for a parser, one explicitly, the other, created under the covers when given a string.

Note: this class has a “messy” API: obscure rules about what you can call or should call. Not great software engineering.

Example 7.1, pg. 310. The SAX InputSource class (there are missing semi-colons in book interface def.)

package org.xml.sax;

public class InputSource

{

public InputSource();

public InputSource(String systemID);

public InputSource(InputStream byteStream);

public InputSource(Reader characterStream);

public void setPublicId(String publicID);

public String getPublicId();

public void setSystemId(String systemID);

public String getSystemId();

public void setByteStream(InputStream byteStream);

public InputStream getByteStream();

public void setEncoding(String encoding);

public String getEncoding();

public void setCharacterStream(Reader characterStream);

public Reader getCharacterStream();

}

A SystemID is a URL, and should be set to resolve relative URLs such as in DOCTYPEs or XLink hrefs.

An InputSource can be constructed from a string ("System ID" i.e. URL), an InputStream, or a Reader

InputStream is at the top of the byte input hierarchy of Java i/o.

Reader is at the top of the Unicode text input hierarchy of Java i/o.

Rules for InputSource use:

1. Don't set both the bytestream and Unicode text input objects on an InputSource object: it uses one or the other for actual input data.

2. Do set the System ID if what was passed to the constructor was a Unicode or byte stream object, at least if your XML has any relative URLs in it.

If an InputSource has both an InputStream and a System ID, the InputStream is used to for input XML data and the System ID is used to resolve relative URLs. Similarly if an InputSource has both a Reader and a SystemID.

Relative URLs in XML

For example, the beginning of book1.xml:

<?xml version="1.0" encoding="ISO-8859-1"?>

<!DOCTYPE book SYSTEM "book.dtd">

<book>

A parser would not be able to find the dtd, which is given as a relative URL, without the System ID information, though the parser could still parse the document.

Another example: schemaLocation= “book.xsd”.

Another way of dealing with the above problem is to use an absolute URL for the SYSTEM value, such as:

http:/www.umb.edu/cs639/validate/book.dtd

Errors and Exceptions in SAX

There are three levels (pg 315)

- Fatal Errors - a well-formedness error. as soon as the parser detects this type of error it must throw a SAXParseException and stop parsing

- Error - an error but not a well-formedness error. the most common variety is a validation error. if a parser detects one of these errors, it may or may not throw a SAXException and it may or may not continue parsing. In the case of a validation error, it will be able to proceed.

- Warning - not an error in itself, but it may indicate a mistake of some kind in the document

NOTE: Validity errors don’t cause a throw themselves. We’ll soon see how to get info on them.

SAXParseException is a subclass of SAXException. Previously we saw catch (SAXException e)…after the call to parse, but the actual thrown exception object is of the subclass type SAXParseException in the case of a well-formedness error. Of course this catch catches all subclasses, so it works fine to gain control.

The only kind of XML the parser is guaranteed to tell you about through an exception is a well-formedness error

To be informed of other problems, you need to implement the ErrorHandler interface and register this class with the XMLReader

Nested Exceptions

Example 7.4. The SAXException class

package org.xml.sax;

public class SAXException extends Exception

{

public SAXException();

public SAXException(String message);

public SAXException(Exception rootCause);

public SAXException(String message, Exception e);

public String getMessage();

public Exception getException();

public String toString();

}

The only kind of exception you can throw in a ContentHandler interface method is a SAXException, but this may not be the kind you want to throw

In this case you wrap the exception you want to throw inside a SAXException and throw the SAXException instead. this can be done because one of the constructors for a SAXException takes an exception for an argument. you can then use the getException() method of this object to retrieve the original exception

For example (pg. 317), if we wanted to throw an InvalidKeyException in a handler, the code that catches this exception might look like this:

catch (SAXException e) {

Exception rootCause = e.getException(); // or more currently: = e.getCause();

if (rootCause == null) {

// handle it as an XML problem...

} else {

if (rootCause instanceof InvalidKeyException) {

InvalidKeyException ike = (InvalidKeyException) rootCause;

throw ike;

} else if (rootCause instanceof SomeOtherException) {

SomeOtherException soe = (SomeOtherException) rootCause;

throw soe;

}

…

}

These days, we would not use a getException() to retrieve a nested exception, but a getCause(), but the SAXException was defined before this became standard--you can use either method to get it.

You can have many levels of nested exceptions if you keep catching and throwing the original exception

The SAXParseException has valuable line # and column # information for parse problems with XML files

Example 7.5. The SAXParseException class

package org.xml.sax;

public class SAXParseException extends SAXException

{

public SAXParseException(String message, Locator locator);

public SAXParseException(String message, Locator locator, Exception e);

public SAXParseException(String message, String publicID,

String systemID, int lineNumber, int columnNumber);

public SAXParseException(String message, String publicID,

String systemID, int lineNumber, int columnNumber,

Exception e);

public String getPublicId();

public String getSystemId();

public int getLineNumber(); ß very useful info

public int getColumnNumber();

}

The line number and column numbers the parser reports may not always be perfectly accurate, but they are very useful anyway. Recall Counter reports with line and column number info.

ErrorHandler Interface: needed for non-fatal error detection, including validation errors

Throwing an exception aborts the parsing process, but not all problems in an XML document necessarily require such a radical step

In particular, validity errors are not signaled by a SAXException exception being thrown, because that would stop parsing. Instead, the ErrorHandler allows us to look at the exception object as just a method argument, not a thrown exception.

If you want your program to be informed of nonfatal errors, then you must register an ErrorHandler object with the XMLReader

Example 7.7. The ErrorHandler interface

package org.xml.sax;

public interface ErrorHandler

{

public void warning(SAXParseException exception)

throws SAXException;

public void error(SAXParseException exception)

throws SAXException;

public void fatalError(SAXParseException exception)

throws SAXException;

}

Use parser.setErrorHandler(handler)--see code on pg. 323

XMLReader ErrorHandler

The DefaultHandler is a convenience class that implements (see pg. 884) the four required SAX callback interfaces:

(they required for the parser implementation, not required to be used by a program)

- EntityResolver

- DTDHandler

- ContentHandler

- ErrorHandler

We can use this class as the base object of our own handlers, so we don't have to provide do-nothing methods to satisfy the API.

Ex 7.8 on page 322 sets up an ErrorHandler, but as it stands, doesn't turn on validity checking, so we won't get a report of validity problems.

Note that its fatalError method does not itself throw, but its caller does throw after fatalError returns, in order to fulfill the rule about fatal errors aborting parsing.

(But does report on NS prefix prob’s, as shown on pg. 324).

To get validation errors, need to set a “feature” on, as explained next. Could add this to Ex. 7.8. by adding

parser.setFeature("http://xml.org/sax/features/validation", true);

before parse is called. This turns on DTD validation. To turn on XML Schema validation (for sure, maybe it’s on by default (pg. 348)):

parser.setFeature("http://xml.org/sax/features/validation", true);

SAX Features (boolean on or off) and Properties (various string values)

Counter.java uses SAX to parse a document and check well-formed-ness. If given flag –v, it will do DTD validation, then –v –s will do schema validation.

So we can look into it and see how this is done.

Counter.java snippets:

parser = XMLReaderFactory.createXMLReader();

…

Here is the explicit validation setting:

if ‘-v’, then parser.setFeature(“http://xml/org/sax/features/validation”, true). (name of validation feature) See pp. 325-326.

if ‘-s’, then parser.setFeature(“http://xml/org/sax/features/validation/schema”, true); See pg. 348.

parser.setContentHandler(counter) ; (counter is a new Counter() class). So Counter IS-A ContentHandler.

All these happen in main.

We look for startElement now. (we must see fElements++; etc, just stats).

As covered previously, validation is a feature that can be turned on. Example 7.9 on page 331 is basically the same as our Counter program. it sets up an ErrorHandler and uses the following line to enable DTD validation:

parser.setFeature("http://xml.org/sax/features/validation", true);

Validation is the most important feature. For XML Schema validation, we need to set an additional Xerces feature, as discussed on pg. 348.

Features can be turned on or off through setFeature() on the parser object, i.e., features are Boolean.

On the other hand, Properties are objects, not boolean variables

Both features and properties are named by absolute URIs.

Features and properties can be read-only, write-only or (rarely) read-write

Standard features that are supported by multiple parsers have names that begin with

http://xml.org/sax/features/

Different parsers also support non-standard, custom features. The names of these features begins with URLs somewhere in the parser vendor's domain

An interesting property is the xml-string, formally http://xml.org/sax/properties/xml-string. It contains the string of text that triggered the current SAX event. This can be used to echo most of the content of an XML document. See pg. 335.

We are skipping Chap 8 on SAX Filters, so on to Chap. 9 on the DOM

But first backtrack to Chap 5 for intro part.

Intro to DOM programming

Back to Chapter 5 for intro program. (Chapter 5 intros both SAX and DOM)

Ex. 5.5, Pg 235-237 -> First DOM Example, see tree of objects in use, similarity to TestXPath code.

NOTE: This program uses class DOMImplementationImpl (not to be confused with DOMImplementation), also XMLSerializer, not in JDK.

Telltale sign of non-JDKness: look at the imports of “org.apache.xerces.parsers”, vs on page 239 “javax.xml.parsers”.

So skip to pg. 239 for the second version, which does work under JDK. The Transformer part is further explained in pg. 484 as trivial “transformation”, i.e., no real change.

With DOM we can build XML representation in terms of objects as well as read XML files. And use XPath with the object tree.

Example 5.6 on page 239 is program that uses JAXP, which is in the JDK. Here is the relevant code section:

try {

// Build the request document

DocumentBuilderFactory builderFactory

= DocumentBuilderFactory.newInstance();

DocumentBuilder builder

= builderFactory.newDocumentBuilder();

Document request = builder.newDocument();

...

}

Steps required to use DOM

1. get a factory object (builderFactory = DocumentBuilderFactory.newInstance())

2. get a document builder object (builder = builderFactory.newDocumentBuilder())

3. get a document object (builder.newDocument())

(Also, using code from pg. 496, we can use the builder object to get a DOMImplementation object, discussed later in this class)

+ here is the tree depicting the XML request message on page 139

/ \

| |

[calculateFibonacci] <param>

<value>

<int>

[23]

To create this in DOM we need to create a Document object, then create an Element object, which originally stands by itself. We then attach this Element object to the Document object by appending it as a child.

We keep coming back to the Document object to create objects which are then appended to their parent object. In this way we build up the tree