CS639 Class 12

Midterm Tues. Mar. 27

Pa2 due Mar. 21, Wed. after break

Tomcat working at home?  Everyone present said yes.

 

Working on reading XML, Chap. 6

 

Receiving characters: you need to append pieces of text together

 

Note:      SAX may call characters several times to deliver the contents of a single node. we should collect the text into a StringBuffer or other container that is specific to a given part of the document tree

 

The Example on page 233 is OK because it outputs all its characters which naturally appends them to their predecessors

 

Ex 6.10 on pages 285-287 is an example of this character buffering.

 

       startElement: if element name is "double" set up a new StringBuffer

 

       characters: if buffer != null

              buffer.append(text, start, length)

 

       endElement: if element is "Double" convert buffer to result and make buffer null. buffer serves double duty as a storage space and a flag

 

We can skip the processor instruction and namespace sections of Chapter 6

 

What the ContentHandler Doesn't Tell You (page 303)

           

The ContentHandler interface gives you most of the information you really need from a document. Some of its omissions are handled by other callback interfaces

 

Some of the things the ContentHandler does not deal with but are available through other interfaces

 

-           comments, unskipped entities and CDATA sections, all of which are available through the LexicalHandler interface

Note: the text of CDATA sections is delivered by characters(). This further lets you know which text came from CDATA’s.

 

-           ELEMENT, ATTLIST and parsed ENTITY declarations from the DTD, all of which are reported through the DTDHandler interface

 

-           validity errors and other nonfatal errors which are reported through the ErrorHandler interface

 

Some of the things the ContentHandler does not deal with, and are not dealt with by any part of SAX (I.e., SAX 2 = book version = Java 6 version)

 

-           the version, encoding attribute from the XML declaration

 

-           insignificant white space in tags and before and after the root element

 

-           the order of attributes

 

-           the type of quotes that surround attributes

 

-           character references

 

-           prenormalized attribute values (see p. 280 on normailization)

 

-           whether an attribute was specified in the instance document or defaulted from a DTD or schema (see more below)

 

-           whether empty elements are represented as <name></name> or <name />

 

 

The only common use case for most of this information would be an XML editor. We will see an example that gets validation errors.

 

Example of an attribute default, using DTD, pg 36:   rate CDATA "0.0"     sets a default of “0.0” for rate.

The corresponding XML Schema is on pg.39: <xsd:attribute name=”rate” type=”xsd:decimal”/> setting no default, but we can add a default easily, to be <xsd:attribute name=”rate” default=”0.0” type=”xsd:decimal”/>.

 

 

Chap. 7: more on SAX: mainly additional exceptions, features (both needed to capture validation problems)

 

Handling the existence of multiple parsers--we just use JDK's

 

Input for parsing: by string URL or "InputSource"…

 

An InputSource object is used in both constructors for a parser, one explicitly, the other, created under the covers when given a string.

Note: this class has a “messy” API: obscure rules about what you can call or should call. Not great software engineering.

 

Example 7.1, pg. 310. The SAX InputSource class (there are missing semi-colons in book interface def.)

      package org.xml.sax;

 

      public class InputSource

      {

            public InputSource();

            public InputSource(String systemID);

            public InputSource(InputStream byteStream);

            public InputSource(Reader characterStream);

 

            public void             setPublicId(String publicID);

            public String           getPublicId();

            public void             setSystemId(String systemID);

            public String           getSystemId();

            public void             setByteStream(InputStream byteStream);

            public InputStream      getByteStream();

            public void             setEncoding(String encoding);

            public String           getEncoding();

            public void             setCharacterStream(Reader characterStream);

            public Reader           getCharacterStream();

      }

 

 

A SystemID is a URL, and should be set to resolve relative URLs such as in DOCTYPEs or XLink hrefs.

 

An InputSource can be constructed from a string ("System ID" i.e. URL), an InputStream, or a Reader

 

InputStream is at the top of the byte input hierarchy of Java i/o.

 

Reader is at the top of the Unicode text input hierarchy of Java i/o.

 

Rules for InputSource use:

1.      Don't set both the bytestream and Unicode text input objects on an InputSource object: it uses one or the other for actual input data.

 

2.      Do set the System ID if what was passed to the constructor was a Unicode or byte stream object, at least if your XML has any relative URLs in it.

 

If an InputSource has both an InputStream and a System ID, the InputStream is used to for input XML data and the System ID is used to resolve relative URLs.  Similarly if an InputSource has both a Reader and a SystemID.

 

Relative URLs in XML

For example, the beginning of book1.xml:

 

            <?xml version="1.0" encoding="ISO-8859-1"?>

      <!DOCTYPE book SYSTEM "book.dtd">

      <book>

 

A parser would not be able to find the dtd, which is given as a relative URL, without the System ID information, though the parser could still parse the document.

Another example: schemaLocation= “book.xsd”.

 

Another way of dealing with the above problem is to use an absolute URL for the SYSTEM value, such as:

 

                        http:/www.umb.edu/cs639/validate/book.dtd

                       

 

Errors and Exceptions in SAX

 

There are three levels (pg 315)

 

-           Fatal Errors - a well-formedness error. as soon as the parser detects this type of error it must throw a SAXParseException and stop parsing

 

-           Error - an error but not a well-formedness error. the most common variety is a validation error. if a parser detects one of these errors, it may or may not throw a SAXException and it may or may not continue parsing. In the case of a validation error, it will be able to proceed.

 

-           Warning - not an error in itself, but it may indicate a mistake of some kind in the document

 

NOTE: Validity errors don’t cause a throw themselves. We’ll soon see how to get info on them.

 

SAXParseException is a subclass of SAXException. Previously we saw catch (SAXException e)…after the call to parse, but the actual thrown exception object is of the subclass type SAXParseException in the case of a well-formedness error.  Of course this catch catches all subclasses, so it works fine to gain control.

 

The only kind of XML the parser is guaranteed to tell you about through an exception is a well-formedness error

 

To be informed of other problems, you need to implement the ErrorHandler interface and register this class with the XMLReader

 

Nested Exceptions

 

Example 7.4. The SAXException class

      package org.xml.sax;

 

      public class SAXException extends Exception

      {

 

            public SAXException();

            public SAXException(String message);

            public SAXException(Exception rootCause);

            public SAXException(String message, Exception e);

 

            public String        getMessage();

            public Exception getException();

            public String        toString();

      }

 

 

 

The only kind of exception you can throw in a ContentHandler interface method is a SAXException, but this may not be the kind you want to throw

 

In this case you wrap the exception you want to throw inside a SAXException and throw the SAXException instead. this can be done because one of the constructors for a SAXException takes an exception for an argument. you can then use the getException() method of this object to retrieve the original exception

 

For example (pg. 317), if we wanted to throw an InvalidKeyException in a handler, the code that catches this exception might look like this:

 

      catch (SAXException e) {

            Exception rootCause = e.getException();        // or more currently: = e.getCause();

            if (rootCause == null) {

                  // handle it as an XML problem...

            } else {

                  if (rootCause instanceof InvalidKeyException) {

                        InvalidKeyException ike = (InvalidKeyException) rootCause;

                        throw ike;

                  } else if (rootCause instanceof SomeOtherException) {

                              SomeOtherException soe = (SomeOtherException) rootCause;

                              throw soe; 

                  }

                 

            }

      }

 

These days, we would not use a getException() to retrieve a nested exception, but a getCause(), but the SAXException was defined before this became standard--you can use either method to get it.

 

You can have many levels of nested exceptions if you keep catching and throwing the original exception

 

The SAXParseException has valuable line # and column # information for parse problems with XML files

 

            Example 7.5. The SAXParseException class

      package org.xml.sax;

 

      public class SAXParseException extends SAXException

      {

   

            public SAXParseException(String message, Locator locator);

            public SAXParseException(String message, Locator locator, Exception e);

            public SAXParseException(String message, String publicID,

                                                String systemID, int lineNumber, int columnNumber);

            public SAXParseException(String message, String publicID,

                                                String systemID, int lineNumber, int columnNumber,

                                                Exception e);

  

            public String     getPublicId();

            public String     getSystemId();

            public int        getLineNumber();            ß very useful info

            public int        getColumnNumber();

      }

 

The line number and column numbers the parser reports may not always be perfectly accurate, but they are very useful anyway. Recall Counter reports with line and column number info.

 

ErrorHandler Interface: needed for non-fatal error detection, including validation errors

 

Throwing an exception aborts the parsing process, but not all problems in an XML document necessarily require such a radical step

 

In particular, validity errors are not signaled by a SAXException exception being thrown, because that would stop parsing. Instead, the ErrorHandler allows us to look at the exception object as just a method argument, not a thrown exception.

 

If you want your program to be informed of nonfatal errors, then you must register an ErrorHandler object with the XMLReader

 

Example 7.7. The ErrorHandler interface

      package org.xml.sax;

 

      public interface ErrorHandler

      {

 

            public void warning(SAXParseException exception)

                  throws SAXException;

            public void error(SAXParseException exception)

                  throws SAXException;

            public void fatalError(SAXParseException exception)

                  throws SAXException;

      }

 

Use parser.setErrorHandler(handler)--see code on pg. 323

                    XMLReader                   ErrorHandler

 

The DefaultHandler is a convenience class that implements (see pg. 884) the four required SAX callback interfaces:

                                          (they required for the parser implementation, not required to be used by a program)

                        -           EntityResolver

                        -           DTDHandler

                        -           ContentHandler

                        -           ErrorHandler

 

We can use this class as the base object of our own handlers, so we don't have to provide do-nothing methods to satisfy the API.

 

Ex 7.8 on page 322 sets up an ErrorHandler, but as it stands, doesn't turn on validity checking, so we won't get a report of validity problems.

Note that its fatalError method does not itself throw, but its caller does throw after fatalError returns, in order to fulfill the rule about fatal errors aborting parsing.

(But does report on NS prefix prob’s, as shown on pg. 324). 

 

To get validation errors, need to set a “feature” on, as explained next.  Could add this to Ex. 7.8. by adding

parser.setFeature("http://xml.org/sax/features/validation", true); 

before parse is called.  This turns on DTD validation.  To turn on XML Schema validation (for sure, maybe it’s on by default (pg. 348)):

parser.setFeature("http://xml.org/sax/features/validation", true); 

 

SAX Features (boolean on or off) and Properties (various string values)

 

 

Counter.java uses SAX to parse a document and check well-formed-ness. If given flag –v, it will do DTD validation, then –v –s will do schema validation.

So we can look into it and see how this is done.

 

Counter.java snippets:

parser = XMLReaderFactory.createXMLReader();

Here is the explicit validation setting:

if ‘-v’, then parser.setFeature(“http://xml/org/sax/features/validation”, true). (name of validation feature)  See pp. 325-326.

if ‘-s’, then parser.setFeature(“http://xml/org/sax/features/validation/schema”, true); See pg. 348.

 

parser.setContentHandler(counter) ; (counter is a new Counter() class). So Counter IS-A ContentHandler.

All these happen in main.

 

We look for startElement now. (we must see fElements++; etc, just stats).

 

 

As covered previously, validation is a feature that can be turned on. Example 7.9 on page 331 is basically the same as our Counter program. it sets up an ErrorHandler and uses the following line to enable DTD validation:

 

                        parser.setFeature("http://xml.org/sax/features/validation", true);

 

Validation is the most important feature.  For XML Schema validation, we need to set an additional Xerces feature, as discussed on pg. 348.

 

Features can be turned on or off through setFeature() on the parser object, i.e., features are Boolean.

 

On the other hand, Properties are objects, not boolean variables      

 

Both features and properties are named by absolute URIs.

 

Features and properties can be read-only, write-only or (rarely) read-write

 

Standard features that are supported by multiple parsers have names that begin with

 

                        http://xml.org/sax/features/

 

Different parsers also support non-standard, custom features. The names of these features begins with URLs somewhere in the parser vendor's domain

 

An interesting property is the xml-string, formally http://xml.org/sax/properties/xml-string. It contains the string of text that triggered the current SAX event. This can be used to echo most of the content of an XML document. See pg. 335.

 

 

We are skipping Chap 8 on SAX Filters, so on to Chap. 9 on the DOM

But first backtrack to Chap 5 for intro part.

Intro to DOM  programming

Back to Chapter 5 for intro program. (Chapter 5 intros both SAX and DOM)

 

Ex. 5.5, Pg 235-237 -> First DOM Example, see tree of objects in use, similarity to TestXPath code.

NOTE: This program uses class DOMImplementationImpl (not to be confused with DOMImplementation), also XMLSerializer, not in JDK.

Telltale sign of non-JDKness: look at the imports of “org.apache.xerces.parsers”, vs on page 239 “javax.xml.parsers”.

So skip to pg. 239 for the second version, which does work under JDK.  The Transformer part is further explained in pg. 484 as trivial “transformation”, i.e., no real change.

 

With DOM we can build XML representation in terms of objects as well as read XML files. And use XPath with the object tree.

 

Example 5.6 on page 239 is program that uses JAXP, which is in the JDK. Here is the relevant code section:

 

            try {      

                  // Build the request document

                        DocumentBuilderFactory builderFactory

                              = DocumentBuilderFactory.newInstance();

                        DocumentBuilder builder

                              = builderFactory.newDocumentBuilder();

                        Document request = builder.newDocument();

                  ...

            }

 

Steps required to use DOM

 

            1.         get a factory object (builderFactory  = DocumentBuilderFactory.newInstance())

 

            2.         get a document builder object (builder = builderFactory.newDocumentBuilder())

 

            3.         get a document object (builder.newDocument())

 

(Also, using code from pg. 496, we can use the builder object to get a DOMImplementation object, discussed later in this class)

 

 

+          here is the tree depicting the XML request message on page 139

 

                  <methodCall>

                  /                       \

       <methodName>     <params>

              |                                |

     [calculateFibonacci]    <param>

                                                  |

                                             <value>

                                                  |

                                             <int>

                                                 |

                                              [23]

 

 

           

 

To create this in DOM we need to create a Document object, then create an Element object, which originally stands by itself. We then attach this Element object to the Document object by appending it as a child.

 

We keep coming back to the Document object to create objects which are then appended to their parent object. In this way we build up the tree