Class 11

Hw 2 is due

Hopefully you’ve tried the tomcat install at home—problems? 

Note: there is another example servlet, servlet2, on the course website. This shows you how to access a local file, needed for pa2.

Good idea to try these out first on UNIX: see directions on forum entry.

Start from pa1b solution, unless your solution is really complete.

 

PA2-related Notes

Eclipse makes writing servlets easier, once we get it working--if you’re still having trouble with it, just use command line tools for now.  We need to use this approach on UNIX anyway.

 

Also note that in pa2 you don't need to use a DTD or Schema for the clients--SAX and XPath work fine without these

 

Servlet programming/debugging Basics

 

Your servlet code is running in the tomcat JVM, somewhat further away than you are probably used to.  What happens to standard output?

 

Good old println-to-System.out

We've set up tomcat to "swallow output" in conf/context.xml (the file to specify default context for webapps), meaning you can output to System.out and see the output in the appropriate tomcat log.

 

In servlet1, in doGet(), see "System.out.println("in doGet"); 

Run it “ant deploy” then “ant test1”

Find the output "in doGet", with timestamp, in logs/localhost.2008-03-6.log

If you are using tomcat run from eclipse, you'll see the output in the Console view, a combination log.

 

You could write HTML or XML to out, but you would lose it in cases of serious errors, so the log is a better way.

 

Later, when servlet development in eclipse is familiar, try out debugging.

 

 

XML Reading APIs - an overview (Chap. 5 surveys the parsers)

 

SAX - "Simple API for XML"

 

·         the gold standard for XML parsing

 

·         is a read only API

 

·         is an event driven API, this is, it uses callbacks

 

·         lightweight - it doesn't create Java objects on its own

 

·         the programmer has to create object if he or she needs them

 

·         this or StAX are the only ways to deal with huge XML documents that won't fit into memory

 

DOM - "Document Object Model"

 

·         the term DOM is used to refer to both the model and the API

 

·         turns XML into a tree of objects

 

·         can write XML, unlike SAX

 

·         resulting DOM tree can support XPath queries, unlike SAX and STAX

 

·         great for small or non-large documents

 

·         can create and then update a tree of objects

 

·         expensive in terms of memory and cpu cycles

 

JAXP - "Java API for XML Parsing"

 

·         is an envelope that manages SAX, DOM and XSLT

 

·         SAX and DOM are both contained in it

 

JDOM (we’re skipping this)

 

·         like DOM it creates a tree of objects

 

·         designed by Java people specifically for Java

 

·         less ugly than DOM, which must be language neutral, in detail, but has gaps which make it less general, especially in mixed content

 

StAX: Streaming XML Parser, newer than our text

 

·         Like SAX, it uses little memory, so is efficient

·         Considered easier to use than SAX: support an iterator on XML, rather than callbacks

·         Has an XML writer associated with it, unlike SAX.

 

 

From Sun’s Web services tutorial:

 

Table 3-1 XML Parser API Feature Summary

Feature

StAX

SAX

DOM

TrAX

API Type

Pull, streaming

Push, streaming

In memory tree

XSLT Rule

Ease of Use

High

Medium

High

Medium

XPath Capability

No

No

Yes

Yes

CPU and Memory Efficiency

Good

Good

Varies

Varies

Forward Only

Yes

Yes

No

No

Read XML

Yes

Yes

Yes

Yes

Write XML

Yes

No

Yes

Yes

Create, Read, Update, Delete

No

No

Yes

No


 

SAX programming

 

To use SAX you create an XMLReader object, to which you register the classes you create implementing the ContentHandler interface, which are the call backs that do the work in the programs we write

 

Note: when we create an XMLReader object, we should use the no-argument version of XMLReaderFactory.createXMLReader( ), which will use the default parser, Xerces as the parser.  Xerxes is the best parser.

 

NOTE: The call to createXMLReader shown on page 213 will not work on our systems--just drop the argument.

 

Example 5.3, pages 230 - 231, is a SAX client that has been created to read the XML generated by the XML-RPC server, shown at the top of page 142. here is the output from this server along with the events SAX generates:

           

C:\XMLJAVA>java FibonacciXMLRPCClient 10

<?xml version="1.0"?> startDocument event

<methodResponse>  startElement event

    <params>                       startElement event

        <param>                    startElement event

            <value>                startElement event

                <double>           startElement event      <--at this callback, set flag "inDouble"

                    55             characters event    <--at this callback, get the "55", knowing we're inside <double>

                      </double>          endElement event       <--at this callback, reset flag "inDouble"

            </value>               endElement event

        </param>                   endElement event

    </params>                      endElement event

</methodResponse>                  endElement event

 

We want to extract the “55” from this XML.  We have to do this by creating the right callback methods—see above ß notes for the plan.  Each callback call provides the relevant details, like the element name. So at the startElement callback for <double> we know we’re processing a <double> start-element.

 

That's the plan.  We get to code the callbacks.  We do it by writing a class that implements ContentHandler, or its subclass DefaultHandler that provides trivial implementations we can override.

 

See pg. 232, for the FibonacciHandler that implements the callbacks startElement, endElement, and characters to do our planned actions at the various events--

 The characters method delivers text via an array of char, which gets printed out

 

The startElement and endElement methods turn on and off the inDouble flag. When the inDouble flag is set to true, the characters method prints the text. 

 

 

Returning to the top-level of the client program in Example 5.3:

 

            XMLReader parser = XMLReaderFactory( );  // use no-argument constructor

 

            creates the parser, when no argument is specified, it uses the default Xerces parser, which is the best

 

We have to tell this parser about the ContentHandler object, FibonacciHandler:

 

            org.xml.sax.ContentHandler  handler = new FibonacciHandler( );

                       // better import statement would make prefix unnecessary

            parser.setContentHandler( handler );

 

Then a new InputSource object named "source" is created from the InputStream from the URLConnection to the server.

 

            InputStream in = connection.getInputStream( );

      InputSource source = new InputSource(in);

            // we don't have to specify the encoding of the input stream since the parser will

            //determine the encoding from the XML document header

 

Then parser.parse(source) is called to do the parsing, creating the events, and thus the callbacks that are handled as planned.

 

            // this call does not return until it finishes parsing the document and issuing the callbacks

 

Whitespace text in XML generates “extra” calls to characters

Whitespace is text and a SAX parser will generate text events (callbacks to characters) for whitespace as well as character text.  These calls are not shown above.

 

 <?xml version="1.0"?> 

<methodResponse>       <--in here have <EOL><space><space><space> between element tags

    <params>                  

 

Here the end-of-line <EOL> is <LF> on UNIX/Linux or <CR><LF> on Windows.  So with the spaces we see 4 or 5 characters of whitespace between these two element tags. The whitespace in the above xml document is an example of "ignorable whitespace". This text can only be truly ignored by the parser if the document has a DTD.  Without a DTD, we have to be ready for the parser to call characters to deliver each piece of whitespace between elements.

 

In a lot of cases we have to deal with this kind of whitespace in our programs, that is, tolerate calls to "characters" for these little bits of whitespace we don't care about.

 

The SAX parser descends this tree in a depth first, right to left traversal, generating events as it progresses

 

The XML tree here, showing the bits of whitespace between the tags:

 

 

 

 

image001.jpg

 

XML tree, showing bits of whitespace reported by SAX via calls to characters. In the example 5.3 program, these calls happen but most don’t produce output from the program because the flag “inDouble” is false, except at the bottom (which really shouldn’t have three separate text nodes).

 

Chapter 6

 

Ex 6.1 never turns on validation, so it is just checking well-formed-ness even with a validating parser and availability DTD/schema.

This program has useful error reports/diagnosis. It tells you the line no, column no and the problem.  We should get back to how this works.

 

Tracking the Element Hierarchy when using SAX, needed for pa2 (some coverage in pa2.html)

 

When working with SAX, you have to track the element hierarchy yourself, you can't ask SAX what the parent of a given element is (unlike DOM)

 

In pa2 getting the full name of the method requires tracking the class hierarchy.

 

Given an xml document

 

       We can find the path of any element using the following approach with a SAX parser:

 

       startDocument: create a stack of type String

       startElement: push the local name onto the stack ßand here get the path from the stack

       endElement:  pop the top String on the stack

       endDocument: the stack should be empty

 

Example 6.7 uses this approach (ignore the JTree GUI)

 

Also, for pa2, note that TestXPath has ancestry-related code (climbs the tree)

 

Back to Ex 6.1 : XMLReader is just an interface, see pg 877.

 

XMLReader parser = XMLReaderFactory.createXMLReader(); // factory method, a static method of XML Reader Factory, that return an object that IS-A XMLReader. (this object implements XMLReader).

 

Parser.parse(args[0]) // args[0] = “http:// www. …”  New form of call to parse, with URL arg

Look at API, pg. 877 in the Appendix

Match this to--

parse(String systemID) // the systemID is the name of the URL document

vs. parse(InputSource input)   which we have previously been using.

Note that both overloaded versions of parse throw IOException, SAXException.

 

Here the IOException is thrown by the parser code for I/O error it sees.

The SAXException can be thrown by the parser or by the callbacks (the parser calls the callbacks, like startElement).

 

Look at the catches in Ex. 6.1, how they report on the different exception cases.

 

DefaultHandler: convenience class, does not add any functionality. This provides do-nothing implementations for the 11 methods of content Handler + other trivial methods for other related interfaces.

 

The do-nothing implementation are live void characters(..) throws SAXException {}; so our overriding methods can throw.

 

Compare the two ways to program the set of callbacks we need: use DefaultHandler, or implement ContentHandler directly:

Case 1: use DefaultHandler:

Ex. 6.5 – the class TextExtractor extends DefaultHandler { … } ß no do-nothing methods.

 

Case 2: implement ContestHandler directly:

Ex. 6.3 – the class TextExtractor implements ContentHandler{…} ßlots of do-nothing methods

 

Suggest use of  DefaultHandler in pa2.

 

 

ContentHandler API

 

Pg 264 – ContentHandler’s core methods:

startElement() , endElement() and start/endDocument(), characters()

….

 

All these methods throw SAXException.

 

On the other hand, an overriding method can throw nothing, or a subclass of the API throw, like the method return value, which also can be a subclass of the original API return type.

 

Content Handler: provides access to the normal “meat” of the doc, not the way it’s expressed exactly in the XML doc. + no comments. The characters provide the text after CDATA markup taken out, character entities resolved & built-in entities.

-         and provides little bits of whitespace we don’t really want.

 

The multiple document example, EX 6.6 is incomplete because it does not have a driver. The driver needs to have a try catch block around the call to parser.parse(), so one bad file name does not abort the entire process

 

 

Handling Attributes in SAX

  

Recall from last week that an attribute usually belongs to no namespace and has no prefix, but is specified in the schema that goes with their element, unless it is a "global attribute" like those used by XInclude, that can be attached to any element, and does have a prefix, and is in its namespace.  Clear?

 

When we use a global attribute, we always use a prefix.

 

XLink is a standard for linking an xml document element to other internet resources (like an html link). Here is an example from page 281. As in the case of the example from pg. 31, we need to drop the xlink:type attribute since it turns out not to be a proper global attribute of XLink (the XLink schema has no <xsd:attribute name=”type”.../> as a direct child to <xsd:schema>)

 

<magazine xmlns:xlink=http://www.w3.org/TR/1999/xlink

      xlink:type="simple" xlink:href="http://www.thenation.com/">   <!—crossed-out part needs to be dropped -->

      The Nation

</magazine>

 

Element name magazine is defined in some application namespace, and the magazine element has an XLink to that magazine’s website.

 

xmlns:xlink = … shows the URI for the XLink namespace, and its prefix (xlink) in this doc 

href is a name in the XLink namespace, (and a global attribute by the XLink schema)

xlink:href is a qualified name, or qname

the attribute value is http://www.thenation.com/, the URL we want to follow  

 

Example 6.9, page 281, is a spider program than crawls an xml document's XLinks, and so must find and process the href attributes in the document, then any href attributes in the linked document, and so on, assuming the linked documents are in XML.

 

How should one go about finding the links, i.e. those particular attributes? The first thought is to search for attributes with names "xlink:href", but the prefix is arbitrary, and the local name "href" might belong to another namespace

 

For a certain element, there is only one instance of an attribute with a specific local name and a specific URI—that’s the best way to identify what we want.  We want to determine the string value of this attribute, to get the URL to follow in the spider operation.

 

Attributes are delivered with the startElement callback

 

SAX's startElement() has a parameter of type Attributes (pg. 279)

 

      package org.xml.sax;

 

public interface Attributes

{

       public int    getLength ();

 

       public String getQName(int index);

       public String getURI(int index);

       public String getLocalName(int index);

       public int getIndex(String uri, String localPart);

       public int getIndex(String qualifiedName);

       public String getType(String uri, String localName); 

       public String getType(String qualifiedName);

       public String getType(int index);

       public String getValue(String uri, String localName);  ß the one we want

       public String getValue(String qualifiedName);

       public String getValue(int index);

}

 

This Attributes object contains a set of attributes of no particular order (an element has a set of attributes)

 

Still, individual attributes can be accessed by an index number in this representation, a somewhat surprising sytem.

 

We can use public String getValue(String uri, String localName) to search for the links we want. This is foolproof, since the URI is a unique id of the NS.

 

The spider of Ex. 6.9, pg 281, puts the URLs it collects in a queue to remember them for later processing

 

In the endDocument() method there is an attempt to dequeue a URL from the stack. If it is successful, this document is parsed

 

The code in endDocument is very ugly, because after processing a few docs you have a stack of calls to the same parser object. But we’re not interested in rewriting this program.