CS639 – 9/3

CS639 – class 14

After Midterm: REST Web Services

Read Webber et al, Chap 1-3: background, we’ve covered the underlying Web technology already.

Also, Chap 11 on SOAP vs REST, etc.

Then start studying Chap 4, the CRUD service for coffee orders, first real example. I’ll supply a Java project for this.

Last time: intro to DOM Programming:

Ex 5.6: Reading and writing no-namespace XML

Ex. 10.5 Writing a document with a default namespace

Handout: Ex. 10.5 edited to write a document with a non-default namespace

Handout: Program to read XML with a namespace and determine that namespace by using rootNode.getNamespaceURI(). (This does assume the namespace is set up at that root element.)

There is another difference between Ex 5.6 and Ex. 10.5: two ways to create the top part of the XML tree. In Ex 5.6, create a Document node, then add a root Element to it. In 10.5, create the Document and root Element together in one call.

Then we started looking at the data models for DOM and XPath

DOM Nodes vs. XPath Nodes

For DOM: 12 kinds of Nodes

For XPath: 7 kinds of Nodes.

See pg. 760 for discussion of XPath vs. DOM

DOM node	Corresp. XPath node
Document node	Root node
Element node*	Element node
Text node* (includes CDATA text if the coalescing feature is on)	Text node (but includes CDATA, entity text)
Attribute node	Attribute node
PI node (skip)*	PI node
Comment node*	Comment node
CDATA node* (missing if coalescing is on)	--- Absorbed into text node
Document type node*
Notation node (skip)
Document fragment node (skip)
Entity node (skip)	--- Resolved entities are in Text nodes
Entity Reference node* (skip)
	Namespace node

*= in official DOM tree, so reported by methods that iterate through child nodes

Note: Namespace Nodes are in the XPath data model but not in DOM (more below). In DOM, namespace information for nodes is available via Node methods getNamespaceURI and getPrefix (if the DOM is made “namespace-aware”). See pg. 904. This setup is convenient for programmers—DOM has digested the whole document and determined from the various xmlns=... attributes what NS pertains to each element and attribute.

Note: The built-in entity references (&, <, >, " and ') are always expanded during parsing, SAX or DOM. pg. 462. These are the only entity references we’re covering. Thus we don’t need to worry about Entity Reference Nodes.

JDK XPath queries are answered using DOM Nodes, so mismatches here could cause problems. It’s important to set the coalescing feature on so text is handled as expected in XPath.

DOM vs. XPath node trees: we’ll assume DOM has coalescing feature on

A simple XML document has the same tree of nodes for DOM, XPath

For example, the tree for the XML of the XML RPC example we looked at earlier. The only difference is root node vs Document node at the top.

Document Nodes (pg. 442)

We have been drawing XML trees without a special node at the top, above the root element node. But both DOM and XPath have such “extra” nodes at the top. Ex 9.2 shows why. It’s possible (even common) to have comments before the root element in the XML document, and other things too. So the Document Node (Root node in XPath) is needed to gather together these nodes along with the root element node to define the whole document.

Ex 9.2 shows a document node with 4 children

Same tree top in XPath: no DOCTYPE, root instead of Document

Now we should be confident that the DOM and XPath trees are similar enough, at least for any XML we have actually covered, to understand how DOM can support XPath queries.

We need to cover a practical matter related to pa2: if the XML has a namespace, even a default namespace, it can cause the DOM’s XPath support to fail to find elements for us if the DOM has namespace aware turned on, as we have recommended. Since pa2 asks you to add a namespace to the XML such as Grid.xml, and then use XPath in a program for a client, we need to make sure that the DOM underlying the XPath support in use is not namespace aware.

We can understand why this can happen. If we use an XPath such as //method with namespaces in play, the name method is in a namespace, and there’s no way to specify a default namespace for an XPath expression within the expression. Recall that XPath is not a standalone language, but instead expects to be living inside some other software system. That means that every XPath tool has to provide a way to define prefixes for XPath expressions separately from the XPath expression itself. Don’t ask me why this is (I don’t really know.) Since the namespace is undefined, the search fails.

Next class we’ll look into how to define prefixes for XPath expressions, but for now, we’ll use DOM with namespace aware turned off for XPath processing. See $cs639/xpath/TestXPathIgnoreNS.java for the code that does this.

Unfortunately, the first XPath tool, $cs639/xpath/TestXPath.java, which is written in the simplest way to tap into the JDK’s XPath support, ends up using a DOM with namespace awareness turned on, so fails to find elements. Try the following to see this:

java TestXPath Grid.xml

//method

… nothing found

java TestXPathIgnoreNS Grid.xml

//method

…lots of results…

Element Nodes (pg. 443)

Name (returned by Node’s getNodeName) is QNAME like “book:section”.

URI of n.s. is available

Can have element, PI, comments, text, CDATA children

- attributes are not children !!! because they are not in the official DOM tree

The tree of nodes is fully ordered, but the attributes are not, so perhaps this is why they are not part of the official DOM tree.

Attribute Nodes – in DOM, not children of Elements and Element is not a PARENT (mutual). This differs from XPATH, where attributes are not children of elements but an attribute has a parent element node. In DOM, attributes are “owned” by an element. We can ask for an attribute’s owner via getOwnerElement(), pg. 895. This method used in TestXPath.

When we define a NAMESPACE via an ATTRIBUTE, DOM just gives it an attribute node, but XPATH has no attribute node for it.

XPath Namespace Nodes

XPath has namespace nodes, not for where the namespace is declared by an attribute, but to mark that an element (or attribute) has a namepace in scope. So the XPath tree of nodes is littered with these namespace nodes, one for each element in scope for each namespace. Deep in the tree an element may easily have multiple namespace nodes as children. See example, pp. 758-759. XPath also is holding the info on the actual namespace of each element and attribute, and these can be accessed by XPath functions (see pg. 758, 775-778).

Text Nodes, CDATA Section Nodes

Unlike XPath and SAX, CDATA nodes are (by default) treated separately from non-CDATA text content. Important to set coalescing on to capture all text in text nodes. XPath text nodes are as big as possible, i.e., you don’t have to concatenate multiple text nodes together to get text content as we saw in SAX processing. DOM parses into big-as-possible text nodes (pg. 445) but allows multiple text nodes together in the tree by adding them in.

CDATA Example: <greeting> Hi <![CDATA[<happy>]]>!</greeting>

Cases of SAX, XPath, DOM with coalescing on: characters/text node value: Hi <happy>!

Case DOM with coalescing off: 3 nodes, with text as values:

Text node: Hi

CDATA node: <happy>

Text node: !

Node Properties: Node is the base interface, Element, Attr, etc are subinterfaces of Node

Note the chart on pg. 450-451. Need to relate this to the Node API on pg. 904. All nodes have a “Name” (Node’s getNodeName()), but for some types it’s very generic, like “#text” for all text nodes. Elements and Attr’s have their prefixed name as name but this depends on the prefix, which is not a universal ID. You can call getLocalName() and getPrefix() and getNamespaceURI() to get the individual parts of the name, if the parsing was done with namespace-aware, as we have promised to do with DOM. Surprisingly, getLocalName() returns null if namespace-aware was not enabled for DOM parsing.

Node value: (column Value on pg. 450, getNodeValue() in Node API): is null for elements, unlike the “string value” for XPath elements, which give you all the “content” below the element in the tree. There is another Node method, not listed on pp 904-905, getTextContent() which can deliver the content at and below this element (not incl. comments). Suggest adding this method to pg. 905.

Node Parent, Children (columns on pg. 450, getParentNode, getChildNodes, also getFirstChild etc, in the Nod API) for the DOM tree connections.

Note there is an error here listing Element as parent of Attr. Need to use Attr’s getOwnerElement to find the element of an attribute.

Also we see here that an attribute can have children, but we can ignore this by using the “value” via getValue, which assembles it together in one string.

DOM Parsers

Getting a parser— Xerces is provided in Java 6-7, via JAXP, as in Example 5.6 and here discussed on pg. 458.

Examples 5.6 and 10.5 are crucial—show most of the tricks we need.

Skip examples 9.3 and 9.4 because these use non JDK classes. Ex. 9.5 does the same thing with JDK classes.

Ex. 9.5 use DOM parser to check well-formedness. We saw same checking with SAXParser - Ex 6.1.

Setting the DOM parser configuration, pp. 461-463

Get a DocumentBuilderFactory as in Ex. 5.6 builderFactory, and act on it:

- setCoalescing (Boolean coalescing) p. 461 Good idea for data-centric apps. Treats text the way XPath does.

In fact, we need to set this to allow DOM’s XPath support (used in TestXPath) to return text from CDATA nodes.

Since TestXPath doesn’t do this, CDATA nodes do not show up in the NodeList returned from the XPath query. Should fix it (will discuss this next time.)

- Ignore Comments, p. 462: could go either way

- namespace Aware: p. 463: If using namespaces, should override this questionable default and make this true. As discussed last time, this default is inconsistent with SAX, which defaults to namespace processing. - validation (DTD validation) p. 463: note the need for the SAXErrorHandler here.

Xerces (in JDK) can do schema validation, see .DOM Parser: Validating with XML Schema in the J2EE tutorial at Sun.

Get a DocumentBuilderFactory object as in Ex. 5.6

First turn on NS - aware, validation, as above, then schema validation is set up like this— from the same website linked above

Of course, we need a schema associated with the document

- our old way, by linkage in document

or by specifying its location dynamically in the program (XML Schema only I think)

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();

factory.setNamespaceAware( true);

factory.setValidating( true);

factory.setProperty( "http://java.sun.com/xml/jaxp/properties/schemaLanguage",

"http://www.w3.org/2001/XMLSchema");

factory.setProperty( "http://java.sun.com/xml/jaxp/properties/schemaSource",

"file:test2.xsd"); // (only needed if test.xml has no schema linkage itself)

DocumentBuilder builder = factory.newDocumentBuilder();

Document document = builder.parse( new Inputsource( "test.xml"));

Also, there is another way to do this now, with a “Schema” object. New with JAXP 1.3, since book was written. This newer way also allows us to check the validity of a DOM tree already in memory, not just one on the way into memory with parsing. If interested, see IBM doc. Also DOMReader1.java in $cs639/dom.

In SOAP and REST, we have multiple schemas at work, but this can be handled too:

- Set up an array of Strings of URIs for schemas, and pass the array name to setAttribute instead of the one File.

Ex 9.6 use DOM parser to check DTD validity. Fix the caption to the figure. As with SAX, we need to turn on validity checking, and provide an ErrorHandler as with SAX.

Example 9.6

- checks validity w.r.t DTD

- we could make it do schema validation.

DOM parser API: page 912 -> 5 versions of parse! Ex 9.6 is using the 3rd version of parse. In fact the 2 versions of parse for the SAX parser (p.878) and 5 versions for the DOM parser actually cover the same set of possible types of input data streams of XML (URL, InputStream, Reader), because InputSource is itself a combo of input choices. The File version for DOM isn’t really a new input type because getting data from the file involves an InputStream. Recall that it’s good to specify a URL for the input in addition to an InputStream or Reader so the parser can interpret relative URLs in the document.

Pg..466: discussion of DOM3 Load and Save, but a non-Sun version. Use JDK docs, FibonacciEx.java example.

Pg. 468: Node Interface—same as on pg. 904-905, with same error.

Ex. 9.11 Walking the tree: There’s no Node iterator across the tree in the Core DOM, so we need to do it by multiple steps as shown here and also (using NodeList) on pg. 482-483

NodeList Interface--you have some experience in pa2 on this.

JAXP Serialization: skip

Next time: advanced XPath, use in database load/extract, etc.

(The XPath examples are already available in $cs639/xpath. See its README if interested.)