Class
11, Mar 7
Hopefully you’ve tried the tomcat install at home—problems?
Note: there is another example servlet, servlet2, on the course website. This shows you how to access a local file, needed for pa2.
Good idea to try these out first on UNIX: see directions on forum entry.
Start from pa1b solution, unless your solution is really complete.
--->We looked at the book example, pg. 146. A servlet producing XML, useful for pa2. Much like servlet2.
Linkage from instance document foo.xml to schema:
· Without namespace: xsi:noNamespaceSchemaLocation with value “URL”
· With namespace: xsi:schemaLocation with value “URI URL”, two URL-syntax strings separated by whitespace, the first for the namespace URI and the second for the schema’s URL.
Note: This attribute (schemaLocation or noNamespaceSchemaLocation) is an example of a global attribute, one with a prefix. The xsi prefix for the XML instance NS is set up with the xmlns:xsi=”URI_of_XMLInstance”. By XML instance we mean the XML document itself, rather than the schema. The XML document needs to point to its schema, which it does with the help of the XSI namespace.
We looked at web.xml from servlet1 and servlet2 as an example of an XML doc with both a namespace and a schema associated with it.
Last time we looked at the XLink example in the book, pg. 31. In order to get this example to actually work with schemas, we need the Address schema to allow “extra” attributes, AND provide schema linkage for the parser to check that href is really a global attribute in the XLink schema. Since the type attribute is not such an attribute, that attribute needs to be removed from the example. See notes for last class for details.
You can restrict the extra
attributes to be in a certain namespace, and “any” elements similarly:
Example from XML Primer: allow an element to have any XHTML elements and attributes:
<element
name="htmlExample">
<complexType>
<sequence>
<any
namespace="http://www.w3.org/1999/xhtml"
minOccurs="1"
maxOccurs="unbounded"
processContents="skip"/>
</sequence>
<anyAttribute
namespace="http://www.w3.org/1999/xhtml"/>
</complexType>
</element>
See examples htmlExample.xml/xsd in
$cs639/validate-ns. Note that the <anyAttribute> here is not needed,
since there are no global attributes in the XHTML schema.
Another example of a standard namespace is the SOAP Namespace, specifically the SOAP Envelope Namespace
This is the request studied above with the standard SOAP envelope around it, from pg. 97:
This shows an example with one default NS and one non-default NS in use.
<?xml …?>
<SOAP-ENV:Envelope
xmlns:SOAP-ENV=”http://schemas.xmlsoap.org/soap/envelope/”> ßSOAP-ENV
is a prefix
<SOAP-ENV:Body>
<getQuote xmlns=”http://namespaces.cafeconleche.org/xmljava/ch2/”>
<symbol>RHAT</symbol>
</getQuote>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
Here we see local names Envelope and Body of the SOAP envelope NS.
We looked into how the SOAP envelope schema can handle the message part inside, which is designed by the app developers.
It uses <any ...>. This schema is in Appendix B, pp. 969-972. The <any> element is on pg. 971, inside schema element name=”Body”.
We have not yet shown how to validate XML using both the SOAP envelope schema and the app schema for the contents of Body. Need to import one schema into another. That’s a more advanced topic.
Also note there is another SOAP schema in Appendix B, for “encoding”. We will never use this encoding technique. It is now obsolete. Skip pp. 104-end of chapter 2.
We are skipping now
to Chapter 5, Reading XML. We should return to the multiple schema
case later sometime.
We will see that reading XML in a Java program (or other programming language) is greatly aided by an XML Parser.
XML Parsers, the quick summary: SAX, DOM are the most important APIs. Only other one we need from Chap. 5 is JAXP. StAX is newly important, but not in Ch. 5.
All these are in the current JDK.
Chapter 5: some discussion of which SAX and DOM parser distribution to choose, ending with conclusion that Xerces is best.
-- Luckily, that’s the one in the current JDK, as predicted on pg. 228.
--So we don’t have to agonize over this, or obtain additional jars, just use the JDK
Starting from the basic idea of reading XML…
In order to read XML we must be able to accept files in UTF-8 encoding and turn them into the Unicode that Java uses.
This is accomplished with the following code snippet from page 215 of our text, also in servlet2’s EchoHtml.java:
Reader reader = new InputStreamReader( in, "UTF-8" );
Here in is a InputStream, i.e., a byte stream
InputStreamReader is a bridge class between Unicode and UTF-8, (as OutputStreamWriter on the output side). It knows how to decode UTF-8 into UTF-16 for Java Strings.
Suppose we were trying to read the XML from page 211-212 of the text, to obtain just the number in the middle:
<?xml
version="1.0"?>
<methodResponse>
<params>
<param>
<value><double>28657</double></value>
</param>
</params>
</methodResponse>
We could parse the XML "by hand" to extract the single number, 28657, as is done by the code on page 215, which looks for the string "<value><double>" This is very fragile and will break down if there is whitespace between the start tags.
The code on page 215 is ugly and not fully-internationalized code. It would break on supplementary (2-char) Unicode characters.
A quick fix to support internationalization is to use String instead of char, as is done in the servlet2 code:
Reader reader = new
InputStreamReader(in, “UTF-8”); // as on
pg. 215
BufferedReader
in = new BufferedReader(reader);
String line;
while ((line = in.readLine()) !=
null) // read into String, not by bytes
sb.append(line); // the end-of-line is not
preserved, but we don’t need it
// servlet2’s code
uses println of PrintWriter to keep eol’s
However the code following this part on pg. 215 is still dependent on an exact “<value><double>” match.
Note: the following is not covered in class
We could improve this code using the Scanner class which is new to Java 5. Scanners work something like the scanf in C. they parse the text looking for items specified by regular expressions. you can construct a scanner from almost anything
Scanner s = new Scanner( in, "UTF-8" ); // in is an
InputStream, as above.
s.findWithinHorizon( "<value>\\s*<double>(\\d*)</double>\\s*</value>", 0
);
//
(\d) is a "group", like %d in scanf
MatchResult result = s.match( );
if ( result.groupCount( ) == 1) {
String value = result.group( 0 );
} else {
...
}
where "<value>\\s*<double>(\\d*)</double>\\s*</value>" is a regular expression, really <value>\s*<double>(\d*)</double>\s*</value>, but we had to escape each \ in the String. The parentheses of (\d*) make it a group (of digits) and \s* matches 0 or more whitespace chars
End of skipped-in-class example
But hand parsing is not the way to go. We want a parser that knows XML to do the work for us.
Goals for an XML parser (page 217)
o convert the stream to Unicode (though Java does this for us)
o assembling the different parts of a document divided into multiple entities (see page 25)
o resolve character references (like ߦ) and built in entities (like <)
o understand CDATA sections
o check well-formedness constraints
o keeping track of namespaces and their scope
o validate the document against its DTD or Schema (not all parsers do this)
o deal with unparsed entities (like images)
o assign types to attributes (like CDATA, ID, IDREF, see list pg. 34)
A CDATA section (not to be confused with CDATA attribute type) contains text that is not parsed in any way. An example from the standard:
<![CDATA[<greeting>Hello world</greeting>]]>
<![CDATA[ is the CDStart markup for a CDATA section and ]]> is the CDEnd markup.
- The content is everything in between. it is interpreted as literal text, markup does not count
- In content for an element: <foo>blah blah <![CDATA[<greeting>Hello world</greeting>]]></foo>
- The parser delivers character content for element foo as "blah blah <greeting>Hello world</greeting>", or for XPath, this is in the text node below element node "foo".
XML Reading APIs - an overview
SAX - "Simple API for XML"
· the gold standard for XML parsing
· is a read only API
· is an event driven API, this is, it uses callbacks
· lightweight - it doesn't create Java objects on its own
· the programmer has to create object if he or she needs them
· this or StAX are the only ways to deal with huge XML documents that won't fit into memory
DOM - "Document Object Model"
· the term DOM is used to refer to both the model and the API
· turns XML into a tree of objects
· can write XML, unlike SAX
· resulting DOM tree can support XPath queries, unlike SAX and STAX
· great for small or non-large documents
· can create and then update a tree of objects
· expensive in terms of memory and cpu cycles
JAXP - "Java API for XML Parsing"
· is an envelope that manages SAX, DOM and XSLT
· SAX and DOM are both contained in it
JDOM (we’re skipping this)
· like DOM it creates a tree of objects
· designed by Java people specifically for Java
· less ugly than DOM, which must be language neutral, in detail, but has gaps which make it less general, especially in mixed content
StAX: Streaming XML Parser, newer than our text
· Like SAX, it uses little memory, so is efficient
· Considered easier to use than SAX: support an iterator on XML, rather than callbacks
· Has an XML writer associated with it, unlike SAX.
From Sun’s Web services tutorial:
SAX programming
To use SAX you create an XMLReader object, to which you register the classes you create implementing the ContentHandler interface, which are the call backs that do the work in the programs we write
Note: when we create an XMLReader object, we should use the no-argument version of XMLReaderFactory.createXMLReader( ), which will use the default parser, Xerces as the parser. Xerxes is the best parser.
NOTE: The call to createXMLReader shown on page 213 will not work on our systems--just drop the argument.
Example 5.3, pages 230 - 231, is a SAX client that has been created to read the XML generated by the XML-RPC server, shown at the top of page 142. here is the output from this server along with the events SAX generates:
C:\XMLJAVA>java
FibonacciXMLRPCClient 10
<?xml
version="1.0"?> startDocument event
<methodResponse> startElement
event
<params> startElement
event
<param>
startElement event
<value>
startElement event
<double> startElement
event <--at this callback, set
flag "inDouble"
55 characters event <--at this callback, get the
"55", knowing we're inside <double>
</double> endElement
event <--at this callback, reset
flag "inDouble"
</value>
endElement event
</param>
endElement event
</params> endElement
event
</methodResponse> endElement
event
We want to extract the “55” from this XML. We have to do this by creating the right callback methods—see above ß notes for the plan. Each callback call provides the relevant details, like the element name. So at the startElement callback for <double> we know we’re processing a <double> start-element. More next time.
That's the plan. We get to code the callbacks. We do it by writing a class that implements ContentHandler, or its subclass DefaultHandler that provides trivial implementations we can override.
See pg. 232, for the FibonacciHandler that implements the callbacks startElement, endElement, and characters to do our planned actions at the various events--