CS639 Wed., Mar 1: handout on servlet1

PA2 available, Pa1b solution is available for it, servlet1, servlet2

Need to install tomcat at home, see links near pa2 on the class web page

Intro to pa2--set up servlet to serve out PA1 XML. Use the request URL to specify the Java file to process (the REST way).

Ex: http://sf08.cs.umb.edu:xxxxx/pa2/examples.sort.Sortex.xml to get XML describing SortEx.

Then write a client that uses XPath to print out methods.
Write a second client that uses SAX directly.

Deploying Servlets: servlet1, servlet2

The servlet1 is in $cs639/servlet1, etc.

To try it out:

cd cs639

mkdir servlet1

cp –r $cs639/servlet1/* servlet1

cd servlet1

ant build

ant deploy -> copies to your webapps dir in Tomcat.

ant test1

Looking at servlet1 Handout:

web.xml specifies how the request URI is mapped to the servlet’s .class file, so tomcat knows what to load into its JVM.

This is done by match-up on servlet-name content (here “HelloWorld”) between the servlet and servlet-mapping elements.

So here the URL-ending “/servlet/HelloWorld” specifies “cs639.xml.servlet.HelloWorld”, understood to be under WEB-INF/classes in the webapp’s deployed area.

i.e.,$CATALINA_HOME/webapps/servlet1/WEB-INF/classes/cs639/xml/servlet/HelloWorld.class.

The full URL that matches, for my tomcat, is http://sf08.cs.umb.edu:11600/servlet1/servlet/HelloWorld

Here we see host, port, webapp name, and finally /servlet/HelloWorld, which is handled by servlet1’s web.xml.

In fact, longer URLs (with query strings or ;xxx or both) would also match, such as http://sf08.cs.umb.edu:11600/servlet1/servlet/HelloWorld?x=10

The x=10 would never be processed by this servlet, however.

--->We looked at the book example, pg. 146. A servlet producing XML, useful for pa2. Much like servlet2.

We looked at web.xml from servlet1 and servlet2 as an example of an XML doc with both a namespace and a schema associated with it.

Last time we looked at the XLink example in the book, pg. 31. In order to get this example to actually work with schemas, we need the Address schema to allow “extra” attributes, AND provide schema linkage for the parser to check that href is really a global attribute in the XLink schema. Since the type attribute is not such an attribute, that attribute needs to be removed from the example. See notes for last class for details.

You can restrict the extra attributes to be in a certain namespace, and “any” elements similarly:

Example from XML Primer: allow an element to have any XHTML elements and attributes:

<any namespace="http://www.w3.org/1999/xhtml"

minOccurs="1" maxOccurs="unbounded"

processContents="skip"/>

</sequence>

</complexType>

</element>

See examples htmlExample.xml/xsd in $cs639/validate-ns. Note that the <anyAttribute> here is not needed, since there are no global attributes in the XHTML schema.

Another example of a standard namespace is the SOAP Namespace, specifically the SOAP Envelope Namespace

This is the request studied above with the standard SOAP envelope around it, from pg. 97:

This shows an example with one default NS and one non-default NS in use.

<?xml …?>

<SOAP-ENV:Envelope

xmlns:SOAP-ENV=”http://schemas.xmlsoap.org/soap/envelope/”> ßSOAP-ENV is a prefix

<SOAP-ENV:Body>

</getQuote>

</SOAP-ENV:Body>

</SOAP-ENV:Envelope>

Here we see local names Envelope and Body of the SOAP envelope NS.

We looked into how the SOAP envelope schema can handle the message part inside, which is designed by the app developers.

It uses <any ...>. This schema is in Appendix B, pp. 969-972. The <any> element is on pg. 971, inside schema element name=”Body”.

We have not yet shown how to validate XML using both the SOAP envelope schema and the app schema for the contents of Body. Need to import one schema into another. That’s a more advanced topic.

Also note there is another SOAP schema in Appendix B, for “encoding”. We will never use this encoding technique. It is now obsolete. Skip pp. 104-end of chapter 2.

We are skipping now to Chapter 5, Reading XML. We should return to the multiple schema case later sometime.

Reading XML in Java programs

We will see that reading XML in a Java program (or other programming language) is greatly aided by an XML Parser.

XML Parsers, the quick summary: SAX, DOM are the most important APIs. Only other one we need from Chap. 5 is JAXP. StAX is newly important, but not in Ch. 5.

All these are in the current JDK.

Chapter 5: some discussion of which SAX and DOM parser distribution to choose, ending with conclusion that Xerces is best.

-- Luckily, that’s the one in the current JDK, as predicted on pg. 228.

--So we don’t have to agonize over this, or obtain additional jars, just use the JDK

Starting from the basic idea of reading XML…

In order to read XML we must be able to accept files in UTF-8 encoding and turn them into the Unicode that Java uses.

This is accomplished with the following code snippet from page 215 of our text, also in servlet2’s EchoHtml.java:

Reader reader = new InputStreamReader( in, "UTF-8" );

Here in is a InputStream, i.e., a byte stream

InputStreamReader is a bridge class between Unicode and UTF-8, (as OutputStreamWriter on the output side). It knows how to decode UTF-8 into UTF-16 for Java Strings.

Suppose we were trying to read the XML from page 211-212 of the text, to obtain just the number in the middle:

<?xml version="1.0"?>

<param>

</param>

</params>

</methodResponse>

We could parse the XML "by hand" to extract the single number, 28657, as is done by the code on page 215, which looks for the string "<value><double>" This is very fragile and will break down if there is whitespace between the start tags.

The code on page 215 is ugly and not fully-internationalized code. It would break on supplementary (2-char) Unicode characters.

A quick fix to support internationalization is to use String instead of char, as is done in the servlet2 code:

Reader reader = new InputStreamReader(in, “UTF-8”); // as on pg. 215

BufferedReader in = new BufferedReader(reader);

String line;

while ((line = in.readLine()) != null) // read into String, not by bytes

sb.append(line); // the end-of-line is not preserved, but we don’t need it

// servlet2’s code uses println of PrintWriter to keep eol’s

However the code following this part on pg. 215 is still dependent on an exact “<value><double>” match.

Note: the following is not covered in class

We could improve this code using the Scanner class which is new to Java 5. Scanners work something like the scanf in C. they parse the text looking for items specified by regular expressions. you can construct a scanner from almost anything

Scanner s = new Scanner( in, "UTF-8" ); // in is an InputStream, as above.

s.findWithinHorizon( "<value>\\s*<double>(\\d*)</double>\\s*</value>", 0 );

// (\d) is a "group", like %d in scanf

MatchResult result = s.match( );

if ( result.groupCount( ) == 1) {

String value = result.group( 0 );

} else {

...

}

where "<value>\\s*<double>(\\d*)</double>\\s*</value>" is a regular expression, really <value>\s*<double>(\d*)</double>\s*</value>, but we had to escape each \ in the String. The parentheses of (\d*) make it a group (of digits) and \s* matches 0 or more whitespace chars

End of skipped-in-class example

But hand parsing is not the way to go. We want a parser that knows XML to do the work for us.

Goals for an XML parser (page 217)

o convert the stream to Unicode (though Java does this for us)

o assembling the different parts of a document divided into multiple entities (see page 25)

o resolve character references (like ߦ) and built in entities (like <)

o understand CDATA sections

o check well-formedness constraints

o keeping track of namespaces and their scope

o validate the document against its DTD or Schema (not all parsers do this)

o deal with unparsed entities (like images)

o assign types to attributes (like CDATA, ID, IDREF, see list pg. 34)

A CDATA section (not to be confused with CDATA attribute type) contains text that is not parsed in any way. An example from the standard:

<![CDATA[<greeting>Hello world</greeting>]]>

<![CDATA[ is the CDStart markup for a CDATA section and ]]> is the CDEnd markup.

- The content is everything in between. it is interpreted as literal text, markup does not count

- In content for an element: <foo>blah blah <![CDATA[<greeting>Hello world</greeting>]]></foo>

- The parser delivers character content for element foo as "blah blah <greeting>Hello world</greeting>", or for XPath, this is in the text node below element node "foo".