CS639 Class 6

Chapter 2 - XML Protocols (includes SOAP, used for WebServices)

Also, read Chap 1 of REST in Practice

There are various “vocabularies” in use for interchanging data.

Google “recipe XML”, find RecipeML, etc.

Table 2.1. XML Schema Data Types – XML itself does not have them. It just has the idea of text, does not even have numbers.

XML Schemas do have Data Types, so useful for this. We can say that quantity should be a non-negative integer for example:

Ex 2.1. shows how you can decorate plain XML with types, an unusual practice in my experience

- wide use will make doc large.

- Big docs need XML Schema.

More normal to use an XML Schema to declare the types that go with the data in the various XML documents.

Skipping forward to pg. 132 for a topic for pa1b…

Outputting XML from Java (relevant to pa1b)

Simple Example: pg 125

System.out.println(“ <fibonacci> ”); // here we have an ASCII chars

System.out.println(low); -> number in ASCII digits

For ASCII chars, ASCII code = UTF-8 code, so this is producing good UTF-8,

On output (result of println above), newLine from Java has 1 LF on UNIX and a CR and LF on Windows. But, for XML it does not matter, so this simple code works fine.

Writing out text with multi-byte UTF-8, i.e., non-ASCII chars.

Java uses Unicode Strings (UTF-16) internally, and UTF-8 is just compressed Unicode, so there is a well-defined transformation.

Pg 133: Shows how to set this up, to get Java to do the Unicode -> UTF-8 translation for us, instead of its default translation to whatever the local default encoding is. But we want to use “UTF-8”, not “8859_1”, which is Latin-1.

The first step is to make an Output Stream:

OutputStream fout = new FileOutPutStream(“foo.xml”);

// Byte stream handler (low level I/O).

BufferedOutputStream bout = new BufferedOutputStream(fout);

// byte stream, buffered (more efficient).

OutputStreamWriter out = new OutputStreamWriter(bout, “UTF-8”);

// knows about Unicode, and how to turn it into other encodings

If we write something to out, we can write a String (which is Unicode) and it gets transformed to UTF-8 for us (in bytes)

OutputStreamWriter is a “bridge” from Unicode to bytes. It has methods write, flush, close, and a few others. No help in outputting numbers as text. Note that you don’t need to flush just before close (as done on pg. 134), because closes flushes the stream.

We could do one more step (not shown in book) and get the use of println, etc:

PrintWriter out1 = new PrintWriter(out);

out1.println(“<Fibonacci>”);…

Note there is bridge class for input, InputStreamReader, so we can read UTF-8 input and get Java to turn it into Unicode for us.

Why “bridge” here?

Two kinds of file support: byte-level and Unicode-char-based, and bridge converts between.

Ref: Core Java, Vo1 of 7^th ed, or Vol 2 of later editions, chapter on Streams and Files, see four class hierarchies, two for byte streams (input and output), two for Unicode character handling (input and output), with about a ten classes each: what a zoo!

In actual files, or coming in from network, have stream of bytes. Inside Java, have Unicode strings.

In from file/network: bytes -> Unicode via bridge (OutputStreamWriter)

Out to file/network Unicode --> bytes via bridge (InputStreamWriter)

This is just a slice of the Java file support jungle.

Moving XML from point A to point B.

HTTP is an obvious way to move XML, for reasons on pg. 65.

Look at Figure 2.1: This URL still works, but the first line no longer has ISO-8859-1, so they have modernized, using default UTF-8.

Still have odd construct of namespace name with URI of DTD—there is a DTD at that URL. No DOCTYPE for normal linkage to the DTD. Luckily our best tools don’t require explicit linkage to use schemas (DTD or XML Schema).

HTTP GET: the most important HTTP verb (others are POST, PUT, DELETE, and a few less common ones)

Look at GET on pg. 66, header, note blank line at end

Response, etc.

HTTP in Java

pg 69 URL program)

URL Grabber program that takes a URL

URL url = new URL(urlString)

InputStream in = url.openStream();

Provides access to the contents of the document, nice and easy.

URL is a JDK class. Its openStream() method opens a TCP/IP stream connection to the server given in the URL, does a HTTP GET, and is ready to accept read’s on this connection to get the data in the contents.

But note that an InputStream is a stream of bytes, not characters, and we know that UTF-8 data has multi-byte characters, so this is not as useful as it seems, unless we know that the data is all ASCII, Latin-1, or something binary.

url.openStream can be done in two steps,

url.openConnection().getInputStream()

------------------------

URLConnection, also provides access to header info:

getContentType.

getContentEncoding

So the HTTP header may say the content encoding is UTF-8 or. Latin-1, etc. Book says it’s better to go by the XML prolog in the document itself, in the contents of the response.

c = in.read() brings in a byte, which gets cast to char when it’s put in the StringBuffer.

Note: URLGrabber is only good for known 8-bit characters: ASCII or Latin-1. Usually OK for English content XML, HTML, except with non-ASCII symbols.

We’ll get back to this topic when we study parsing. Already mentioned the input bridge class that can translate UTF-8 to Unicode for us.

More on Java Unicode, not covered in detail in class and just for your information:

Java Strings and characters are Unicode: but Unicode itself has grown in size!

Up through Java 1.4: supports older Unicode, where all characters fit in 16-bit codes, so each character fits in one Java char (16 bits).

Java 5, 6: supports Unicode 4.0, and has some “supplementary” characters that take two 16-bit chars for one character, or “code point”. Altogether, this expansion covers 16x as many possible characters as before.

For example, Z with an extra bar (for set of integers), is “U+1D56B”, encoded (UTF-16) by D835 DD6B in two 16-bit chars, where D835 and DD6B are both in special areas of the 16-bit space.

Does this effect what we know about UTF-8? Not really, just more multi-byte characters around.

So char and “one character” are no longer the same in Java! So try to avoid chars, just use Strings, if doing internationalized work.

One character of String s:

int cp = s.codePointAt(i)

if (Character.isSupplementaryCharacter(cp)) … if you need to know it’s a supplementary char (unusual)

End of aside

How can we read UTF-8, with its multi-byte chars, possibly leading to 2 chars each in Java?

The hard low-level way: use the above primitive method to read the XML prolog, find the actual encoding, then use the Java “bridge class” that knows how to read bytes, turn them into the right chars: InputStreamReader. It has a constructor that takes a Charset arg (for “UTF-8” or whatever), and then uses that to guide its input.

The easy way: Luckily the XML parsers know how to do this for us. XPath uses an XML parser, another easy way to read docs for specific information.

RSS – Really Simple Syndication: skipping this section: will look at Atom, the newer protocol, later

For “news feed”, etc.

An XML application, and can be displayed decently by some newer browsers. Older ones just show it as XML.

Ex 2.5 – ignore this example. This format was an attempt to use RDF, didn’t catch on.

End of skipped section

Requests using Query Strings

Recall that a query string is just after the ? in the URL. In hw1 we saw that we can request google searches this way.

The info in the query string is delivered to the application that serves this URL.

Suppose we are serving out stock quotes this way:

oursite.com/ourservlet?mode=stock&symbol=IBM

query string provided to servlet

As we will see in detail later, the servlet support parses the query string & puts the name = value info into the “request” object as its “parameters”. The request object is available to the Java code of the servlet. Suppose its name is req.

String symbol = req.getParameter(“symbol”);

//Servlet does lookup of data, sends back XML--

out.write(“<quote>100</quote>”) // OutputStreamWriter, outputting UTF-8

As Harold points out, doing requests by query strings is limited:

- better to send in XML to describe what we want. (& get back XML). (the SOAP way)

- or express resources in URIs, the REST way

Example from pg 81 How POST Works

At pg 82 you see a query string in the content body. This is hidden from the browser URL window of the browser that the user sees.

Here we see content type application/x-www-form-urlencoding (format for query strings) See p 77-78, but not essential, since this is not needed if there is XML in the body, and our tools handle decoding.

POST with XML body ex, pg 98-99

We see content type: text/xml (this is in HTTP header of POST)

XML-RPC

XML RPC like a prototype for Web RPC. It has good basic ideas, has XML both ways.

See ex at pg 85-85, DTD and XSD at pp 90-95 (another useful example of XML much like pa1’s)

It handles faults, and it does have data types as commonly used data types, but runs out: only 6 fixed data types in ASCII! Should allow UTF-8 for internationalization.

Quick Intro to SOAP

SOAP is the descendent of XML RPC and this does allow the internationalization. It is better XML, UTF-8 or … and it allows any XML in the app part. We need to use namespaces with it, because we need to use the SOAP vocabulary for the “envelope”, app-specific vocabulary for the body, enclosed within the envelope.

See ex on pg 97, also an example in REST book, pg. 378, Example 11-2

SOAP envelope: has names Envelope and Body here, in the SOAP Envelope “namespace”, id’d by URI http://schemas.xmlsoap.org/soap/envelope/, which has been given the prefix SOAP-ENV by the xmlns:SOAP-ENV=”…” construct, a special XML attribute for namespaces.

The body of the SOAP message is application-specific XML, with its own namespace id’d by a URI involving the author’s website. Here we see xmlns=”…”, which says that this namespace is the “default namespace”, so no prefix needs to be (or is allowed to be) used for its names “getQuote” and “symbol” See pp. 28-32 for an intro to namespaces. We’ll cover this again.

In current SOAP web services, the application messages contained within the SOAP envelopes are usually described by XML Schemas. So be sure to skip everything with SOAP-ENC (SOAP encoding)—this was an earlier attempt to describe message structure within the message itself.

Note that SOAP does not have to use HTTP for transport. For example, you could send a SOAP request by email (SMTP rather than HTTP) and get a response by email. The SOAP request is self contained, i.e., doesn’t use HTTP headers or different HTTP verbs, just POST when used over HTTP.

SOAP over HTTP

· uses POST only

· ignores headers except possibly Content-length

· typically uses a single URL “service endpoint” at a particular server, no query string

SOAP web services took over (2000-2005 say) from older technologies that were much less open and portable, a big step forward.

Quick Intro to REST

The idea of REST is to use HTTP directly, rather than reducing it to a carrier of SOAP messages. With REST, we use multiple HTTP verbs:

· GET for reading data (no changes!)

· POST for creating new data items

· PUT for updating old data items

· DELETE for deleting old data items