CS639 Class 6

The weather xml of pg. 754 is encoded in ISO-8859-1. It could be in UTF-8, the default encoding for XML. Note that it has a non-ASCII character, the degree sign, so it can’t be in plain ASCII.

CharSet: ISO-8859-1(also known as Latin-1) can encode western European languages, 8 bit code, is the default encoding for HTML. UTF-8 (compressed form of Unicode), is the default encoding for XML

We’ll look at details of encoding right after XPath.

How TestXPath works.

This code parses the XML, with knowledge of the XPath & picks out the matching data. It builds a “DOM tree” of nodes in memory. It returns info on the resulting nodes to the program and the program prints out info that we were looking for.

In particular, we see in TestXPath.java that the nodes come back in a NodeList object:

See JDK documentation for the NodeList. In TestXPath, we see a NodeList named nodes. Then

nodes.item(i) = individual Node.

For a single Node node:

node.getNodeType() – it will tell you if it is an ELEMENT_NODE, …

node.getNodeName() and so on.

This way of using XPath is a lot easier than having the program actively involved in the XML parsing. This is the best way to do simple information retrieval from XML that has no schema, and is competitive for XML that does have schema.

Character Encodings

Most important encodings:

· ASCII 128 chars in 7 bits, usually held in one byte (8 bits), with high bit off

· Latin-1 or ISO-8859-1 256 chars in 8 bits, 128 codes as in ASCII, plus extra codes handling French, German, etc. marks, and other chars

· UTF-8/UTF-16/Unicode, 18 bits uncompressed, 128 codes as in ASCII

Let’s look at the non-ASCII character that showed up in weather.xml, the degree sign: To convert weather.xml to UTF-8, we would need to replace all the degree signs in it from their Latin-1 code (1 byte) to their UTF-8 coding (two bytes).

Degree sign ° in LATIN-1 = 0xb0 = binary 1011 0000 (high bit on in byte)

Note that 00b0 > 007f

Code 0x7f (binary 0111 1111) is the highest ASCII code.

Unicode for this char is 0x00b0 with two more binary zeroes on the left to make 18 bits.

UTF-16, used by Java: code = 0x00b0

Recall that the UTF-8 representation of an ASCII char is just the single 8-bit ASCII code (with high bit off).

An important fact about UTF-8: High bits of bytes are flags for non-ASCII chars and their multi-byte codes

In UTF-8, the only bytes with high bits off are the ones representing ASCII codes. All other chars have at least 2 bytes in their UTF-8 representation, each with the high bit on.

What is the UTF-8 representation of this ° char? It can’t be 00 b0, because the high bit is off in the first byte. It is two bytes long, and both bytes have high bit 1 to mark it a non-ASCII char. We don’t care what the exact bit pattern is, since we can always use tools to create these codes. In fact the UTF-8 encoding is the sequence of two bytes 0xc2 0xb0.

For more information on how to encode any Unicode char, see Character Encodings (1 page).

How to convert Latin-1 to UTF-8

On Linux: iconv --from-code=ISO-8859-1 --to-code=UTF-8 weather.xml>weather1.xml

However, after this the first line still says encoding=ISO-8859-1, You can use ed or vi on Linux to change this.

The result is in $cs639/xpath weather2.xml

Our Linux hosts, users.cs.umb.edu, etc., have a UTF-8 environment: use the “locale” command to see this.

Displaying UTF-8 text from users on your screen, if logged in from home:

If you are using putty, you can change its settings to display UTF-8 properly:

Change Settings>Window>Translation> Received data assumed to be in which character set: UTF-8

With this setting, the original weather.xml (in Latin-1) is displayed with block chars where degree signs should be.

Creating new UTF-8 text: Word may be the best choice, with its Insert>Symbol pop-up display of Unicode characters to choose from. Eclipse doesn’t seem to have a pop-up to show possible characters to insert, but it properly displayed both versions of the file, that is, the degree sign looks good for both weather.xml and weather2.xml. It must be looking at the first line and seeing the encoding= value.

Windows has a Character Map accessory, where you can select chars by their picture, copy and paste their UTF-8 values into UTF-8 XML.

On Windows: Word (Word 2007) can convert Latin-1 XML to UTF-8 (or other encoding) if you are very careful. Read weather.xml as given in $cs639/xpath or from Harold’s online book, and the “Save as..” and choose Web Options> Encoding> UTF-8. Choose “Word2003 XML format”. The resulting file will have proper <?xml...?> and converted character codes, but poor formatting (mostly all on one line.) I used eclipse to reformat it, and the final version of the file as UTF-8 XML. As a Windows text file, it will have CR-LF for end-of-lines, but that’s OK.

I couldn’t get eclipse to convert XML from Latin-1 to UTF-8.

We’ll see that with HTTP, we can deliver text along with attributes such as charset, so the end user doesn’t have to know the details and set things up.

Final note on this: It is important to realize that some characters have multi-byte UTF-8 representations. We can’t just read a byte of UTF-8 and expect it to be a full character. All multi-byte representations have the high bit on in all of its bytes. If a high bit is off in a byte, that’s a single-byte UTF-8 representation of an ASCII char.

Now we are done with Chapter 1, plus intro to XPath. Next: Chap. 2

Chapter 2 - XML Protocols (includes SOAP, used for WebServices)

Also, read Chap 1 of REST in Practice

There are various “vocabularies” in use for interchanging data. A vocabulary is just a set of names in use, mainly names for XML elements. You can speak of a vocabulary for schemaless XML, but you have a better idea of the vocabulary if you have a schema (DTD or XML).

Table 2.1. XML Schema Data Types – XML itself does not have them. It just has the idea of text, does not even have numbers.

XML Schemas do have Data Types, so useful for this. We can say that quantity should be a non-negative integer for example:

Ex 2.1. shows how you can decorate plain XML with types, an unusual practice in my experience

- wide use will make doc large.

- Big docs need XML Schema.

More normal to use an XML Schema to declare the types that go with the data in the various XML documents.

Skipping forward to pg. 132 for a topic for pa1b…

Outputting XML from Java (relevant to pa1b) –not covered in class

Simple Example: pg 125

System.out.println(“ <fibonacci> ”); // here we have an ASCII chars

System.out.println(low); -> number in ASCII digits

For ASCII chars, ASCII code = UTF-8 code, so this is producing good UTF-8,

On output (result of println above), newLine from Java has 1 LF on UNIX and a CR and LF on Windows. But, for XML it does not matter, so this simple code works fine.

Writing out text with multi-byte UTF-8, i.e., non-ASCII chars.

Java uses Unicode Strings (UTF-16) internally, and UTF-8 is just compressed Unicode, so there is a well-defined transformation.

Pg 133: Shows how to set this up, to get Java to do the Unicode -> UTF-8 translation for us, instead of its default translation to whatever the local default encoding is. But we want to use “UTF-8”, not “8859_1”, which is Latin-1.

The first step is to make an Output Stream:

OutputStream fout = new FileOutPutStream(“foo.xml”);

// Byte stream handler (low level I/O).

BufferedOutputStream bout = new BufferedOutputStream(fout);

// byte stream, buffered (more efficient).

OutputStreamWriter out = new OutputStreamWriter(bout, “UTF-8”);

// knows about Unicode, and how to turn it into other encodings

If we write something to out, we can write a String (which is Unicode) and it gets transformed to UTF-8 for us (in bytes)

OutputStreamWriter is a “bridge” from Unicode to bytes. It has methods write, flush, close, and a few others. No help in outputting numbers as text. Note that you don’t need to flush just before close (as done on pg. 134), because closes flushes the stream.

We could do one more step (not shown in book) and get the use of println, etc:

PrintWriter out1 = new PrintWriter(out);

out1.println(“<Fibonacci>”);…

Note there is bridge class for input, InputStreamReader, so we can read UTF-8 input and get Java to turn it into Unicode for us.

Why “bridge” here?

Two kinds of file support: byte-level and Unicode-char-based, and bridge converts between.

Ref: Core Java, Vo1 of 7^th ed, or Vol 2 of later editions, chapter on Streams and Files, see four class hierarchies, two for byte streams (input and output), two for Unicode character handling (input and output), with about a ten classes each: what a zoo!

In actual files, or coming in from network, have stream of bytes. Inside Java, have Unicode strings.

In from file/network: bytes -> Unicode via bridge (OutputStreamWriter)

Out to file/network Unicode --> bytes via bridge (InputStreamWriter)

This is just a slice of the Java file support jungle.

Moving XML from point A to point B.

HTTP is an obvious way to move XML, for reasons on pg. 65.

Look at Figure 2.1: This URL still works, but the first line no longer has ISO-8859-1, so they have modernized, using default UTF-8.

Still have odd construct of namespace name with URI of DTD—there is a DTD at that URL. No DOCTYPE for normal linkage to the DTD. Luckily our best tools don’t require explicit linkage to use schemas (DTD or XML Schema).

HTTP GET: the most important HTTP verb (others are POST, PUT, DELETE, and a few less common ones)

Look at GET on pg. 66, header, note blank line at end

Response, etc.

HTTP in Java

pg 69 URL program)

URL Grabber program that takes a URL

URL url = new URL(urlString)

InputStream in = url.openStream();

Provides access to the contents of the document, nice and easy.

URL is a JDK class. Its openStream() method opens a TCP/IP stream connection to the server given in the URL, does a HTTP GET, and is ready to accept read’s on this connection to get the data in the contents.

But note that an InputStream is a stream of bytes, not characters, and we know that UTF-8 data has multi-byte characters, so this is not as useful as it seems, unless we know that the data is all ASCII, Latin-1, or something binary.

url.openStream can be done in two steps,

url.openConnection().getInputStream()

------------------------

URLConnection, also provides access to header info:

getContentType.

getContentEncoding

So the HTTP header may say the content encoding is UTF-8 or. Latin-1, etc. Book says it’s better to go by the XML prolog in the document itself, in the contents of the response.

c = in.read() brings in a byte, which gets cast to char when it’s put in the StringBuffer.

Note: URLGrabber is only good for known 8-bit characters: ASCII or Latin-1. Usually OK for English content XML, HTML, except with non-ASCII symbols.

We’ll get back to this topic when we study parsing. Already mentioned the input bridge class that can translate UTF-8 to Unicode for us.

More on Java Unicode, not covered in detail in class and just for your information:

Java Strings and characters are Unicode: but Unicode itself has grown in size!

Up through Java 1.4: supports older Unicode, where all characters fit in 16-bit codes, so each character fits in one Java char (16 bits).

Java 5, 6: supports Unicode 4.0, and has some “supplementary” characters that take two 16-bit chars for one character, or “code point”. Altogether, this expansion covers 16x as many possible characters as before.

For example, Z with an extra bar (for set of integers), is “U+1D56B”, encoded (UTF-16) by D835 DD6B in two 16-bit chars, where D835 and DD6B are both in special areas of the 16-bit space.

Does this effect what we know about UTF-8? Not really, just more multi-byte characters around.

So char and “one character” are no longer the same in Java! So try to avoid chars, just use Strings, if doing internationalized work.

One character of String s:

int cp = s.codePointAt(i)

if (Character.isSupplementaryCharacter(cp)) … if you need to know it’s a supplementary char (unusual)

End of aside

How can we read UTF-8, with its multi-byte chars, possibly leading to 2 chars each in Java?

The hard low-level way: use the above primitive method to read the XML prolog, find the actual encoding, then use the Java “bridge class” that knows how to read bytes, turn them into the right chars: InputStreamReader. It has a constructor that takes a Charset arg (for “UTF-8” or whatever), and then uses that to guide its input.

The easy way: Luckily the XML parsers know how to do this for us. XPath uses an XML parser, another easy way to read docs for specific information.

RSS – Really Simple Syndication: skipping this section: will look at Atom, the newer protocol, later

For “news feed”, etc.

An XML application, and can be displayed decently by some newer browsers. Older ones just show it as XML.

Ex 2.5 – ignore this example. This format was an attempt to use RDF, didn’t catch on.

End of skipped section