CS636 Class 24 The shared cache, internationalization

Using the shared cache

Shared Cache : owned by the EMF
Mutable objects: if only one application server and the JPA app is the only app accessing the database, OK to cache mutable objects, because the JPA runtime updates the shared cache on committed changes to entities.

Then the cache represents the database state to the extent it knows it.  It is smart enough to make a private copy of a domain object and put it in the em when a transaction updates an object gotten from the shared cache.

Can’t cache mutable objects in general because there is no mechanism to notify the cache about changes in the database.

However, when the system gets busy enough to worry about caching, the usual first step is to multiply the app servers, while keeping a single database server (and then the trick of letting the shared cache handle mutable objects stops working).  The Java of the app is a heavier load than the database actions it causes, so it is common to have a dozen or more app servers all running against one database.

In some cases, it’s better not to cache immutable objects – if there are millions of them, the shared cache gets bigger & bigger… à performance problem

Another approach: We can load the immutable objects before we start the Tx. – still serializable

end up with shorter Tx, shorter lock periods, better performance

Distributed caches – for multiple app servers: use only if database gets overloaded, and database has no other apps. Then the distributed cache is holding the database state for the system (also keeping the database up to date in case of crashes or reloads)

Example: JBoss Cache, but many others

Internationalization

We want to be able to display web pages in Chinese, Arabic, Russian, etc., so we need to use Unicode for text. Java uses Unicode for String, so we’re OK there.

Look at http://kermitproject.org/utf8.html to see snippets of many languages, all encoded in UTF-8 and displayed by your browser. (I used Chrome in class.)

To fit all these language characters in one coding, Unicode codes are 18 bits long.  Java uses 16 bit chars, so sometimes it needs to use multiple chars to hold one Unicode char, but this is extremely rare in practice.  Being a little sloppy, we say “Java uses Unicode for chars”, but it really uses UTF-16, a slightly compressed Unicode. Some obscure characters need two bytes.

Note: HTML5 defaults to UTF-8 encoding, a big improvement over HTML4, which defaulted to encoding "Latin-1" AKA "8859", another encoding where ASCII qualifies as valid characters. In Latin-1, the extra bit is used differently, and it can only encode European (Latin related) languages (accented chars, etc.).

Our HTML (in the various projects, hopefully all HTML5) is provided in UTF-8, which is compressed Unicode where ASCII chars qualify as UTF-8 one-byte codes (this is possible because ASCII only needs 7 bits of the 8 in a byte, leaving one bit as a flag for non-ASCII).  Non-ASCII chars need multiple bytes in UTF-8.

How this works is pretty impressive.  Here is the compression table for Unicode values up to FFFF (see linked doc for values above FFFF)

From the standard http://www.faqs.org/rfcs/rfc3629.html

    Unicode value range         |        UTF-8 octet sequence
     hexadecimal range(bits)    |              (binary)
   -----------------------------|------------------------------------
   0000-007F (000000000xxxxxxx) | 0xxxxxxx
   0080-07FF (00000xxxxxxxxxxx) | 110xxxxx 10xxxxxx
   0800-FFFF (xxxxxxxxxxxxxxxx) | 1110xxxx 10xxxxxx 10xxxxxx

The first range above are the pure ASCII characters, all 7 bits long. Those same 7 bits become the UTF-8 value, along with a leading 0 bit. That's a nice feature: A pure ASCII text qualifies as UTF-8 without change at all (as long as the leading bits are 0).

The second range encodes non-ASCII codes that cover things like accented chars of European languages, special symbols, etc. These take two bytes of UTF-8 to hold their 11 significant bits.

The third range encodes non-ASCII codes of all the languages of the world except some added too late to fit (these are in the extended range not shown above.). Their 16 significant bits are encoded in 3 bytes of UTF-8.

This encoding preserves binary order: Note how the bitstrings on both sides of the table are in bitstring order (like char strings but using bits).

Also, each byte starts off with 2 bits that say whether this is a leading byte or a trailing byte. If a UTF-8 data stream is broken, the reader can figure out how to restart decoding by looking at these header bits and skipping trailing bytes until it reaches a leading byte, and then restart decoding characters.

Example Characters

For example, the euro sign has Unicode 0x20ac (this is 16 bits, add two more binary 0s on the left to make 18 bits). Its UTF-8 encoding takes 3 bytes: e2 82 ac. Note that the high bit is on in these bytes, marking non-ASCII.  In Java, we can use ‘\u20ac’.

To compare, ASCII ‘A’ has code 0x41, and its UTF-8 representation is just the same one-byte code 0x41, with high bit off.

More examples, showing bits:

a (0x0061)  is coded by the first rule:       0061 = 00000000 0|1100001, UTF-8 code = 01100001 (8 bits)

© (0x00a9) is coded by the second rule:  00a9 = 00000|000 10|101001, UTF-8 code = 11000010 10101001  (16 bits)

€ (0x20ac) is coded by the third rule:       20ac = 0010|0000|1010|1100, UTF-8 code =  11100010 10000010 10101100 (24 bits)

™ (0x2122) is coded by the third rule:    2122 = 0010|0001 00|100010, UTF-8 code =  11100010 10000100 10100010 (24 bits)

Using UTF-8 in our JSP pages (so they produce HTML in UTF-8 for sure)

We need to tell the browser that we are using UTF-8 to make our UTF-8 HTML actually work for all browsers (soon this may not be necessary as HTML5 takes over). This is done by putting the content-type response header in the response that carries the HTML document back to the browser.  See Murach, pg. 553 for response headers, including content-type. We tell JSP to do this as follows, as you can see in many jsps in our projects. This will work for HTML4 pages as well as HTML5.

<%@page content-type=”text/html; charset=UTF-8”%>

This also has the effect of telling  the “out” object in JSP (the object that output the response text) in the servlet to produce UTF-8 encoded strings, an easy conversion for it from Java’s internal Unicode representation.

Note: Normally Java outputs using the local char set of the command environment, often not UTF-8, even though it is using Unicode internally. To override this default behavior, you can specify the char encoding in the various output classes (also in input classes), for example

       OutputStreamWriter out2 = new OutputStreamWriter(
                            new FileOutputStream("foo.txt"), "UTF-8");
       // make a printwriter, so we can use println, etc
       PrintWriter out = new PrintWriter(out2);

You can also set this response header from pure HTML (esp. useful if it's HTML4) with

inside the element.  See W3C doc.

(I’m not sure what happens if a document has both and they are contradictory.)

To find the right character to use, the "charmap" program of Windows is very nice. Find it in Windows/System32/charmap.exe and make an alias for your desktop. We looked up the euro sign, etc. Try typing 20ac in the “Go to Unicode” box.

Also, see the UTF-8 test page at www.fileformat.info.

To put the euro sign in HTML, you can specify it by its Unicode value as in € (using decimal value of 0x20ac) or €  using the hex value.  At least in some environments, you can use € or you can put in the 3 bytes of actual UTF-8. Your editor may switch representations on you.

For easy editing, you need a Unicode-supporting editor.  MS Word and post-1998 Windows in general supports Unicode (though not in the command line environment, which derives from ancient DOS days). Eclipse supports it, at lease for HTML and JSP that declares its char-coding as above, although it doesn’t have a UI to show us what various codes look like (that I’ve found, anyway).  For text files in eclipse, Project Properties>Resource allows setting the encoding. 

There’s more work to do for internationalization: Need to translate app’s results too, error messages. 

Note: we haven’t covered character sets in use in databases—we’re really using ASCII there. But we could use UTF-8.

Handling multiple natural languages in one app is even more challenging.  You need to be able to detect the user’s language (“locale”) and then switch to that language. Too complicated to cover here.