CS636 Class 24 More on Servlet URLs, caching, internationalization

Servlet URLs

Question on Piazza:

I found the URL pattern of AdminServlet in the web.xml, and "listVariables.jsp" and "logout.jsp" already exist in the project. Why am I unable to browse to "http://localhost:8080/music3/adminController/listVariables.jsp"?

Answer:

You can only browse to URLs implied by the url-mappings of an actual servlet (plus static web pages of course). So for pizza3, which has implemented servlets, we see in web.xml 

<url-pattern>/servlet/SystemTest</url-pattern> for SysTestServlet

<url-pattern>/adminController/*</url-pattern>  for AdminServle

<url-pattern>*.html</url-pattern>  for DispatcherServlet

 

All of these are "context-relative", which means if this project is in webapps/pizza3, then the URLs you can use at home must satisfy:

 

http://localhost:8080/pizza3/servlet/SystemTest      one particular URL

http://localhost:8080/pizza3/adminController/*            wildcard URL

http://localhost:8080/pizza3/*.html                 another wildcard URL, completely disjoint from the others

 

Not all urls satisfying the wildcards will work, but at least the servlet will get the request and can complain about ones it doesn't understand.  All the URLs in the JSPs satisfy one of the wildcard URLs, usually by being relative URLs.

 

In music3, only the SystemTest URL will work in the provided code

edit Wed., Dec. 6: actually listVariables.html is also implemented, available at http://localhost:8080/pizza3/adminController/listVariables.html so if you change your attempted http://localhost:8080/music3/adminController/listVariables.jsp to .html, it should work


You shouldn't use URLs ending in jsp anywhere except as forwarding URLs in the servlet.  Think of the JSPs as tools of the servlet, not independent pieces of software.


The URLs in JSPs need to point to the servlet. Usually we use relative URLs in a group of related JSPs.  See the URLs in the student pages of pizza3: they all end in .html, so they can point to the DispatcherServlet.


In general, you can use relative URLs (ending in .html or whatever the url-mapping wants) in JSPs to go from place to place inside one servlets URL-space define by its url-mapping. It's harder to go elsewhere.


When a URL in a JSP needs to point outside the servlet's URL-space, you need to use c:url as done in pizza3's sidebar.jsp, or use a site-relative URL.


c:url has a feature called context-relative URLs. For example, in sidebar.jsp of pizza3 we see:

    <c:url var="welcomeURL" value="/welcome.html" />

Since the value starts with /, this is defining a context-relative URL. c:url adds the current webapp name to this to create a site-relative URL

    /pizza3/welcome.html

Then the browser, seeing this site-relative URL (because it starts with /), adds the current site http://localhost:8080, and the full URL becomes http://localhost:8080/pizza3/welcome.html.

This use of <c:url is discussed in Murach on pg. 287. There, the URL being specified is /ch09cart/cart, since the webapp is named ch09cart.


We can directly use site-relative URLs:  you can code "/pizza3/welcome.html" as a URL in a JSP and it will get to the same place.  The only problem with this is you are assuming the project is installed as pizza3, not pizza3test or something. So this way is more rigid.

Caching

Our basic system gets fresh data from the database to create new POJOs on each request.  This is made quite fast by the fact that the database has memory buffers holding all its "warm" data, data that has been accessed in the last few minutes or so. However, the round trip to the database server takes some time (under 1ms if local).

The database buffering system can be called the "database cache". It holds data in memory, avoiding disk reads of the table data. The database carefully manages the data so that it always is using the correct current data. It is important to give the database enough memory to hold all the app's commonly used data. If you have a server with 64GB of memory, you could use 32GB for this buffering system.

Immutable domain objects don't really need new copies on each request. Even with a plain JDBC implementation, we could set up our own app cache for these objects, using the "application" scope: the ServletContext has get/setAttribute just like request and session. Such variables last the lifetime of the servlet. But there's no need to do this until performance is a problem. Don't forget KISS.

In an enterprise app, we would more probably using JPA, and it supplies this kind of caching service for us.

Using JPA's shared cache

Shared Cache : owned by the EMF (EntityManagerFactory, the object that gives out the EMs (EntityManagers)
Mutable objects: if only one application server and the JPA app is the only app accessing the database, OK to cache mutable objects, because the JPA runtime updates the shared cache on committed changes to entities.

Then the cache represents the database state to the extent it knows it.  It is smart enough to make a private copy of a domain object and put it in the em when a transaction updates an object gotten from the shared cache.

Can’t cache mutable objects in general because there is no mechanism to notify the cache about changes in the database.

However, when the system gets busy enough to worry about caching, the usual first step is to multiply the app servers, while keeping a single database server (and then the trick of letting the shared cache handle mutable objects stops working).  The Java of the app is a heavier load than the database actions it causes, so it is common to have a dozen or more app servers all running against one database.

In some cases, it’s better not to cache immutable objects – if there are millions of them, the shared cache gets bigger & bigger… à performance problem

Another approach: We can load the immutable objects before we start the Tx. – still serializable

end up with shorter Tx, shorter lock periods, better performance

Distributed caches – for multiple app servers: use only if database gets overloaded, and database has no other apps. Then the distributed cache is holding the database state for the system

Example: JBoss Cache, but many others

Internationalization

We want to be able to display web pages in Chinese, Arabic, Russian, etc., so we need to use Unicode for text. Java uses Unicode for String, so we’re OK there.

Look at http://kermitproject.org/utf8.html to see snippets of many languages, all encoded in UTF-8 and displayed by your browser. (I used Chrome in class.)

To fit all these language characters in one coding, Unicode codes are 18 bits long.  Java uses 16 bit chars, so sometimes it needs to use multiple chars to hold one Unicode char, but this is extremely rare in practice.  Being a little sloppy, we say “Java uses Unicode for chars”, but it really uses UTF-16, a slightly compressed Unicode. Some obscure characters need four bytes.

UTF-8 is the more common encoding outside Java. The 8 stands for the 8 bits in each byte used in the representation: one character takes 1-6 bytes, but for our use, most characters take only one byte (the ASCII characters). Thus this paragraph could be called ASCII or UTF-8.

Note: HTML5 defaults to UTF-8 encoding, a big improvement over HTML4, which defaulted to encoding "Latin-1" AKA "8859".

In both UTF-8 and Latin-1, ASCII characters qualify as valid characters. It only takes 7 bits to encode an ASCII character, and the 8th bit is 0 to fill one byte. In Latin-1, the 8th bit can be a one, so that 128 more characters can be encoded for European (Latin related) languages (accented chars, euro sign, etc.). In UTF-8, if the 8th bit is 0, it's an ASCII char, and if not, it's the start of a multi-byte char.

How this works is pretty impressive.  Here is the compression table for Unicode values up to FFFF (see linked doc for values above FFFF)

From the standard http://www.faqs.org/rfcs/rfc3629.html

    Unicode value range         |        UTF-8 octet sequence
     hexadecimal range(bits)    |              (binary)
   -----------------------------|------------------------------------
   0000-007F (000000000xxxxxxx) | 0xxxxxxx
   0080-07FF (00000xxxxxxxxxxx) | 110xxxxx 10xxxxxx
   0800-FFFF (xxxxxxxxxxxxxxxx) | 1110xxxx 10xxxxxx 10xxxxxx

The first range above are the pure ASCII characters, all 7 bits long. Those same 7 bits become the UTF-8 value, along with a leading 0 bit. That's a nice feature: A pure ASCII text qualifies as UTF-8 without change at all (as long as the leading bits are 0).

The second range encodes non-ASCII codes that cover things like accented chars of European languages, special symbols, etc. These take two bytes of UTF-8 to hold their 11 significant bits.

The third range encodes non-ASCII codes of all the languages of the world except some added too late to fit (these are in the extended range not shown above.). Their 16 significant bits are encoded in 3 bytes of UTF-8.

This encoding preserves binary order: Note how the bitstrings on both sides of the table are in bitstring order (like char strings but using bits).

Also, each byte starts off with 2 bits that say whether this is a leading byte or a trailing byte. If a UTF-8 data stream is broken, the reader can figure out how to restart decoding by looking at these header bits and skipping trailing bytes until it reaches a leading byte, and then restart decoding characters.

Example Characters

For example, the euro sign has Unicode 0x20ac (this is 16 bits, add two more binary 0s on the left to make 18 bits). Its UTF-8 encoding takes 3 bytes: e2 82 ac. Note that the high bit is on in these bytes, marking non-ASCII.  In Java, we can use ‘\u20ac’.

To compare, ASCII ‘A’ has code 0x41, and its UTF-8 representation is just the same one-byte code 0x41, with high bit off.

More examples, showing bits:

a (0x0061)  is coded by the first rule:       0061 = 00000000 0|1100001, UTF-8 code = 01100001 (8 bits)

© (0x00a9) is coded by the second rule:  00a9 = 00000|000 10|101001, UTF-8 code = 11000010 10101001  (16 bits)

€ (0x20ac) is coded by the third rule:       20ac = 0010|0000|1010|1100, UTF-8 code =  11100010 10000010 10101100 (24 bits)

™ (0x2122) is coded by the third rule:    2122 = 0010|0001 00|100010, UTF-8 code =  11100010 10000100 10100010 (24 bits)

Using UTF-8 in our JSP pages (so they produce HTML in UTF-8 for sure)

If we want our pages to work with all browsers (not necessary for pa2), we need to tell the browser that we are using UTF-8 to make our UTF-8 HTML actually work for old browsers (soon this may not be necessary as HTML5 takes over). This is done by putting the content-type response header in the response that carries the HTML document back to the browser.  See Murach, pg. 553 for response headers, including content-type. We tell JSP to do this as follows, as you can see in many jsps in our projects. This will work for HTML4 pages as well as HTML5.

<%@page content-type=”text/html; charset=UTF-8”%>

This also has the effect of telling  the “out” object in JSP (the object that output the response text) in the servlet to produce UTF-8 encoded strings, an easy conversion for it from Java’s internal Unicode representation.

Note: Normally Java outputs using the local char set of the command environment, often not UTF-8, even though it is using Unicode internally. To override this default behavior, you can specify the char encoding in the various output classes (also in input classes), for example

            OutputStreamWriter out2 = new OutputStreamWriter(new FileOutputStream("foo.txt"), "UTF-8");
      // make a printwriter, so we can use println, etc
      PrintWriter out = new PrintWriter(out2);

To find the right character to use, the "charmap" program of Windows is very nice. Find it in Windows/System32/charmap.exe and make an alias for your desktop. We looked up the euro sign, etc. Try typing 20ac in the “Go to Unicode” box.

Also, see the UTF-8 test page at www.fileformat.info.

To put the euro sign in HTML, you can specify it by its Unicode value as in € (using decimal value of 0x20ac) or €  using the hex value.  At least in some environments, you can use &euro; or you can put in the 3 bytes of actual UTF-8. Your editor may switch representations on you.

For easy editing, you need a Unicode-supporting editor.  MS Word and post-1998 Windows in general supports Unicode (though not in the command line environment, which derives from ancient DOS days). Eclipse supports it, at lease for HTML and JSP that declares its char-coding as above, although it doesn’t have a UI to show us what various codes look like (that I’ve found, anyway).  For text files in eclipse, Project Properties>Resource allows setting the encoding. 

There’s more work to do for internationalization: Need to translate app’s results too, error messages. 

Note: we haven’t covered character sets in use in databases—we’re really using ASCII there. But we could use UTF-8.

Handling multiple natural languages in one app is even more challenging.  You need to be able to detect the user’s language (“locale”) and then switch to that language. Too complicated to cover here.