Class 03 – cs639

HW1 is due next class—questions?

PA1 is available. Topic is writing XML from non-XML data.

HTTP - the protocol underlying web access, very simple, only one request-response cycle & then disconnect (at least logically).

1 Client connects to port 80 on host Y

2. web server on port 80 accepts.

3. client sends a GET / HTTP/1.0 (1.0 is the HTTP version, 1.1 for a real browser)

then there is an optional header.

Then we have one empty line to finish the header.

4. The Web Server sends back the file contents

5. The two parties disconnect.

for example:

if you put the url for our webpage of the cs639 class in the address window of your browser, the browser will get from you:

http://www.cs.umb.edu/cs639

Browser picks out the host name from this, www.cs.umb.edu, converts it to the IP address, and uses port 80, the default web server port. It connects to this host and port and sends the HTTP GET command:

GET /cs639 HTTP/1.1

Followed by some header information and a blank line to finish it. Or use HTTP/1.0 and no header lines.

It gets back the contents of index.html from that cs639 directory.

(Locally known as /data/htdocs/cs639. The root of our website is /data/htdocs in our distributed filesystem, available from any of our systems.)

The browser knows html, parses the index.html file, finds links for images.

browser does another GET for an image. The web server returns the named image.

There is a link to hw1.html in this index.html file:

If user clicks on this link, then the browser does a request for that resource, at URL “hw1.html”. This is relative URL (is asking for a resource in the same directory as its own .html file). The browser remembers that index.html is at /cs639, so the browser figures out that hw1.html is at local path /cs639/hw1.html on the server. The server is always given the full local path in the GET (the web server is “stateless”, so is never required to remember what happened previously, as explained more below).

The user clicks on this and the browser does a connect to the same web server as the main page

GET /cs639/hw1.html HTTP/1.0 (or 1.1)

Then the browser gets this data back from the server and disconnects, and displays this page

Consider the link to XML Schema document: <a href="http://www.w3.org/TR/xml-schema-0/"> this is a full or absolute URL. In this case the browser does a connection to the ip address of the server www.w3.org, on port 80, then does a GET /TR/xml-schema-0/ HTTP/1.1

Other ports can be used for web service, for example 8080 for tomcat, in which case the URL looks like this:

http://example.com:8080/something.

The browser is a "universal client"

The browser knows html, its most basic capability. This is enough to do simple interactions, forms, links, etc. Many websites are set up to only need a browser on the client side.

HTTP is a “stateless protocol”

We give the URL and the browser does the appropriate GET, gets back a resource. Images in HTML -> more GETs in separate connections, where the server could be different. Even if it is the same server, then it is handled separately, in different connections (at least logically.) The server does not have to remember anything. The web server can be "dumb". It does not have to remember the last connection by this user. It doesn't need to know about users at all! Each GET is self-contained. This is why people say that HTTP is a stateless protocol. It does not have to remember things, leading to robustness and universal use. Web servers are so simple to implement that printers and thermostats can have them.

The browser has to be a little smarter. For example, it had to remember that the first index.html came from /cs639, so when it uses the relative URL “hw1.html”, it composes the “GET /cs639/index.html HTTP/1.1”.

A web-based application sometimes has a problem with all these independent request-response cycles and needs to track a user over their many separate HTTP requests. That’s another subject, covered in cs636.

Using < in character data

Last time we noted that < must be a special character in XML, since it delimits the tags.

If we want a < in character content, we represent the < with the “predefined entity reference” <

This makes & special, so we represent & with &

Since > is naturally paired with <, it also has such a rep >, although it’s not so special as the other two.

Thus if <greeting> holds the string “abc<d>e” we write it in the XML document like this:

When an XML parser reads this, it assembles the string “abc<d>e” as element string content.

Question comes up in pa1: is the [] OK in the following, or do we need to quote it somehow?

<paramType>String[]</paramType>

| |

This is a start tag This is an end tag

All this is the element.

“String[]” is the character content of the element—is this [] OK for well-formed XML?

We can find out by using the XML 1.0 Standard, linked from the course web page under “Resources”

Using the XML Standard—it’s readable!

It’s in HTML, with effective use of links.

The rules are written in EBNF, covered in Sec. 6—we went over them.

[abc] a or b or c

[^abc] not (a or b or c)

[^<&]* a sequence of 0 or more occurrences of chars, each not a ‘<’ or ‘&’

…

To determine if “String[]” is valid content of the element—

1. find the element definition:

element::= EmptyElemTag | Stag content Etag

Our question is in here

2. follow the link from content to its definition:

content ::= CharData?((element | Reference | CDSet | PI | Comment) CharData?)*

<-----------------------------------------------------à

All this can be not-there because of the *

CharData ::=[^<&]*-([^<&]*’]]>’[^<&]*)

[^<&]* is a regular expression of EBNF

[^<&]*’]]>’[^<&]* sequences of chars, each not < or &, with ‘]]>’ somewhere within it.

CharData: sequences of chars, each not a ‘<’ or ‘&’, and also not containing any subsequences of the form ‘]]>’

Element content abc<d>e (example above) parses as CharData Reference CharData Reference CharData by the rule above, since < and > are References, leaving abc, d, and e as simple CharData strings.

So ‘String[]’ is OK as CharData, and all oldstyle Java types are too – but what about generics such as Pair<String>. This needs the < escape for ‘<’. Can also use > for the > sign, for symmetry.

(or we could use abc<d>e, but it looks funny. Only < really needs quoting in element content)

(or we could use abc&d>e, that is, using the “character reference” & for <, using its character code.)

Attribute values also can use <, etc., need " or ' in some cases

Similar rules about quoting > and & hold for attribute values, but there quotes are used as delimiters, and thus need quoting if the same quote is used in the string within the delimiters. " '

Luckily, with the choice of two quotes, currency=”USD” or currency=’USD’, we usually don’t need these.

XML as sequence of Unicode characters, in UTF-8 or other encoding

The text of an XML document (markup and content) has even more basic requirements set out on pg. 20 of text and near the start of the standard. Basically, an XML document is a sequence of Unicode characters, with some (strange) characters disallowed. Unicode is “encoded” into 8-bit bytes by well-known mappings. Some characters (ex. Asian chars) take multiple bytes.

The default encoding for XML is UTF-8, which matches ASCII across the 128 ASCII characters (codes 0-127). Many examples in the text have encoding=”ISO-8859-1” in the XML declaration. This is the “Latin-1” character set that has more than 128 characters (but less than 256, so fits in 8 bits, and again matches ASCII for codes 0-127.) We should come back to this later.

End of lines (eols): It’s OK to store XML using LF (UNIX 1-byte eol) or CRLF (Windows 2-byte eol), but processor is expected to remove CR from CRLF as first step, so in programs, eols appear as LFs, and in fact no CRs are left, since singleton CRs are replaced with LFs.

XML Coverage

We’ll cover carefully:

- elements

- attributes

- comments

- XML declarations

We will worry less about:

- PIs

- Entities (except the built-in entities like < needed to make XML work)

- Namespaces – related to integration of XML to multiple sources. (we will cover this later)

Read Chap. 1 to pg. to pg. 41, skipping PIs, entities, and also namespaces for now.