July 1997, no. 25
It is a pleasure to take over as coordinator of the ICE Project, and to help bring to fruition Sidney Greenbaum's goal of creating computerized corpora of the many different varieties of English that have evolved around the world.
Even though I will be coordinator of ICE, Gerry Nelson has graciously agreed to answer any technical questions about ICE that you may have. Gerry can be reached by e-mail (firstname.lastname@example.org) or by snail mail at the Survey of English Usage (University College London, Gower St. London, WC1E 6BT, England).
This newsletter contains a discussion of a number of recent developments
in the ICE Project:
Department of English
University of Massachusetts at Boston
100 Morrissey Blvd.
Boston, MA 02125-3393
If you have any questions or comments about anything in this newsletter or about the ICE Project in general, don't hesitate to contact me.
ICE Meeting, 18th Annual ICAME Meeting, Chester, England, 23 May 1997
In attendance: Bas Aarts (London), Jan Aarts (Nijmegen), Doug Arnold
(Essex), Peter Collins (Sydney), Sylviane Granger (Louvain), Knut Hofland
(Bergen), John Kirk (Belfast), William Kretzschmar (Athens, GA), Christian
Mair (Freiburg), Charles Meyer (Boston), Gerry Nelson (London), Nelleke
Oostdijk (Nijmegen), Pam Peters (New South Wales), Andrea Sand (Freiburg),
Josef Schmied (Chemnitz)
A fully tagged and parsed version of ICE-GB is nearing completion, and should be available by the autumn of 1997. Also to be released soon are a series of CD-ROMs containing digitized sound files for the spoken part of ICE-GB. ICE-GB will be made available on CD-ROM along with ICECUP 3. The CD-ROM will be distributed either by the Survey of English Usage, or by the Norwegian Computing Centre for the Humanities, which also distributes other corpora.
Each ICE team will receive a free copy of the tagged and parsed version of ICE-GB, and there is also the possibility of obtaining additional copies at a discount.
ICE Tree and ICECUP 3
ICE Tree is now available from the Survey of English Usage. This program can be used to edit the output of ICE texts that have been syntactically parsed. A demonstration copy can be downloaded from the Survey's Web site
ICECUP 3 will be released with ICE-GB. This program facilitates the analysis of ICE texts by enabling various kinds of searches, the generation of KWIK concordances, the selection of sub-corpora for analysis, and the stripping of markup from ICE texts for easier viewing.
Both of these programs are described in detail in the book on ICE that Sidney Greenbaum edited: Comparing English Worldwide (OUP, 1996).
Minimal Markup for ICE Texts
As ICE teams continue to collect texts for their respective components
of the ICE Corpus, it becomes increasingly important to think about the
kinds of markup to include in their components. Essentially, there are
three types of markup:
Listed below are the minimal number of structural ICE tags that are needed to insure intelligibility and that are necessary for input to a tagger, and analysis of the corpus by ICECUP.
Changed name or word
The markup listed above represents a much more limited number of structural ICE tags than Sidney Greenbaum recommended in a February 1996 paper he distributed to all ICE teams ("Reduced Markup for ICE Corpora"). In addition, depending upon what you include in your corpus, you may need to add other markup, such as tags to set off editorial comments, unclear words, and corrected spellings. And if you intend to parse your component of ICE, you'll need to insert various kinds of normalizations, particularly in the spoken texts.
The reason I'm recommending that we allow for the minimal set of markup above is that I know that many ICE teams have had considerable difficulty obtaining funds to support the creation of their components of ICE, and by minimizing the amount of markup necessary for ICE texts, the amount of time needed to insert structural markup is greatly reduced. I would also emphasize that ICE teams are still free to tag their components of ICE with as many structural tags as they wish to.
I would appreciate hearing from ICE teams concerning their views on the set of minimal markup I'm recommending above, and whether they think that other structural tags are necessary for ICE texts.
The ICE CD-ROM
A Proposal from Gerry Nelson:
The Survey has acquired software and hardware which enables us to digitise sound recordings from audio tape and to produce CD-ROMs. It is proposed that we produce one CD for each ICE team, which will contain:
- 20 digitised spoken texts (40,000 words)
- the corresponding orthographic transcriptions
- a copy of ICECUP for retrieval of lexical items
The sound files will be aligned with the transcriptions at text unit level. This allows playback of the sound via ICECUP for each retrieved text unit.
The 20 texts will comprise:
10 Dialogues (6 face-to-face conversations, 4 public dialogues)
10 Monologues (4 unscripted speeches, 4 scripted speeches, 2 "mixed", ie news broadcasts)
Note: You should have copyright permission for all texts, for use in non-commercial research.
If you would like to participate in this project:
1. Select 20 recordings from the categories above.
2. Send the recordings on audio tape to Gerry Nelson, together with their orthographic transcriptions on diskette. The only markup you need in these transcriptions is the text unit marker <#> at the start of each "sentence", speaker IDs <$A>, <$B>, etc at the start of each speaker turn, and overlapping strings. However, if you have already inserted additional markup, there is no need to remove it.
3. If possible, send only that part of the recording which has been transcribed.
The quality of the recordings you send should be as good as possible.
Among conversations, in particular, you should select only the best recordings,
that is, those in which the speech is most clearly audible, and in which
there is a minimum of overlapping.