Using CLAWS to annotate the British National Corpus
Roger Garside, Department of Computing, University of Lancaster
The main steps of the CLAWS annotation process as applied to the BNC were as
follows:
- text-files to be processed are deposited by OUCS (Oxford University
Computing Service) in a "drop" directory on the UCREL computer system,
from which an automatic procedure logs them into UCREL processing directories
overnight, so that they are available to the corpus analysts when they start
work.
- a corpus analyst selects text-files to be processed, and invokes a
procedure to carry out the various steps of the tagging procedure in a work area
allocated to that analyst, monitoring the process and logging the completion and
success or failure of each step. This makes it easy to keep track of where each
text-file has reached in processing.
- the first step on a new text-file is to run it through an SGML parser, to
check the validity of the formatting of the file on arrival. This is followed by
a number of filter programs which checked for various special SGML situations
which were valid according to the parser but caused the Claws tagging system
problems. Examples of such things are certain formats of quoted SGML attribute
values, and certain empty SGML elements; these are silently corrected in the
source text-file (although the original text-file is archived in this case).
- next the Claws part-of-speech tagging system is run. The text was divided
into separate orthographic units, and a part-of-speech marker assigned to each
such unit; Claws also segments the text into units which approximate to
sentences. The output from this step is a list of orthographic units, together
with the preferred tag and some information about how it was chosen. The SGML
header information and the more voluminous information associated with SGML tags
was put in a supplementary output file, to be re-incorporated with the main
output at a later step. This tagging phase is discussed in more detail in the
next section.
- the next step is a post-processing program whose task is to reformat the
Claws output into the format required for returning to OUCS. This involves
re-incorporating the information in the supplementary file, deleting the extra
tagging process information, and representing each part-of-speech marker as an
SGML entity. It is at this stage that portmanteau tags are introduced
(see next section). Our original plans were to merge this step into the output
phase of the Claws system, but we decided that it was preferable to retain the
intermediate format with the extra information while manual corrections were
being made to the annotation.
- after this there were a number of further filter programs, which check
various aspects of the current output file (for example, there is an intelligent
differencing program which checks the validity of all differences between
the source text and the output text) or which make various systematic changes to
the output file (such as to ensure the SGML element structure is still valid
after Claws has segmented the text). This culminates in a rerun of the SGML
parser, to ensure the validity of the output file.
- at this stage the automatic processing terminates, and manual processing
commences. All the error reports from the above steps are examined, and where
necessary scripts invoked to correct the annotation. Usually the Claws output is
corrected, but in extremis the final output file or the original source text can
be adjusted. The scripts ensure that steps are rerun as appropriate, and that
the monitor information is kept up-to-date, providing an audit trail of what has
been done. Selected hundred-sentence blocks are manually corrected to check the
validity of the automatic processing.
- the output files are then returned to OUCS. This is at present a manual
procedure, but we expect to automate it at a later stage.
- after the annotated text-files have been sent to OUCS and some further
processing has been performed there, there is a further window of opportunity
within which UCREL can revise the text-file annotation if necessary. We are
planning to use this to ensure consistency between texts tagged at different
times, and to eliminate certain erroneous tagging decisions where this can be
done without disruption to the remainder of the text.
The tagging of the BNC is carried out with a version of the Claws stochastic
part-of-speech tagger (Marshall 1983;
Garside, Leech and Sampson 1987). Its main steps are:
- formatting of the input text into orthographic units, and segmentation
into units approximating to sentences.
- assignment of a list of potential part-of-speech markers to each
orthographic unit, based on a lexicon, suffix-list, and a set of rules to deal
with capitalised words, hyphenated words, etc.
- modification of the potential part-of-speech lists by matching to a
collection of pattern templates, which make use of the original words and the
potential part-of-speech markers already introduced.
- selection of the preferred part-of-speech by calculation of the most
likely part-of-speech sequence, based on probabilities taken from a large corpus
of annotated text and using the well-known Viterbi alignment procedure.
- reformatting and output of the results.
The main modifications we have made to the Claws system (we are currently
using version 16 of the Claws4 system) are:
- earlier versions of Claws used a tagset which evolved from the tagsets
used to tag the Brown and LOB corpora. The current version of this (which we
call the c6 tagset) has 170-80 tags or parts-of-speech, and is being used to
annotate the 2 million word core corpus. For the rest of the BNC we are using a
more restricted (c5) tagset of about 65 tags, eliminating some of the finer
distinctions made in the larger tagset (for instance, in distinctions between
various classes of common noun). Claws has been rewritten so as to be
independent of the particular tagset used, and the appropriate tagset is now
read in before the other resources (lexicon, etc.) are read.
In some cases Claws needs to preserve a distinction between certain tags
in order to perform the disambiguation process adequately, where we do not wish
to maintain the distinction in the final output. In this case, we use what we
call process tags, which are mapped onto a smaller set of output tags in
step 5 above.
- earlier versions of Claws chose a single part-of-speech marker for each
orthographic unit, and (in common with other stochastic tagging systems)
operated at an accuracy rate of about 96%. In order to provide more useful
results in a substantial proportion of the residual words which cannot be
successfully tagged, we have introduced
portmanteau tags. A portmanteau tag is used in a situation where there
is insufficient evidence for Claws to make a clear distinction between two tags.
Thus, in the notoriously difficult choice between a past participle and the past
tense of a verb, if there is insufficient probabilistic evidence to choose
between the two Claws marks the word as VVN-VVD. A set of fifteen such
portmanteau tags have been declared, covering the major pairs of confusable
tags. Experiments have been done to choose a threshold for each portmanteau
tag, involving a trade-off between reducing tagging accuracy and reducing tag
ambiguity.
- a great deal of effort has been required in interfacing the Claws system
to the SGML mark-up of the input text-files, and in ensuring that the addition
of segment markers is consistent with the other SGML mark-up. The resources used
by Claws (lexicon, etc.) have now been translated from using the LOB notation
for accented letters and other special symbols to using a set of SGML entity
names.
- the lexicons used by the Claws system are in a constant state of
improvement. One major change is that we have incorporated a lexicon of some
four to five thousand proper names (mainly place names, but also common personal
names, etc.).
- we have developed the pattern template idea (which we erroneously
call an idiomlist) very extensively in the current version of Claws. We
now have several such template lists, matched at different stages of the tagging
process; each pattern consists of a sequence of required or optional items, each
of which is a regular expression to match an orthographic unit (with specified
restrictions on typographic case) or one of the potential part-of-speech markers
assigned at an earlier stage. We use this to find sequences of orthographic
units which should be treated as a single grammatical unit (for example according
to), as foreign expressions (for example hoi polloi), as place names
or other naming expressions (for example Ann Arbor and the Sunday
Times). We also use this mechanism to catch particular word and tag patterns
which are commonly mis-tagged, in order to supply the correct tag sequence.
- finally, the most recent version of Claws has been modified to deal with
the spoken section of the BNC. There are supplementary lexicons and lists of
pattern templates for spoken data; we have made an attempt to deal with common
patterns of orthography used to represent non-standard speech (such as truncated
words, and, for example, writing having as `avin'); and Claws
looks for vocalized pauses and repetitions of parts of phrases (thus we erm,
we stopped going is disambiguated as if it read simply we stopped going).
The following is an example of a piece of BNC text with c5 part-of-speech
markers (taken from Captain Pugwash and the Huge Reward):
<s c="0000002 002" n=00001>
When&AVQ-CJS; Captain&NP0; Pugwash&NP0; retires&VVZ; from&PRP;
active&AJ0; piracy&NN1; he&PNP; is&VBZ; amazed&AJ0-VVN; and&CJC;
delighted&AJ0-VVN; to&TO0; be&VBI; offered&VVN; a&AT0; Huge&AJ0;
Reward&NN1; for&PRP; what&DTQ; seems&VVZ; to&TO0; be&VBI; a&AT0;
simple&AJ0; task&NN1;.&PUN;
<s c="0000005 022" n=00002>
Little&DT0; does&VDZ; he&PNP; realise&VVI; what&DTQ; villainy&NN1;
and&CJC; treachery&NN1; lurk&NN1-VVB; in&PRP; the&AT0; little&AJ0;
town&NN1; of&PRF; Sinkport&NN1-NP0;,&PUN; or&CJC; what&DTQ; a&AT0;
hideous&AJ0; fate&NN1; may&VM0; await&VVI; him&PNP; there&AV0;.&PUN;
AJ0 adjective (unmarked) (e.g. GOOD, OLD)
AJC comparative adjective (e.g. BETTER, OLDER)
AJS superlative adjective (e.g. BEST, OLDEST)
AT0 article (e.g. THE, A, AN)
AV0 adverb (unmarked) (e.g. OFTEN, WELL, LONGER, FURTHEST)
AVP adverb particle (e.g. UP, OFF, OUT)
AVQ wh-adverb (e.g. WHEN, HOW, WHY)
CJC coordinating conjunction (e.g. AND, OR)
CJS subordinating conjunction (e.g. ALTHOUGH, WHEN)
CJT the conjunction THAT
CRD cardinal numeral (e.g. 3, FIFTY-FIVE, 6609) (excl ONE)
DPS possessive determiner form (e.g. YOUR, THEIR)
DT0 general determiner (e.g. THESE, SOME)
DTQ wh-determiner (e.g. WHOSE, WHICH)
EX0 existential THERE
ITJ interjection or other isolate (e.g. OH, YES, MHM)
NN0 noun (neutral for number) (e.g. AIRCRAFT, DATA)
NN1 singular noun (e.g. PENCIL, GOOSE)
NN2 plural noun (e.g. PENCILS, GEESE)
NNN <<PROCESS TAG>> numeral noun, neutral for number (dozen,
hundred)*/
NNN <<PROCESS TAG>> plural numeral noun (hundreds,
thousands)*/
NNS <<PROCESS TAG>> noun of style (e.g. president,
governments, Messrs.)
NP0 proper noun (e.g. LONDON, MICHAEL, MARS)
NUL the null tag (for items not to be tagged)
ORD ordinal (e.g. SIXTH, 77TH, LAST)
PNI indefinite pronoun (e.g. NONE, EVERYTHING)
PNP personal pronoun (e.g. YOU, THEM, OURS)
PNQ wh-pronoun (e.g. WHO, WHOEVER)
PNX reflexive pronoun (e.g. ITSELF, OURSELVES)
POS the possessive (or genitive morpheme) 'S or '
PRF the preposition OF
PRP preposition (except for OF) (e.g. FOR, ABOVE, TO)
PUL punctuation - left bracket (i.e. ( or [ )
PUN punctuation - general mark (i.e. . ! , : ; - ? ... )
PUQ punctuation - quotation mark (i.e. ` ' " )
PUR punctuation - right bracket (i.e. ) or ] )
TO0 infinitive marker TO
UNC "unclassified" items which are not words of the English
lexicon
VBB the "base forms" of the verb "BE" (except the
infinitive), i.e. AM, ARE
VBD past form of the verb "BE", i.e. WAS, WERE
VBG -ing form of the verb "BE", i.e. BEING
VBI infinitive of the verb "BE"
VBN past participle of the verb "BE", i.e. BEEN
VBZ -s form of the verb "BE", i.e. IS, 'S
VDB base form of the verb "DO" (except the infinitive), i.e. "DO"
VDD past form of the verb "DO", i.e. DID
VDG -ing form of the verb "DO", i.e. DOING
VDI infinitive of the verb "DO"
VDN past participle of the verb "DO", i.e. DONE
VDZ -s form of the verb "DO", i.e. DOES
VHB base form of the verb "HAVE" (except the infinitive), i.e.
HAVE
VHD past tense form of the verb "HAVE", i.e. HAD, 'D
VHG -ing form of the verb "HAVE", i.e. HAVING
VHI infinitive of the verb "HAVE"
VHN past participle of the verb "HAVE", i.e. HAD
VHZ -s form of the verb "HAVE", i.e. HAS, 'S
VM0 modal auxiliary verb (e.g. CAN, COULD, WILL, 'LL)
VVB base form of lexical verb (except the infinitive)(e.g. TAKE, LIVE)
VVD past tense form of lexical verb (e.g. TOOK, LIVED)
VVG -ing form of lexical verb (e.g. TAKING, LIVING)
VVI infinitive of lexical verb
VVN past participle form of lex. verb (e.g. TAKEN, LIVED)
VVZ -s form of lexical verb (e.g. TAKES, LIVES)
XX0 the negative NOT or N'T
ZZ0 alphabetical symbol (e.g. A, B, c, d)
Garside, R.G. (1993). The Large-scale Production of
Syntactically-analysed Corpora, Literary and Linguistic Computing, 8:
39-46.
Garside, R.G., Leech, G.N., and Sampson, G.R. (eds)
(1987). The Computational Analysis of English: A Corpus-based Approach.
Longman, London.
Leech, G.N., and Garside, R.G. (1991). Running a Grammar
Factory: the Production of Syntactically Analysed Corpora or `Treebanks'. In
English Computer Corpora: Selected Papers and Research Guide edited by
S. Johansson and A. Stenstrom. Mouton de Gruyter, Berlin.
Marshall, I. (1983). Choice of Grammatical Word-class
without Global Syntactic Analysis: Tagging Words in the LOB Corpus, Computers
and the Humanities, 17: 139-50.
|