[bnc] Linguistic Annotation of Texts ("tagging") - Designing and Creating the BNC

Linguistic Annotation of Texts ("tagging")

Every one of the 100 million words in the BNC carries a grammatical tag, that is, a label indicating its part of speech. In addition, each text is divided into sentence-like segments. This process was carried out at theUniversity Centre for Computer Corpus Research on Language (UCREL) at Lancaster University, using the CLAWS4 automatic tagger developed by Roger Garside at Lancaster. Full details of the procedure adopted are given in two articles entitled 'Using CLAWS to annotate the BNC' and `CLAWS4: The tagging of the British National Corpus'.

The basic BNC tagset (known as C5) distinguishes 61 categories found in most "traditional" grammars, such as adjectives, articles, adverbs, conjunctions, determiners, nouns, verbs etc. Tags are also attached to major punctuation marks, indicating their function.

This automatic procedure has an error rate of around 1.7%. In addition, about 4.7% words could not be assigned unambiguously to a single category. To overcome these problems, a 2% sample of the corpus was manually post-edited, using an enriched BNC tagset known as C7, in which over 160 categories are distinguished, with a much lower error rate (less than 0.3%). This core corpus, the BNC Sampler, is being made available as a distinct subcorpus: it contains the same mixture of text types as the BNC proper, but with a larger (approximately 50%) proportion of spoken texts. Full details of both schemes for linguistic annotation are given in the A Brief Users' Guide to the Grammatical Tagging of the BNC and the Manual to accompany The British National Corpus (Version 2) with Improved Word-class Tagging .

Up: Contents

Creating the BNC
Creation stage
Permissions Clearance
Collection of Texts
Making electronic texts
Encoding of Texts
Linguistic Annotation of Texts ("tagging")
Storage and Documentation of Texts