BNC2 POS-tagging Manual

POS-tagging Error Rates

The purpose of this document is to report on the accuracy of the results of the improved tagging programs.

[ Related documents: Introduction to the Manual | Guidelines to Wordclass Tagging | Automatic tagging of the BNC ]


1. Levels of estimation

Based on the findings from the 50,000-word test sample, the estimated ambiguity and error rates for the BNC are shown below in three different degrees of detail.

(a) First, as a general assessment of accuracy, the estimated rates are given for the whole corpus. (See Table 1 below.)

(b) Secondly, separate estimates of ambiguity rates and error rates are given for each of the 57 word tags in the corpus. This will enable users of the corpus to assign appropriate degrees of reliability to each tag. Some tags are always correct; other tags are quite often erroneous. For example, the tag VDD stands for a single form of the verb do: the form did. Since the spelling did is unambiguous, the chances of ambiguity or error, in the use of the tag VDD, are virtually nil. On the other hand, the tag VVB (base finite form of a lexical verb) is not only quite frequent, but also highly prone to ambiguity and error. 15 per cent of the occurrences of VVB are errors - a much higher error rate than any other tag. (See Table 2 below.)

(c) Thirdly, separate estimates of ambiguity rates and error rates are given for �wrong-tag--right-tag� pairings XXX, YYY, consisting of (i) the actually-occurring erroneous tag XXX, and (ii) the correct tag YYY which should have occurred in its place. However, because the number of possible tag-pairs is large (572), and most of these tag-pairs have few or no errors, only the more common pairings of erroneous tag and correct tag are separately listed, with their estimated probability of occurrence. This list of tag-pairings will help users further, in enabling them to estimate not merely the reliability of a tag, but, if that tag is incorrect, the likelihood that the correct tag would have been some other particular tag. In this way, the frequency of grammatical word classes, or individual words in those classes, can be estimated more accurately for the whole BNC. (See Table 3 below.)

2. Presentation of Ambiguity Rates and Error Rates (fine-grained mode of calculation)

In this section, we examine ambiguities and errors using a �fine-grained� mode of calculation, treating each error as of equal importance to any other error. In section 4, we look at the same data in terms of a �coarse-grained� mode of calculation, ignoring errors and ambiguities involving subcategories of the same part of speech.

2.1 Overall estimated ambiguity and error rates: based on the 50,000 word sample

As the following table shows, the ambiguity rate varies considerably between written and spoken texts. (However, note that the calculation for speech is based on a small sample of 5,000 words.)

Table 1: Estimated ambiguity and error rates for the whole corpus (fine-grained calculation)

 

Sample tag count

Ambiguity rate (%)

Error rate (%)

Written texts

45,000

3.83%

1.14%

Spoken texts

5,000

3.00%

1.17%

All texts

50,000

3.75%

1.15%

It will be noted that written texts on the whole have a higher ambiguity rate, whereas spoken texts have a slightly greater error rate.

The success of an automatic tagger is sometimes represented in terms of the information-retrieval measures of precision and recall, rather than ambiguity rate and error rate as in Table 1. Precision is the extent to which incorrect tags are successfully discarded from the output. Recall is the extent to which all correct tags are successfully retained in the output of the tagger, allowing, however, for more than one reading to occur for one word (i.e. ambiguous tagging is permitted). According to these measures, the success of the tagging is as follows:

 

Precision

Recall

Written texts

96.17%

98.86%

Spoken texts

97.00%

98.83%

All texts

96.25%

98.85%

However, from now on we will continue to use �ambiguity rate� and �error rate�, which appear to us more transparent.

2.2 Estimated ambiguity and error rates for each tag (fine-grained mode of calculation)

The estimates for individual tags are again based on the 50,000 sample, and the ambiguity rate for each tag is based on the number of ambiguity tags which begin with a given tag. The table also specifies the estimated likelihood that a given tag, in the first position of the ambiguity tag, is the correct tag.

Table 2: Estimated ambiguity rates and error rates (by tag) (fine-grained calculation)

In Table 2, column (b) shows the overall frequency of particular tags (not including ambiguity tags). Column (c) gives the overall occurrence of ambiguity tags, as well as of particular ambiguity tags, beginning with a given tag. (Ambiguity tags marked * are less �serious� in that they apply to two subcategories of the same part of speech, such as past tense and past participle of the verb - see 4.1 below.) Column (d) shows which tags are more or less likely to be found as the first part of an ambiguity tag. For example, both NP0 and VVG have an especially high incidence of ambiguity tags. Column (e) tells us, given that we have observed an ambiguity tag, what is the likelihood of the first tag�s being correct? Overall, there is more than a 3-1 chance that the first tag will be correct; but there are some exceptions, where the chances of the first tag�s being correct are much lower: for example, PNI (indefinite pronoun). [ Note that (f) and (g) exclude errors where the first tag of an ambiguity tag is wrong; contrast Table 5, and Table 6 column (c), in section 3.2 below.]

(a) Tag

(b) SingleTag count (out of 50,000 words)

(c) Ambiguity Tag count (out of 50,000 words)

(d) Ambiguity rate (%)

(c / b + c)

(e) 1st tag of ambiguity tag correct (% of all ambiguity tags)

(f) Error count

(g) Error rate (%)

(f / b)

AJ0

3412

all 338

9.01%

282 (83.43%)

46

1.35%

   

(AJ0-AVO 48)

       
   

(AJ0-NN1 209)

       
   

(AJ0-VVD 21)

       
   

(AJ0-VVG 28)

       
   

(AJ0-VVN 32)

       

AJC

142

 

0.0%

 

4

2.82%

AJS

26

 

0.0%

 

2

7.69%

AT0

4351

 

0.0%

 

2

0.05%

AV0

2450

all 45

1.80%

37 (82.22%)

57

2.33%

   

(AV0-AJ0 45)

       

AVP

379

all 44

10.40%

34 (77.27%)

6

1.58%

   

(AVP-PRP 44)

       

AVQ

157

all 10

5.99%

10 (100.00%)

9

5.73%

   

(AVQ-CJS 10)

       

CJC

1915

 

0.0%

 

3

0.16%

CJS

692

all 39

5.34%

30 (76.92%)

18

2.60%

   

(CJS-AVQ 26)

       
   

(CJS-PRP 13)

       

CJT

236

(all) 28

10.61%

 

3

1.27%

   

(CJT-DT0 28 )

       

CRD

940

all 1

0.11%

0 (0.00%)

0

0.00%

   

(CRD-PNI 1)

       

DPS

787

 

0.0%

 

0

0.00%

DT0

1180

all 20

1.67%

16 (80.00%)

19

1.61%

   

(DT0-CJT 20)

       

DTQ

370

 

0.0%

 

0

0.00%

EX0

131

 

0.0%

 

1

0.76%

ITJ

214

 

0.0%

 

2

0.93%

NN0

270

 

0.0%

 

10

3.70%

NN1

7198

all 514

6.66%

395 (76.84%)

86

1.19%

   

(NN1-AJ0 130)

       
   

(NN1-NP0 92)*

       
   

(NN1-VVB 243)

       
   

(NN1-VVG 49)

       

NN2

2718

all 55

1.98%

48 (87.27%)

30

1.10%

   

(NN2-VVZ 55)

       

NP0

1385

all 264

16.01%

224 (84.84%)

31

2.24%

   

(NP0-NN1 264)*

       

ORD

136

 

0.0%

 

0

0.00%

PNI

159

all 8

4.79%

3 (37.50%)

5

3.14%

   

(PNI-CRD 8)

       

PNP

2646

 

0.0%

 

0

0.00%

PNQ

112

 

0.0%

 

0

0.00%

PNX

84

 

0.0%

 

0

0.00%

POS

217

 

0.0%

 

5

2.30%

PRF

1615

 

0.0%

 

0

0.00%

PRP

4051

all 166

3.94%

154 (92.77%)

24

0.59%

   

(PRP-AVP 132)

       
   

(PRP-CJS 34)

       

TO0

819

 

0.0%

 

6

0.73%

UNC

158

 

0.0%

 

4

2.53%

VBB

328

 

0.0%

 

1

0.30%

VBD

663

 

0.0%

 

0

0.00%

VBG

37

 

0.0%

 

0

0.00%

VBI

374

 

0.0%

 

0

0.00%

VBN

133

 

0.0%

 

0

0.00%

VBZ

640

 

0.0%

 

4

0.63%

VDB

87

 

0.0%

 

0

0.00%

VDD

71

 

0.0%

 

0

0.00%

VDG

10

 

0.0%

 

0

0.00%

VDI

36

 

0.0%

 

0

0.00%

VDN

20

 

0.0%

 

0

0.00%

VDZ

22

 

0.0%

 

0

0.00%

VHB

150

 

0.0%

 

1

0.67%

VHD

258

 

0.0%

 

0

0.00%

VHG

16

 

0.0%

 

0

0.00%

VHI

119

 

0.0%

 

0

0.00%

VHN

9

 

0.0%

 

0

0.00%

VHZ

116

 

0.0%

 

1

0.86%

VM0

782

 

0.0%

 

3

0.38%

VVB

560

all 84

13.04%

56 (66.67%)

84

15.00%

   

(VVB-NN1 84)

       

VVD

970

all 90

8.49%

62 (58.89%)

50

5.15%

   

(VVD-AJ0 11)

       
   

(VVD-VVN 79)*

       

VVG

597

all 132

18.11%

112 (84.84%)

9

1.51%

   

(VVG-AJ0 83)

       
   

(VVG-NN1 49)

       

VVI

1211

 

0.0%

 

7

0.58%

VVN

1086

all 158

12.70%

113 (71.52%)

27

2.49%

   

(VVN-AJ0 50)

       
   

(VVN-VVD 108)*

       

VVZ

295

all 26

8.10%

14 (53.85%)

11

3.73%

   

(VVZ-NN2 26)

       

XX0

363

 

0.0%

 

0

0.00%

ZZ0

75

 

0.0%

 

3

4.00%

 

2.3 Estimated error rates specifying the incorrect tag and the correct tag (fine-grained calculation)

The next table, Table 3, gives the frequency, as a percentage, of error-prone tag-pairs where XXX is the incorrect tag and YYY is the correct tag which should have occurred in its place. In the third column, the number of the specified error-type is listed, as a frequency count from the sample of 50,000 words. In the fourth column, this is expressed as a percentage of all the tagging errors of word category XXX (in Table 2 column (f)). The fifth column answers the question: if tag XXX occurs, what is the likelihood that it is an error for tag YYY? Where the number of occurrences of a given error-type is less than 5 (i.e. 1 in 10,000 words), they are ignored. Hence, Table 3 is not exhaustive: only the more likely error-types are listed. In the second column, we add, where useful, the individual words which trigger these errors.

Table 3: Estimated frequency of selected tag-pairs, where XXX is the incorrect tag, and YYY is the correct one

(1)
Incorrect tag XXX

(2)
Corrected tag YYY

(3)
No. of occurrences of this error type

(4)
% of all incorrect uses of tag(XXX)
(col 3 / Table 2 col(f))

(5)
% of all tags XXX
(col 3 / Table 2 col(b))

AJ0

AVO

12

26.1%

0.4%

 

NN1

12

26.1%

0.4%

 

NP0

5

10.9%

0.1%

 

VVN

8

17.4%

0.2%

AV0

AJ0

6

10.5%

0.2%

 

AJC

8

14.0%

0.3%

 

DT0

24

42.1%

1.0%

 

EX (there)

5

8.8%

0.2%

 

PRP

5

8.8%

0.2%

AVQ

CJS (when, where)

6

66.7%

3.8%

CJS

PRP

10

55.6%

1.4%

DTO

AV0

15

78.9%

1.3%

NN1

AJ0

13

15.1%

0.2%

 

NN0*

8

9.3%

0.1%

 

NP0*

22

25.6%

0.3%

 

UNC

9

10.5%

0.2%

 

VVI

13

15.1%

0.2%

NN2

NP0*

14

46.7%

0.5%

NP0

NN1*

10

32.3%

0.7%

 

NN0*

5

16.1%

0.4%

PRP

AV0

7

29.2%

0.2%

 

AVP

5

20.8%

0.1%

TO0

PRP (to)

6

100.0%

0.7%

VVB

AJ0

7

8.3%

1.3%

 

NN1

7

8.3%

1.3%

 

VVI*

55

65.5%

9.8%

VVD

AJ0

6

12.0%

0.6%

 

VVN*

44

88.0%

4.5%

VVG

NN1

9

100.0%

1.5%

VVI

NN1

5

71.4%

0.4%

VVN

AJ0

7

25.9%

0.6%

 

VVD*

17

63.0%

1.6%

VVZ

NN2

8

72.7%

2.7%

Similar to before, the asterisk * indicates a �less serious� error, in which the erroneous and correct tags belong to the same major category or part of speech. As the Table shows, the most frequent specific error types are within the verb category: VVB ? VVI (55, or 9.8% of all VVB tags) and VVD ? VVN (44, or 4.5% of all VVD tags).

3. A further mode of calculation: ignoring subcategories of the same part of speech

3.1 Presentation of Ambiguity and Error Rates (coarse-grained calculation)

Yet a further way of looking at the ambiguities and errors in the corpus is to make a coarse-grained calculation in counting these phenomena. In a fine-grained measurement, which is the one assumed up to now, each tag is considered to define its own word class which is different from all other word classes. Using the coarse-grained calculation, on the other hand, we consider words to belong to different word classes (parts of speech) only when the major category is different. If we consider the pair NN1 (singular and common noun) and NP0 (proper noun), the coarse-grained calculation says that the ambiguity tag NN1-NP0 or NP0-NN1 does not show tagging uncertainty, since both the proposed tags agree in categorizing the word as the same part of speech (a noun). So this does not add to the ambiguity rate. Similarly, the coarse-grained point of view on error is that, if a word is tagged as NN1 when it should be NP0, or vice versa, then this is not error, because both tags are within the noun category. To summarize: in the fine-grained calculation, minor differences of word-class count towards the ambiguity and error rates; in the coarse-grained calculation, they do not.

In this section, the same calculations are made as in section 3, except that errors and ambiguities which are confined within a major category (noun, verb, etc.) are ignored. In practice, most of the errors and ambiguities of this kind come from the difficulty the tagger finds in recognizing the difference between NN1 (singular common noun) and NP0 (proper noun), between VVD (past tense lexical verb) and VVN (past participle lexical verb), and between VVB (finite present tense base form, lexical verb) and VVI (infinitive lexical verb). Thus the ambiguity tags NN1-NP0, VVD-VVN and their mirror images do not occur in the relevant table (Table 5) below. However, since there are no ambiguity tags for VVB and VVI, the problem of distinguishing these two shows up only in the error calculation. The three tables in this section, Tables 4-6, correspond to Tables 1-3 above respectively:

Table 4: Estimated ambiguity and error rates for the whole corpus (coarse-grained calculation)

 

Sample tag count

Ambiguity rate (%)

Error rate (%)

Written texts

45,000

2.78%

0.69%

Spoken texts

5,000

2.67%

0.87%

All texts

50,000

2.77%

0.71%

It will be noted from Table 4 that this method of calculation reduces the overall ambiguity rate by c.1 per cent, and the overall error rate by c.0.5 per cent. We will not present coarse-grained tables corresponding to Tables 2 and 3 above: these tables would be unchanged from the fine-grained calculation, except that the rows marked with an asterisk (*) would be deleted, and the other calculations changed as necessary.

 

3.2 Different modes of calculation: eliminating ambiguities

Given that the elimination of errors was beyond our capability within the time frame and budget we had available, the corpus in its present form, containing ambiguity tags as well as a small proportion of errors, is designed for what we believe will be the most common type of user, who will find it easier to tolerate ambiguity than error. However, other users may prefer a corpus which does not contain ambiguities, even though its error rate is higher. For this latter type of user, the present corpus is easy to interpret as a corpus free of ambiguities, simply by deleting or ignoring the second tag of any ambiguity tag, and accepting the first tag as the only one. In what follows, we therefore allow two modes of calculation: in addition to the "safer" mode, in which ambiguities are allowed and consequently errors are relatively low, we allow a "riskier" mode in which ambiguities are abolished, and errors are more frequent. In fact, if ambiguity tags are eliminated, the overall error rate rises to almost 2 per cent.

Table 5: Estimated error rates for the whole corpus after ambiguities have been automatically eliminated (assuming the first part of each ambiguity tag to be correct)

 

Sample tag count

Error rate (%)

Written texts

45,000

2.01%

Spoken texts

5,000

1.92%

All texts

50,000

2.00%

The following table gives an error count (c) for each tag: i.e. the number of errors in the 50,000 word sample where that tag was the erroneous tag. [Cf. the "safer" error count in Table 3, column (f).] In addition, each tag has a correction count (d): i.e. the number of erroneous tags for which that tag was the correct tag. If we subtract the Error count (c) from the Tag count (b), and add the Correction count (d) to the result, we arrive at the "Real tag count" (e) representing the number of occurrences of that tag in the corrected sample corpus. Not included in the table is the small number of �multi-word� errors which resulted in two tags being replaced by one (error count), or one tag being replaced by two (correction count), due to the incorrect non-use or use of multi-word tags. The last column divides the error count by the tag count to provide the error rate (as a percentage).

Table 6: Estimated error rates (by tag) after the second tags of ambiguity tags have been automatically eliminated

(a)
Tag

(b)
Tag count

(c)
Error count

(d)
Correction count

(e)
Real tag count

(b - c + d )

(f)
Error rate (%)

(c / b)x 100

AJ0

3750

102

(132)

3780

2.72%

AJC

142

4

(12)

150

2.82%

AJS

26

2

(0)

24

7.69%

AT0

4351

2

(3)

4352

0.05%

AV0

2495

65

(67)

2497

2.61%

AVP

423

16

(17)

424

3.78%

AVQ

167

9

(6)

164

5.39%

CJC

1915

3

(1)

1913

0.16%

CJS

731

27

(5)

709

3.69%

CJT

264

3

(15)

276

1.14%

CRD

940

1

(11)

950

0.11%

DPS

787

0

(0)

787

0.00%

DT0

1200

23

(29)

1206

1.92%

DTQ

370

0

(0)

370

0.00%

EX0

131

1

(5)

135

0.76%

ITJ

214

2

(2)

214

0.93%

NN0

270

10

(16)

276

0.37%

NN1

7712

205

(152)

7659

2.66%

NN2

2773

37

(29)

2765

1.33%

ORD

136

0

(2)

138

0.00%

NP0

1649

71

(102)

1680

4.31%

PNI

167

10

(1)

158

5.99%

PNP

2646

0

(1)

2647

0.00%

PNQ

112

0

(0)

112

0.00%

PNX

84

0

(1)

85

0.00%

POS

217

5

(6)

218

2.30%

PRF

1615

0

(0)

1615

0.00%

PRP

4217

36

(45)

4226

0.85%

TO0

819

6

(1)

814

0.73%

UNC

158

4

(29)

183

2.53%

VBB

328

1

(0)

327

0.30%

VBD

663

0

(0)

663

0.00%

VBG

37

0

(0)

37

0.00%

VBI

374

0

(0)

374

0.00%

VBN

133

0

(0)

133

0.00%

VBZ

640

4

(5)

641

0.63%

VDB

87

0

(0)

87

0.00%

VDD

71

0

(0)

71

0.00%

VDG

10

0

(0)

10

0.00%

VDI

36

0

(0)

36

0.00%

VDN

20

0

(0)

20

0.00%

VDZ

22

0

(0)

22

0.00%

VHB

150

1

(0)

151

0.67%

VHD

258

0

(0)

258

0.00%

VHG

16

0

(0)

16

0.00%

VHI

119

0

(1)

120

0.00%

VHN

9

0

(0)

9

0.00%

VHZ

116

1

(0)

115

0.86%

VM0

782

3

(0)

779

0.38%

VVB

644

112

(13)

545

17.39%

VVD

1060

78

(60)

1042

7.36%

VVG

729

29

(29)

729

3.98%

VVI

1211

7

(73)

1277

0.57%

VVN

1244

72

(87)

1259

5.79%

VVZ

321

23

(12)

310

7.17%

XX0

363

0

(0)

363

0.00%

ZZ0

75

3

(4)

76

4.00%

It is clear from this table that the amount of error in the tagging of the corpus varies greatly from one tag to another. The most error prone-tag, by a large margin, is VVB, with more than 17 per cent error, while many of the tags are associated with no errors at all, and well over half the tags have less than a 1 per cent error.

The final table, Table 7, gives figures for the third level of detail, where we itemise individual tag pairs XXX, YYY, where XXX is the incorrect tag, and YYY is the correct one which should have appeared but did not. Only those pairings which account for 5 or more errors are listed. This table differs from Table 3 in that here the second tags of ambiguity tags are not taken into account ("riskier mode"). It will be seen that the errors which occur tend to fall into a relatively small number of major categories.

Table 7: Estimated frequency of selected tag-pairs, where XXX is the incorrect tag, and YYY is the correct one (after the second tags of ambiguity tags have been eliminated automatically) based on the sample of 50,000 words

(1)
Incorrect tag XXX

(2)
Correct tag YYY

(3)
No. of occurrences of this error type

(4)
% of all incorrect uses of tag XXX
(col 3 / Table 2 col(f))

(5)
% of all tags XXX
(col 3 / Table 2 col(b))

AJ0

AV0

22

21.57%

0.59%

 

NN1

41

40.19%

1.09%

 

NP0

5

4.90%

0.13%

 

VVG

14

13.73%

0.37%

 

VVN

14

13.73%

0.37%

AV0

AJ0

9

13.85%

0.36%

 

AJC

8

12.31%

0.32%

 

DT0

26

40.00%

1.04%

 

EX0 (there)

5

7.69%

0.20%

 

PRP

6

9.23%

0.24%

AVP

CJT

6

94.12%

1.42%

AVQ

CJS (when, where)

6

66.67%

3.59%

CJS

PRP

15

55.56%

2.05%

DTO

AV0 (much, more, etc)

15

65.22%

1.25%

NN1

AJ0

63

30.73%

0.82%

 

NN0

8

3.90%

0.10%

 

NP0

74

36.10%

0.96%

 

UNC

9

4.39%

0.12%

 

VVB

9

4.39%

0.12%

 

VVG

13

6.34%

0.17%

 

VVI

13

6.34%

0.17%

NN2

NP0

14

37.84%

0.50%

 

UNC

9

24.32%

0.32%

 

VVZ

10

27.02%

0.36%

NN0

UNC

7

70.00%

2.59%

NP0

NN1

50

70.42%

3.03%

 

NN2

5

7.04%

0.30%

PNI

CRD (one)

9

90.00%

5.39%

PRP

AV0

8

22.22%

0.19%

TO0

PRP (to)

6

100.00%

0.73%

VVB

AJ0

7

6.25%

1.09%

 

NN1

35

31.25%

5.43%

 

VVI

55

49.11%

8.54%

 

VVN

5

4.46%

0.85%

VVD

AJ0

14

17.95%

1.32%

 

VVN

64

82.05%

6.04%

VVG

AJ0

11

37.93%

1.51%

 

NN1

18

62.07%

2.47%

VVI

NN1

5

71.43%

0.41%

VVZ

NN2

20

86.96%

6.23%

Some of the error types above are associated with one or two particular words, and where these occur they are listed. For example, the AV0 - EX0 type of error occurs invariably with the one word there.


Appendix:

List of text samples used for the manually-conducted 50,000-word error analysis

Each sample consisted of 2,000 words of each of the BNC texts below, except that two samples, one of written and one of spoken English, consisted of 1,000 words only. These samples are marked "*" in the list below. The reason for using half-length samples in two cases was to maintain the proportion of written and spoken data as 90% - 10%, so as to keep the proportions of the sample the same as the proportions in the BNC as a whole. The BNC text files are cited by the three-character code used in the BNC Users Reference Guide.

Written: Imaginative writing G0S, ADY, H7P, GW0, FSF

Written: Informative writing

Natural Science JXN

Applied Science HWV, CEG

Social Science CLH, EE8, *A6Y

World Affairs A4J, CMT, EE2, EB7

Commerce and finance HGP, B27

Arts C9U, G1N

Belief and Thought CA9

Leisure EX0, ADR, CE4

Spoken: Demographic KBG

Context-governed D8Y, *FXH


[ Related documents: Introduction to the Manual | Guidelines to Wordclass Tagging | Automatic tagging of the BNC ]

Date: 17 March 2000