BNC2: POS-tagging error rates

BNC2 POS-tagging Manual

POS-tagging Error Rates

The purpose of this document is to report on the accuracy of the results of the improved tagging programs.

1. Levels of estimation

Based on the findings from the 50,000-word test sample, the estimated ambiguity and error rates for the BNC are shown below in three different degrees of detail.

(a) First, as a general assessment of accuracy, the estimated rates are given for the whole corpus. (See Table 1 below.)

(b) Secondly, separate estimates of ambiguity rates and error rates are given for each of the 57 word tags in the corpus. This will enable users of the corpus to assign appropriate degrees of reliability to each tag. Some tags are always correct; other tags are quite often erroneous. For example, the tag VDD stands for a single form of the verb do: the form did. Since the spelling did is unambiguous, the chances of ambiguity or error, in the use of the tag VDD, are virtually nil. On the other hand, the tag VVB (base finite form of a lexical verb) is not only quite frequent, but also highly prone to ambiguity and error. 15 per cent of the occurrences of VVB are errors - a much higher error rate than any other tag. (See Table 2 below.)

(c) Thirdly, separate estimates of ambiguity rates and error rates are given for �wrong-tag--right-tag� pairings ~~XXX~~, YYY, consisting of (i) the actually-occurring erroneous tag ~~XXX~~, and (ii) the correct tag YYY which should have occurred in its place. However, because the number of possible tag-pairs is large (57²), and most of these tag-pairs have few or no errors, only the more common pairings of erroneous tag and correct tag are separately listed, with their estimated probability of occurrence. This list of tag-pairings will help users further, in enabling them to estimate not merely the reliability of a tag, but, if that tag is incorrect, the likelihood that the correct tag would have been some other particular tag. In this way, the frequency of grammatical word classes, or individual words in those classes, can be estimated more accurately for the whole BNC. (See Table 3 below.)

2. Presentation of Ambiguity Rates and Error Rates (fine-grained mode of calculation)

In this section, we examine ambiguities and errors using a �fine-grained� mode of calculation, treating each error as of equal importance to any other error. In section 4, we look at the same data in terms of a �coarse-grained� mode of calculation, ignoring errors and ambiguities involving subcategories of the same part of speech.

2.1 Overall estimated ambiguity and error rates: based on the 50,000 word sample

As the following table shows, the ambiguity rate varies considerably between written and spoken texts. (However, note that the calculation for speech is based on a small sample of 5,000 words.)

Table 1: Estimated ambiguity and error rates for the whole corpus (fine-grained calculation)

	Sample tag count	Ambiguity rate (%)	Error rate (%)
Written texts	45,000	3.83%	1.14%
Spoken texts	5,000	3.00%	1.17%
All texts	50,000	3.75%	1.15%

It will be noted that written texts on the whole have a higher ambiguity rate, whereas spoken texts have a slightly greater error rate.

The success of an automatic tagger is sometimes represented in terms of the information-retrieval measures of precision and recall, rather than ambiguity rate and error rate as in Table 1. Precision is the extent to which incorrect tags are successfully discarded from the output. Recall is the extent to which all correct tags are successfully retained in the output of the tagger, allowing, however, for more than one reading to occur for one word (i.e. ambiguous tagging is permitted). According to these measures, the success of the tagging is as follows:

	Precision	Recall
Written texts	96.17%	98.86%
Spoken texts	97.00%	98.83%
All texts	96.25%	98.85%

However, from now on we will continue to use �ambiguity rate� and �error rate�, which appear to us more transparent.

2.2 Estimated ambiguity and error rates for each tag (fine-grained mode of calculation)

The estimates for individual tags are again based on the 50,000 sample, and the ambiguity rate for each tag is based on the number of ambiguity tags which begin with a given tag. The table also specifies the estimated likelihood that a given tag, in the first position of the ambiguity tag, is the correct tag.

Table 2: Estimated ambiguity rates and error rates (by tag) (fine-grained calculation)

In Table 2, column (b) shows the overall frequency of particular tags (not including ambiguity tags). Column (c) gives the overall occurrence of ambiguity tags, as well as of particular ambiguity tags, beginning with a given tag. (Ambiguity tags marked * are less �serious� in that they apply to two subcategories of the same part of speech, such as past tense and past participle of the verb - see 4.1 below.) Column (d) shows which tags are more or less likely to be found as the first part of an ambiguity tag. For example, both NP0 and VVG have an especially high incidence of ambiguity tags. Column (e) tells us, given that we have observed an ambiguity tag, what is the likelihood of the first tag�s being correct? Overall, there is more than a 3-1 chance that the first tag will be correct; but there are some exceptions, where the chances of the first tag�s being correct are much lower: for example, PNI (indefinite pronoun). [ Note that (f) and (g) exclude errors where the first tag of an ambiguity tag is wrong; contrast Table 5, and Table 6 column (c), in section 3.2 below.]

(a) Tag	(b) SingleTag count (out of 50,000 words)	(c) Ambiguity Tag count (out of 50,000 words)	(d) Ambiguity rate (%) (c / b + c)	(e) 1st tag of ambiguity tag correct (% of all ambiguity tags)	(f) Error count	(g) Error rate (%) (f / b)
AJ0	3412	all 338	9.01%	282 (83.43%)	46	1.35%
		(AJ0-AVO 48)
		(AJ0-NN1 209)
		(AJ0-VVD 21)
		(AJ0-VVG 28)
		(AJ0-VVN 32)
AJC	142		0.0%		4	2.82%
AJS	26		0.0%		2	7.69%
AT0	4351		0.0%		2	0.05%
AV0	2450	all 45	1.80%	37 (82.22%)	57	2.33%
		(AV0-AJ0 45)
AVP	379	all 44	10.40%	34 (77.27%)	6	1.58%
		(AVP-PRP 44)
AVQ	157	all 10	5.99%	10 (100.00%)	9	5.73%
		(AVQ-CJS 10)
CJC	1915		0.0%		3	0.16%
CJS	692	all 39	5.34%	30 (76.92%)	18	2.60%
		(CJS-AVQ 26)
		(CJS-PRP 13)
CJT	236	(all) 28	10.61%		3	1.27%
		(CJT-DT0 28 )
CRD	940	all 1	0.11%	0 (0.00%)	0	0.00%
		(CRD-PNI 1)
DPS	787		0.0%		0	0.00%
DT0	1180	all 20	1.67%	16 (80.00%)	19	1.61%
		(DT0-CJT 20)
DTQ	370		0.0%		0	0.00%
EX0	131		0.0%		1	0.76%
ITJ	214		0.0%		2	0.93%
NN0	270		0.0%		10	3.70%
NN1	7198	all 514	6.66%	395 (76.84%)	86	1.19%
		(NN1-AJ0 130)
		(NN1-NP0 92)*
		(NN1-VVB 243)
		(NN1-VVG 49)
NN2	2718	all 55	1.98%	48 (87.27%)	30	1.10%
		(NN2-VVZ 55)
NP0	1385	all 264	16.01%	224 (84.84%)	31	2.24%
		(NP0-NN1 264)*
ORD	136		0.0%		0	0.00%
PNI	159	all 8	4.79%	3 (37.50%)	5	3.14%
		(PNI-CRD 8)
PNP	2646		0.0%		0	0.00%
PNQ	112		0.0%		0	0.00%
PNX	84		0.0%		0	0.00%
POS	217		0.0%		5	2.30%
PRF	1615		0.0%		0	0.00%
PRP	4051	all 166	3.94%	154 (92.77%)	24	0.59%
		(PRP-AVP 132)
		(PRP-CJS 34)
TO0	819		0.0%		6	0.73%
UNC	158		0.0%		4	2.53%
VBB	328		0.0%		1	0.30%
VBD	663		0.0%		0	0.00%
VBG	37		0.0%		0	0.00%
VBI	374		0.0%		0	0.00%
VBN	133		0.0%		0	0.00%
VBZ	640		0.0%		4	0.63%
VDB	87		0.0%		0	0.00%
VDD	71		0.0%		0	0.00%
VDG	10		0.0%		0	0.00%
VDI	36		0.0%		0	0.00%
VDN	20		0.0%		0	0.00%
VDZ	22		0.0%		0	0.00%
VHB	150		0.0%		1	0.67%
VHD	258		0.0%		0	0.00%
VHG	16		0.0%		0	0.00%
VHI	119		0.0%		0	0.00%
VHN	9		0.0%		0	0.00%
VHZ	116		0.0%		1	0.86%
VM0	782		0.0%		3	0.38%
VVB	560	all 84	13.04%	56 (66.67%)	84	15.00%
		(VVB-NN1 84)
VVD	970	all 90	8.49%	62 (58.89%)	50	5.15%
		(VVD-AJ0 11)
		(VVD-VVN 79)*
VVG	597	all 132	18.11%	112 (84.84%)	9	1.51%
		(VVG-AJ0 83)
		(VVG-NN1 49)
VVI	1211		0.0%		7	0.58%
VVN	1086	all 158	12.70%	113 (71.52%)	27	2.49%
		(VVN-AJ0 50)
		(VVN-VVD 108)*
VVZ	295	all 26	8.10%	14 (53.85%)	11	3.73%
		(VVZ-NN2 26)
XX0	363		0.0%		0	0.00%
ZZ0	75		0.0%		3	4.00%

2.3 Estimated error rates specifying the incorrect tag and the correct tag (fine-grained calculation)

The next table, Table 3, gives the frequency, as a percentage, of error-prone tag-pairs where ~~XXX~~ is the incorrect tag and YYY is the correct tag which should have occurred in its place. In the third column, the number of the specified error-type is listed, as a frequency count from the sample of 50,000 words. In the fourth column, this is expressed as a percentage of all the tagging errors of word category XXX (in Table 2 column (f)). The fifth column answers the question: if tag XXX occurs, what is the likelihood that it is an error for tag YYY? Where the number of occurrences of a given error-type is less than 5 (i.e. 1 in 10,000 words), they are ignored. Hence, Table 3 is not exhaustive: only the more likely error-types are listed. In the second column, we add, where useful, the individual words which trigger these errors.

Table 3: Estimated frequency of selected tag-pairs, where ~~XXX~~ is the incorrect tag, and YYY is the correct one

(1) Incorrect tag ~~XXX~~	(2) Corrected tag YYY	(3) No. of occurrences of this error type	(4) % of all incorrect uses of tag(~~XXX~~) (col 3 / Table 2 col(f))	(5) % of all tags XXX (col 3 / Table 2 col(b))
~~AJ0~~	AVO	12	26.1%	0.4%
	NN1	12	26.1%	0.4%
	NP0	5	10.9%	0.1%
	VVN	8	17.4%	0.2%
~~AV0~~	AJ0	6	10.5%	0.2%
	AJC	8	14.0%	0.3%
	DT0	24	42.1%	1.0%
	EX (there)	5	8.8%	0.2%
	PRP	5	8.8%	0.2%
~~AVQ~~	CJS (when, where)	6	66.7%	3.8%
~~CJS~~	PRP	10	55.6%	1.4%
~~DTO~~	AV0	15	78.9%	1.3%
~~NN1~~	AJ0	13	15.1%	0.2%
	NN0*	8	9.3%	0.1%
	NP0*	22	25.6%	0.3%
	UNC	9	10.5%	0.2%
	VVI	13	15.1%	0.2%
~~NN2~~	NP0*	14	46.7%	0.5%
~~NP0~~	NN1*	10	32.3%	0.7%
	NN0*	5	16.1%	0.4%
~~PRP~~	AV0	7	29.2%	0.2%
	AVP	5	20.8%	0.1%
~~TO0~~	PRP (to)	6	100.0%	0.7%
~~VVB~~	AJ0	7	8.3%	1.3%
	NN1	7	8.3%	1.3%
	VVI*	55	65.5%	9.8%
~~VVD~~	AJ0	6	12.0%	0.6%
	VVN*	44	88.0%	4.5%
~~VVG~~	NN1	9	100.0%	1.5%
~~VVI~~	NN1	5	71.4%	0.4%
~~VVN~~	AJ0	7	25.9%	0.6%
	VVD*	17	63.0%	1.6%
~~VVZ~~	NN2	8	72.7%	2.7%

Similar to before, the asterisk * indicates a �less serious� error, in which the erroneous and correct tags belong to the same major category or part of speech. As the Table shows, the most frequent specific error types are within the verb category: ~~VVB~~ ? VVI (55, or 9.8% of all VVB tags) and ~~VVD~~ ? VVN (44, or 4.5% of all VVD tags).

3. A further mode of calculation: ignoring subcategories of the same part of speech

3.1 Presentation of Ambiguity and Error Rates (coarse-grained calculation)

Yet a further way of looking at the ambiguities and errors in the corpus is to make a coarse-grained calculation in counting these phenomena. In a fine-grained measurement, which is the one assumed up to now, each tag is considered to define its own word class which is different from all other word classes. Using the coarse-grained calculation, on the other hand, we consider words to belong to different word classes (parts of speech) only when the major category is different. If we consider the pair NN1 (singular and common noun) and NP0 (proper noun), the coarse-grained calculation says that the ambiguity tag NN1-NP0 or NP0-NN1 does not show tagging uncertainty, since both the proposed tags agree in categorizing the word as the same part of speech (a noun). So this does not add to the ambiguity rate. Similarly, the coarse-grained point of view on error is that, if a word is tagged as NN1 when it should be NP0, or vice versa, then this is not error, because both tags are within the noun category. To summarize: in the fine-grained calculation, minor differences of word-class count towards the ambiguity and error rates; in the coarse-grained calculation, they do not.

In this section, the same calculations are made as in section 3, except that errors and ambiguities which are confined within a major category (noun, verb, etc.) are ignored. In practice, most of the errors and ambiguities of this kind come from the difficulty the tagger finds in recognizing the difference between NN1 (singular common noun) and NP0 (proper noun), between VVD (past tense lexical verb) and VVN (past participle lexical verb), and between VVB (finite present tense base form, lexical verb) and VVI (infinitive lexical verb). Thus the ambiguity tags NN1-NP0, VVD-VVN and their mirror images do not occur in the relevant table (Table 5) below. However, since there are no ambiguity tags for VVB and VVI, the problem of distinguishing these two shows up only in the error calculation. The three tables in this section, Tables 4-6, correspond to Tables 1-3 above respectively:

Table 4: Estimated ambiguity and error rates for the whole corpus (coarse-grained calculation)
Table 5: Estimated error rates for the whole corpus after ambiguities have been automatically eliminated (assuming the first part of each ambiguity tag to be correct)
Table 6: Estimated error rates (by tag) after the second tags of ambiguity tags have been automatically eliminated
Table 7: Estimated frequency of selected tag-pairs, where XXX is the incorrect tag, and YYY is the correct one (after the second tags of ambiguity tags have been eliminated automatically)

Table 4: Estimated ambiguity and error rates for the whole corpus (coarse-grained calculation)

	Sample tag count	Ambiguity rate (%)	Error rate (%)
Written texts	45,000	2.78%	0.69%
Spoken texts	5,000	2.67%	0.87%
All texts	50,000	2.77%	0.71%

It will be noted from Table 4 that this method of calculation reduces the overall ambiguity rate by c.1 per cent, and the overall error rate by c.0.5 per cent. We will not present coarse-grained tables corresponding to Tables 2 and 3 above: these tables would be unchanged from the fine-grained calculation, except that the rows marked with an asterisk (*) would be deleted, and the other calculations changed as necessary.

3.2 Different modes of calculation: eliminating ambiguities

Given that the elimination of errors was beyond our capability within the time frame and budget we had available, the corpus in its present form, containing ambiguity tags as well as a small proportion of errors, is designed for what we believe will be the most common type of user, who will find it easier to tolerate ambiguity than error. However, other users may prefer a corpus which does not contain ambiguities, even though its error rate is higher. For this latter type of user, the present corpus is easy to interpret as a corpus free of ambiguities, simply by deleting or ignoring the second tag of any ambiguity tag, and accepting the first tag as the only one. In what follows, we therefore allow two modes of calculation: in addition to the "safer" mode, in which ambiguities are allowed and consequently errors are relatively low, we allow a "riskier" mode in which ambiguities are abolished, and errors are more frequent. In fact, if ambiguity tags are eliminated, the overall error rate rises to almost 2 per cent.

Table 5: Estimated error rates for the whole corpus after ambiguities have been automatically eliminated (assuming the first part of each ambiguity tag to be correct)

	Sample tag count	Error rate (%)
Written texts	45,000	2.01%
Spoken texts	5,000	1.92%
All texts	50,000	2.00%

The following table gives an error count (c) for each tag: i.e. the number of errors in the 50,000 word sample where that tag was the erroneous tag. [Cf. the "safer" error count in Table 3, column (f).] In addition, each tag has a correction count (d): i.e. the number of erroneous tags for which that tag was the correct tag. If we subtract the Error count (c) from the Tag count (b), and add the Correction count (d) to the result, we arrive at the "Real tag count" (e) representing the number of occurrences of that tag in the corrected sample corpus. Not included in the table is the small number of �multi-word� errors which resulted in two tags being replaced by one (error count), or one tag being replaced by two (correction count), due to the incorrect non-use or use of multi-word tags. The last column divides the error count by the tag count to provide the error rate (as a percentage).

Table 6: Estimated error rates (by tag) after the second tags of ambiguity tags have been automatically eliminated

(a) Tag	(b) Tag count	(c) Error count	(d) Correction count	(e) Real tag count (b - c + d )	(f) Error rate (%) (c / b)x 100
AJ0	3750	102	(132)	3780	2.72%
AJC	142	4	(12)	150	2.82%
AJS	26	2	(0)	24	7.69%
AT0	4351	2	(3)	4352	0.05%
AV0	2495	65	(67)	2497	2.61%
AVP	423	16	(17)	424	3.78%
AVQ	167	9	(6)	164	5.39%
CJC	1915	3	(1)	1913	0.16%
CJS	731	27	(5)	709	3.69%
CJT	264	3	(15)	276	1.14%
CRD	940	1	(11)	950	0.11%
DPS	787	0	(0)	787	0.00%
DT0	1200	23	(29)	1206	1.92%
DTQ	370	0	(0)	370	0.00%
EX0	131	1	(5)	135	0.76%
ITJ	214	2	(2)	214	0.93%
NN0	270	10	(16)	276	0.37%
NN1	7712	205	(152)	7659	2.66%
NN2	2773	37	(29)	2765	1.33%
ORD	136	0	(2)	138	0.00%
NP0	1649	71	(102)	1680	4.31%
PNI	167	10	(1)	158	5.99%
PNP	2646	0	(1)	2647	0.00%
PNQ	112	0	(0)	112	0.00%
PNX	84	0	(1)	85	0.00%
POS	217	5	(6)	218	2.30%
PRF	1615	0	(0)	1615	0.00%
PRP	4217	36	(45)	4226	0.85%
TO0	819	6	(1)	814	0.73%
UNC	158	4	(29)	183	2.53%
VBB	328	1	(0)	327	0.30%
VBD	663	0	(0)	663	0.00%
VBG	37	0	(0)	37	0.00%
VBI	374	0	(0)	374	0.00%
VBN	133	0	(0)	133	0.00%
VBZ	640	4	(5)	641	0.63%
VDB	87	0	(0)	87	0.00%
VDD	71	0	(0)	71	0.00%
VDG	10	0	(0)	10	0.00%
VDI	36	0	(0)	36	0.00%
VDN	20	0	(0)	20	0.00%
VDZ	22	0	(0)	22	0.00%
VHB	150	1	(0)	151	0.67%
VHD	258	0	(0)	258	0.00%
VHG	16	0	(0)	16	0.00%
VHI	119	0	(1)	120	0.00%
VHN	9	0	(0)	9	0.00%
VHZ	116	1	(0)	115	0.86%
VM0	782	3	(0)	779	0.38%
VVB	644	112	(13)	545	17.39%
VVD	1060	78	(60)	1042	7.36%
VVG	729	29	(29)	729	3.98%
VVI	1211	7	(73)	1277	0.57%
VVN	1244	72	(87)	1259	5.79%
VVZ	321	23	(12)	310	7.17%
XX0	363	0	(0)	363	0.00%
ZZ0	75	3	(4)	76	4.00%

It is clear from this table that the amount of error in the tagging of the corpus varies greatly from one tag to another. The most error prone-tag, by a large margin, is VVB, with more than 17 per cent error, while many of the tags are associated with no errors at all, and well over half the tags have less than a 1 per cent error.

The final table, Table 7, gives figures for the third level of detail, where we itemise individual tag pairs ~~XXX~~, YYY, where ~~XXX~~ is the incorrect tag, and YYY is the correct one which should have appeared but did not. Only those pairings which account for 5 or more errors are listed. This table differs from Table 3 in that here the second tags of ambiguity tags are not taken into account ("riskier mode"). It will be seen that the errors which occur tend to fall into a relatively small number of major categories.

Table 7: Estimated frequency of selected tag-pairs, where ~~XXX~~ is the incorrect tag, and YYY is the correct one (after the second tags of ambiguity tags have been eliminated automatically) based on the sample of 50,000 words

(1) Incorrect tag ~~XXX~~	(2) Correct tag YYY	(3) No. of occurrences of this error type	(4) % of all incorrect uses of tag ~~XXX~~ (col 3 / Table 2 col(f))	(5) % of all tags XXX (col 3 / Table 2 col(b))
~~AJ0~~	AV0	22	21.57%	0.59%
	NN1	41	40.19%	1.09%
	NP0	5	4.90%	0.13%
	VVG	14	13.73%	0.37%
	VVN	14	13.73%	0.37%
~~AV0~~	AJ0	9	13.85%	0.36%
	AJC	8	12.31%	0.32%
	DT0	26	40.00%	1.04%
	EX0 (there)	5	7.69%	0.20%
	PRP	6	9.23%	0.24%
~~AVP~~	CJT	6	94.12%	1.42%
~~AVQ~~	CJS (when, where)	6	66.67%	3.59%
~~CJS~~	PRP	15	55.56%	2.05%
~~DTO~~	AV0 (much, more, etc)	15	65.22%	1.25%
~~NN1~~	AJ0	63	30.73%	0.82%
	NN0	8	3.90%	0.10%
	NP0	74	36.10%	0.96%
	UNC	9	4.39%	0.12%
	VVB	9	4.39%	0.12%
	VVG	13	6.34%	0.17%
	VVI	13	6.34%	0.17%
~~NN2~~	NP0	14	37.84%	0.50%
	UNC	9	24.32%	0.32%
	VVZ	10	27.02%	0.36%
~~NN0~~	UNC	7	70.00%	2.59%
~~NP0~~	NN1	50	70.42%	3.03%
	NN2	5	7.04%	0.30%
~~PNI~~	CRD (one)	9	90.00%	5.39%
~~PRP~~	AV0	8	22.22%	0.19%
~~TO0~~	PRP (to)	6	100.00%	0.73%
~~VVB~~	AJ0	7	6.25%	1.09%
	NN1	35	31.25%	5.43%
	VVI	55	49.11%	8.54%
	VVN	5	4.46%	0.85%
~~VVD~~	AJ0	14	17.95%	1.32%
	VVN	64	82.05%	6.04%
~~VVG~~	AJ0	11	37.93%	1.51%
	NN1	18	62.07%	2.47%
~~VVI~~	NN1	5	71.43%	0.41%
~~VVZ~~	NN2	20	86.96%	6.23%

Some of the error types above are associated with one or two particular words, and where these occur they are listed. For example, the ~~AV0~~ - EX0 type of error occurs invariably with the one word there.

Appendix:

List of text samples used for the manually-conducted 50,000-word error analysis

Each sample consisted of 2,000 words of each of the BNC texts below, except that two samples, one of written and one of spoken English, consisted of 1,000 words only. These samples are marked "*" in the list below. The reason for using half-length samples in two cases was to maintain the proportion of written and spoken data as 90% - 10%, so as to keep the proportions of the sample the same as the proportions in the BNC as a whole. The BNC text files are cited by the three-character code used in the BNC Users Reference Guide.

Written: Imaginative writing G0S, ADY, H7P, GW0, FSF

Written: Informative writing

Natural Science JXN

Applied Science HWV, CEG

Social Science CLH, EE8, *A6Y

World Affairs A4J, CMT, EE2, EB7

Commerce and finance HGP, B27

Arts C9U, G1N

Belief and Thought CA9

Leisure EX0, ADR, CE4

Spoken: Demographic KBG

Context-governed D8Y, *FXH

Date: 17 March 2000