This chapter will cover evaluation, trade-offs between methods that create large vs small models (e.g. n-gram tagging vs Brill tagging).
★ Tagger context:
N-gram taggers choose a tag for a token based on its text and the tags of the n-1 preceding tokens. This is a common context to use for tagging, but certainly not the only possible context.
★ Reverse sequential taggers Since sequential taggers tag tokens in order, one at a time, they can only use the predicted tags to the left of the current token to decide what tag to assign to a token. But in some cases, the right context may provide more useful information than the left context. A reverse sequential tagger starts with the last word of the sentence and, proceeding in right-to-left order, assigns tags to words on the basis of the tags it has already predicted to the right. By reversing texts at appropriate times, we can use NLTK's existing sequential tagging classes to perform reverse sequential tagging: reverse the training text before training the tagger; and reverse the text being tagged both before and after.
★ Alternatives to backoff: Create a new kind of tagger that combines several taggers using a new mechanism other than backoff (e.g. voting). For robustness in the face of unknown words, include a regexp tagger, a unigram tagger that removes a small number of prefix or suffix characters until it recognizes a word, or an n-gram tagger that does not consider the text of the token being tagged.
As we experiment with different taggers, it is important to have an objective performance measure. Fortunately, we already have manually verified training data (the original tagged corpus), so we can use that to score the accuracy of a tagger, and to perform systematic error analysis.
Consider the sentence from the Brown Corpus in Table 12.1. The 'Gold Standard' tags from the corpus are given in the second column, while the tags assigned by a unigram tagger appear in the third column. Two mistakes made by the unigram tagger are italicized.
Sentence | Gold Standard | Unigram Tagger |
---|---|---|
The | at | at |
President | nn-tl | nn-tl |
said | vbd | vbd |
he | pps | pps |
will | md | md |
ask | vb | vb |
Congress | np | np |
to | to | to |
increase | vb | nn |
grants | nns | nns |
to | in | to |
states | nns | nns |
for | in | in |
vocational | jj | jj |
rehabilitation | nn | nn |
. | . | . |
The tagger correctly tagged 14 out of 16 words, so it gets a score of 14/16, or 87.5%. Of course, accuracy should be judged on the basis of a larger sample of data. NLTK provides a function called tag.accuracy to automate the task. In the simplest case, we can test the tagger using the same data it was trained on:
|
However, testing a language processing system over its training data is unwise. A system which simply memorized the training data would get a perfect score without doing any linguistic modeling. Instead, we would like to reward systems that make good generalizations, so we should test against unseen data, and replace train_sents above with unseen_sents. We can then define the two sets of data as follows:
|
Now we train the tagger using train_sents and evaluate it using unseen_sents, as follows:
|
The accuracy scores produced by this evaluation method are lower, but they give a more realistic picture of the performance of the tagger. Note that the performance of any statistical tagger is highly dependent on the quality of its training set. In particular, if the training set is too small, it will not be able to reliably estimate the most likely tag for each word. Performance will also suffer if the training set is significantly different from the texts we wish to tag.
In the process of developing a tagger, we can use the accuracy score as an objective measure of the improvements made to the system. Initially, the accuracy score will go up quickly as we fix obvious shortcomings of the tagger. After a while, however, it becomes more difficult and improvements are small.
It is difficult to interpret an accuracy score in isolation. For example, is a person who scores 25% in a test likely to know a quarter of the course material? If the test is made up of 4-way multiple choice questions, then this person has not performed any better than chance. Thus, it is clear that we should interpret an accuracy score relative to a baseline. The choice of baseline is somewhat arbitrary, but it usually corresponds to minimal knowledge about the domain.
In the case of tagging, a possible baseline score can be found by tagging every word with NN, the most likely tag.
|
Unfortunately this is not a very good baseline. There are many high-frequency words which are not nouns. Instead we could use the standard unigram tagger to get a baseline of 75%. However, this does not seem fully legitimate: the unigram's model covers all words seen during training, which hardly seems like 'minimal knowledge'. Instead, let's only permit ourselves to store tags for the most frequent words.
The first step is to identify the most frequent words in the corpus, and for each of these words, identify the most likely tag:
|
Now we can create a lookup table (a dictionary) which maps words to likely tags, just for these high-frequency words. We can then define a new baseline tagger which uses this lookup table:
|
This, then, would seem to be a reasonable baseline score for a tagger. When we build new taggers, we will only credit ourselves for performance exceeding this baseline.
While the accuracy score is certainly useful, it does not tell us how to improve the tagger. For this we need to undertake error analysis. For instance, we could construct a confusion matrix, with a row and a column for every possible tag, and entries that record how often a word with tag Ti is incorrectly tagged as Tj Another approach is to analyze the context of the errors, which is what we do now.
Consider the following program, which catalogs all errors along with the tag on the left and their frequency of occurrence.
|
The errors dictionary has keys of the form ((t1,t2),(g1,g2)), where (t1,t2) are the test tags, and (g1,g2) are the gold-standard tags. The values in the errors dictionary are simple counts of how often each error occurred. With some further processing, we construct the list counted_errors containing tuples consisting of counts and errors, and then do a reverse sort to get the most significant errors first:
|
The fifth line of output records the fact that there were 3 cases where the unigram tagger mistakenly tagged a verb as a noun, following the word to. (We encountered the inverse of this mistake for the word increase in the above evaluation table, where the unigram tagger tagged increase as a verb instead of a noun since it occurred more often in the training data as a verb.) Here, when form appears after the word to, it is invariably a verb. Evidently, the performance of the tagger would improve if it was modified to consider not just the word being tagged, but also the tag of the word on the left. Such taggers are known as bigram taggers, and we consider them in the next section.
[Introduction to NLTK's support for statistical estimation.]
A potential issue with n-gram taggers is the size of their n-gram table (or language model). If tagging is to be employed in a variety of language technologies deployed on mobile computing devices, it is important to strike a balance between model size and tagger performance. An n-gram tagger with backoff may store trigram and bigram tables, large sparse arrays which may have hundreds of millions of entries.
A second issue concerns context. The only information an n-gram tagger considers from prior context is tags, even though words themselves might be a useful source of information. It is simply impractical for n-gram models to be conditioned on the identities of words in the context. In this section we examine Brill tagging, a statistical tagging method which performs very well using models that are only a tiny fraction of the size of n-gram taggers.
Brill tagging is a kind of transformation-based learning. The general idea is very simple: guess the tag of each word, then go back and fix the mistakes. In this way, a Brill tagger successively transforms a bad tagging of a text into a better one. As with n-gram tagging, this is a supervised learning method, since we need annotated training data. However, unlike n-gram tagging, it does not count observations but compiles a list of transformational correction rules.
The process of Brill tagging is usually explained by analogy with painting. Suppose we were painting a tree, with all its details of boughs, branches, twigs and leaves, against a uniform sky-blue background. Instead of painting the tree first then trying to paint blue in the gaps, it is simpler to paint the whole canvas blue, then "correct" the tree section by over-painting the blue background. In the same fashion we might paint the trunk a uniform brown before going back to over-paint further details with even finer brushes. Brill tagging uses the same idea: begin with broad brush strokes then fix up the details, with successively finer changes. Table 12.2 illustrates this process, first tagging with the unigram tagger, then fixing the errors.
Sentence: | Gold: | Unigram: | Replace nn with vb when the previous word is to | Replace to with in when the next tag is nns |
---|---|---|---|---|
The | at | at | ||
President | nn-tl | nn-tl | ||
said | vbd | vbd | ||
he | pps | pps | ||
will | md | md | ||
ask | vb | vb | ||
Congress | np | np | ||
to | to | to | ||
increase | vb | nn | vb | |
grants | nns | nns | ||
to | in | to | to | in |
states | nns | nns | ||
for | in | in | ||
vocational | jj | jj | ||
rehabilitation | nn | nn |
In this table we see two rules. All such rules are generated from a template of the following form: form "replace T1 with T2 in the context C". Typical contexts are the identity or the tag of the preceding or following word, or the appearance of a specific tag within 2-3 words of of the current word. During its training phase, the tagger guesses values for T1, T2 and C, to create thousands of candidate rules. Each rule is scored according to its net benefit: the number of incorrect tags that it corrects, less the number of correct tags it incorrectly modifies. This process is best illustrated by a listing of the output from the NLTK Brill tagger (here run on tagged Wall Street Journal text from the Penn Treebank).
Loading tagged data... Training unigram tagger: [accuracy: 0.820940] Training Brill tagger on 37168 tokens... Iteration 1: 1482 errors; ranking 23989 rules; Found: "Replace POS with VBZ if the preceding word is tagged PRP" Apply: [changed 39 tags: 39 correct; 0 incorrect] Iteration 2: 1443 errors; ranking 23662 rules; Found: "Replace VBP with VB if one of the 3 preceding words is tagged MD" Apply: [changed 36 tags: 36 correct; 0 incorrect] Iteration 3: 1407 errors; ranking 23308 rules; Found: "Replace VBP with VB if the preceding word is tagged TO" Apply: [changed 24 tags: 23 correct; 1 incorrect] Iteration 4: 1384 errors; ranking 23057 rules; Found: "Replace NN with VB if the preceding word is to" Apply: [changed 67 tags: 22 correct; 45 incorrect] ... Iteration 20: 1138 errors; ranking 20717 rules; Found: "Replace RBR with JJR if one of the 2 following words is tagged NNS" Apply: [changed 14 tags: 10 correct; 4 incorrect] Iteration 21: 1128 errors; ranking 20569 rules; Found: "Replace VBD with VBN if the preceding word is tagged VBD" [insufficient improvement; stopping] Brill accuracy: 0.835145
Brill taggers have another interesting property: the rules are linguistically interpretable. Compare this with the n-gram taggers, which employ a potentially massive table of n-grams. We cannot learn much from direct inspection of such a table, in comparison to the rules learned by the Brill tagger.
☼ Try the Brill tagger demonstration, as follows:
from nltk_lite.tag import brill brill.demo()
◑ Consult the documentation for the demo function, using help(brill.demo). Experiment with the tagger by setting different values for the parameters. Is there any trade-off between training time (corpus size) and performance?
★ Inspect the diagnostic files created by the tagger rules.out and errors.out. Obtain the demonstration code (nltk_lite/tag/brill.py) and create your own version of the Brill tagger.
(We are grateful to Christopher Maloof for developing NLTK's Brill tagger, and Trevor Cohn for developing NLTK's HMM tagger.)
[Overview of NLTK's HMM tagger.]
An easy way to evaluate a chunk parser is to take some already chunked text, strip off the chunks, rechunk it, and compare the result with the original chunked text. The ChunkScore.score() function takes the correctly chunked sentence as its first argument, and the newly chunked version as its second argument, and compares them. It reports the fraction of actual chunks that were found (recall), the fraction of hypothesized chunks that were correct (precision), and a combined score, the F-measure (the harmonic mean of precision and recall).
A number of different metrics can be used to evaluate chunk parsers. We will concentrate on a class of metrics that can be derived from two sets:
The evaluation method we will use comes from the field of information retrieval, and considers the performance of a document retrieval system. We will set up an analogy between the correct set of chunks and a user's so-called "information need", and between the set of returned chunks and a system's returned documents. Consider the following diagram.
Figure 12.1: True and False Positives and Negatives
The intersection of these sets defines four regions: the true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Two standard measures are precision, the fraction of guessed chunks that were correct TP/(TP+FP), and recall, the fraction of correct chunks that were identified TP/(TP+FN). A third measure, the F measure, is the harmonic mean of precision and recall, i.e. 1/(0.5/Precision + 0.5/Recall).
During evaluation of a chunk parser, it is useful to flatten a chunk structure into a tree consisting only of a root node and leaves:
|
We run a chunker over this flattened data, and compare the resulting chunked sentences with the originals, as follows:
|
ChunkScore is a class for scoring chunk parsers. It can be used to evaluate the output of a chunk parser, using precision, recall, f-measure, missed chunks, and incorrect chunks. It can also be used to combine the scores from the parsing of multiple texts. This is quite useful if we are parsing a text one sentence at a time. The following program listing shows a typical use of the ChunkScore class. In this example, chunkparser is being tested on each sentence from the Wall Street Journal tagged files.
|
The overall results of the evaluation can be viewed by printing the ChunkScore. Each evaluation metric is also returned by an accessor method: precision(), recall, f_measure, missed, and incorrect. The missed and incorrect methods can be especially useful when trying to improve the performance of a chunk parser. Here are the missed chunks:
|
Here are the incorrect chunks:
|
As we saw with tagging, we need to interpret the performance scores for a chunker relative to a baseline. Perhaps the most naive chunking method is to classify every tag in the training data as to whether it occurs inside or outside a chunk more often. We can do this easily using a chunked corpus and a conditional frequency distribution as shown below:
>>> from nltk_lite.probability import ConditionalFreqDist >>> from nltk_lite.parse import Tree >>> import re >>> cfdist = ConditionalFreqDist() >>> chunk_data = list(treebank.chunked()) >>> split = len(chunk_data)*9/10 >>> train, test = chunk_data[:split], chunk_data[split:] >>> for chunk_struct in train: ... for constituent in chunk_struct: ... if isinstance(constituent, Tree): ... for (word, tag) in constituent.leaves(): ... cfdist[tag].inc(True) ... else: ... (word, tag) = constituent ... cfdist[tag].inc(False)
>>> chunk_tags = [tag for tag in cfdist.conditions() if cfdist[tag].max() == True] >>> chunk_tags = [re.sub(r'(\W)', r'\\\1', tag) for tag in chunk_tags] >>> tag_pattern = '<' + '|'.join(chunk_tags) + '>+' >>> print 'Chunking:', tag_pattern Chunking: <PRP\$|VBG\|NN|POS|WDT|JJ|WP|DT|\#|\$|NN|FW|PRP|NNS|NNP|LS|PDT|RBS|CD|EX|WP\$|NNPS|JJS|JJR>+
Now, in the evaluation phase we chunk any sequence of those tags:
|
Information Extraction is the task of converting unstructured data (e.g., unrestricted text) or semi-structured data (e.g., web pages marked up with HTML) into structured data (e.g., tables in a relational database). For example, let's suppose we are given a text containing the fragment (1), and let's also suppose we are trying to find pairs of entities X and Y that stand in the relation 'organization X is located in location Y'.
(1) | ... said William Gale, an economist at the Brookings Institution, the research group in Washington. |
As a result of processing this text, we should be able to add the pair 〈Brookings Institution, Washington〉 to this relation. As we will see shortly, Information Extraction proceeds on the assumption that we are only looking for specific sorts of information, and these have been decided in advance. This limitation has been a necessary concession to allow the robust processing of unrestricted text.
Potential applications of Information Extraction are many, and include business intelligence, resume harvesting, media analysis, sentiment detection patent search, and email scanning. A particularly important area of current research involves the attempt to extract structured data out of electronically-available scientific literature, most notably in the domain of biology and medicine.
Information Extraction is usually broken down into at least two major steps: Named Entity Recognition and Relation Extraction. Named Entities (NE s) are usually taken to be noun phrases that denote specific types of individuals such as organizations, persons, dates, and so on. Thus, we might use the following XML annotations to mark-up the NEs in (1):
(2) | ... said <ne type='PERSON'>William Gale</ne>, an economist at the <ne type='ORGANIZATION'>Brookings Institution</ne>, the research group in <ne type='LOCATION'>Washington<ne>. |
How do we go about identifying NEs? Our first thought might be that we could look up candidate expressions in an appropriate list of names. For example, in the case of locations, we might try using a resource such as the Alexandria Gazetteer. Depending on the nature of our input data, this be adequate — such a gazetteer is likely to have good coverage of international cities and many locations in the U.S.A., but will probably be missing the names of obscure villages in remote regions. However, a list of names for people or organization will probably have poor coverage. New organizations, and new names for them, are coming into existence every day, so if we are trying to deal with contemporary newswire or blog entries, say, it is unlikely that we will be able to recognize many of the NEs by using gazetteer lookup.
A second consideration is that many NE terms are ambiguous. Thus May and North are likely to be parts of NEs for DATE and LOCATION, respectively, but could both be part of a PERSON NE; conversely Christian Dior looks like a PERSON NE but is more likely to be of type ORGANIZATION. A terms like Yankee will be ordinary modifier in some contexts, but will be marked as an NE of type ORGANIZATION in the phrase Yankee infielders. To summarize, we cannot reliably detect NEs by looking them up in a gazetteer, and it is also hard to develop rules that will correctly recognize ambiguous NEs on the basis of their context of occurrence. Although lookup may contribute to a solution, most contemporary approaches to Named Entity Recognition treat it as a statistical classification task that requires training data for good performance. This task is facilitated by adopting an appropriate data representation, such as the IOB tags which we saw being deployed in the CoNLL chunk data (Chapter 5). For example, here are a representative few lines from the CONLL 2002 (conll2002) Dutch training data:
Eddy N B-PER Bonte N I-PER is V O woordvoerder N O van Prep O diezelfde Pron O Hogeschool N B-ORG . Punc O
As noted before, in this representation, there is one token per line, each with its part-of-speech tag and its NE tag. When NEs have been identified in a text, we then want to extract relations that hold between them. As indicated earlier, we will typically be looking for relations between specified types of NE. One way of approaching this task is to initially look for all triples of the form X, α, Y, where X and Y are NE s of the required types, and α is the string of words that intervenes between X and Y. We can then use regular expressions to pull out just those instances of α that express the relation that we are looking for. The following example searches for strings that contain the word in. The special character expression (?!\b.+ing\b) is a negative lookahead condition which allows us to disregard strings such as success in supervising the transition of, where in is followed by a gerundive verb.
|
As you will see, although searching for in does a reasonably good job, there is some inaccuracy in the output which is hard to avoid — there is unlikely to be simple string-based method of excluding fillers such as secured the most money in the.
As shown above, the conll2002 corpus contains not just NE annotation but also part-of-speech tags. In principle, this allows us to devise patterns which are sensitive to these tags.
|
[To be written]
Sekine -- 140 types of NE.
About this document...
This chapter is a draft from Introduction to Natural Language Processing, by Steven Bird, Ewan Klein and Edward Loper, Copyright © 2007 the authors. It is distributed with the Natural Language Toolkit [http://nltk.sourceforge.net], Version 0.7.5, under the terms of the Creative Commons Attribution-ShareAlike License [http://creativecommons.org/licenses/by-sa/2.5/].
This document is Revision: 4518 Wed May 16 20:08:28 EST 2007