Package nltk_lite :: Package chunk
[hide private]
[frames] | no frames]

Source Code for Package nltk_lite.chunk

  1  # Natural Language Toolkit: Chunkers 
  2  # 
  3  # Copyright (C) 2001-2007 University of Pennsylvania 
  4  # Author: Steven Bird <sb@csse.unimelb.edu.au> 
  5  #         Edward Loper <edloper@gradient.cis.upenn.edu> 
  6  # URL: <http://nltk.sf.net> 
  7  # For license information, see LICENSE.TXT 
  8  # 
  9   
 10  """ 
 11  Classes and interfaces for identifying non-overlapping linguistic 
 12  groups (such as base noun phrases) in unrestricted text.  This task is 
 13  called X{chunk parsing} or X{chunking}, and the identified groups are 
 14  called X{chunks}.  The chunked text is represented using a shallow 
 15  tree called a "chunk structure."  A X{chunk structure} is a tree 
 16  containing tokens and chunks, where each chunk is a subtree containing 
 17  only tokens.  For example, the chunk structure for base noun phrase 
 18  chunks in the sentence "I saw the big dog on the hill" is:: 
 19   
 20    (SENTENCE: 
 21      (NP: <I>) 
 22      <saw> 
 23      (NP: <the> <big> <dog>) 
 24      <on> 
 25      (NP: <the> <hill>)) 
 26   
 27  To convert a chunk structure back to a list of tokens, simply use the 
 28  chunk structure's L{leaves<Tree.leaves>} method. 
 29   
 30  The C{parser.chunk} module defines L{ChunkI}, a standard interface for 
 31  chunking texts; and L{RegexpChunk}, a regular-expression based 
 32  implementation of that interface.  It uses the L{tree.chunk} and 
 33  L{tree.conll_chunk} methods, which tokenize strings containing chunked 
 34  and tagged texts.  It defines L{ChunkScore}, a utility class for 
 35  scoring chunk parsers. 
 36   
 37  RegexpChunk 
 38  =========== 
 39   
 40  C{parse.RegexpChunk} is an implementation of the chunk parser interface 
 41  that uses regular-expressions over tags to chunk a text.  Its 
 42  C{parse} method first constructs a C{ChunkString}, which encodes a 
 43  particular chunking of the input text.  Initially, nothing is 
 44  chunked.  C{parse.RegexpChunk} then applies a sequence of 
 45  C{RegexpChunkRule}s to the C{ChunkString}, each of which modifies 
 46  the chunking that it encodes.  Finally, the C{ChunkString} is 
 47  transformed back into a chunk structure, which is returned. 
 48   
 49  C{RegexpChunk} can only be used to chunk a single kind of phrase. 
 50  For example, you can use an C{RegexpChunk} to chunk the noun 
 51  phrases in a text, or the verb phrases in a text; but you can not 
 52  use it to simultaneously chunk both noun phrases and verb phrases in 
 53  the same text.  (This is a limitation of C{RegexpChunk}, not of 
 54  chunk parsers in general.) 
 55   
 56  RegexpChunkRules 
 57  ---------------- 
 58   
 59  C{RegexpChunkRule}s are transformational rules that update the 
 60  chunking of a text by modifying its C{ChunkString}.  Each 
 61  C{RegexpChunkRule} defines the C{apply} method, which modifies 
 62  the chunking encoded by a C{ChunkString}.  The 
 63  L{RegexpChunkRule} class itself can be used to implement any 
 64  transformational rule based on regular expressions.  There are 
 65  also a number of subclasses, which can be used to implement 
 66  simpler types of rules: 
 67   
 68      - L{ChunkRule} chunks anything that matches a given regular 
 69        expression. 
 70      - L{ChinkRule} chinks anything that matches a given regular 
 71        expression. 
 72      - L{UnChunkRule} will un-chunk any chunk that matches a given 
 73        regular expression. 
 74      - L{MergeRule} can be used to merge two contiguous chunks. 
 75      - L{SplitRule} can be used to split a single chunk into two 
 76        smaller chunks. 
 77      - L{ExpandLeftRule} will expand a chunk to incorporate new 
 78        unchunked material on the left. 
 79      - L{ExpandRightRule} will expand a chunk to incorporate new 
 80        unchunked material on the right. 
 81   
 82  Tag Patterns 
 83  ~~~~~~~~~~~~ 
 84   
 85  C{RegexpChunkRule}s use a modified version of regular 
 86  expression patterns, called X{tag patterns}.  Tag patterns are 
 87  used to match sequences of tags.  Examples of tag patterns are:: 
 88   
 89       r'(<DT>|<JJ>|<NN>)+' 
 90       r'<NN>+' 
 91       r'<NN.*>' 
 92   
 93  The differences between regular expression patterns and tag 
 94  patterns are: 
 95   
 96      - In tag patterns, C{'<'} and C{'>'} act as parentheses; so 
 97        C{'<NN>+'} matches one or more repetitions of C{'<NN>'}, not 
 98        C{'<NN'} followed by one or more repetitions of C{'>'}. 
 99      - Whitespace in tag patterns is ignored.  So 
100        C{'<DT> | <NN>'} is equivalant to C{'<DT>|<NN>'} 
101      - In tag patterns, C{'.'} is equivalant to C{'[^{}<>]'}; so 
102        C{'<NN.*>'} matches any single tag starting with C{'NN'}. 
103   
104  The function L{tag_pattern2re_pattern} can be used to transform 
105  a tag pattern to an equivalent regular expression pattern. 
106   
107  Efficiency 
108  ---------- 
109   
110  Preliminary tests indicate that C{RegexpChunk} can chunk at a 
111  rate of about 300 tokens/second, with a moderately complex rule set. 
112   
113  There may be problems if C{RegexpChunk} is used with more than 
114  5,000 tokens at a time.  In particular, evaluation of some regular 
115  expressions may cause the Python regular expression engine to 
116  exceed its maximum recursion depth.  We have attempted to minimize 
117  these problems, but it is impossible to avoid them completely.  We 
118  therefore recommend that you apply the chunk parser to a single 
119  sentence at a time. 
120   
121  Emacs Tip 
122  --------- 
123   
124  If you evaluate the following elisp expression in emacs, it will 
125  colorize C{ChunkString}s when you use an interactive python shell 
126  with emacs or xemacs ("C-c !"):: 
127   
128      (let () 
129        (defconst comint-mode-font-lock-keywords  
130          '(("<[^>]+>" 0 'font-lock-reference-face) 
131            ("[{}]" 0 'font-lock-function-name-face))) 
132        (add-hook 'comint-mode-hook (lambda () (turn-on-font-lock)))) 
133   
134  You can evaluate this code by copying it to a temporary buffer, 
135  placing the cursor after the last close parenthesis, and typing 
136  "C{C-x C-e}".  You should evaluate it before running the interactive 
137  session.  The change will last until you close emacs. 
138   
139  Unresolved Issues 
140  ----------------- 
141   
142  If we use the C{re} module for regular expressions, Python's 
143  regular expression engine generates "maximum recursion depth 
144  exceeded" errors when processing very large texts, even for 
145  regular expressions that should not require any recursion.  We 
146  therefore use the C{pre} module instead.  But note that C{pre} 
147  does not include Unicode support, so this module will not work 
148  with unicode strings.  Note also that C{pre} regular expressions 
149  are not quite as advanced as C{re} ones (e.g., no leftward 
150  zero-length assertions). 
151   
152  @type CHUNK_TAG_PATTERN: C{regexp} 
153  @var CHUNK_TAG_PATTERN: A regular expression to test whether a tag 
154       pattern is valid. 
155  """ 
156   
157  import re, types 
158  import types 
159  from nltk_lite.parse import ParseI 
160  from convert import tree2conlltags 
161   
162  ##////////////////////////////////////////////////////// 
163  ##  Chunk Parser Interface 
164  ##////////////////////////////////////////////////////// 
165   
166 -class ChunkParseI(ParseI):
167 """ 168 A processing interface for identifying non-overlapping groups in 169 unrestricted text. Typically, chunk parsers are used to find base 170 syntactic constituants, such as base noun phrases. Unlike 171 L{ParseI}, C{ChunkParseI} guarantees that the C{parse} method 172 will always generate a parse. 173 174 """
175 - def parse(self, tokens):
176 """ 177 Find the best chunk structure for the given tokens 178 and return a tree 179 180 @param tokens: The list of (word, tag) tokens to be chunked. 181 @type tokens: L{list} of L{tuple} 182 """ 183 assert 0, "ChunkParseI is an abstract interface"
184 185 ##////////////////////////////////////////////////////// 186 ## Precompiled regular expressions 187 ##////////////////////////////////////////////////////// 188 189 190 CHUNK_TAG_CHAR = r'[^\{\}<>]' 191 CHUNK_TAG = r'(<%s+?>)' % CHUNK_TAG_CHAR 192 193 194 ##////////////////////////////////////////////////////// 195 ## ChunkString 196 ##////////////////////////////////////////////////////// 197
198 -class ChunkString(object):
199 """ 200 A string-based encoding of a particular chunking of a text. 201 Internally, the C{ChunkString} class uses a single string to 202 encode the chunking of the input text. This string contains a 203 sequence of angle-bracket delimited tags, with chunking indicated 204 by braces. An example of this encoding is:: 205 206 {<DT><JJ><NN>}<VBN><IN>{<DT><NN>}<.>{<DT><NN>}<VBD><.> 207 208 C{ChunkString} are created from tagged texts (i.e., C{list}s of 209 C{tokens} whose type is C{TaggedType}). Initially, nothing is 210 chunked. 211 212 The chunking of a C{ChunkString} can be modified with the C{xform} 213 method, which uses a regular expression to transform the string 214 representation. These transformations should only add and remove 215 braces; they should I{not} modify the sequence of angle-bracket 216 delimited tags. 217 218 @type _str: C{string} 219 @ivar _str: The internal string representation of the text's 220 encoding. This string representation contains a sequence of 221 angle-bracket delimited tags, with chunking indicated by 222 braces. An example of this encoding is:: 223 224 {<DT><JJ><NN>}<VBN><IN>{<DT><NN>}<.>{<DT><NN>}<VBD><.> 225 226 @type _pieces: C{list} of pieces (tagged tokens and chunks) 227 @ivar _pieces: The tagged tokens and chunks encoded by this C{ChunkString}. 228 @ivar _debug: The debug level. See the constructor docs. 229 230 @cvar IN_CHUNK_PATTERN: A zero-width regexp pattern string that 231 will only match positions that are in chunks. 232 @cvar IN_CHINK_PATTERN: A zero-width regexp pattern string that 233 will only match positions that are in chinks. 234 """ 235 IN_CHUNK_PATTERN = r'(?=[^\{]*\})' 236 IN_CHINK_PATTERN = r'(?=[^\}]*(\{|$))' 237 238 # These are used by _verify 239 _CHUNK = r'(\{%s+?\})+?' % CHUNK_TAG 240 _CHINK = r'(%s+?)+?' % CHUNK_TAG 241 _VALID = re.compile(r'(\{?%s\}?)*?' % CHUNK_TAG) 242 _BRACKETS = re.compile('[^\{\}]+') 243 _BALANCED_BRACKETS = re.compile(r'(\{\})*$') 244
245 - def __init__(self, chunk_struct, debug_level=3):
246 """ 247 Construct a new C{ChunkString} that encodes the chunking of 248 the text C{tagged_tokens}. 249 250 @type chunk_struct: C{Tree} 251 @param chunk_struct: The chunk structure to be further chunked. 252 @type debug_level: int 253 @param debug_level: The level of debugging which should be 254 applied to transformations on the C{ChunkString}. The 255 valid levels are: 256 - 0: no checks 257 - 1: full check on to_chunkstruct 258 - 2: full check on to_chunkstruct and cursory check after 259 each transformation. 260 - 3: full check on to_chunkstruct and full check after 261 each transformation. 262 We recommend you use at least level 1. You should 263 probably use level 3 if you use any non-standard 264 subclasses of C{RegexpChunkRule}. 265 """ 266 self._top_node = chunk_struct.node 267 self._pieces = chunk_struct[:] 268 tags = [self._tag(tok) for tok in self._pieces] 269 self._str = '<' + '><'.join(tags) + '>' 270 self._debug = debug_level
271
272 - def _tag(self, tok):
273 if type(tok) == types.TupleType: 274 return tok[1] 275 elif isinstance(tok, Tree): 276 return tok.node 277 else: 278 raise ValueError, 'chunk structures must contain tokens and trees'
279
280 - def _verify(self, verify_tags):
281 """ 282 Check to make sure that C{_str} still corresponds to some chunked 283 version of C{_pieces}. 284 285 @type verify_tags: C{boolean} 286 @param verify_tags: Whether the individual tags should be 287 checked. If this is false, C{_verify} will check to make 288 sure that C{_str} encodes a chunked version of I{some} 289 list of tokens. If this is true, then C{_verify} will 290 check to make sure that the tags in C{_str} match those in 291 C{_pieces}. 292 293 @raise ValueError: if this C{ChunkString}'s internal string 294 representation is invalid or not consistent with _pieces. 295 """ 296 # Check overall form 297 if not ChunkString._VALID.match(self._str): 298 raise ValueError('Transformation generated invalid chunkstring: %s' % self._str) 299 300 # Check that parens are balanced. If the string is long, we 301 # have to do this in pieces, to avoid a maximum recursion 302 # depth limit for regular expressions. 303 brackets = ChunkString._BRACKETS.sub('', self._str) 304 for i in range(1+len(brackets)/5000): 305 substr = brackets[i*5000:i*5000+5000] 306 if not ChunkString._BALANCED_BRACKETS.match(substr): 307 raise ValueError('Transformation generated invalid chunkstring: %s' % substr) 308 309 if verify_tags<=0: return 310 311 tags1 = (re.split(r'[\{\}<>]+', self._str))[1:-1] 312 tags2 = [self._tag(piece) for piece in self._pieces] 313 if tags1 != tags2: 314 raise ValueError('Transformation generated invalid chunkstring: %s / %s' % (tags1,tags2))
315
316 - def to_chunkstruct(self, chunk_node='CHUNK'):
317 """ 318 @return: the chunk structure encoded by this C{ChunkString}. 319 @rtype: C{Tree} 320 @raise ValueError: If a transformation has generated an 321 invalid chunkstring. 322 """ 323 if self._debug > 0: self._verify(1) 324 325 # Use this alternating list to create the chunkstruct. 326 pieces = [] 327 index = 0 328 piece_in_chunk = 0 329 for piece in re.split('[{}]', self._str): 330 331 # Find the list of tokens contained in this piece. 332 length = piece.count('<') 333 subsequence = self._pieces[index:index+length] 334 335 # Add this list of tokens to our pieces. 336 if piece_in_chunk: 337 pieces.append(Tree(chunk_node, subsequence)) 338 else: 339 pieces += subsequence 340 341 # Update index, piece_in_chunk 342 index += length 343 piece_in_chunk = not piece_in_chunk 344 345 return Tree(self._top_node, pieces)
346
347 - def xform(self, regexp, repl):
348 """ 349 Apply the given transformation to this C{ChunkString}'s string 350 encoding. In particular, find all occurrences that match 351 C{regexp}, and replace them using C{repl} (as done by 352 C{re.sub}). 353 354 This transformation should only add and remove braces; it 355 should I{not} modify the sequence of angle-bracket delimited 356 tags. Furthermore, this transformation may not result in 357 improper bracketing. Note, in particular, that bracketing may 358 not be nested. 359 360 @type regexp: C{string} or C{regexp} 361 @param regexp: A regular expression matching the substring 362 that should be replaced. This will typically include a 363 named group, which can be used by C{repl}. 364 @type repl: C{string} 365 @param repl: An expression specifying what should replace the 366 matched substring. Typically, this will include a named 367 replacement group, specified by C{regexp}. 368 @rtype: C{None} 369 @raise ValueError: If this transformation generated an 370 invalid chunkstring. 371 """ 372 # Do the actual substitution 373 374 self._str = re.sub(regexp, repl, self._str) 375 376 # The substitution might have generated "empty chunks" 377 # (substrings of the form "{}"). Remove them, so they don't 378 # interfere with other transformations. 379 self._str = re.sub('\{\}', '', self._str) 380 381 # Make sure that the transformation was legal. 382 if self._debug > 1: self._verify(self._debug-2)
383
384 - def __repr__(self):
385 """ 386 @rtype: C{string} 387 @return: A string representation of this C{ChunkString}. This 388 string representation has the form:: 389 390 <ChunkString: '{<DT><JJ><NN>}<VBN><IN>{<DT><NN>}'> 391 392 """ 393 return '<ChunkString: %s>' % `self._str`
394
395 - def __str__(self):
396 """ 397 @rtype: C{string} 398 @return: A formatted representation of this C{ChunkString}'s 399 string encoding. This representation will include extra 400 spaces to ensure that tags will line up with the 401 representation of other C{ChunkStrings} for the same text, 402 regardless of the chunking. 403 """ 404 # Add spaces to make everything line up. 405 str = re.sub(r'>(?!\})', r'> ', self._str) 406 str = re.sub(r'([^\{])<', r'\1 <', str) 407 if str[0] == '<': str = ' ' + str 408 return str
409 410 411 ##////////////////////////////////////////////////////// 412 ## EVALUATION 413 ##////////////////////////////////////////////////////// 414 415 from nltk_lite import evaluate
416 -def accuracy(chunker, gold):
417 """ 418 Score the accuracy of the chunker against the gold standard. 419 Strip the chunk information from the gold standard and rechunk it using 420 the chunker, then compute the accuracy score. 421 422 @type chunker: C{ChunkParseI} 423 @param tagger: The chunker being evaluated. 424 @type gold: C{tree} 425 @param gold: The chunk structures to score the chunker on. 426 @rtype: C{float} 427 """ 428 429 gold_tags = [] 430 test_tags = [] 431 for gold_tree in gold: 432 test_tree = chunker.parse(gold_tree.flatten()) 433 gold_tags += tree2conlltags(gold_tree) 434 test_tags += tree2conlltags(test_tree) 435 436 # print 'GOLD:', gold_tags[:50] 437 # print 'TEST:', test_tags[:50] 438 return evaluate.accuracy(gold_tags, test_tags)
439 440 441 # Patched for increased performance by Yoav Goldberg <yoavg@cs.bgu.ac.il>, 2006-01-13 442 # -- statistics are evaluated only on demand, instead of at every sentence evaluation 443 # 444 # SB: use nltk_lite.evaluate for precision/recall scoring? 445 #
446 -class ChunkScore(object):
447 """ 448 A utility class for scoring chunk parsers. C{ChunkScore} can 449 evaluate a chunk parser's output, based on a number of statistics 450 (precision, recall, f-measure, misssed chunks, incorrect chunks). 451 It can also combine the scores from the parsing of multiple texts; 452 this makes it signifigantly easier to evaluate a chunk parser that 453 operates one sentence at a time. 454 455 Texts are evaluated with the C{score} method. The results of 456 evaluation can be accessed via a number of accessor methods, such 457 as C{precision} and C{f_measure}. A typical use of the 458 C{ChunkScore} class is:: 459 460 >>> chunkscore = ChunkScore() 461 >>> for correct in correct_sentences: 462 ... guess = chunkparser.parse(correct.leaves()) 463 ... chunkscore.score(correct, guess) 464 >>> print 'F Measure:', chunkscore.f_measure() 465 F Measure: 0.823 466 467 @ivar kwargs: Keyword arguments: 468 469 - max_tp_examples: The maximum number actual examples of true 470 positives to record. This affects the C{correct} member 471 function: C{correct} will not return more than this number 472 of true positive examples. This does *not* affect any of 473 the numerical metrics (precision, recall, or f-measure) 474 475 - max_fp_examples: The maximum number actual examples of false 476 positives to record. This affects the C{incorrect} member 477 function and the C{guessed} member function: C{incorrect} 478 will not return more than this number of examples, and 479 C{guessed} will not return more than this number of true 480 positive examples. This does *not* affect any of the 481 numerical metrics (precision, recall, or f-measure) 482 483 - max_fn_examples: The maximum number actual examples of false 484 negatives to record. This affects the C{missed} member 485 function and the C{correct} member function: C{missed} 486 will not return more than this number of examples, and 487 C{correct} will not return more than this number of true 488 negative examples. This does *not* affect any of the 489 numerical metrics (precision, recall, or f-measure) 490 491 @type _tp: C{list} of C{Token} 492 @ivar _tp: List of true positives 493 @type _fp: C{list} of C{Token} 494 @ivar _fp: List of false positives 495 @type _fn: C{list} of C{Token} 496 @ivar _fn: List of false negatives 497 498 @type _tp_num: C{int} 499 @ivar _tp_num: Number of true positives 500 @type _fp_num: C{int} 501 @ivar _fp_num: Number of false positives 502 @type _fn_num: C{int} 503 @ivar _fn_num: Number of false negatives. 504 """
505 - def __init__(self, **kwargs):
506 self._correct = set() 507 self._guessed = set() 508 self._tp = set() 509 self._fp = set() 510 self._fn = set() 511 self._max_tp = kwargs.get('max_tp_examples', 100) 512 self._max_fp = kwargs.get('max_fp_examples', 100) 513 self._max_fn = kwargs.get('max_fn_examples', 100) 514 self._tp_num = 0 515 self._fp_num = 0 516 self._fn_num = 0 517 self._count = 0 518 519 self._measuresNeedUpdate = False
520
521 - def _updateMeasures(self):
522 if (self._measuresNeedUpdate): 523 self._tp = self._guessed & self._correct 524 self._fn = self._correct - self._guessed 525 self._fp = self._guessed - self._correct 526 self._tp_num = len(self._tp) 527 self._fp_num = len(self._fp) 528 self._fn_num = len(self._fn) 529 self._measuresNeedUpdate = False
530
531 - def score(self, correct, guessed):
532 """ 533 Given a correctly chunked sentence, score another chunked 534 version of the same sentence. 535 536 @type correct: chunk structure 537 @param correct: The known-correct ("gold standard") chunked 538 sentence. 539 @type guessed: chunk structure 540 @param guessed: The chunked sentence to be scored. 541 """ 542 543 self._correct |= _chunksets(correct, self._count) 544 self._guessed |= _chunksets(guessed, self._count) 545 self._count += 1 546 self._measuresNeedUpdate = True
547
548 - def precision(self):
549 """ 550 @return: the overall precision for all texts that have been 551 scored by this C{ChunkScore}. 552 @rtype: C{float} 553 """ 554 self._updateMeasures() 555 div = self._tp_num + self._fp_num 556 if div == 0: return 0 557 else: return float(self._tp_num) / div
558
559 - def recall(self):
560 """ 561 @return: the overall recall for all texts that have been 562 scored by this C{ChunkScore}. 563 @rtype: C{float} 564 """ 565 self._updateMeasures() 566 div = self._tp_num + self._fn_num 567 if div == 0: return 0 568 else: return float(self._tp_num) / div
569
570 - def f_measure(self, alpha=0.5):
571 """ 572 @return: the overall F measure for all texts that have been 573 scored by this C{ChunkScore}. 574 @rtype: C{float} 575 576 @param alpha: the relative weighting of precision and recall. 577 Larger alpha biases the score towards the precision value, 578 while smaller alpha biases the score towards the recall 579 value. C{alpha} should have a value in the range [0,1]. 580 @type alpha: C{float} 581 """ 582 self._updateMeasures() 583 p = self.precision() 584 r = self.recall() 585 if p == 0 or r == 0: # what if alpha is 0 or 1? 586 return 0 587 return 1/(alpha/p + (1-alpha)/r)
588
589 - def missed(self):
590 """ 591 @rtype: C{list} of chunks 592 @return: the chunks which were included in the 593 correct chunk structures, but not in the guessed chunk 594 structures, listed in input order. 595 """ 596 self._updateMeasures() 597 chunks = list(self._fn) 598 return [c[1] for c in chunks] # discard position information
599
600 - def incorrect(self):
601 """ 602 @rtype: C{list} of chunks 603 @return: the chunks which were included in the 604 guessed chunk structures, but not in the correct chunk 605 structures, listed in input order. 606 """ 607 self._updateMeasures() 608 chunks = list(self._fp) 609 return [c[1] for c in chunks] # discard position information
610
611 - def correct(self):
612 """ 613 @rtype: C{list} of chunks 614 @return: the chunks which were included in the correct 615 chunk structures, listed in input order. 616 """ 617 chunks = list(self._correct) 618 return [c[1] for c in chunks] # discard position information
619
620 - def guessed(self):
621 """ 622 @rtype: C{list} of chunks 623 @return: the chunks which were included in the guessed 624 chunk structures, listed in input order. 625 """ 626 chunks = list(self._guessed) 627 return [c[1] for c in chunks] # discard position information
628
629 - def __len__(self):
630 self._updateMeasures() 631 return self._tp_num + self._fn_num
632
633 - def __repr__(self):
634 """ 635 @rtype: C{String} 636 @return: a concise representation of this C{ChunkScoring}. 637 """ 638 return '<ChunkScoring of '+`len(self)`+' chunks>'
639
640 - def __str__(self):
641 """ 642 @rtype: C{String} 643 @return: a verbose representation of this C{ChunkScoring}. 644 This representation includes the precision, recall, and 645 f-measure scores. For other information about the score, 646 use the accessor methods (e.g., C{missed()} and 647 C{incorrect()}). 648 """ 649 return ("ChunkParse score:\n" + 650 (" Precision: %5.1f%%\n" % (self.precision()*100)) + 651 (" Recall: %5.1f%%\n" % (self.recall()*100))+ 652 (" F-Measure: %5.1f%%" % (self.f_measure()*100)))
653
654 - def _chunk_toks(self, text):
655 """ 656 @return: The list of tokens contained in C{text}. 657 """ 658 return [tok for tok in text if isinstance(tok, AbstractTree)]
659 660 # extract chunks, and assign unique id, the absolute position of 661 # the first word of the chunk
662 -def _chunksets(t, count):
663 pos = 0 664 chunks = [] 665 for child in t: 666 try: 667 chunks.append(((count, pos), tuple(child.freeze()))) 668 pos += len(child) 669 except AttributeError: 670 pos += 1 671 return set(chunks)
672 673 674 675 from regexp import * 676 from convert import * 677