Home | Trees | Indices | Help |
|
---|
|
Classes and interfaces for identifying non-overlapping linguistic groups (such as base noun phrases) in unrestricted text. This task is called chunk parsing or chunking, and the identified groups are called chunks. The chunked text is represented using a shallow tree called a "chunk structure." A chunk structure is a tree containing tokens and chunks, where each chunk is a subtree containing only tokens. For example, the chunk structure for base noun phrase chunks in the sentence "I saw the big dog on the hill" is:
(SENTENCE: (NP: <I>) <saw> (NP: <the> <big> <dog>) <on> (NP: <the> <hill>))
To convert a chunk structure back to a list of tokens, simply use the chunk structure's leaves method.
The parser.chunk
module defines ChunkI
, a standard interface for chunking texts; and
RegexpChunk, a regular-expression based implementation
of that interface. It uses the tree.chunk
and
tree.conll_chunk
methods, which tokenize
strings containing chunked and tagged texts. It defines ChunkScore,
a utility class for scoring chunk parsers.
parse.RegexpChunk
is an implementation of the chunk
parser interface that uses regular-expressions over tags to chunk a
text. Its parse
method first constructs a
ChunkString
, which encodes a particular chunking of the
input text. Initially, nothing is chunked.
parse.RegexpChunk
then applies a sequence of
RegexpChunkRule
s to the ChunkString
, each of
which modifies the chunking that it encodes. Finally, the
ChunkString
is transformed back into a chunk structure,
which is returned.
RegexpChunk
can only be used to chunk a single kind of
phrase. For example, you can use an RegexpChunk
to chunk
the noun phrases in a text, or the verb phrases in a text; but you can
not use it to simultaneously chunk both noun phrases and verb phrases
in the same text. (This is a limitation of RegexpChunk
,
not of chunk parsers in general.)
RegexpChunkRule
s are transformational rules that
update the chunking of a text by modifying its
ChunkString
. Each RegexpChunkRule
defines
the apply
method, which modifies the chunking encoded by
a ChunkString
. The RegexpChunkRule class itself can be used to
implement any transformational rule based on regular expressions.
There are also a number of subclasses, which can be used to implement
simpler types of rules:
RegexpChunkRule
s use a modified version of regular
expression patterns, called tag patterns. Tag patterns are used to match
sequences of tags. Examples of tag patterns are:
r'(<DT>|<JJ>|<NN>)+' r'<NN>+' r'<NN.*>'
The differences between regular expression patterns and tag patterns are:
'<'
and '>'
act as parentheses; so '<NN>+'
matches one
or more repetitions of '<NN>'
, not
'<NN'
followed by one or more repetitions of
'>'
.
'<DT> |
<NN>'
is equivalant to
'<DT>|<NN>'
'.'
is equivalant to
'[^{}<>]'
; so '<NN.*>'
matches any single tag starting with 'NN'
.
The function tag_pattern2re_pattern can be used to transform a tag pattern to an equivalent regular expression pattern.
Preliminary tests indicate that RegexpChunk
can chunk
at a rate of about 300 tokens/second, with a moderately complex rule
set.
There may be problems if RegexpChunk
is used with
more than 5,000 tokens at a time. In particular, evaluation of some
regular expressions may cause the Python regular expression engine to
exceed its maximum recursion depth. We have attempted to minimize
these problems, but it is impossible to avoid them completely. We
therefore recommend that you apply the chunk parser to a single
sentence at a time.
If you evaluate the following elisp expression in emacs, it will
colorize ChunkString
s when you use an interactive python
shell with emacs or xemacs ("C-c !"):
(let () (defconst comint-mode-font-lock-keywords '(("<[^>]+>" 0 'font-lock-reference-face) ("[{}]" 0 'font-lock-function-name-face))) (add-hook 'comint-mode-hook (lambda () (turn-on-font-lock))))
You can evaluate this code by copying it to a temporary buffer,
placing the cursor after the last close parenthesis, and typing
"C-x C-e
". You should evaluate it before
running the interactive session. The change will last until you
close emacs.
If we use the re
module for regular expressions,
Python's regular expression engine generates "maximum recursion
depth exceeded" errors when processing very large texts, even
for regular expressions that should not require any recursion. We
therefore use the pre
module instead. But note that
pre
does not include Unicode support, so this module
will not work with unicode strings. Note also that pre
regular expressions are not quite as advanced as re
ones
(e.g., no leftward zero-length assertions).
|
|||
|
|||
ChunkParseI A processing interface for identifying non-overlapping groups in unrestricted text. |
|||
ChunkScore A utility class for scoring chunk parsers. |
|||
ChunkString A string-based encoding of a particular chunking of a text. |
|
|||
|
|||
float
|
|
|
Score the accuracy of the chunker against the gold standard. Strip the chunk information from the gold standard and rechunk it using the chunker, then compute the accuracy score.
|
Home | Trees | Indices | Help |
|
---|
Generated by Epydoc 3.0beta1 on Wed May 16 22:47:17 2007 | http://epydoc.sourceforge.net |