Transform¶

class supar.utils.transform.Transform[source]¶

A Transform object corresponds to a specific data format. It holds several instances of data fields that provide instructions for preprocessing and numericalizing, etc.

training¶

Sets the object in training mode. If False, some data fields not required for predictions won’t be returned. Default: True.

Type: bool

CoNLL¶

class supar.utils.transform.CoNLL(ID=None, FORM=None, LEMMA=None, CPOS=None, POS=None, FEATS=None, HEAD=None, DEPREL=None, PHEAD=None, PDEPREL=None)[source]¶

The CoNLL object holds ten fields required for CoNLL-X data format [Buchholz & Marsi 2006]. Each field can be bound to one or more Field objects. For example, FORM can contain both Field and SubwordField to produce tensors for words and subwords.

ID¶: Token counter, starting at 1.

FORM¶: Words in the sentence.

LEMMA¶: Lemmas or stems (depending on the particular treebank) of words, or underscores if not available.

CPOS¶: Coarse-grained part-of-speech tags, where the tagset depends on the treebank.

POS¶: Fine-grained part-of-speech tags, where the tagset depends on the treebank.

FEATS¶: Unordered set of syntactic and/or morphological features (depending on the particular treebank), or underscores if not available.

HEAD¶: Heads of the tokens, which are either values of ID or zeros.

DEPREL¶: Dependency relations to the HEAD.

PHEAD¶: Projective heads of tokens, which are either values of ID or zeros, or underscores if not available.

PDEPREL¶: Dependency relations to the PHEAD, or underscores if not available.

classmethod toconll(tokens)[source]¶

Converts a list of tokens to a string in CoNLL-X format. Missing fields are filled with underscores.

Parameters: tokens (list[str] or list[tuple]) – This can be either a list of words, word/pos pairs or word/lemma/pos triples.
Returns: A string in CoNLL-X format.

Examples

>>> print(CoNLL.toconll(['She', 'enjoys', 'playing', 'tennis', '.']))
     She     _       _       _       _       _       _       _       _
     enjoys  _       _       _       _       _       _       _       _
     playing _       _       _       _       _       _       _       _
     tennis  _       _       _       _       _       _       _       _
     .       _       _       _       _       _       _       _       _

>>> print(CoNLL.toconll([('She',     'she',    'PRP'),
                         ('enjoys',  'enjoy',  'VBZ'),
                         ('playing', 'play',   'VBG'),
                         ('tennis',  'tennis', 'NN'),
                         ('.',       '_',      '.')]))
1       She     she     PRP     _       _       _       _       _       _
2       enjoys  enjoy   VBZ     _       _       _       _       _       _
3       playing play    VBG     _       _       _       _       _       _
4       tennis  tennis  NN      _       _       _       _       _       _
5       .       _       .       _       _       _       _       _       _

classmethod isprojective(sequence)[source]¶

Checks if a dependency tree is projective. This also works for partial annotation.

Besides the obvious crossing arcs, the examples below illustrate two non-projective cases which are hard to detect in the scenario of partial annotation.

Parameters: sequence (list[int]) – A list of head indices.
Returns: True if the tree is projective, False otherwise.

Examples

>>> CoNLL.isprojective([2, -1, 1])  # -1 denotes un-annotated cases
False
>>> CoNLL.isprojective([3, -1, 2])
False

classmethod istree(sequence, proj=False, multiroot=False)[source]¶

Checks if the arcs form an valid dependency tree.

Parameters

sequence (list[int]) – A list of head indices.
proj (bool) – If True, requires the tree to be projective. Default: False.
multiroot (bool) – If False, requires the tree to contain only a single root. Default: True.

Returns

True if the arcs form an valid tree, False otherwise.

Examples

>>> CoNLL.istree([3, 0, 0, 3], multiroot=True)
True
>>> CoNLL.istree([3, 0, 0, 3], proj=True)
False

load(data, lang=None, proj=False, max_len=None, **kwargs)[source]¶

Loads the data in CoNLL-X format. Also supports for loading data from CoNLL-U file with comments and non-integer IDs.

Parameters

data (list[list] or str) – A list of instances or a filename.
lang (str) – Language code (e.g., en) or language name (e.g., English) for the text to tokenize. None if tokenization is not required. Default: None.
proj (bool) – If True, discards all non-projective sentences. Default: False.
max_len (int) – Sentences exceeding the length will be discarded. Default: None.

Returns

A list of CoNLLSentence instances.

Tree¶

class supar.utils.transform.Tree(WORD=None, POS=None, TREE=None, CHART=None)[source]¶

The Tree object factorize a constituency tree into four fields, each associated with one or more Field objects.

WORD¶: Words in the sentence.

POS¶: Part-of-speech tags, or underscores if not available.

TREE¶: The raw constituency tree in nltk.tree.Tree format.

CHART¶: The factorized sequence of binarized tree traversed in pre-order.

classmethod totree(tokens, root='', special_tokens={'(': '-LRB-', ')': '-RRB-'})[source]¶

Converts a list of tokens to a nltk.tree.Tree. Missing fields are filled with underscores.

Parameters

tokens (list[str] or list[tuple]) – This can be either a list of words or word/pos pairs.
root (str) – The root label of the tree. Default: ‘’.
special_tokens (dict) – A dict for normalizing some special tokens to avoid tree construction crash. Default: {‘(‘: ‘-LRB-‘, ‘)’: ‘-RRB-‘}.

Returns

A nltk.tree.Tree object.

Examples

>>> print(Tree.totree(['She', 'enjoys', 'playing', 'tennis', '.'], 'TOP'))
(TOP ( (_ She)) ( (_ enjoys)) ( (_ playing)) ( (_ tennis)) ( (_ .)))

classmethod binarize(tree)[source]¶

Conducts binarization over the tree.

First, the tree is transformed to satisfy Chomsky Normal Form (CNF). Here we call chomsky_normal_form() to conduct left-binarization. Second, all unary productions in the tree are collapsed.

Parameters: tree (nltk.tree.Tree) – The tree to be binarized.
Returns: The binarized tree.

Examples

>>> tree = nltk.Tree.fromstring('''
                                (TOP
                                  (S
                                    (NP (_ She))
                                    (VP (_ enjoys) (S (VP (_ playing) (NP (_ tennis)))))
                                    (_ .)))
                                ''')
>>> print(Tree.binarize(tree))
(TOP
  (S
    (S|<>
      (NP (_ She))
      (VP
        (VP|<> (_ enjoys))
        (S::VP (VP|<> (_ playing)) (NP (_ tennis)))))
    (S|<> (_ .))))

classmethod factorize(tree, delete_labels=None, equal_labels=None)[source]¶

Factorizes the tree into a sequence. The tree is traversed in pre-order.

Parameters

tree (nltk.tree.Tree) – The tree to be factorized.
delete_labels (set[str]) – A set of labels to be ignored. This is used for evaluation. If it is a pre-terminal label, delete the word along with the brackets. If it is a non-terminal label, just delete the brackets (don’t delete children). In EVALB, the default set is: {‘TOP’, ‘S1’, ‘-NONE-‘, ‘,’, ‘:’, ‘``’, “’’”, ‘.’, ‘?’, ‘!’, ‘’} Default: None.
equal_labels (dict[str, str]) – The key-val pairs in the dict are considered equivalent (non-directional). This is used for evaluation. The default dict defined in EVALB is: {‘ADVP’: ‘PRT’} Default: None.

Returns

The sequence of the factorized tree.

Examples

>>> tree = nltk.Tree.fromstring('''
                                (TOP
                                  (S
                                    (NP (_ She))
                                    (VP (_ enjoys) (S (VP (_ playing) (NP (_ tennis)))))
                                    (_ .)))
                                ''')
>>> Tree.factorize(tree)
[(0, 5, 'TOP'), (0, 5, 'S'), (0, 1, 'NP'), (1, 4, 'VP'), (2, 4, 'S'), (2, 4, 'VP'), (3, 4, 'NP')]
>>> Tree.factorize(tree, delete_labels={'TOP', 'S1', '-NONE-', ',', ':', '``', "''", '.', '?', '!', ''})
[(0, 5, 'S'), (0, 1, 'NP'), (1, 4, 'VP'), (2, 4, 'S'), (2, 4, 'VP'), (3, 4, 'NP')]

classmethod build(tree, sequence)[source]¶

Builds a constituency tree from the sequence. The sequence is generated in pre-order. During building the tree, the sequence is de-binarized to the original format (i.e., the suffixes |<> are ignored, the collapsed labels are recovered).

Parameters

tree (nltk.tree.Tree) – An empty tree that provides a base for building a result tree.
sequence (list[tuple]) – A list of tuples used for generating a tree. Each tuple consits of the indices of left/right boundaries and label of the constituent.

Returns

A result constituency tree.

Examples

>>> tree = Tree.totree(['She', 'enjoys', 'playing', 'tennis', '.'], 'TOP')
>>> sequence = [(0, 5, 'S'), (0, 4, 'S|<>'), (0, 1, 'NP'), (1, 4, 'VP'), (1, 2, 'VP|<>'),
                (2, 4, 'S::VP'), (2, 3, 'VP|<>'), (3, 4, 'NP'), (4, 5, 'S|<>')]
>>> print(Tree.build(tree, sequence))
(TOP
  (S
    (NP (_ She))
    (VP (_ enjoys) (S (VP (_ playing) (NP (_ tennis)))))
    (_ .)))

load(data, lang=None, max_len=None, **kwargs)[source]¶

Parameters

data (list[list] or str) – A list of instances or a filename.
lang (str) – Language code (e.g., en) or language name (e.g., English) for the text to tokenize. None if tokenization is not required. Default: None.
max_len (int) – Sentences exceeding the length will be discarded. Default: None.

Returns

A list of TreeSentence instances.

SuPar 1.1.4 documentation

Transform¶

Transform¶

CoNLL¶

Tree¶