Transform¶
Transform¶
CoNLL¶
- class supar.utils.transform.CoNLL(ID=None, FORM=None, LEMMA=None, CPOS=None, POS=None, FEATS=None, HEAD=None, DEPREL=None, PHEAD=None, PDEPREL=None)[source]¶
The CoNLL object holds ten fields required for CoNLL-X data format [Buchholz & Marsi 2006]. Each field can be bound to one or more
Fieldobjects. For example,FORMcan contain bothFieldandSubwordFieldto produce tensors for words and subwords.- ID¶
Token counter, starting at 1.
- FORM¶
Words in the sentence.
- LEMMA¶
Lemmas or stems (depending on the particular treebank) of words, or underscores if not available.
- CPOS¶
Coarse-grained part-of-speech tags, where the tagset depends on the treebank.
- POS¶
Fine-grained part-of-speech tags, where the tagset depends on the treebank.
- FEATS¶
Unordered set of syntactic and/or morphological features (depending on the particular treebank), or underscores if not available.
- HEAD¶
Heads of the tokens, which are either values of ID or zeros.
- DEPREL¶
Dependency relations to the HEAD.
- PHEAD¶
Projective heads of tokens, which are either values of ID or zeros, or underscores if not available.
- PDEPREL¶
Dependency relations to the PHEAD, or underscores if not available.
- classmethod toconll(tokens)[source]¶
Converts a list of tokens to a string in CoNLL-X format. Missing fields are filled with underscores.
- Parameters
tokens (list[str] or list[tuple]) – This can be either a list of words, word/pos pairs or word/lemma/pos triples.
- Returns
A string in CoNLL-X format.
Examples
>>> print(CoNLL.toconll(['She', 'enjoys', 'playing', 'tennis', '.'])) 1 She _ _ _ _ _ _ _ _ 2 enjoys _ _ _ _ _ _ _ _ 3 playing _ _ _ _ _ _ _ _ 4 tennis _ _ _ _ _ _ _ _ 5 . _ _ _ _ _ _ _ _
>>> print(CoNLL.toconll([('She', 'she', 'PRP'), ('enjoys', 'enjoy', 'VBZ'), ('playing', 'play', 'VBG'), ('tennis', 'tennis', 'NN'), ('.', '_', '.')])) 1 She she PRP _ _ _ _ _ _ 2 enjoys enjoy VBZ _ _ _ _ _ _ 3 playing play VBG _ _ _ _ _ _ 4 tennis tennis NN _ _ _ _ _ _ 5 . _ . _ _ _ _ _ _
- classmethod isprojective(sequence)[source]¶
Checks if a dependency tree is projective. This also works for partial annotation.
Besides the obvious crossing arcs, the examples below illustrate two non-projective cases which are hard to detect in the scenario of partial annotation.
- Parameters
- Returns
Trueif the tree is projective,Falseotherwise.
Examples
>>> CoNLL.isprojective([2, -1, 1]) # -1 denotes un-annotated cases False >>> CoNLL.isprojective([3, -1, 2]) False
- classmethod istree(sequence, proj=False, multiroot=False)[source]¶
Checks if the arcs form an valid dependency tree.
- Parameters
- Returns
Trueif the arcs form an valid tree,Falseotherwise.
Examples
>>> CoNLL.istree([3, 0, 0, 3], multiroot=True) True >>> CoNLL.istree([3, 0, 0, 3], proj=True) False
- load(data, lang=None, proj=False, max_len=None, **kwargs)[source]¶
Loads the data in CoNLL-X format. Also supports for loading data from CoNLL-U file with comments and non-integer IDs.
- Parameters
data (list[list] or str) – A list of instances or a filename.
lang (str) – Language code (e.g.,
en) or language name (e.g.,English) for the text to tokenize.Noneif tokenization is not required. Default:None.proj (bool) – If
True, discards all non-projective sentences. Default:False.max_len (int) – Sentences exceeding the length will be discarded. Default:
None.
- Returns
A list of
CoNLLSentenceinstances.
Tree¶
- class supar.utils.transform.Tree(WORD=None, POS=None, TREE=None, CHART=None)[source]¶
The Tree object factorize a constituency tree into four fields, each associated with one or more
Fieldobjects.- WORD¶
Words in the sentence.
- POS¶
Part-of-speech tags, or underscores if not available.
- TREE¶
The raw constituency tree in
nltk.tree.Treeformat.
- CHART¶
The factorized sequence of binarized tree traversed in pre-order.
- classmethod totree(tokens, root='', special_tokens={'(': '-LRB-', ')': '-RRB-'})[source]¶
Converts a list of tokens to a
nltk.tree.Tree. Missing fields are filled with underscores.- Parameters
- Returns
A
nltk.tree.Treeobject.
Examples
>>> print(Tree.totree(['She', 'enjoys', 'playing', 'tennis', '.'], 'TOP')) (TOP ( (_ She)) ( (_ enjoys)) ( (_ playing)) ( (_ tennis)) ( (_ .)))
- classmethod binarize(tree)[source]¶
Conducts binarization over the tree.
First, the tree is transformed to satisfy Chomsky Normal Form (CNF). Here we call
chomsky_normal_form()to conduct left-binarization. Second, all unary productions in the tree are collapsed.- Parameters
tree (nltk.tree.Tree) – The tree to be binarized.
- Returns
The binarized tree.
Examples
>>> tree = nltk.Tree.fromstring(''' (TOP (S (NP (_ She)) (VP (_ enjoys) (S (VP (_ playing) (NP (_ tennis))))) (_ .))) ''') >>> print(Tree.binarize(tree)) (TOP (S (S|<> (NP (_ She)) (VP (VP|<> (_ enjoys)) (S::VP (VP|<> (_ playing)) (NP (_ tennis))))) (S|<> (_ .))))
- classmethod factorize(tree, delete_labels=None, equal_labels=None)[source]¶
Factorizes the tree into a sequence. The tree is traversed in pre-order.
- Parameters
tree (nltk.tree.Tree) – The tree to be factorized.
delete_labels (set[str]) – A set of labels to be ignored. This is used for evaluation. If it is a pre-terminal label, delete the word along with the brackets. If it is a non-terminal label, just delete the brackets (don’t delete children). In EVALB, the default set is: {‘TOP’, ‘S1’, ‘-NONE-‘, ‘,’, ‘:’, ‘``’, “’’”, ‘.’, ‘?’, ‘!’, ‘’} Default:
None.equal_labels (dict[str, str]) – The key-val pairs in the dict are considered equivalent (non-directional). This is used for evaluation. The default dict defined in EVALB is: {‘ADVP’: ‘PRT’} Default:
None.
- Returns
The sequence of the factorized tree.
Examples
>>> tree = nltk.Tree.fromstring(''' (TOP (S (NP (_ She)) (VP (_ enjoys) (S (VP (_ playing) (NP (_ tennis))))) (_ .))) ''') >>> Tree.factorize(tree) [(0, 5, 'TOP'), (0, 5, 'S'), (0, 1, 'NP'), (1, 4, 'VP'), (2, 4, 'S'), (2, 4, 'VP'), (3, 4, 'NP')] >>> Tree.factorize(tree, delete_labels={'TOP', 'S1', '-NONE-', ',', ':', '``', "''", '.', '?', '!', ''}) [(0, 5, 'S'), (0, 1, 'NP'), (1, 4, 'VP'), (2, 4, 'S'), (2, 4, 'VP'), (3, 4, 'NP')]
- classmethod build(tree, sequence)[source]¶
Builds a constituency tree from the sequence. The sequence is generated in pre-order. During building the tree, the sequence is de-binarized to the original format (i.e., the suffixes
|<>are ignored, the collapsed labels are recovered).- Parameters
- Returns
A result constituency tree.
Examples
>>> tree = Tree.totree(['She', 'enjoys', 'playing', 'tennis', '.'], 'TOP') >>> sequence = [(0, 5, 'S'), (0, 4, 'S|<>'), (0, 1, 'NP'), (1, 4, 'VP'), (1, 2, 'VP|<>'), (2, 4, 'S::VP'), (2, 3, 'VP|<>'), (3, 4, 'NP'), (4, 5, 'S|<>')] >>> print(Tree.build(tree, sequence)) (TOP (S (NP (_ She)) (VP (_ enjoys) (S (VP (_ playing) (NP (_ tennis))))) (_ .)))