Transform
Contents
Transform#
Transform#
CoNLL#
- class supar.utils.transform.CoNLL(ID: Optional[Union[Field, Iterable[Field]]] = None, FORM: Optional[Union[Field, Iterable[Field]]] = None, LEMMA: Optional[Union[Field, Iterable[Field]]] = None, CPOS: Optional[Union[Field, Iterable[Field]]] = None, POS: Optional[Union[Field, Iterable[Field]]] = None, FEATS: Optional[Union[Field, Iterable[Field]]] = None, HEAD: Optional[Union[Field, Iterable[Field]]] = None, DEPREL: Optional[Union[Field, Iterable[Field]]] = None, PHEAD: Optional[Union[Field, Iterable[Field]]] = None, PDEPREL: Optional[Union[Field, Iterable[Field]]] = None)[source]#
A
CoNLL
object holds ten fields required for CoNLL-X data format Buchholz & Marsi (2006). Each field can be bound to one or moreField
objects. For example,FORM
can contain bothField
andSubwordField
to produce tensors for words and subwords.- ID#
Token counter, starting at 1.
- FORM#
Words in the sentence.
- LEMMA#
Lemmas or stems (depending on the particular treebank) of words, or underscores if not available.
- CPOS#
Coarse-grained part-of-speech tags, where the tagset depends on the treebank.
- POS#
Fine-grained part-of-speech tags, where the tagset depends on the treebank.
- FEATS#
Unordered set of syntactic and/or morphological features (depending on the particular treebank), or underscores if not available.
- HEAD#
Heads of the tokens, which are either values of ID or zeros.
- DEPREL#
Dependency relations to the HEAD.
- PHEAD#
Projective heads of tokens, which are either values of ID or zeros, or underscores if not available.
- PDEPREL#
Dependency relations to the PHEAD, or underscores if not available.
- classmethod toconll(tokens: List[Union[str, Tuple]]) str [source]#
Converts a list of tokens to a string in CoNLL-X format with missing fields filled with underscores.
- Parameters
tokens (List[Union[str, Tuple]]) – This can be either a list of words, word/pos pairs or word/lemma/pos triples.
- Returns
A string in CoNLL-X format.
Examples
>>> print(CoNLL.toconll(['She', 'enjoys', 'playing', 'tennis', '.'])) 1 She _ _ _ _ _ _ _ _ 2 enjoys _ _ _ _ _ _ _ _ 3 playing _ _ _ _ _ _ _ _ 4 tennis _ _ _ _ _ _ _ _ 5 . _ _ _ _ _ _ _ _
>>> print(CoNLL.toconll([('She', 'she', 'PRP'), ('enjoys', 'enjoy', 'VBZ'), ('playing', 'play', 'VBG'), ('tennis', 'tennis', 'NN'), ('.', '_', '.')])) 1 She she PRP _ _ _ _ _ _ 2 enjoys enjoy VBZ _ _ _ _ _ _ 3 playing play VBG _ _ _ _ _ _ 4 tennis tennis NN _ _ _ _ _ _ 5 . _ . _ _ _ _ _ _
- classmethod isprojective(sequence: List[int]) bool [source]#
Checks if a dependency tree is projective. This also works for partial annotation.
Besides the obvious crossing arcs, the examples below illustrate two non-projective cases which are hard to detect in the scenario of partial annotation.
- Parameters
sequence (List[int]) – A list of head indices.
- Returns
True
if the tree is projective,False
otherwise.
Examples
>>> CoNLL.isprojective([2, -1, 1]) # -1 denotes un-annotated cases False >>> CoNLL.isprojective([3, -1, 2]) False
- classmethod istree(sequence: List[int], proj: bool = False, multiroot: bool = False) bool [source]#
Checks if the arcs form an valid dependency tree.
- Parameters
- Returns
True
if the arcs form an valid tree,False
otherwise.
Examples
>>> CoNLL.istree([3, 0, 0, 3], multiroot=True) True >>> CoNLL.istree([3, 0, 0, 3], proj=True) False
- load(data: Union[str, Iterable], lang: Optional[str] = None, proj: bool = False, **kwargs) Iterable[supar.utils.transform.CoNLLSentence] [source]#
Loads the data in CoNLL-X format. Also supports for loading data from CoNLL-U file with comments and non-integer IDs.
- Parameters
- Returns
A list of
CoNLLSentence
instances.
Tree#
- class supar.utils.transform.Tree(WORD: Optional[Union[Field, Iterable[Field]]] = None, POS: Optional[Union[Field, Iterable[Field]]] = None, TREE: Optional[Union[Field, Iterable[Field]]] = None, CHART: Optional[Union[Field, Iterable[Field]]] = None)[source]#
A
Tree
object factorize a constituency tree into four fields, each associated with one or moreField
objects.- WORD#
Words in the sentence.
- POS#
Part-of-speech tags, or underscores if not available.
- TREE#
The raw constituency tree in
nltk.tree.Tree
format.
- CHART#
The factorized sequence of binarized tree traversed in post-order.
- classmethod totree(tokens: List[Union[str, Tuple]], root: str = '', normalize: Dict[str, str] = {'(': '-LRB-', ')': '-RRB-'}) nltk.tree.tree.Tree [source]#
Converts a list of tokens to a
nltk.tree.Tree
. Missing fields are filled with underscores.- Parameters
- Returns
A
nltk.tree.Tree
object.
Examples
>>> Tree.totree(['She', 'enjoys', 'playing', 'tennis', '.'], 'TOP').pretty_print() TOP ____________|____________
| | | |_ _ _ _ _ | | | | |
She enjoys playing tennis .
>>> Tree.totree(['(', 'If', 'You', 'Let', 'It', ')'], 'TOP').pretty_print() TOP ________|____________
| | | | |_ _ _ _ _ _ | | | | | |
- -LRB-
If You Let It -RRB-
- classmethod binarize(tree: nltk.tree.tree.Tree, left: bool = True, mark: str = '*', join: str = '::', implicit: bool = False) nltk.tree.tree.Tree [source]#
Conducts binarization over the tree.
First, the tree is transformed to satisfy Chomsky Normal Form (CNF). Here we call
chomsky_normal_form()
to conduct left-binarization. Second, all unary productions in the tree are collapsed.- Parameters
tree (nltk.tree.Tree) – The tree to be binarized.
left (bool) – If
True
, left-binarization is conducted. Default:True
.mark (str) – A string used to mark newly inserted nodes, working if performing explicit binarization. Default:
'*'
.join (str) – A string used to connect collapsed node labels. Default:
'::'
.implicit (bool) – If
True
, performs implicit binarization. Default:False
.
- Returns
The binarized tree.
Examples
>>> from supar.utils import Tree >>> tree = nltk.Tree.fromstring(''' (TOP (S (NP (_ She)) (VP (_ enjoys) (S (VP (_ playing) (NP (_ tennis))))) (_ .))) ''') >>> tree.pretty_print() TOP | S ____________|________________ | VP | | _______|_____ | | | S | | | | | | | VP | | | _____|____ | NP | | NP | | | | | | _ _ _ _ _ | | | | | She enjoys playing tennis .
>>> Tree.binarize(tree).pretty_print() TOP | S _____|__________________ S* | __________|_____ | | VP | | ___________|______ | | | S::VP | | | ______|_____ | NP VP* VP* NP S* | | | | | _ _ _ _ _ | | | | | She enjoys playing tennis .
>>> Tree.binarize(tree, implicit=True).pretty_print() TOP | S _____|__________________ | __________|_____ | | VP | | ___________|______ | | | S::VP | | | ______|_____ | NP NP | | | | | _ _ _ _ _ | | | | | She enjoys playing tennis .
>>> Tree.binarize(tree, left=False).pretty_print() TOP | S ____________|______ | S* | ______|___________ | VP | | _______|______ | | | S::VP | | | ______|_____ | NP VP* VP* NP S* | | | | | _ _ _ _ _ | | | | | She enjoys playing tennis .
- classmethod factorize(tree: nltk.tree.tree.Tree, delete_labels: Optional[Set[str]] = None, equal_labels: Optional[Dict[str, str]] = None) List[Tuple] [source]#
Factorizes the tree into a sequence traversed in post-order.
- Parameters
tree (nltk.tree.Tree) – The tree to be factorized.
delete_labels (Optional[Set[str]]) – A set of labels to be ignored. This is used for evaluation. If it is a pre-terminal label, delete the word along with the brackets. If it is a non-terminal label, just delete the brackets (don’t delete children). In EVALB, the default set is: {‘TOP’, ‘S1’, ‘-NONE-’, ‘,’, ‘:’, ‘``’, “’’”, ‘.’, ‘?’, ‘!’, ‘’} Default:
None
.equal_labels (Optional[Dict[str, str]]) – The key-val pairs in the dict are considered equivalent (non-directional). This is used for evaluation. The default dict defined in EVALB is: {‘ADVP’: ‘PRT’} Default:
None
.
- Returns
The sequence of the factorized tree.
Examples
>>> from supar.utils import Tree >>> tree = nltk.Tree.fromstring(''' (TOP (S (NP (_ She)) (VP (_ enjoys) (S (VP (_ playing) (NP (_ tennis))))) (_ .))) ''') >>> Tree.factorize(tree) [(0, 1, 'NP'), (3, 4, 'NP'), (2, 4, 'VP'), (2, 4, 'S'), (1, 4, 'VP'), (0, 5, 'S'), (0, 5, 'TOP')] >>> Tree.factorize(tree, delete_labels={'TOP', 'S1', '-NONE-', ',', ':', '``', "''", '.', '?', '!', ''}) [(0, 1, 'NP'), (3, 4, 'NP'), (2, 4, 'VP'), (2, 4, 'S'), (1, 4, 'VP'), (0, 5, 'S')]
- classmethod build(tree: nltk.tree.tree.Tree, sequence: List[Tuple], delete_labels: Optional[Set[str]] = None, mark: Union[str, Tuple[str]] = ('*', '|<>'), join: str = '::', postorder: bool = True) nltk.tree.tree.Tree [source]#
Builds a constituency tree from the sequence generated in post-order. During building, the sequence is recovered to the original format, i.e., de-binarized.
- Parameters
tree (nltk.tree.Tree) – An empty tree that provides a base for building a result tree.
sequence (List[Tuple]) – A list of tuples used for generating a tree. Each tuple consits of the indices of left/right boundaries and label of the constituent.
delete_labels (Optional[Set[str]]) – A set of labels to be ignored. Default:
None
.mark (Union[str, List[str]]) – A string used to mark newly inserted nodes. Non-terminals containing this will be removed. Default:
('*', '|<>')
.join (str) – A string used to connect collapsed node labels. Non-terminals containing this will be expanded to unary chains. Default:
'::'
.postorder (bool) – If
True
, enforces the sequence is sorted in post-order. Default:True
.
- Returns
A result constituency tree.
Examples
>>> from supar.utils import Tree >>> tree = Tree.totree(['She', 'enjoys', 'playing', 'tennis', '.'], 'TOP') >>> Tree.build(tree, [(0, 5, 'S'), (0, 4, 'S*'), (0, 1, 'NP'), (1, 4, 'VP'), (1, 2, 'VP*'), (2, 4, 'S::VP'), (2, 3, 'VP*'), (3, 4, 'NP'), (4, 5, 'S*')]).pretty_print() TOP | S ____________|________________ | VP | | _______|_____ | | | S | | | | | | | VP | | | _____|____ | NP | | NP | | | | | | _ _ _ _ _ | | | | | She enjoys playing tennis .
>>> Tree.build(tree, [(0, 1, 'NP'), (3, 4, 'NP'), (2, 4, 'VP'), (2, 4, 'S'), (1, 4, 'VP'), (0, 5, 'S')]).pretty_print() TOP | S ____________|________________ | VP | | _______|_____ | | | S | | | | | | | VP | | | _____|____ | NP | | NP | | | | | | _ _ _ _ _ | | | | | She enjoys playing tennis .
AttachJuxtaposeTree#
- class supar.utils.transform.AttachJuxtaposeTree(WORD: Optional[Union[Field, Iterable[Field]]] = None, POS: Optional[Union[Field, Iterable[Field]]] = None, TREE: Optional[Union[Field, Iterable[Field]]] = None, NODE: Optional[Union[Field, Iterable[Field]]] = None, PARENT: Optional[Union[Field, Iterable[Field]]] = None, NEW: Optional[Union[Field, Iterable[Field]]] = None)[source]#
AttachJuxtaposeTree
is derived from theTree
class, supporting back-and-forth transformations between trees and AttachJuxtapose actions Yang & Deng (2020).- WORD#
Words in the sentence.
- POS#
Part-of-speech tags, or underscores if not available.
- TREE#
The raw constituency tree in
nltk.tree.Tree
format.
- NODE#
The target node on each rightmost chain.
- PARENT#
The label of the parent node of each terminal.
- NEW#
The label of each newly inserted non-terminal with a target node and a terminal as juxtaposed children.
NUL
represents the Attach action.
- classmethod tree2action(tree: nltk.tree.tree.Tree)[source]#
Converts a constituency tree into AttachJuxtapose actions.
- Parameters
tree (nltk.tree.Tree) – A constituency tree in
nltk.tree.Tree
format.- Returns
A sequence of AttachJuxtapose actions.
Examples
>>> from supar.utils import AttachJuxtaposeTree >>> tree = nltk.Tree.fromstring(''' (TOP (S (NP (_ Arthur)) (VP (_ is) (NP (NP (_ King)) (PP (_ of) (NP (_ the) (_ Britons))))) (_ .))) ''') >>> tree.pretty_print() TOP | S ______________|_______________________ | VP | | ________|___ | | | NP | | | ________|___ | | | | PP | | | | _______|___ | NP | NP | NP | | | | | ___|_____ | _ _ _ _ _ _ _ | | | | | | | Arthur is King of the Britons . >>> AttachJuxtaposeTree.tree2action(tree) [(0, 'NP', '<nul>'), (0, 'VP', 'S'), (1, 'NP', '<nul>'), (2, 'PP', 'NP'), (3, 'NP', '<nul>'), (4, '<nul>', '<nul>'), (0, '<nul>', '<nul>')]
- classmethod action2tree(tree: nltk.tree.tree.Tree, actions: List[Tuple[int, str, str]], join: str = '::') nltk.tree.tree.Tree [source]#
Recovers a constituency tree from a sequence of AttachJuxtapose actions.
- Parameters
tree (nltk.tree.Tree) – An empty tree that provides a base for building a result tree.
actions (List[Tuple[int, str, str]]) – A sequence of AttachJuxtapose actions.
join (str) – A string used to connect collapsed node labels. Non-terminals containing this will be expanded to unary chains. Default:
'::'
.
- Returns
A result constituency tree.
Examples
>>> from supar.utils import AttachJuxtaposeTree >>> tree = AttachJuxtaposeTree.totree(['Arthur', 'is', 'King', 'of', 'the', 'Britons', '.'], 'TOP') >>> AttachJuxtaposeTree.action2tree(tree, [(0, 'NP', '<nul>'), (0, 'VP', 'S'), (1, 'NP', '<nul>'), (2, 'PP', 'NP'), (3, 'NP', '<nul>'), (4, '<nul>', '<nul>'), (0, '<nul>', '<nul>')]).pretty_print() TOP | S ______________|_______________________ | VP | | ________|___ | | | NP | | | ________|___ | | | | PP | | | | _______|___ | NP | NP | NP | | | | | ___|_____ | _ _ _ _ _ _ _ | | | | | | | Arthur is King of the Britons .
- classmethod action2span(action: torch.Tensor, spans: Optional[torch.Tensor] = None, nul_index: int = - 1, mask: Optional[torch.BoolTensor] = None) torch.Tensor [source]#
Converts a batch of the tensorized action at a given step into spans.
- Parameters
action (Tensor) –
[3, batch_size]
. A batch of the tensorized action at a given step, containing indices of target nodes, parent and new labels.spans (Tensor) – Spans generated at previous steps,
None
at the first step. Default:None
.nul_index (int) – The index for the obj:NUL token, representing the Attach action. Default: -1.
mask (BoolTensor) –
[batch_size]
. The mask for covering the unpadded tokens.
- Returns
A tensor representing a batch of spans for the given step.
Examples
>>> from collections import Counter >>> from supar.utils import AttachJuxtaposeTree, Vocab >>> from supar.utils.common import NUL >>> nodes, parents, news = zip(*[(0, 'NP', NUL), (0, 'VP', 'S'), (1, 'NP', NUL), (2, 'PP', 'NP'), (3, 'NP', NUL), (4, NUL, NUL), (0, NUL, NUL)]) >>> vocab = Vocab(Counter(sorted(set([*parents, *news])))) >>> actions = torch.tensor([nodes, vocab[parents], vocab[news]]).unsqueeze(1) >>> spans = None >>> for action in actions.unbind(-1): ... spans = AttachJuxtaposeTree.action2span(action, spans, vocab[NUL]) ... >>> spans tensor([[[-1, 1, -1, -1, -1, -1, -1, 3], [-1, -1, -1, -1, -1, -1, 4, -1], [-1, -1, -1, 1, -1, -1, 1, -1], [-1, -1, -1, -1, -1, -1, 2, -1], [-1, -1, -1, -1, -1, -1, 1, -1], [-1, -1, -1, -1, -1, -1, -1, -1], [-1, -1, -1, -1, -1, -1, -1, -1], [-1, -1, -1, -1, -1, -1, -1, -1]]]) >>> sequence = torch.where(spans.ge(0)) >>> sequence = list(zip(sequence[1].tolist(), sequence[2].tolist(), vocab[spans[sequence]])) >>> sequence [(0, 1, 'NP'), (0, 7, 'S'), (1, 6, 'VP'), (2, 3, 'NP'), (2, 6, 'NP'), (3, 6, 'PP'), (4, 6, 'NP')] >>> tree = AttachJuxtaposeTree.totree(['Arthur', 'is', 'King', 'of', 'the', 'Britons', '.'], 'TOP') >>> AttachJuxtaposeTree.build(tree, sequence).pretty_print() TOP | S ______________|_______________________ | VP | | ________|___ | | | NP | | | ________|___ | | | | PP | | | | _______|___ | NP | NP | NP | | | | | ___|_____ | _ _ _ _ _ _ _ | | | | | | | Arthur is King of the Britons .