Transform#

Transform#

class supar.utils.transform.Transform[source]#

A Transform object corresponds to a specific data format, which holds several instances of data fields that provide instructions for preprocessing and numericalization, etc.

training#

Sets the object in training mode. If False, some data fields not required for predictions won’t be returned. Default: True.

Type

bool

CoNLL#

class supar.utils.transform.CoNLL(ID: Optional[Union[Field, Iterable[Field]]] = None, FORM: Optional[Union[Field, Iterable[Field]]] = None, LEMMA: Optional[Union[Field, Iterable[Field]]] = None, CPOS: Optional[Union[Field, Iterable[Field]]] = None, POS: Optional[Union[Field, Iterable[Field]]] = None, FEATS: Optional[Union[Field, Iterable[Field]]] = None, HEAD: Optional[Union[Field, Iterable[Field]]] = None, DEPREL: Optional[Union[Field, Iterable[Field]]] = None, PHEAD: Optional[Union[Field, Iterable[Field]]] = None, PDEPREL: Optional[Union[Field, Iterable[Field]]] = None)[source]#

A CoNLL object holds ten fields required for CoNLL-X data format Buchholz & Marsi (2006). Each field can be bound to one or more Field objects. For example, FORM can contain both Field and SubwordField to produce tensors for words and subwords.

ID#

Token counter, starting at 1.

FORM#

Words in the sentence.

LEMMA#

Lemmas or stems (depending on the particular treebank) of words, or underscores if not available.

CPOS#

Coarse-grained part-of-speech tags, where the tagset depends on the treebank.

POS#

Fine-grained part-of-speech tags, where the tagset depends on the treebank.

FEATS#

Unordered set of syntactic and/or morphological features (depending on the particular treebank), or underscores if not available.

HEAD#

Heads of the tokens, which are either values of ID or zeros.

DEPREL#

Dependency relations to the HEAD.

PHEAD#

Projective heads of tokens, which are either values of ID or zeros, or underscores if not available.

PDEPREL#

Dependency relations to the PHEAD, or underscores if not available.

classmethod toconll(tokens: List[Union[str, Tuple]]) str[source]#

Converts a list of tokens to a string in CoNLL-X format with missing fields filled with underscores.

Parameters

tokens (List[Union[str, Tuple]]) – This can be either a list of words, word/pos pairs or word/lemma/pos triples.

Returns

A string in CoNLL-X format.

Examples

>>> print(CoNLL.toconll(['She', 'enjoys', 'playing', 'tennis', '.']))
1       She     _       _       _       _       _       _       _       _
2       enjoys  _       _       _       _       _       _       _       _
3       playing _       _       _       _       _       _       _       _
4       tennis  _       _       _       _       _       _       _       _
5       .       _       _       _       _       _       _       _       _
>>> print(CoNLL.toconll([('She',     'she',    'PRP'),
                         ('enjoys',  'enjoy',  'VBZ'),
                         ('playing', 'play',   'VBG'),
                         ('tennis',  'tennis', 'NN'),
                         ('.',       '_',      '.')]))
1       She     she     PRP     _       _       _       _       _       _
2       enjoys  enjoy   VBZ     _       _       _       _       _       _
3       playing play    VBG     _       _       _       _       _       _
4       tennis  tennis  NN      _       _       _       _       _       _
5       .       _       .       _       _       _       _       _       _
classmethod isprojective(sequence: List[int]) bool[source]#

Checks if a dependency tree is projective. This also works for partial annotation.

Besides the obvious crossing arcs, the examples below illustrate two non-projective cases which are hard to detect in the scenario of partial annotation.

Parameters

sequence (List[int]) – A list of head indices.

Returns

True if the tree is projective, False otherwise.

Examples

>>> CoNLL.isprojective([2, -1, 1])  # -1 denotes un-annotated cases
False
>>> CoNLL.isprojective([3, -1, 2])
False
classmethod istree(sequence: List[int], proj: bool = False, multiroot: bool = False) bool[source]#

Checks if the arcs form an valid dependency tree.

Parameters
  • sequence (List[int]) – A list of head indices.

  • proj (bool) – If True, requires the tree to be projective. Default: False.

  • multiroot (bool) – If False, requires the tree to contain only a single root. Default: True.

Returns

True if the arcs form an valid tree, False otherwise.

Examples

>>> CoNLL.istree([3, 0, 0, 3], multiroot=True)
True
>>> CoNLL.istree([3, 0, 0, 3], proj=True)
False
load(data: Union[str, Iterable], lang: Optional[str] = None, proj: bool = False, **kwargs) Iterable[supar.utils.transform.CoNLLSentence][source]#

Loads the data in CoNLL-X format. Also supports for loading data from CoNLL-U file with comments and non-integer IDs.

Parameters
  • data (Union[str, Iterable]) – A filename or a list of instances.

  • lang (str) – Language code (e.g., en) or language name (e.g., English) for the text to tokenize. None if tokenization is not required. Default: None.

  • proj (bool) – If True, discards all non-projective sentences. Default: False.

Returns

A list of CoNLLSentence instances.

Tree#

class supar.utils.transform.Tree(WORD: Optional[Union[Field, Iterable[Field]]] = None, POS: Optional[Union[Field, Iterable[Field]]] = None, TREE: Optional[Union[Field, Iterable[Field]]] = None, CHART: Optional[Union[Field, Iterable[Field]]] = None)[source]#

A Tree object factorize a constituency tree into four fields, each associated with one or more Field objects.

WORD#

Words in the sentence.

POS#

Part-of-speech tags, or underscores if not available.

TREE#

The raw constituency tree in nltk.tree.Tree format.

CHART#

The factorized sequence of binarized tree traversed in post-order.

classmethod totree(tokens: List[Union[str, Tuple]], root: str = '', normalize: Dict[str, str] = {'(': '-LRB-', ')': '-RRB-'}) nltk.tree.tree.Tree[source]#

Converts a list of tokens to a nltk.tree.Tree. Missing fields are filled with underscores.

Parameters
  • tokens (List[Union[str, Tuple]]) – This can be either a list of words or word/pos pairs.

  • root (str) – The root label of the tree. Default: ‘’.

  • normalize (Dict) – Keys within the dict in each token will be replaced by the values. Default: {'(': '-LRB-', ')': '-RRB-'}.

Returns

A nltk.tree.Tree object.

Examples

>>> Tree.totree(['She', 'enjoys', 'playing', 'tennis', '.'], 'TOP').pretty_print()
             TOP
  ____________|____________
| | | |

_ _ _ _ _ | | | | |

She enjoys playing tennis .

>>> Tree.totree(['(', 'If', 'You', 'Let', 'It', ')'], 'TOP').pretty_print()
          TOP
   ________|____________
| | | | |

_ _ _ _ _ _ | | | | | |

-LRB-

If You Let It -RRB-

classmethod binarize(tree: nltk.tree.tree.Tree, left: bool = True, mark: str = '*', join: str = '::', implicit: bool = False) nltk.tree.tree.Tree[source]#

Conducts binarization over the tree.

First, the tree is transformed to satisfy Chomsky Normal Form (CNF). Here we call chomsky_normal_form() to conduct left-binarization. Second, all unary productions in the tree are collapsed.

Parameters
  • tree (nltk.tree.Tree) – The tree to be binarized.

  • left (bool) – If True, left-binarization is conducted. Default: True.

  • mark (str) – A string used to mark newly inserted nodes, working if performing explicit binarization. Default: '*'.

  • join (str) – A string used to connect collapsed node labels. Default: '::'.

  • implicit (bool) – If True, performs implicit binarization. Default: False.

Returns

The binarized tree.

Examples

>>> from supar.utils import Tree
>>> tree = nltk.Tree.fromstring('''
                                (TOP
                                  (S
                                    (NP (_ She))
                                    (VP (_ enjoys) (S (VP (_ playing) (NP (_ tennis)))))
                                    (_ .)))
                                ''')
>>> tree.pretty_print()
             TOP
              |
              S
  ____________|________________
 |            VP               |
 |     _______|_____           |
 |    |             S          |
 |    |             |          |
 |    |             VP         |
 |    |        _____|____      |
 NP   |       |          NP    |
 |    |       |          |     |
 _    _       _          _     _
 |    |       |          |     |
She enjoys playing     tennis  .
>>> Tree.binarize(tree).pretty_print()
                 TOP
                  |
                  S
             _____|__________________
            S*                       |
  __________|_____                   |
 |                VP                 |
 |     ___________|______            |
 |    |                S::VP         |
 |    |            ______|_____      |
 NP  VP*         VP*           NP    S*
 |    |           |            |     |
 _    _           _            _     _
 |    |           |            |     |
She enjoys     playing       tennis  .
>>> Tree.binarize(tree, implicit=True).pretty_print()
                 TOP
                  |
                  S
             _____|__________________
                                     |
  __________|_____                   |
 |                VP                 |
 |     ___________|______            |
 |    |                S::VP         |
 |    |            ______|_____      |
 NP                            NP
 |    |           |            |     |
 _    _           _            _     _
 |    |           |            |     |
She enjoys     playing       tennis  .
>>> Tree.binarize(tree, left=False).pretty_print()
             TOP
              |
              S
  ____________|______
 |                   S*
 |             ______|___________
 |            VP                 |
 |     _______|______            |
 |    |            S::VP         |
 |    |        ______|_____      |
 NP  VP*     VP*           NP    S*
 |    |       |            |     |
 _    _       _            _     _
 |    |       |            |     |
She enjoys playing       tennis  .
classmethod factorize(tree: nltk.tree.tree.Tree, delete_labels: Optional[Set[str]] = None, equal_labels: Optional[Dict[str, str]] = None) List[Tuple][source]#

Factorizes the tree into a sequence traversed in post-order.

Parameters
  • tree (nltk.tree.Tree) – The tree to be factorized.

  • delete_labels (Optional[Set[str]]) – A set of labels to be ignored. This is used for evaluation. If it is a pre-terminal label, delete the word along with the brackets. If it is a non-terminal label, just delete the brackets (don’t delete children). In EVALB, the default set is: {‘TOP’, ‘S1’, ‘-NONE-’, ‘,’, ‘:’, ‘``’, “’’”, ‘.’, ‘?’, ‘!’, ‘’} Default: None.

  • equal_labels (Optional[Dict[str, str]]) – The key-val pairs in the dict are considered equivalent (non-directional). This is used for evaluation. The default dict defined in EVALB is: {‘ADVP’: ‘PRT’} Default: None.

Returns

The sequence of the factorized tree.

Examples

>>> from supar.utils import Tree
>>> tree = nltk.Tree.fromstring('''
                                (TOP
                                  (S
                                    (NP (_ She))
                                    (VP (_ enjoys) (S (VP (_ playing) (NP (_ tennis)))))
                                    (_ .)))
                                ''')
>>> Tree.factorize(tree)
[(0, 1, 'NP'), (3, 4, 'NP'), (2, 4, 'VP'), (2, 4, 'S'), (1, 4, 'VP'), (0, 5, 'S'), (0, 5, 'TOP')]
>>> Tree.factorize(tree, delete_labels={'TOP', 'S1', '-NONE-', ',', ':', '``', "''", '.', '?', '!', ''})
[(0, 1, 'NP'), (3, 4, 'NP'), (2, 4, 'VP'), (2, 4, 'S'), (1, 4, 'VP'), (0, 5, 'S')]
classmethod build(tree: nltk.tree.tree.Tree, sequence: List[Tuple], delete_labels: Optional[Set[str]] = None, mark: Union[str, Tuple[str]] = ('*', '|<>'), join: str = '::', postorder: bool = True) nltk.tree.tree.Tree[source]#

Builds a constituency tree from the sequence generated in post-order. During building, the sequence is recovered to the original format, i.e., de-binarized.

Parameters
  • tree (nltk.tree.Tree) – An empty tree that provides a base for building a result tree.

  • sequence (List[Tuple]) – A list of tuples used for generating a tree. Each tuple consits of the indices of left/right boundaries and label of the constituent.

  • delete_labels (Optional[Set[str]]) – A set of labels to be ignored. Default: None.

  • mark (Union[str, List[str]]) – A string used to mark newly inserted nodes. Non-terminals containing this will be removed. Default: ('*', '|<>').

  • join (str) – A string used to connect collapsed node labels. Non-terminals containing this will be expanded to unary chains. Default: '::'.

  • postorder (bool) – If True, enforces the sequence is sorted in post-order. Default: True.

Returns

A result constituency tree.

Examples

>>> from supar.utils import Tree
>>> tree = Tree.totree(['She', 'enjoys', 'playing', 'tennis', '.'], 'TOP')
>>> Tree.build(tree,
               [(0, 5, 'S'), (0, 4, 'S*'), (0, 1, 'NP'), (1, 4, 'VP'), (1, 2, 'VP*'),
                (2, 4, 'S::VP'), (2, 3, 'VP*'), (3, 4, 'NP'), (4, 5, 'S*')]).pretty_print()
             TOP
              |
              S
  ____________|________________
 |            VP               |
 |     _______|_____           |
 |    |             S          |
 |    |             |          |
 |    |             VP         |
 |    |        _____|____      |
 NP   |       |          NP    |
 |    |       |          |     |
 _    _       _          _     _
 |    |       |          |     |
She enjoys playing     tennis  .
>>> Tree.build(tree,
               [(0, 1, 'NP'), (3, 4, 'NP'), (2, 4, 'VP'), (2, 4, 'S'), (1, 4, 'VP'), (0, 5, 'S')]).pretty_print()
             TOP
              |
              S
  ____________|________________
 |            VP               |
 |     _______|_____           |
 |    |             S          |
 |    |             |          |
 |    |             VP         |
 |    |        _____|____      |
 NP   |       |          NP    |
 |    |       |          |     |
 _    _       _          _     _
 |    |       |          |     |
She enjoys playing     tennis  .
load(data: Union[str, Iterable], lang: Optional[str] = None, **kwargs) List[supar.utils.transform.TreeSentence][source]#
Parameters
  • data (Union[str, Iterable]) – A filename or a list of instances.

  • lang (str) – Language code (e.g., en) or language name (e.g., English) for the text to tokenize. None if tokenization is not required. Default: None.

Returns

A list of TreeSentence instances.

AttachJuxtaposeTree#

class supar.utils.transform.AttachJuxtaposeTree(WORD: Optional[Union[Field, Iterable[Field]]] = None, POS: Optional[Union[Field, Iterable[Field]]] = None, TREE: Optional[Union[Field, Iterable[Field]]] = None, NODE: Optional[Union[Field, Iterable[Field]]] = None, PARENT: Optional[Union[Field, Iterable[Field]]] = None, NEW: Optional[Union[Field, Iterable[Field]]] = None)[source]#

AttachJuxtaposeTree is derived from the Tree class, supporting back-and-forth transformations between trees and AttachJuxtapose actions Yang & Deng (2020).

WORD#

Words in the sentence.

POS#

Part-of-speech tags, or underscores if not available.

TREE#

The raw constituency tree in nltk.tree.Tree format.

NODE#

The target node on each rightmost chain.

PARENT#

The label of the parent node of each terminal.

NEW#

The label of each newly inserted non-terminal with a target node and a terminal as juxtaposed children. NUL represents the Attach action.

classmethod tree2action(tree: nltk.tree.tree.Tree)[source]#

Converts a constituency tree into AttachJuxtapose actions.

Parameters

tree (nltk.tree.Tree) – A constituency tree in nltk.tree.Tree format.

Returns

A sequence of AttachJuxtapose actions.

Examples

>>> from supar.utils import AttachJuxtaposeTree
>>> tree = nltk.Tree.fromstring('''
                                (TOP
                                  (S
                                    (NP (_ Arthur))
                                    (VP
                                      (_ is)
                                      (NP (NP (_ King)) (PP (_ of) (NP (_ the) (_ Britons)))))
                                    (_ .)))
                                ''')
>>> tree.pretty_print()
                TOP
                 |
                 S
   ______________|_______________________
  |              VP                      |
  |      ________|___                    |
  |     |            NP                  |
  |     |    ________|___                |
  |     |   |            PP              |
  |     |   |     _______|___            |
  NP    |   NP   |           NP          |
  |     |   |    |        ___|_____      |
  _     _   _    _       _         _     _
  |     |   |    |       |         |     |
Arthur  is King  of     the     Britons  .
>>> AttachJuxtaposeTree.tree2action(tree)
[(0, 'NP', '<nul>'), (0, 'VP', 'S'), (1, 'NP', '<nul>'),
 (2, 'PP', 'NP'), (3, 'NP', '<nul>'), (4, '<nul>', '<nul>'),
 (0, '<nul>', '<nul>')]
classmethod action2tree(tree: nltk.tree.tree.Tree, actions: List[Tuple[int, str, str]], join: str = '::') nltk.tree.tree.Tree[source]#

Recovers a constituency tree from a sequence of AttachJuxtapose actions.

Parameters
  • tree (nltk.tree.Tree) – An empty tree that provides a base for building a result tree.

  • actions (List[Tuple[int, str, str]]) – A sequence of AttachJuxtapose actions.

  • join (str) – A string used to connect collapsed node labels. Non-terminals containing this will be expanded to unary chains. Default: '::'.

Returns

A result constituency tree.

Examples

>>> from supar.utils import AttachJuxtaposeTree
>>> tree = AttachJuxtaposeTree.totree(['Arthur', 'is', 'King', 'of', 'the', 'Britons', '.'], 'TOP')
>>> AttachJuxtaposeTree.action2tree(tree,
                                    [(0, 'NP', '<nul>'), (0, 'VP', 'S'), (1, 'NP', '<nul>'),
                                     (2, 'PP', 'NP'), (3, 'NP', '<nul>'), (4, '<nul>', '<nul>'),
                                     (0, '<nul>', '<nul>')]).pretty_print()
                TOP
                 |
                 S
   ______________|_______________________
  |              VP                      |
  |      ________|___                    |
  |     |            NP                  |
  |     |    ________|___                |
  |     |   |            PP              |
  |     |   |     _______|___            |
  NP    |   NP   |           NP          |
  |     |   |    |        ___|_____      |
  _     _   _    _       _         _     _
  |     |   |    |       |         |     |
Arthur  is King  of     the     Britons  .
classmethod action2span(action: torch.Tensor, spans: Optional[torch.Tensor] = None, nul_index: int = - 1, mask: Optional[torch.BoolTensor] = None) torch.Tensor[source]#

Converts a batch of the tensorized action at a given step into spans.

Parameters
  • action (Tensor) – [3, batch_size]. A batch of the tensorized action at a given step, containing indices of target nodes, parent and new labels.

  • spans (Tensor) – Spans generated at previous steps, None at the first step. Default: None.

  • nul_index (int) – The index for the obj:NUL token, representing the Attach action. Default: -1.

  • mask (BoolTensor) – [batch_size]. The mask for covering the unpadded tokens.

Returns

A tensor representing a batch of spans for the given step.

Examples

>>> from collections import Counter
>>> from supar.utils import AttachJuxtaposeTree, Vocab
>>> from supar.utils.common import NUL
>>> nodes, parents, news = zip(*[(0, 'NP', NUL), (0, 'VP', 'S'), (1, 'NP', NUL),
                                 (2, 'PP', 'NP'), (3, 'NP', NUL), (4, NUL, NUL),
                                 (0, NUL, NUL)])
>>> vocab = Vocab(Counter(sorted(set([*parents, *news]))))
>>> actions = torch.tensor([nodes, vocab[parents], vocab[news]]).unsqueeze(1)
>>> spans = None
>>> for action in actions.unbind(-1):
...     spans = AttachJuxtaposeTree.action2span(action, spans, vocab[NUL])
...
>>> spans
tensor([[[-1,  1, -1, -1, -1, -1, -1,  3],
         [-1, -1, -1, -1, -1, -1,  4, -1],
         [-1, -1, -1,  1, -1, -1,  1, -1],
         [-1, -1, -1, -1, -1, -1,  2, -1],
         [-1, -1, -1, -1, -1, -1,  1, -1],
         [-1, -1, -1, -1, -1, -1, -1, -1],
         [-1, -1, -1, -1, -1, -1, -1, -1],
         [-1, -1, -1, -1, -1, -1, -1, -1]]])
>>> sequence = torch.where(spans.ge(0))
>>> sequence = list(zip(sequence[1].tolist(), sequence[2].tolist(), vocab[spans[sequence]]))
>>> sequence
[(0, 1, 'NP'), (0, 7, 'S'), (1, 6, 'VP'), (2, 3, 'NP'), (2, 6, 'NP'), (3, 6, 'PP'), (4, 6, 'NP')]
>>> tree = AttachJuxtaposeTree.totree(['Arthur', 'is', 'King', 'of', 'the', 'Britons', '.'], 'TOP')
>>> AttachJuxtaposeTree.build(tree, sequence).pretty_print()
                TOP
                 |
                 S
   ______________|_______________________
  |              VP                      |
  |      ________|___                    |
  |     |            NP                  |
  |     |    ________|___                |
  |     |   |            PP              |
  |     |   |     _______|___            |
  NP    |   NP   |           NP          |
  |     |   |    |        ___|_____      |
  _     _   _    _       _         _     _
  |     |   |    |       |         |     |
Arthur  is King  of     the     Britons  .
load(data: Union[str, Iterable], lang: Optional[str] = None, **kwargs) List[supar.utils.transform.AttachJuxtaposeTreeSentence][source]#
Parameters
  • data (Union[str, Iterable]) – A filename or a list of instances.

  • lang (str) – Language code (e.g., en) or language name (e.g., English) for the text to tokenize. None if tokenization is not required. Default: None.

Returns

A list of AttachJuxtaposeTreeSentence instances.