Aspects

Functions that can be applied to sentences to corrupt them in a controled way. Corrupted sentences can be then used to test NLP models’ robustness.

Base class

class wildnlp.aspects.base.Aspect[source]

Base, abstract class. All the aspects must implement the __call__ method.

__call__(sentence)[source]

We want to directly call objects of the Aspect class for easy chaining. This function will be applied to sentences.

static _detokenize(tokens)[source]

Join tokens into tokens including punctuation and special characters.

Parameters:tokens – List of tokens.
Returns:A sentence as a single string.
static _tokenize(sentence)[source]

Split text into tokens including punctuation and special characters.

Parameters:sentence – A sentences as a single string.
Returns:List of tokens.

Utility functions

wildnlp.aspects.utils.compose(*functions)[source]

Chains multiple aspects into a single function.

Parameters:functions – Object(s) of the Callable instance.
Returns:chained function

Example:

from wildnlp.aspects.utils import compose
from wildnlp.aspects import Swap, QWERTY

composed_aspect = compose(Swap(), QWERTY())
modified_text = composed_aspect('Text to corrupt')

Articles

class wildnlp.aspects.articles.Articles(swap_probability=0.5, seed=42)[source]

Bases: wildnlp.aspects.base.Aspect

Randomly removes or swaps articles into wrong ones.

Caution

Uses random numbers, default seed is 42.

__init__(swap_probability=0.5, seed=42)[source]
Parameters:
  • swap_probability – Probability of applying a transformation. Defaults to 0.5.
  • seed – Random seed.

Characters removal

class wildnlp.aspects.remove_char.RemoveChar(char=None, words_percentage=50, characters_percentage=10, seed=42)[source]

Bases: wildnlp.aspects.base.Aspect

Randomly removes characters from words.

Note

Note that you may specify white space as a character to be removed but it’ll be processed differently.

Caution

Uses random numbers, default seed is 42.

__init__(char=None, words_percentage=50, characters_percentage=10, seed=42)[source]
Parameters:
  • words_percentage – Percentage of words in a sentence that should be transformed. If greater than 0, always at least single word will be transformed.
  • characters_percentage – Percentage of characters in a word that should be transformed. If greater than 0 always at least single character will be transformed.
  • char – If specified only that character will be randomly removed. The specified character can also be a white space.
  • seed – Random seed.

Characters swapping

class wildnlp.aspects.swap.Swap(transform_percentage=100, seed=42)[source]

Bases: wildnlp.aspects.base.Aspect

Randomly swaps two characters within a word, excluding punctuations. It’s possible that the same two characters will be swapped, so the word won’t be changed, for example letter can become letter after swapping.

Caution

Uses random numbers, default seed is 42.

__init__(transform_percentage=100, seed=42)[source]
Parameters:
  • transform_percentage – Maximum percentage of words in a sentence that should be transformed.
  • seed – Random seed.

Digits2Words

class wildnlp.aspects.digits2words.Digits2Words[source]

Bases: wildnlp.aspects.base.Aspect

Converts numbers into words. Handles floating numbers as well.

All numbers will be converted

Misspelling

class wildnlp.aspects.misspelling.Misspelling(use_homophones=False, seed=42)[source]

Bases: wildnlp.aspects.base.Aspect

Misspells words appearing in the Wikipedia list of commonly misspelled English words (default): https://en.wikipedia.org/wiki/Commonly_misspelled_English_words

If a word has more then one common misspelling, the replacement is selected randomly.

All words that have any misspellings listed will be replaced.

Caution

Uses random numbers, default seed is 42.

__init__(use_homophones=False, seed=42)[source]
Parameters:
  • use_homophones – If True list of homophones will be used to replace words.
  • seed – Random seed.

Punctuation

class wildnlp.aspects.punctuation.Punctuation(char=', ', add_percentage=0, remove_percentage=100, seed=42)[source]

Bases: wildnlp.aspects.base.Aspect

Randomly adds or removes specified punctuation marks. The implementation guarantees that punctuation marks won’t be appended to the original ones or won’t replace them after removal.

With default settings all occurrences of the specified punctuation mark will be removed.

  • Example:

    Sentence, have a comma.
    
    Possible transformations:
    - Sentence have, a comma.
    - Sentence, have, a, comma.
    
    Impossible transformations:
    - Sentence,, have a comma.
    

Caution

Uses random numbers, default seed is 42.

__init__(char=', ', add_percentage=0, remove_percentage=100, seed=42)[source]
Parameters:
  • char – Punctuation mark that will be removed or added to sentences.
  • add_percentage – Max percentage of white spaces in a sentence to be prepended with punctuation marks.
  • remove_percentage – Max percentage of existing punctuation marks that will be removed.
  • seed – Random seed.

QWERTY

class wildnlp.aspects.qwerty.QWERTY(words_percentage=1, characters_percentage=10, seed=42)[source]

Bases: wildnlp.aspects.base.Aspect

Simulates errors made while writing on a QWERTY-type keyboard. Characters are swapped with their neighbors on the keyboard.

Caution

Uses random numbers, default seed is 42.

__init__(words_percentage=1, characters_percentage=10, seed=42)[source]
Parameters:
  • words_percentage – Percentage of words in a sentence that should be transformed. If greater than 0, always at least single word will be transformed.
  • characters_percentage – Percentage of characters in a word that should be transformed. If greater than 0 always at least single character will be transformed.
  • seed – Random seed.

Sentiment words masking

class wildnlp.aspects.sentiment_masking.SentimentMasking(char='*', use_positive=False, seed=42)[source]

Bases: wildnlp.aspects.base.Aspect

This aspect reflects attempts made by Internet users to mask profanity or hate speech in online forums to evade moderation. We perform masking (replacing random, single character with for example an asterisk) of negative (or positive for completeness) words from Opinion Lexicon: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

All words that are listed will be transformed.

Caution

Uses random numbers, default seed is 42.

__init__(char='*', use_positive=False, seed=42)[source]
Parameters:
  • char – A character that will be used to mask words.
  • use_positive – If True positive (instead of negative) words will be masked.
  • seed – Random seed.