Test NLP models in the wild!¶
Performance of NLP models should be measured not only in terms of well established metrics if we want to use them in the real life.
This module contains implementation of multiple functions designed to corrupt text in a way resembling naturally occuring mistakes (aspects).
Quick Start Guide¶
5 minutes is enough to start using the module!
See how easy you can enrich your analysis of a NLP model with a robustness test.
Installation¶
pip install wild-nlp
Loading a dataset¶
1 2 3 4 | from wildnlp.datasets import SampleDataset
dataset = SampleDataset()
dataset.load()
|
Corrupting a text¶
There are two usecases. You may either apply corruption to a supported dataset or modify an arbtrary string.
Applying corruption to a supported dataset¶
1 2 3 | from wildnlp.aspects import Reverser
modified = dataset.apply(Reverser())
|
Modifying a string¶
1 2 3 | from wildnlp.aspects import Reverser
modified = Reverser()('A string to be modified.')
|
Note
All instances of classes derived from the Aspect class are callable. You can think of them as any other functions.
from wildnlp.aspects import Reverser
reverser_object = Reverser()
modified = reverser_object('A string to be modified.')
# Note that this is the same as
# modified = Reverser()('A string to be modified.')
Full example with multiple corruptors¶
The code¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | from wildnlp.aspects import Reverser, PigLatin
from wildnlp.aspects.utils import compose
from wildnlp.datasets import SampleDataset
# Create a dataset object and load the dataset
dataset = SampleDataset()
dataset.load()
# Crate a composed corruptor function.
# Functions will be applied in the same order they appear.
composed = compose(Reverser(), PigLatin())
# Apply the function to the dataset
modified = dataset.apply(composed)
|
The dataset’s contents¶
["Manning is a leader in applying Deep Learning "
"to Natural Language Processing",
"Manning has coauthored leading textbooks on statistical "
"approaches to Natural Language Processing"]
After applying the aspects¶
>>> print(modified)
['ninnamgay isay aay edaelray inay niylppagay eedpay ninraelgay toay arutanlay gaugnaleay nissecorpgay', 'ninnamgay ahsay erohtuaocday nidaelgay koobtxetsay onay acitsitatslay ehcaorppasay toay arutanlay gaugnaleay nissecorpgay']
Saving the dataset¶
Serialized dataset will have exactly the same format as the original dataset before modification.
It means that you don’t have to modify your existing code to test robustness of your models. Simply modify a dataset, save the modified version and provide it as an input to your existing pipeline instead of the original file.
Note: in this example no file will be saved.
1 2 3 4 5 6 | from wildnlp.datasets import SampleDataset
dataset = SampleDataset()
dataset.load()
dataset.save(data.data, '<path_to_file>')
|
Aspects¶
Functions that can be applied to sentences to corrupt them in a controled way. Corrupted sentences can be then used to test NLP models’ robustness.
Base class¶
-
class
wildnlp.aspects.base.
Aspect
[source]¶ Base, abstract class. All the aspects must implement the __call__ method.
-
__call__
(sentence)[source]¶ We want to directly call objects of the Aspect class for easy chaining. This function will be applied to sentences.
-
Utility functions¶
-
wildnlp.aspects.utils.
compose
(*functions)[source]¶ Chains multiple aspects into a single function.
Parameters: functions – Object(s) of the Callable instance. Returns: chained function Example:
from wildnlp.aspects.utils import compose from wildnlp.aspects import Swap, QWERTY composed_aspect = compose(Swap(), QWERTY()) modified_text = composed_aspect('Text to corrupt')
Articles¶
-
class
wildnlp.aspects.articles.
Articles
(swap_probability=0.5, seed=42)[source]¶ Bases:
wildnlp.aspects.base.Aspect
Randomly removes or swaps articles into wrong ones.
Caution
Uses random numbers, default seed is 42.
Characters removal¶
-
class
wildnlp.aspects.remove_char.
RemoveChar
(char=None, words_percentage=50, characters_percentage=10, seed=42)[source]¶ Bases:
wildnlp.aspects.base.Aspect
Randomly removes characters from words.
Note
Note that you may specify white space as a character to be removed but it’ll be processed differently.
Caution
Uses random numbers, default seed is 42.
-
__init__
(char=None, words_percentage=50, characters_percentage=10, seed=42)[source]¶ Parameters: - words_percentage – Percentage of words in a sentence that should be transformed. If greater than 0, always at least single word will be transformed.
- characters_percentage – Percentage of characters in a word that should be transformed. If greater than 0 always at least single character will be transformed.
- char – If specified only that character will be randomly removed. The specified character can also be a white space.
- seed – Random seed.
-
Characters swapping¶
-
class
wildnlp.aspects.swap.
Swap
(transform_percentage=100, seed=42)[source]¶ Bases:
wildnlp.aspects.base.Aspect
Randomly swaps two characters within a word, excluding punctuations. It’s possible that the same two characters will be swapped, so the word won’t be changed, for example letter can become letter after swapping.
Caution
Uses random numbers, default seed is 42.
Digits2Words¶
-
class
wildnlp.aspects.digits2words.
Digits2Words
[source]¶ Bases:
wildnlp.aspects.base.Aspect
Converts numbers into words. Handles floating numbers as well.
All numbers will be converted
Misspelling¶
-
class
wildnlp.aspects.misspelling.
Misspelling
(use_homophones=False, seed=42)[source]¶ Bases:
wildnlp.aspects.base.Aspect
Misspells words appearing in the Wikipedia list of commonly misspelled English words (default): https://en.wikipedia.org/wiki/Commonly_misspelled_English_words
Tip
You can use homophones instead: https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/Homophones
If a word has more then one common misspelling, the replacement is selected randomly.
All words that have any misspellings listed will be replaced.
Caution
Uses random numbers, default seed is 42.
Punctuation¶
-
class
wildnlp.aspects.punctuation.
Punctuation
(char=', ', add_percentage=0, remove_percentage=100, seed=42)[source]¶ Bases:
wildnlp.aspects.base.Aspect
Randomly adds or removes specified punctuation marks. The implementation guarantees that punctuation marks won’t be appended to the original ones or won’t replace them after removal.
With default settings all occurrences of the specified punctuation mark will be removed.
Example:
Sentence, have a comma. Possible transformations: - Sentence have, a comma. - Sentence, have, a, comma. Impossible transformations: - Sentence,, have a comma.
Caution
Uses random numbers, default seed is 42.
-
__init__
(char=', ', add_percentage=0, remove_percentage=100, seed=42)[source]¶ Parameters: - char – Punctuation mark that will be removed or added to sentences.
- add_percentage – Max percentage of white spaces in a sentence to be prepended with punctuation marks.
- remove_percentage – Max percentage of existing punctuation marks that will be removed.
- seed – Random seed.
QWERTY¶
-
class
wildnlp.aspects.qwerty.
QWERTY
(words_percentage=1, characters_percentage=10, seed=42)[source]¶ Bases:
wildnlp.aspects.base.Aspect
Simulates errors made while writing on a QWERTY-type keyboard. Characters are swapped with their neighbors on the keyboard.
Caution
Uses random numbers, default seed is 42.
-
__init__
(words_percentage=1, characters_percentage=10, seed=42)[source]¶ Parameters: - words_percentage – Percentage of words in a sentence that should be transformed. If greater than 0, always at least single word will be transformed.
- characters_percentage – Percentage of characters in a word that should be transformed. If greater than 0 always at least single character will be transformed.
- seed – Random seed.
-
Sentiment words masking¶
-
class
wildnlp.aspects.sentiment_masking.
SentimentMasking
(char='*', use_positive=False, seed=42)[source]¶ Bases:
wildnlp.aspects.base.Aspect
This aspect reflects attempts made by Internet users to mask profanity or hate speech in online forums to evade moderation. We perform masking (replacing random, single character with for example an asterisk) of negative (or positive for completeness) words from Opinion Lexicon: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
All words that are listed will be transformed.
Caution
Uses random numbers, default seed is 42.
Datasets¶
Details regarding popular datasets for various NLP problems that are supported by the wild-nlp.
Base class¶
CoNLL 2003¶
-
class
wildnlp.datasets.conll.
CoNLL
(*args, **kwargs)[source]¶ Bases:
wildnlp.datasets.base.Dataset
The CoNLL-2003 shared task data for language-independent named entity recognition. For details see: https://www.clips.uantwerpen.be/conll2003/ner/
-
apply
(aspect, apply_to_ne=False)[source]¶ Parameters: - aspect – transformation function
- apply_to_ne – if False, transformation won’t be applied to Named Entities. If True, transformation will be applied only to Named Entities.
Returns: modified dataset in the following form:
[{tokens: array(<tokens>) pos_tags: array(<pos_tags>), chunk_tags: array(<chunk_tags>), ner_tags: array(<ner_tags>}, ..., ]
-
load
(path)[source]¶ Reads a CoNLL dataset file and loads into internal data structure in the following form:
[{tokens: array(<tokens>) pos_tags: array(<pos_tags>), chunk_tags: array(<chunk_tags>), ner_tags: array(<ner_tags>}, ..., ]
Parameters: path – A path to a file with CoNLL data Returns: None
-
save
(data, path)[source]¶ Saves data in the CoNLL format
Parameters: data – list of dictionaries in the following form: [{tokens: array(<tokens>) pos_tags: array(<pos_tags>), chunk_tags: array(<chunk_tags>), ner_tags: array(<ner_tags>}, ..., ]
Parameters: path – Path to save the file. If the file exists, it will be overwritten. Returns: None
-
SNLI¶
-
class
wildnlp.datasets.snli.
SNLI
(*args, **kwargs)[source]¶ Bases:
wildnlp.datasets.base.Dataset
The SNLI dataset supporting the task of natural language inference. For details see: https://nlp.stanford.edu/projects/snli/
SQuAD¶
-
class
wildnlp.datasets.squad.
SQuAD
[source]¶ Bases:
wildnlp.datasets.base.Dataset
The SQuAD dataset. For details see: https://rajpurkar.github.io/SQuAD-explorer/
IMDB¶
-
class
wildnlp.datasets.imdb.
IMDB
(*args, **kwargs)[source]¶ Bases:
wildnlp.datasets.base.Dataset
The IMDB dataset containing movie reviews for a sentiment analysis. The dataset consists of 50 000 reviews of two classes, negative and positive. Each review is stored in a separate text file. For details see: http://ai.stanford.edu/~amaas/data/sentiment/
-
load
(path)[source]¶ Loads a SNLI dataset.
Parameters: path – A path to single file, directory containing review files or list of paths to such directories. Returns: None
-
How to contribute¶
Already implemented functions by no means exhaust all the possibilities for corruptin a text.
There are also many popular datasets that we still don’t support.
If you’d like to extend the module, you’re more than welcome! :)
General remarks¶
- Unit tests are boring but sometimes also helpful.
- Documenting your work helps other to use it.
How to add an Aspect¶
All the aspects inherits from the Aspect class. The only thin you should remeber is to implement the __call__ function accepting a single argument (string) and returning a single output (string).
That’s it, we leave the details to you! Remeber that natural language is tricky, there are punctuation marks, capitalizations, apostrophes etc, that you may would like to leave intact if it’s not the target of your aspect.
Caution
Every aspect must implement __call__ function accepting a string as an input and outputing a string.
How to add a dataset support¶
All the datasets inherits from the Dataset class. There are only 3 methods that it must implement.
- load - load a dataset from file to an iterable internal object.
2. apply - iterate through all elements (strings) in a dataset and modify them one by one.
- save - save the dataset in the same exact format as the original one.
Caution
Saved dataset should have the same format as the original one.