Datasets¶

Details regarding popular datasets for various NLP problems that are supported by the wild-nlp.

Base class¶

class wildnlp.datasets.base.Dataset(*args, **kwargs)[source]¶

apply(*args, **kwargs)[source]¶: The method should iterate through texts in the dataset and apply a given aspect to them.

data¶

Property of the Dataset class.

Returns:	Internal object storing a loaded dataset.

load(*args, **kwargs)[source]¶: The method should handle loading and parsing of a specific dataset.

save(*args, **kwargs)[source]¶: The method should iterate through texts in the dataset and apply a given aspect to them.

CoNLL 2003¶

class wildnlp.datasets.conll.CoNLL(*args, **kwargs)[source]¶

Bases: wildnlp.datasets.base.Dataset

The CoNLL-2003 shared task data for language-independent named entity recognition. For details see: https://www.clips.uantwerpen.be/conll2003/ner/

apply(aspect, apply_to_ne=False)[source]¶

Parameters:	aspect – transformation function apply_to_ne – if False, transformation won’t be applied to Named Entities. If True, transformation will be applied only to Named Entities.
Returns:	modified dataset in the following form:

[{tokens: array(<tokens>)
  pos_tags: array(<pos_tags>),
  chunk_tags: array(<chunk_tags>),
  ner_tags: array(<ner_tags>},

  ...,

  ]

load(path)[source]¶

Reads a CoNLL dataset file and loads into internal data structure in the following form:

[{tokens: array(<tokens>)
  pos_tags: array(<pos_tags>),
  chunk_tags: array(<chunk_tags>),
  ner_tags: array(<ner_tags>},

  ...,

  ]

Parameters:	path – A path to a file with CoNLL data
Returns:	None

save(data, path)[source]¶

Saves data in the CoNLL format

Parameters:	data – list of dictionaries in the following form:

[{tokens: array(<tokens>)
  pos_tags: array(<pos_tags>),
  chunk_tags: array(<chunk_tags>),
  ner_tags: array(<ner_tags>},

  ...,

  ]

Parameters:	path – Path to save the file. If the file exists, it will be overwritten.
Returns:	None

SNLI¶

class wildnlp.datasets.snli.SNLI(*args, **kwargs)[source]¶

Bases: wildnlp.datasets.base.Dataset

The SNLI dataset supporting the task of natural language inference. For details see: https://nlp.stanford.edu/projects/snli/

apply(aspect)[source]¶: Modifies premises (sentence1) in the dataset leaving other data intact.

load(path)[source]¶

Loads a SNLI dataset.

Parameters:	path – A path to a SNLI data file in JSONL format.
Returns:	None

save(data, path)[source]¶: Saves data in the SNLI format

SQuAD¶

class wildnlp.datasets.squad.SQuAD[source]¶

Bases: wildnlp.datasets.base.Dataset

The SQuAD dataset. For details see: https://rajpurkar.github.io/SQuAD-explorer/

apply(aspect)[source]¶: Modifies questions in the dataset leaving other data intact.

load(path)[source]¶

Loads a SQuAD dataset.

Parameters:	path – A path to a SQuAD data file in JSONL format.
Returns:	None

save(data, path)[source]¶: Saves data in the SQuAD format

IMDB¶

class wildnlp.datasets.imdb.IMDB(*args, **kwargs)[source]¶

Bases: wildnlp.datasets.base.Dataset

The IMDB dataset containing movie reviews for a sentiment analysis. The dataset consists of 50 000 reviews of two classes, negative and positive. Each review is stored in a separate text file. For details see: http://ai.stanford.edu/~amaas/data/sentiment/

apply(aspect)[source]¶: Modifies contents of the whole files in the IMDB dataset.

load(path)[source]¶

Loads a SNLI dataset.

Parameters:	path – A path to single file, directory containing review files or list of paths to such directories.
Returns:	None

save(data, path)[source]¶

Saves IMDB reviews to separate files with the original names.

Parameters:	path – path to a top directory where files will be saved.
Returns:	None

save_tsv(data, path)[source]¶

Convenience function for saving IMDB reviews into a single TSV file.

Parameters:	path – Path to a tab separated file.
Returns:	None