Datasets

Details regarding popular datasets for various NLP problems that are supported by the wild-nlp.

Base class

class wildnlp.datasets.base.Dataset(*args, **kwargs)[source]
apply(*args, **kwargs)[source]

The method should iterate through texts in the dataset and apply a given aspect to them.

data

Property of the Dataset class.

Returns:Internal object storing a loaded dataset.
load(*args, **kwargs)[source]

The method should handle loading and parsing of a specific dataset.

save(*args, **kwargs)[source]

The method should iterate through texts in the dataset and apply a given aspect to them.

CoNLL 2003

class wildnlp.datasets.conll.CoNLL(*args, **kwargs)[source]

Bases: wildnlp.datasets.base.Dataset

The CoNLL-2003 shared task data for language-independent named entity recognition. For details see: https://www.clips.uantwerpen.be/conll2003/ner/

apply(aspect, apply_to_ne=False)[source]
Parameters:
  • aspect – transformation function
  • apply_to_ne – if False, transformation won’t be applied to Named Entities. If True, transformation will be applied only to Named Entities.
Returns:

modified dataset in the following form:

[{tokens: array(<tokens>)
  pos_tags: array(<pos_tags>),
  chunk_tags: array(<chunk_tags>),
  ner_tags: array(<ner_tags>},

  ...,

  ]
load(path)[source]

Reads a CoNLL dataset file and loads into internal data structure in the following form:

[{tokens: array(<tokens>)
  pos_tags: array(<pos_tags>),
  chunk_tags: array(<chunk_tags>),
  ner_tags: array(<ner_tags>},

  ...,

  ]
Parameters:path – A path to a file with CoNLL data
Returns:None
save(data, path)[source]

Saves data in the CoNLL format

Parameters:data – list of dictionaries in the following form:
[{tokens: array(<tokens>)
  pos_tags: array(<pos_tags>),
  chunk_tags: array(<chunk_tags>),
  ner_tags: array(<ner_tags>},

  ...,

  ]
Parameters:path – Path to save the file. If the file exists, it will be overwritten.
Returns:None

SNLI

class wildnlp.datasets.snli.SNLI(*args, **kwargs)[source]

Bases: wildnlp.datasets.base.Dataset

The SNLI dataset supporting the task of natural language inference. For details see: https://nlp.stanford.edu/projects/snli/

apply(aspect)[source]

Modifies premises (sentence1) in the dataset leaving other data intact.

load(path)[source]

Loads a SNLI dataset.

Parameters:path – A path to a SNLI data file in JSONL format.
Returns:None
save(data, path)[source]

Saves data in the SNLI format

SQuAD

class wildnlp.datasets.squad.SQuAD[source]

Bases: wildnlp.datasets.base.Dataset

The SQuAD dataset. For details see: https://rajpurkar.github.io/SQuAD-explorer/

apply(aspect)[source]

Modifies questions in the dataset leaving other data intact.

load(path)[source]

Loads a SQuAD dataset.

Parameters:path – A path to a SQuAD data file in JSONL format.
Returns:None
save(data, path)[source]

Saves data in the SQuAD format

IMDB

class wildnlp.datasets.imdb.IMDB(*args, **kwargs)[source]

Bases: wildnlp.datasets.base.Dataset

The IMDB dataset containing movie reviews for a sentiment analysis. The dataset consists of 50 000 reviews of two classes, negative and positive. Each review is stored in a separate text file. For details see: http://ai.stanford.edu/~amaas/data/sentiment/

apply(aspect)[source]

Modifies contents of the whole files in the IMDB dataset.

load(path)[source]

Loads a SNLI dataset.

Parameters:path – A path to single file, directory containing review files or list of paths to such directories.
Returns:None
save(data, path)[source]

Saves IMDB reviews to separate files with the original names.

Parameters:path – path to a top directory where files will be saved.
Returns:None
save_tsv(data, path)[source]

Convenience function for saving IMDB reviews into a single TSV file.

Parameters:path – Path to a tab separated file.
Returns:None