InContextLearningDataset#

class composer.datasets.InContextLearningDataset(dataset_uri, tokenizer, max_seq_len, pad_tok_id, num_fewshot, fewshot_random_seed, prompt_string, example_delimiter, continuation_delimiter, destination_path, prelimiter='', context_key='context', answer_key='answer', strip_dataset=True, padding_side='right', tokenize_labels=True, static_keys=None, list_keys=None, tensor_keys=None, padding_size=None, base_batch=None, batch_mapping=None, hf_loading_vars=None, hf_parsing_map=None, generation_kwargs=None)[source]#

A base dataset that constructs batches for in-context learning task evaluations. The dataset format is expected to be a local jsonl file, a cloud link to a jsonl file, or a Hugging Face dataset link. โ€˜contextโ€™ refers to the input a model will recieve before generating an output. For example, the question in question answering tasks, the preceding text in a language modeling task, or the document and question regarding the document in a document understanding task. โ€˜exampleโ€™ refers to a loaded dictionary, generally containing a context, an answer, and any other information needed to run the task. โ€˜answerโ€™ refers to the desired output of the model.

When creating a new ICL Dataset, it is likely that you will need to reimplement the following methods:

  • construct_context(): Takes a single example dictionary and formulates the context as a string for that eval question.

  • get_answer_from_example(): Takes a single example dictionary and formulates the correct, ground truth answer as a string.

  • tokenize_example(): Tokenizes the example and adds any extra content from the original dictionary that needs to be passed downstream.

  • read_dataset(): Loads the dataset and does basic parsing. If additional parsing must be done, this is a good place to do so (See InContextLearningQATaskDataset.read_dataset())

Additionally, base_batch and batch_mapping must be defined.

  • base_batch (Dict): The base dictionary that the dataset will use to construct a batch. This should contain static values, like generation_kwargs or mode, and empty lists for values that will need to be accumulated from each example. NOTE: Sometimes you will need to set base_batch directly after the init call, e.g. in order to use class variables like self.pad_tok_id or self.max_answer_length. If you manually set generation_kwargs this way, youโ€™ll need to call self.update_generation_kwargs() after setting self.base_batch.

  • batch_mapping (Dict): A mapping with keys that are keys in the batch and values that are columns in the loaded dataset. collate_fn will use this mapping to create batches from self.dataset.

Parameters
  • dataset_uri (str) โ€“ A local path, a remote path beginning with s3:// or another backend, or a HuggingFace dataset uri prepended with hf://. Alternate backends must be supported by composer.utils.maybe_create_object_store_from_uri(). A local dataset must consist of rows of JSON data points with task dependent fields. The default keys expected are โ€œcontextโ€ and โ€œanswerโ€.

  • tokenizer (PreTrainedTokenizerBase) โ€“ The tokenizer used to map between strings and token ids.

  • max_seq_len (int) โ€“ The maximum sequence length supported by the model.

  • pad_tok_id (int) โ€“ The special token used for padding batches.

  • num_fewshot (int) โ€“ The number of complete fewshot examples to prepend before each test example. These are not identical across examples.

  • fewshot_random_seed (int) โ€“ Random seed to use for fewshot sampling.

  • prompt_string (str) โ€“ Prompt string to put once before all fewshot examples/test examples (e.g. โ€˜Translate english to french.โ€™).

  • example_delimiter (str) โ€“ Separator inserted before (context, answer) pairs (e.g. โ€˜nโ€™) for fewshot sampling and prompting.

  • continuation_delimiter โ€“ (str): Separator inserted between context and answer in each example (e.g. โ€˜nA: โ€˜).

  • destination_path (str) โ€“ Temporary path to store downloaded datasets.

  • prelimiter (str) โ€“ Text to be prepended before each context, including few shot examples (e.g. โ€œQuestion: โ€œ).

  • context_key (str) โ€“ The key in the loaded dataset that contains the context.

  • answer_key (str) โ€“ The key in the loaded dataset that contains the answer.

  • strip_dataset (bool) โ€“ Boolean for whether to strip whitespace from data. Trailing whitespace can cause degenerative outputs, so unless whitespace should be preserved (for example in code), this should be set to True.

  • padding_side (str) โ€“ Side of the content and answer on which to apply padding. Can be either โ€˜rightโ€™ or โ€˜leftโ€™.

  • padding_size (int) โ€“ The final size of the tensor after padding. Defaults to max_sequence_length.

  • base_batch (Dict) โ€“ The base dictionary upon which a batch is created. See above for more details.

  • base_mapping (Dict) โ€“ A mapping of batch keys to dataset columns, used to create batches. See above for more details.

  • hf_loading_vars (Dict) โ€“ A dictionary containing keyword arguments to be passed into load_dataset if dataset is being pulled from HF.

  • hf_parsing_map (Dict) โ€“ A dictionary containing a mapping from HF columns to ICL dataset keys. The dictionary should be formatted {icl_key:[hf_key1, hf_key1]}. Column contents will be concatenated with โ€˜ โ€˜ seperating them. If not included, will load the columns already present in the HF dataset.

  • tokenize_labels (bool) โ€“ Whether or not the labels should be tokenized. Generally determined by which metric a dataset uses.

  • generation_kwargs (Dict) โ€“ A dictionary containing keyword arguments to be passed along to the modelโ€™s generate function.

collate_fn(data)[source]#

The function that the dataloader uses to accumulate data into batches.

Parameters

data (List) โ€“ List of tokenized datapoints (dicts returned by self._tokenize_example)

Returns

Dict โ€“ Dictionary for a single batch

construct_context(example, preceding_text='', add_answer=False)[source]#

Takes an example and constructs a context, i.e. the input the model reads for this example. Optionally adds the correct answer (for fewshot examples) and handles example delimiters

Parameters
  • example (Dict) โ€“ The example from which to construct the context

  • preceding_text (str) โ€“ Any preceding text, used as a check for prepending self.example_delimiter

  • add_answer (bool) โ€“ Bool for whether or not to add the answer on the end of the context (e.g. for fewshot examples)

Returns

str โ€“ The constructed context. The default output context is formatted as follows: fโ€™{self.prelimiter}{example[self.context_key]}{self.continuation_delimiter}โ€™

get_answer_from_example(example, in_context=False)[source]#

Returns the answer from the example.

Parameters

example (Dict) โ€“ The example from which to retrieve the answer

Returns

str โ€“ The answer in the example

read_dataset(dataset_uri, destination_path, hf_loading_vars=None, hf_parsing_map=None)[source]#

Reads a dataset and handles parsing it from HuggingFace.

Parameters
  • dataset_uri (str) โ€“ A local path, a remote path beginning with s3:// or another backend, or a HuggingFace dataset uri. Alternate backends must be supported by composer.utils.maybe_create_object_store_from_uri().

  • destination_path (str) โ€“ A local path where the data will be stored

  • hf_loading_vars (Dict) โ€“ If parsing from HuggingFace, keyword args that will be passed into load_dataset

  • hf_parsing_map (Dict) โ€“ Dictionary in the form of {icl_key: [hf_col1, hf_col2]} that will map one or more hf columns, in order, to ICL dataset columns

Returns

dataset โ€“ A loaded HF dataset

split_batch(batch, microbatch_size)[source]#

Handling for certain specialty columns that must be split into batches in different formats.

Parameters
  • batch (Dict) โ€“ Batch of data

  • microbatch_size (int | float) โ€“ Size of microbatches

Returns

List โ€“ List of chunked batches

tokenize_example(prompt_and_fewshot, ctxt, example)[source]#

Runs text through the tokenizer and handle special cases.

Parameters
  • prompt_and_fewshot (str) โ€“ The collection of the prompt and fewshot examples that belongs before the exampleโ€™s context

  • ctxt (str) โ€“ The specific exampleโ€™s derrived context

  • example (Dict) โ€“ The example as a dictionary. Used for additional processing in inherited classes.

Returns

Dict โ€“ Dictionary with the tokenized data

update_generation_kwargs(generation_kwargs)[source]#

Updates self.base_batch with the passed in generation_kwargs. This must be run after self.base_batch is set (for example, if self.base_batch is set after __init__() is run, likely because base_batch needs a class variable like self.pad_tok_id or self.max_answer_length).

Parameters

dict โ€“ Keyword arguments that be written into base_batch[โ€˜generation_kwargsโ€™]