InContextLearningDataset#

class composer.datasets.InContextLearningDataset(dataset_uri, tokenizer, max_seq_len, pad_tok_id, num_fewshot, fewshot_random_seed, prompt_string, example_delimiter, continuation_delimiter, destination_path, prelimiter='', context_key='context', answer_key='answer', strip_dataset=True, padding_side='right', tokenize_labels=True, static_keys=None, list_keys=None, tensor_keys=None, padding_size=None, base_batch=None, batch_mapping=None, hf_loading_vars=None, hf_parsing_map=None, generation_kwargs=None)[source]#

A base dataset that constructs batches for in-context learning task evaluations. The dataset format is expected to be a local jsonl file, a cloud link to a jsonl file, or a Hugging Face dataset link. ‘context’ refers to the input a model will recieve before generating an output. For example, the question in question answering tasks, the preceding text in a language modeling task, or the document and question regarding the document in a document understanding task. ‘example’ refers to a loaded dictionary, generally containing a context, an answer, and any other information needed to run the task. ‘answer’ refers to the desired output of the model.

When creating a new ICL Dataset, it is likely that you will need to reimplement the following methods:

construct_context(): Takes a single example dictionary and formulates the context as a string for that eval question.
get_answer_from_example(): Takes a single example dictionary and formulates the correct, ground truth answer as a string.
tokenize_example(): Tokenizes the example and adds any extra content from the original dictionary that needs to be passed downstream.
read_dataset(): Loads the dataset and does basic parsing. If additional parsing must be done, this is a good place to do so (See InContextLearningQATaskDataset.read_dataset())

Additionally, base_batch and batch_mapping must be defined.

base_batch (Dict): The base dictionary that the dataset will use to construct a batch. This should contain static values, like generation_kwargs or mode, and empty lists for values that will need to be accumulated from each example. NOTE: Sometimes you will need to set base_batch directly after the init call, e.g. in order to use class variables like self.pad_tok_id or self.max_answer_length. If you manually set generation_kwargs this way, you’ll need to call self.update_generation_kwargs() after setting self.base_batch.
batch_mapping (Dict): A mapping with keys that are keys in the batch and values that are columns in the loaded dataset. collate_fn will use this mapping to create batches from self.dataset.

Parameters

dataset_uri (str) – A local path, a remote path beginning with s3:// or another backend, or a HuggingFace dataset uri prepended with hf://. Alternate backends must be supported by composer.utils.maybe_create_object_store_from_uri(). A local dataset must consist of rows of JSON data points with task dependent fields. The default keys expected are “context” and “answer”.
tokenizer (PreTrainedTokenizerBase) – The tokenizer used to map between strings and token ids.
max_seq_len (int) – The maximum sequence length supported by the model.
pad_tok_id (int) – The special token used for padding batches.
num_fewshot (int) – The number of complete fewshot examples to prepend before each test example. These are not identical across examples.
fewshot_random_seed (int) – Random seed to use for fewshot sampling.
prompt_string (str) – Prompt string to put once before all fewshot examples/test examples (e.g. ‘Translate english to french.’).
example_delimiter (str) – Separator inserted before (context, answer) pairs (e.g. ‘n’) for fewshot sampling and prompting.
continuation_delimiter – (str): Separator inserted between context and answer in each example (e.g. ‘nA: ‘).
destination_path (str) – Temporary path to store downloaded datasets.
prelimiter (str) – Text to be prepended before each context, including few shot examples (e.g. “Question: “).
context_key (str) – The key in the loaded dataset that contains the context.
answer_key (str) – The key in the loaded dataset that contains the answer.
strip_dataset (bool) – Boolean for whether to strip whitespace from data. Trailing whitespace can cause degenerative outputs, so unless whitespace should be preserved (for example in code), this should be set to True.
padding_side (str) – Side of the content and answer on which to apply padding. Can be either ‘right’ or ‘left’.
padding_size (int) – The final size of the tensor after padding. Defaults to max_sequence_length.
base_batch (Dict) – The base dictionary upon which a batch is created. See above for more details.
base_mapping (Dict) – A mapping of batch keys to dataset columns, used to create batches. See above for more details.
hf_loading_vars (Dict) – A dictionary containing keyword arguments to be passed into load_dataset if dataset is being pulled from HF.
hf_parsing_map (Dict) – A dictionary containing a mapping from HF columns to ICL dataset keys. The dictionary should be formatted {icl_key:[hf_key1, hf_key1]}. Column contents will be concatenated with ‘ ‘ seperating them. If not included, will load the columns already present in the HF dataset.
tokenize_labels (bool) – Whether or not the labels should be tokenized. Generally determined by which metric a dataset uses.
generation_kwargs (Dict) – A dictionary containing keyword arguments to be passed along to the model’s generate function.

collate_fn(data)[source]#

The function that the dataloader uses to accumulate data into batches.

Parameters: data (List) – List of tokenized datapoints (dicts returned by self._tokenize_example)
Returns: Dict – Dictionary for a single batch

construct_context(example, preceding_text='', add_answer=False)[source]#

Takes an example and constructs a context, i.e. the input the model reads for this example. Optionally adds the correct answer (for fewshot examples) and handles example delimiters

Parameters

example (Dict) – The example from which to construct the context
preceding_text (str) – Any preceding text, used as a check for prepending self.example_delimiter
add_answer (bool) – Bool for whether or not to add the answer on the end of the context (e.g. for fewshot examples)

Returns

str – The constructed context. The default output context is formatted as follows: f’{self.prelimiter}{example[self.context_key]}{self.continuation_delimiter}’

get_answer_from_example(example, in_context=False)[source]#

Returns the answer from the example.

Parameters: example (Dict) – The example from which to retrieve the answer
Returns: str – The answer in the example

read_dataset(dataset_uri, destination_path, hf_loading_vars=None, hf_parsing_map=None)[source]#

Reads a dataset and handles parsing it from HuggingFace.

Parameters

dataset_uri (str) – A local path, a remote path beginning with s3:// or another backend, or a HuggingFace dataset uri. Alternate backends must be supported by composer.utils.maybe_create_object_store_from_uri().
destination_path (str) – A local path where the data will be stored
hf_loading_vars (Dict) – If parsing from HuggingFace, keyword args that will be passed into load_dataset
hf_parsing_map (Dict) – Dictionary in the form of {icl_key: [hf_col1, hf_col2]} that will map one or more hf columns, in order, to ICL dataset columns

Returns

dataset – A loaded HF dataset

split_batch(batch, microbatch_size)[source]#

Handling for certain specialty columns that must be split into batches in different formats.

Parameters

batch (Dict) – Batch of data
microbatch_size (int | float) – Size of microbatches

Returns

List – List of chunked batches

tokenize_example(prompt_and_fewshot, ctxt, example)[source]#

Runs text through the tokenizer and handle special cases.

Parameters

prompt_and_fewshot (str) – The collection of the prompt and fewshot examples that belongs before the example’s context
ctxt (str) – The specific example’s derrived context
example (Dict) – The example as a dictionary. Used for additional processing in inherited classes.

Returns

Dict – Dictionary with the tokenized data

update_generation_kwargs(generation_kwargs)[source]#

Updates self.base_batch with the passed in generation_kwargs. This must be run after self.base_batch is set (for example, if self.base_batch is set after __init__() is run, likely because base_batch needs a class variable like self.pad_tok_id or self.max_answer_length).

Parameters: dict – Keyword arguments that be written into base_batch[‘generation_kwargs’]