InContextLearningDataset#
- class composer.datasets.InContextLearningDataset(dataset_uri, tokenizer, max_seq_len, pad_tok_id, num_fewshot, fewshot_random_seed, prompt_string, example_delimiter, continuation_delimiter, destination_path, prelimiter='', context_key='context', answer_key='answer', strip_dataset=True, padding_side='right', tokenize_labels=True, static_keys=None, list_keys=None, tensor_keys=None, padding_size=None, base_batch=None, batch_mapping=None, hf_loading_vars=None, hf_parsing_map=None, generation_kwargs=None)[source]#
A base dataset that constructs batches for in-context learning task evaluations. The dataset format is expected to be a local jsonl file, a cloud link to a jsonl file, or a Hugging Face dataset link. โcontextโ refers to the input a model will recieve before generating an output. For example, the question in question answering tasks, the preceding text in a language modeling task, or the document and question regarding the document in a document understanding task. โexampleโ refers to a loaded dictionary, generally containing a context, an answer, and any other information needed to run the task. โanswerโ refers to the desired output of the model.
When creating a new ICL Dataset, it is likely that you will need to reimplement the following methods:
construct_context(): Takes a single example dictionary and formulates the context as a string for that eval question.
get_answer_from_example(): Takes a single example dictionary and formulates the correct, ground truth answer as a string.
tokenize_example(): Tokenizes the example and adds any extra content from the original dictionary that needs to be passed downstream.
read_dataset(): Loads the dataset and does basic parsing. If additional parsing must be done, this is a good place to do so (See InContextLearningQATaskDataset.read_dataset())
Additionally, base_batch and batch_mapping must be defined.
base_batch (Dict): The base dictionary that the dataset will use to construct a batch. This should contain static values, like generation_kwargs or mode, and empty lists for values that will need to be accumulated from each example. NOTE: Sometimes you will need to set base_batch directly after the init call, e.g. in order to use class variables like self.pad_tok_id or self.max_answer_length. If you manually set generation_kwargs this way, youโll need to call self.update_generation_kwargs() after setting self.base_batch.
batch_mapping (Dict): A mapping with keys that are keys in the batch and values that are columns in the loaded dataset. collate_fn will use this mapping to create batches from self.dataset.
- Parameters
dataset_uri (str) โ A local path, a remote path beginning with
s3://
or another backend, or a HuggingFace dataset uri prepended withhf://
. Alternate backends must be supported bycomposer.utils.maybe_create_object_store_from_uri()
. A local dataset must consist of rows of JSON data points with task dependent fields. The default keys expected are โcontextโ and โanswerโ.tokenizer (PreTrainedTokenizerBase) โ The tokenizer used to map between strings and token ids.
max_seq_len (int) โ The maximum sequence length supported by the model.
pad_tok_id (int) โ The special token used for padding batches.
num_fewshot (int) โ The number of complete fewshot examples to prepend before each test example. These are not identical across examples.
fewshot_random_seed (int) โ Random seed to use for fewshot sampling.
prompt_string (str) โ Prompt string to put once before all fewshot examples/test examples (e.g. โTranslate english to french.โ).
example_delimiter (str) โ Separator inserted before (context, answer) pairs (e.g. โnโ) for fewshot sampling and prompting.
continuation_delimiter โ (str): Separator inserted between context and answer in each example (e.g. โnA: โ).
destination_path (str) โ Temporary path to store downloaded datasets.
prelimiter (str) โ Text to be prepended before each context, including few shot examples (e.g. โQuestion: โ).
context_key (str) โ The key in the loaded dataset that contains the context.
answer_key (str) โ The key in the loaded dataset that contains the answer.
strip_dataset (bool) โ Boolean for whether to strip whitespace from data. Trailing whitespace can cause degenerative outputs, so unless whitespace should be preserved (for example in code), this should be set to True.
padding_side (str) โ Side of the content and answer on which to apply padding. Can be either โrightโ or โleftโ.
padding_size (int) โ The final size of the tensor after padding. Defaults to max_sequence_length.
base_batch (Dict) โ The base dictionary upon which a batch is created. See above for more details.
base_mapping (Dict) โ A mapping of batch keys to dataset columns, used to create batches. See above for more details.
hf_loading_vars (Dict) โ A dictionary containing keyword arguments to be passed into load_dataset if dataset is being pulled from HF.
hf_parsing_map (Dict) โ A dictionary containing a mapping from HF columns to ICL dataset keys. The dictionary should be formatted {icl_key:[hf_key1, hf_key1]}. Column contents will be concatenated with โ โ seperating them. If not included, will load the columns already present in the HF dataset.
tokenize_labels (bool) โ Whether or not the labels should be tokenized. Generally determined by which metric a dataset uses.
generation_kwargs (Dict) โ A dictionary containing keyword arguments to be passed along to the modelโs generate function.
- collate_fn(data)[source]#
The function that the dataloader uses to accumulate data into batches.
- Parameters
data (List) โ List of tokenized datapoints (dicts returned by self._tokenize_example)
- Returns
Dict โ Dictionary for a single batch
- construct_context(example, preceding_text='', add_answer=False)[source]#
Takes an example and constructs a context, i.e. the input the model reads for this example. Optionally adds the correct answer (for fewshot examples) and handles example delimiters
- Parameters
- Returns
str โ The constructed context. The default output context is formatted as follows: fโ{self.prelimiter}{example[self.context_key]}{self.continuation_delimiter}โ
- get_answer_from_example(example, in_context=False)[source]#
Returns the answer from the example.
- Parameters
example (Dict) โ The example from which to retrieve the answer
- Returns
str โ The answer in the example
- read_dataset(dataset_uri, destination_path, hf_loading_vars=None, hf_parsing_map=None)[source]#
Reads a dataset and handles parsing it from HuggingFace.
- Parameters
dataset_uri (str) โ A local path, a remote path beginning with
s3://
or another backend, or a HuggingFace dataset uri. Alternate backends must be supported bycomposer.utils.maybe_create_object_store_from_uri()
.destination_path (str) โ A local path where the data will be stored
hf_loading_vars (Dict) โ If parsing from HuggingFace, keyword args that will be passed into load_dataset
hf_parsing_map (Dict) โ Dictionary in the form of {icl_key: [hf_col1, hf_col2]} that will map one or more hf columns, in order, to ICL dataset columns
- Returns
dataset โ A loaded HF dataset
- split_batch(batch, microbatch_size)[source]#
Handling for certain specialty columns that must be split into batches in different formats.
- tokenize_example(prompt_and_fewshot, ctxt, example)[source]#
Runs text through the tokenizer and handle special cases.
- Parameters
- Returns
Dict โ Dictionary with the tokenized data
- update_generation_kwargs(generation_kwargs)[source]#
Updates self.base_batch with the passed in generation_kwargs. This must be run after self.base_batch is set (for example, if self.base_batch is set after __init__() is run, likely because base_batch needs a class variable like self.pad_tok_id or self.max_answer_length).
- Parameters
dict โ Keyword arguments that be written into base_batch[โgeneration_kwargsโ]