InContextLearningSchemaTaskDataset#

class composer.datasets.InContextLearningSchemaTaskDataset(choices_key='context_options', *args, **kwargs)[source]#

A dataset that constructs batches for in-context learning schema evaluation. A schema task involves sentences with a fill-in-the-blank where the user needs to choose the correct word to fill in from a set of N options. We use the partial evaluation technique from https://arxiv.org/abs/1806.02847 to determine the model’s choice of fill-in word.

The default input format is a jsonl file with the following fields: - context_options: List of strings corresponding to possible preceding context options for the continuation - gold: Index of the correct context from ‘context_options’ - continuation: The finishing continuation

Each batch then consists of batch_size // N distinct tasks and has the following the structure - input_ids: Input tensor batch x seqlen x # of tokens - continuation_indices: List of batch consisting of tensors indicating which indices in the sequence correspond to the question answer (aka continuation) - mode: Indicates to the model that this is an ICL task and may rely on a custom code path to properly update metrics - labels: Identical to the input, used by the model to calculate loss/metrics - gold_indices: List of length batch_size // N indicating for each question, which of the answers is correct (via an integer [0, N-1]) - choice_groupings: Indicates which indices of the batch correspond to which questions

construct_context(example, preceding_text='', add_answer=False)[source]#

Takes a example and constructs a context with the correct context for the example’s continuation.

Parameters

example (Dict) – The example from which to construct the context
preceding_text (str) – Any preceding text, needed to if self.example_delimiter is needed at the beginning
add_answer (bool) – This will always be true when calling this function for SchemaTaskDataset

Returns

str – The single correct context for a given continuation

tokenize_example(prompt_and_fewshot, context_options, example)[source]#

Runs text through the tokenizer and handle special cases.

Parameters

prompt_and_fewshot (str) – The collection of the prompt and fewshot examples that belongs before the example’s context
ctx (str) – The specific example’s derrived context
example (Dict) – The example as a dictionary.

Returns

Dict – Dictionary with the tokenized data