InContextLearningSchemaTaskDataset#

class composer.datasets.InContextLearningSchemaTaskDataset(choices_key='context_options', *args, **kwargs)[source]#

A dataset that constructs batches for in-context learning schema evaluation. A schema task involves sentences with a fill-in-the-blank where the user needs to choose the correct word to fill in from a set of N options. We use the partial evaluation technique from https://arxiv.org/abs/1806.02847 to determine the modelโ€™s choice of fill-in word.

The default input format is a jsonl file with the following fields: - context_options: List of strings corresponding to possible preceding context options for the continuation - gold: Index of the correct context from โ€˜context_optionsโ€™ - continuation: The finishing continuation

Each batch then consists of batch_size // N distinct tasks and has the following the structure - input_ids: Input tensor batch x seqlen x # of tokens - continuation_indices: List of batch consisting of tensors indicating which indices in the sequence correspond to the question answer (aka continuation) - mode: Indicates to the model that this is an ICL task and may rely on a custom code path to properly update metrics - labels: Identical to the input, used by the model to calculate loss/metrics - gold_indices: List of length batch_size // N indicating for each question, which of the answers is correct (via an integer [0, N-1]) - choice_groupings: Indicates which indices of the batch correspond to which questions

construct_context(example, preceding_text='', add_answer=False)[source]#

Takes a example and constructs a context with the correct context for the exampleโ€™s continuation.

Parameters
  • example (Dict) โ€“ The example from which to construct the context

  • preceding_text (str) โ€“ Any preceding text, needed to if self.example_delimiter is needed at the beginning

  • add_answer (bool) โ€“ This will always be true when calling this function for SchemaTaskDataset

Returns

str โ€“ The single correct context for a given continuation

tokenize_example(prompt_and_fewshot, context_options, example)[source]#

Runs text through the tokenizer and handle special cases.

Parameters
  • prompt_and_fewshot (str) โ€“ The collection of the prompt and fewshot examples that belongs before the exampleโ€™s context

  • ctx (str) โ€“ The specific exampleโ€™s derrived context

  • example (Dict) โ€“ The example as a dictionary.

Returns

Dict โ€“ Dictionary with the tokenized data