InContextLearningMultipleChoiceTaskDataset#

class composer.datasets.InContextLearningMultipleChoiceTaskDataset(choices_key='choices', static_keys=None, list_of_tensors_keys=None, list_of_tuples_keys=None, list_of_primitives=None, *args, **kwargs)[source]#

A dataset that construct batches for in-context learning multiple choice evaluation.

If each question has N answer choices, we construct N distinct inputs per question. In order to ensure consistency across multi-GPU, we set the batch size to be min(N, batch_size) so that all N inputs per question can stored in the same batch.

The default input format is a jsonl file with the following fields: - query: The preceding text, question, or document relevant to the choices - gold: Index of the correct choice under ‘choices’ - choices: A list of strings, each being one of the potential choices

Each batch then consists of |batch_size // N| distinct questions and has the following the structure. - input_ids: Input tensor |batch x seqlen x # tokens| - continuation_indices: List of |batch| consisting of tensors indicating which indices in the sequence correspond to the question answer (aka continuation) - mode: Indicates to the model that this is an ICL task and may rely on a custom code path to properly update metrics - labels: Identical to the input, used by the model to calculate loss/metrics - gold_indices: List of length |batch_size // N| indicating for each question, which of the answers is correct (via an integer [0, N-1]) - choice_groupings: Indicates which indices of the batch correspond to which questions

Additional Args:: choices_key (str): The key under which the choices are stored in the saved dataset. Defaults to ‘choices’.

collate_fn(data)[source]#

The function that the dataloader uses to accumulate data into batches. We run each distinct query + answer choice through the model separately and determine which answer has the lowest per-token-perplexity.

If each question has N possible choices, all N must be grouped together as distinct elements of the batch since the batch may consist of multiple questions, the choice_groupings indicates which contiguous sequences of elements in the batch correspond to which question gold_indices indicates which of the [0, N-1] choices is the correct one for each question. :param data: List of tokenized datapoints (dicts returned by self._tokenize_example) :type data: List

Returns: Dict – Dictionary for a single batch

get_answer_from_example(example, in_context=False)[source]#

Returns the correct answer from the example’s choices. :param example: The example from which to retrieve the answer :type example: Dict

Returns: str – The full string of the correct answer based on the ‘gold’ key

split_batch(batch, microbatch_size)[source]#

Split batch while ensuring all continuations are in the same microbatch.

In ICL Multiple Choice, we duplicate each data point for each possible continuation. When splitting a batch, we have logical example, which refer to one possible question, and real example, which refers to one possible continuation. As example count and microbatch_size are tracked in logical example, we split logical attributes by microbatch_size and real attributes by microbatch_size * num_choices. :param batch: Batch of data :type batch: Dict :param microbatch_size: Size of microbatches :type microbatch_size: int

Returns: list – List of chunked batches

tokenize_example(prompt_and_fewshot, ctxt, example)[source]#

Runs text through the tokenizer and handle special cases. :param prompt_and_fewshot: The collection of the prompt and fewshot examples that belongs before the example’s context :type prompt_and_fewshot: str :param ctx: The specific example’s derrived context :type ctx: str :param example: The example as a dictionary. :type example: Dict

Returns: Dict – Dictionary with the tokenized data