InContextLearningMultipleChoiceTaskDataset#
- class composer.datasets.InContextLearningMultipleChoiceTaskDataset(choices_key='choices', static_keys=None, list_of_tensors_keys=None, list_of_tuples_keys=None, list_of_primitives=None, *args, **kwargs)[source]#
A dataset that construct batches for in-context learning multiple choice evaluation.
If each question has N answer choices, we construct N distinct inputs per question. In order to ensure consistency across multi-GPU, we set the batch size to be min(N, batch_size) so that all N inputs per question can stored in the same batch.
The default input format is a jsonl file with the following fields: - query: The preceding text, question, or document relevant to the choices - gold: Index of the correct choice under โchoicesโ - choices: A list of strings, each being one of the potential choices
Each batch then consists of
|batch_size // N|
distinct questions and has the following the structure. - input_ids: Input tensor|batch x seqlen x # tokens|
- continuation_indices: List of|batch|
consisting of tensors indicating which indices in the sequence correspond to the question answer (aka continuation) - mode: Indicates to the model that this is an ICL task and may rely on a custom code path to properly update metrics - labels: Identical to the input, used by the model to calculate loss/metrics - gold_indices: List of length|batch_size // N|
indicating for each question, which of the answers is correct (via an integer [0, N-1]) - choice_groupings: Indicates which indices of the batch correspond to which questions- Additional Args:
choices_key (str): The key under which the choices are stored in the saved dataset. Defaults to โchoicesโ.
- collate_fn(data)[source]#
The function that the dataloader uses to accumulate data into batches. We run each distinct query + answer choice through the model separately and determine which answer has the lowest per-token-perplexity.
If each question has N possible choices, all N must be grouped together as distinct elements of the batch since the batch may consist of multiple questions, the choice_groupings indicates which contiguous sequences of elements in the batch correspond to which question gold_indices indicates which of the [0, N-1] choices is the correct one for each question. :param data: List of tokenized datapoints (dicts returned by self._tokenize_example) :type data: List
- Returns
Dict โ Dictionary for a single batch
- get_answer_from_example(example, in_context=False)[source]#
Returns the correct answer from the exampleโs choices. :param example: The example from which to retrieve the answer :type example: Dict
- Returns
str โ The full string of the correct answer based on the โgoldโ key
- split_batch(batch, microbatch_size)[source]#
Split batch while ensuring all continuations are in the same microbatch.
In ICL Multiple Choice, we duplicate each data point for each possible continuation. When splitting a batch, we have logical example, which refer to one possible question, and real example, which refers to one possible continuation. As example count and microbatch_size are tracked in logical example, we split logical attributes by microbatch_size and real attributes by microbatch_size * num_choices. :param batch: Batch of data :type batch: Dict :param microbatch_size: Size of microbatches :type microbatch_size: int
- Returns
list โ List of chunked batches
- tokenize_example(prompt_and_fewshot, ctxt, example)[source]#
Runs text through the tokenizer and handle special cases. :param prompt_and_fewshot: The collection of the prompt and fewshot examples that belongs before the exampleโs context :type prompt_and_fewshot: str :param ctx: The specific exampleโs derrived context :type ctx: str :param example: The example as a dictionary. :type example: Dict
- Returns
Dict โ Dictionary with the tokenized data