InContextLearningCodeEvalAccuracy#

class composer.metrics.InContextLearningCodeEvalAccuracy(dist_sync_on_step=False)[source]#

Computes accuracy for In-context learning (ICL) code evaluation tasks.

ICL code eval tasks consist of some number of example code eval tasks (referred to as the โ€˜contextโ€™), followed by a test task where the model must complete the code, where we term the code completion a โ€˜continuationโ€™.

In each case, the model constructs a given number of continuations (termed pass@K for K continuations), and each continuation is run against a set of test cases. The model is considered correct if at least one of the proposed continuations passes all the test cases.

Runs on AWS Lambdas by default.

Adds metric state variables:

correct (float): The number of instances where the predictions passed all the test cases. total (float): The number of total instances that were predicted.

Parameters

dist_sync_on_step (bool, optional) โ€“ Synchronize metric state across processes at each forward() before returning the value at the step. Default: False.

estimator(n, c, k)[source]#

Computes the pass@k metric.

Given the number of generated samples, n, the number of correct samples, c, and the k of interest, this function calculates pass@k as 1 - comb(n - c, k) / comb(n, k) as per the definition of pass@k in the HumanEval paper (https://arxiv.org/abs/2107.03374) and itโ€™s associated implementation: https://github.com/openai/human-eval.

get_client()[source]#

Returns a client for the appropriate remote platform.

update(batch, outputs, labels)[source]#

Updates the pass@k accuracy of code generation.

Given a batch of prompts, test cases, and code generations, evaluates the code generations against the test cases and augments the pass@k accuracy of the batch to the values so far.

Parameters
  • batch (Dict[str, Any]) โ€“ A batch of data produced by the InContextLearningCodeEvalDataset, with

  • prompt (the) โ€“

  • cases (test) โ€“

  • following (and entry points. This will be a dictionary that must have the) โ€“

  • arguments โ€“

  • { โ€“ โ€˜promptsโ€™: List[str], โ€˜test_inputsโ€™: List[List[str]], โ€˜test_outputsโ€™: List[List[str]], โ€˜entry_pointsโ€™: List[str], โ€˜languagesโ€™: List[str], โ€˜generation_kwargsโ€™: Dict[str, Any]

  • } โ€“

  • outputs (List[str]) โ€“ A list of code generations in the format of HF generate with beam search,

  • 2 (prompt 1 gen) โ€“

  • list (the) โ€“

  • 1 (prompt 2 gen) โ€“

  • 2 โ€“

  • 1 โ€“

  • 2] (prompt 2 gen) โ€“

  • labels (List[str]) โ€“ A list of the correct code generations, for compatibility with existing HF generate

  • used. (functionalities. This is not) โ€“