class composer.callbacks.HealthChecker(threshold=10, sample_freq=5, window_size=120, wait=120, slack_webhook_url=None, test_mode=False)[source]#

Checks for GPU health.

This callback checks for GPU health by tracking and alerting for abnormal GPU utilizations.

For example, if the average utilization during the observation window is, [30, 30, 45], then the range (45-30=15) would exceed a threshold of 10%.

  • threshold (float, optional) โ€“ Threshold of GPU utilization range to trigger an alert. Defaults to 10.

  • sample_freq (int, optional) โ€“ Sample frequency in seconds. Default: 5.

  • window_size (int, optional) โ€“ Window size in seconds. HealthChecker will check for abnormalities at this frequency. Default: 120.

  • wait (int, optional) โ€“ Seconds to wait for starting to sample. Default: 120.

  • slack_webhook_url (str, optional) โ€“ Slack URL to send alerts. Can also be set with the SLACK_WEBHOOK_URL environment variable. Default: None

  • test_mode (bool, optional) โ€“ If True, will send a test alert at the first check. Default: False