- class composer.callbacks.HealthChecker(threshold=10, sample_freq=5, window_size=120, wait=120, slack_webhook_url=None, test_mode=False)[source]#
Checks for GPU health.
This callback checks for GPU health by tracking and alerting for abnormal GPU utilizations.
For example, if the average utilization during the observation window is, [30, 30, 45], then the range (45-30=15) would exceed a threshold of 10%.
threshold (float, optional) – Threshold of GPU utilization range to trigger an alert. Defaults to 10.
sample_freq (int, optional) – Sample frequency in seconds. Default: 5.
window_size (int, optional) – Window size in seconds. HealthChecker will check for abnormalities at this frequency. Default: 120.
wait (int, optional) – Seconds to wait for starting to sample. Default: 120.
slack_webhook_url (str, optional) – Slack URL to send alerts. Can also be set with the SLACK_WEBHOOK_URL environment variable. Default: None
test_mode (bool, optional) – If True, will send a test alert at the first check. Default: False