HealthChecker#
- class composer.callbacks.HealthChecker(threshold=10, sample_freq=5, window_size=120, wait=120, slack_webhook_url=None, test_mode=False)[source]#
- Checks for GPU health. - This callback checks for GPU health by tracking and alerting for abnormal GPU utilizations. - For example, if the average utilization during the observation window is, [30, 30, 45], then the range (45-30=15) would exceed a threshold of 10%. - Parameters
- threshold (float, optional) โ Threshold of GPU utilization range to trigger an alert. Defaults to 10. 
- sample_freq (int, optional) โ Sample frequency in seconds. Default: 5. 
- window_size (int, optional) โ Window size in seconds. HealthChecker will check for abnormalities at this frequency. Default: 120. 
- wait (int, optional) โ Seconds to wait for starting to sample. Default: 120. 
- slack_webhook_url (str, optional) โ Slack URL to send alerts. Can also be set with the SLACK_WEBHOOK_URL environment variable. Default: None 
- test_mode (bool, optional) โ If True, will send a test alert at the first check. Default: False