Monitoring Batch Execution Failures
This documentation should guide you through setting up a monitoring solution that checks for failed batch executions on JFrogML and sends notifications to Slack using an AWS Lambda. Feel free to adjust the scripts as needed to fit your specific infrastructure and notification requirements.
Requirements:
To set up this batch monitoring automation, you'll need the following:
- Cron job infrastructure: We will use AWS Lambda to schedule and run the script periodically.
- Notification platform: We will use Slack to receive notifications about any batch job failures.
1. Check the Batch Executions that finished in the last N minutes/hours
We need a Python script to checks for batch executions that have finished within a specified time window and identifies any failures. This helps us monitor the status of our jobs and react promptly to any issues.
# check_and_retrieve_failed_executions.py
import os
from qwak import QwakClient
from qwak.qwak_client.batch_jobs.execution import ExecutionStatus
from datetime import datetime, timedelta
def check_and_retrieve_failed_executions(minutes_to_search_back):
if 'QWAK_API_KEY' not in os.environ:
raise EnvironmentError("Please set the 'QWAK_API_KEY' environment variable before running the script.")
# Initialize the Qwak client
client = QwakClient()
failed_executions = []
# Set the time window to check for executions
time_threshold = datetime.now() - timedelta(minutes=minutes_to_search_back)
# List all projects
projects = client.list_projects()
# Iterate through projects and their models to check executions
for project in projects:
models = client.list_models(project.project_id)
for model in models:
executions = client.list_executions(model=model.model_id)
for execution in executions:
# Check if the execution failed within the last N minutes
if execution.end_time >= time_threshold and execution.execution_status == ExecutionStatus.BATCH_JOB_FAILED_STATUS:
failed_executions.append({
'execution_id': execution.execution_id,
'model_id': model.model_id,
'failure_message': execution.failure_message
})
# Create a message with failed executions
if failed_executions:
message = "Failed Executions in the Last {} Minutes:\n".format(minutes_to_search_back)
for failure in failed_executions:
message += "Execution ID: {}\nModel ID: {}\nFailure Message: {}\n\n".format(
failure['execution_id'], failure['model_id'], failure['failure_message']
)
return message
else:
return None
2. Sending a notification to Slack with all the failed alerts for the last N minutes
# notify_on_slack.py
import requests
import json
def send_message_to_slack(message, webhook_url):
payload = {"text": message}
headers = {'Content-Type': 'application/json'}
response = requests.post(webhook_url, data=json.dumps(payload), headers=headers)
if response.status_code != 200:
raise ValueError(f'Request to Slack returned an error {response.status_code}, the response is: {response.text}')
else:
print('Message posted successfully.')
3. Calling everything in the Lambda Handler
# lambda_handler.py
import os
from check_and_retrieve_failed_executions import check_and_retrieve_failed_executions
from notify_on_slack import send_message_to_slack
def monitor_executions(event, context):
# Retrieve environment variables
qwak_api_key = os.environ.get('QWAK_API_KEY')
webhook_url = os.environ.get('SLACK_WEBHOOK_URL') ### 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
minutes_to_search_back = int(os.environ.get('MINUTES_TO_SEARCH_BACK', 60))
# Check for missing environment variables
if not qwak_api_key or not webhook_url:
raise EnvironmentError("Please set the 'QWAK_API_KEY' and 'SLACK_WEBHOOK_URL' environment variables."
failed_executions_message = check_and_retrieve_failed_executions(minutes_to_search_back)
if "No failed executions found" not in failed_executions_message:
send_message_to_slack(failed_executions_message, webhook_url)
Putting It All Together
Steps to Deploy the automation on AWS Lambda:
- Create a new Lambda function in the AWS Lambda console.
- Set the runtime to Python 3.x.
- Add environment variables in the Lambda function configuration for:
QWAK_API_KEY
: Your Qwak API key.SLACK_WEBHOOK_URL
: Your Slack webhook URL.MINUTES_TO_SEARCH_BACK
: The number of minutes to check back for failed executions. Defaults to 60 if not set.
- Upload the three Python scripts (
check_and_retrieve_failed_executions.py
,notify_on_slack.py
,lambda_handler.py
) as a Lambda deployment package (zip file). - Configure the function handler in the Lambda console to
lambda_handler.monitor_executions
. - Set up a CloudWatch event rule to trigger the Lambda function periodically according to your preferred schedule (e.g., every hour).
By following these steps, you will be able to monitor JFrogML batch executions for failures and receive alerts via Slack, enriching the functionality already existent on the platform.
Updated 2 days ago