Monitoring Batch Execution Failures

This documentation should guide you through setting up a monitoring solution that checks for failed batch executions on JFrogML and sends notifications to Slack using an AWS Lambda. Feel free to adjust the scripts as needed to fit your specific infrastructure and notification requirements.

Requirements:

To set up this batch monitoring automation, you'll need the following:

Cron job infrastructure: We will use AWS Lambda to schedule and run the script periodically.
Notification platform: We will use Slack to receive notifications about any batch job failures.

1. Check the Batch Executions that finished in the last N minutes/hours

We need a Python script to checks for batch executions that have finished within a specified time window and identifies any failures. This helps us monitor the status of our jobs and react promptly to any issues.

# check_and_retrieve_failed_executions.py
import os
from qwak import QwakClient
from qwak.qwak_client.batch_jobs.execution import ExecutionStatus
from datetime import datetime, timedelta

def check_and_retrieve_failed_executions(minutes_to_search_back):
    if 'QWAK_API_KEY' not in os.environ:
        raise EnvironmentError("Please set the 'QWAK_API_KEY' environment variable before running the script.")
    
    # Initialize the Qwak client
    client = QwakClient()
    failed_executions = []
    
    # Set the time window to check for executions
    time_threshold = datetime.now() - timedelta(minutes=minutes_to_search_back)
    
    # List all projects
    projects = client.list_projects()
    
    # Iterate through projects and their models to check executions
    for project in projects:
        models = client.list_models(project.project_id)
        for model in models:
            executions = client.list_executions(model=model.model_id)
            for execution in executions:
                # Check if the execution failed within the last N minutes
                if execution.end_time >= time_threshold and execution.execution_status == ExecutionStatus.BATCH_JOB_FAILED_STATUS:
                    failed_executions.append({
                        'execution_id': execution.execution_id,
                        'model_id': model.model_id,
                        'failure_message': execution.failure_message
                    })
    
    # Create a message with failed executions
    if failed_executions:
        message = "Failed Executions in the Last {} Minutes:\n".format(minutes_to_search_back)
        for failure in failed_executions:
            message += "Execution ID: {}\nModel ID: {}\nFailure Message: {}\n\n".format(
                failure['execution_id'], failure['model_id'], failure['failure_message']
            )
        return message
    else:
        return None

2. Sending a notification to Slack with all the failed alerts for the last N minutes

# notify_on_slack.py
import requests
import json

def send_message_to_slack(message, webhook_url):
    payload = {"text": message}
    
    headers = {'Content-Type': 'application/json'}
    
    response = requests.post(webhook_url, data=json.dumps(payload), headers=headers)
    
    if response.status_code != 200:
        raise ValueError(f'Request to Slack returned an error {response.status_code}, the response is: {response.text}')
    else:
        print('Message posted successfully.')

3. Calling everything in the Lambda Handler

# lambda_handler.py
import os
from check_and_retrieve_failed_executions import check_and_retrieve_failed_executions
from notify_on_slack import send_message_to_slack

def monitor_executions(event, context):
    # Retrieve environment variables
    qwak_api_key = os.environ.get('QWAK_API_KEY')
    webhook_url = os.environ.get('SLACK_WEBHOOK_URL') ### 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
    minutes_to_search_back = int(os.environ.get('MINUTES_TO_SEARCH_BACK', 60))

    # Check for missing environment variables
    if not qwak_api_key or not webhook_url:
        raise EnvironmentError("Please set the 'QWAK_API_KEY' and 'SLACK_WEBHOOK_URL' environment variables."
    
    failed_executions_message = check_and_retrieve_failed_executions(minutes_to_search_back)
    
    if "No failed executions found" not in failed_executions_message:
        send_message_to_slack(failed_executions_message, webhook_url)

Putting It All Together

Steps to Deploy the automation on AWS Lambda:

Create a new Lambda function in the AWS Lambda console.
Set the runtime to Python 3.x.
Add environment variables in the Lambda function configuration for:
- QWAK_API_KEY: Your Qwak API key.
- SLACK_WEBHOOK_URL: Your Slack webhook URL.
- MINUTES_TO_SEARCH_BACK: The number of minutes to check back for failed executions. Defaults to 60 if not set.
Upload the three Python scripts (check_and_retrieve_failed_executions.py, notify_on_slack.py, lambda_handler.py) as a Lambda deployment package (zip file).
Configure the function handler in the Lambda console to lambda_handler.monitor_executions.
Set up a CloudWatch event rule to trigger the Lambda function periodically according to your preferred schedule (e.g., every hour).

By following these steps, you will be able to monitor JFrogML batch executions for failures and receive alerts via Slack, enriching the functionality already existent on the platform.