Real-Time Deployment


Qwak can help you set up a real-time Web service. Qwak takes your code and model and encases it in a lightweight REST API service, enabling it to be easily queried through the Web.

All you need to do is to provide a Qwak-based compatible model and Qwak does the rest.

Qwak sets up all the network requirements and deploys the service to Kubernetes, allowing you to leverage auto-scaling and ensuring that all incoming traffic is accommodated. Qwak adds an authentication token to secure access. Furthermore, it adds a suite of monitoring tools, simplifying the process of managing your Web service and its performance.

Once you have a successful build, you can opt for a real-time deployment. Real-time deployment uses an HTTPS server to expose the endpoint.

Deployment Configuration

ParameterDescriptionDefault Value
Model ID [Required]The Model ID, as displayed on the model header.
Build ID [Required]The Qwak-assigned build ID.
Variation nameThe name of the variation to deploy the build on.default
Initial number of podsThe number of k8s pods to be used by the deployment.

Each pod contains an HTTPS server. A load balancer splits the traffic between them.
CPU fractionA CPU fraction allocated to the pod. The CPU resource is measured in CPU units. One CPU, in Qwak, is equivalent to:
1 GCP Core
1 Azure vCore
1 Hyperthread on a bare-metal Intel processor with Hyperthreading
MemoryThe RAM memory (in MB) to allocate to each pod.512
TimeoutThe number of milliseconds required for an API server request to time out.1000
Concurrent workersThe number of Gunicorn workers. The number of worker processes for handling requests.

A positive integer is generally in the 2-4 x $(NUM_CORES) range. You may want to vary this a bit to find the optimal value for your particular application’s workload.
Daemon modeWhether or not to Daemonize the Gunicorn process.
Detaches the server from the controlling terminal and enters the background.
IAM role ARNThe user-provided AWS custom IAM role.None
Max batch sizeThe maximal allowed batch size.1
TimeoutThe prediction request timeout.5000 milliseconds
GPU TypeThe GPU Type to use in the model deployment. Supported options are, NVIDIA K80, NVIDIA Tesla V100, NVIDIA T4 and NVIDIA A10.None
GPU AmountThe number of GPUs available for the model deployment.
Varies based on the selected GPU type
Based on GPU Type

Real-Time Deployment from the App

To deploy a real-time model from the UI:

  1. In the left navigation bar in the Qwak UI, select Projects.
  2. Select a project and then select a model.
  3. Select the Builds tab. Find a build to deploy and click the deployment toggle. The Deploy dialog box appears.
  1. Select Realtime and then select Next.
  1. Configure your real-time deployment by selecting the initial number of pods, CPU fraction, and memory.

The Advanced settings real-time deployment configuration page includes additional settings such as injecting environment variables, setting invocation timeouts, and more.

Real-Time Deployment from the CLI

To deploy a model in real-time mode from the CLI, populate the following command template:

qwak models deploy realtime \ 
    --model-id <model-id> \
    --build-id <build-id> \
    --pods <pods-count> \
    --cpus <cpus-fraction> \
    --memory <memory-size> \
    --timeout <timeout> \
    --server-workers <workers> \
    --variation-name <variation-name> \
    --daemon-mode <bool>

For example, for the model built in the Getting Started with Qwak section, the deployment command is:

qwak models deploy realtime \ 
    --model-id churn_model \
    --build-id 7121b796-5027-11ec-b97c-367dda8b746f \
    --pods 4 \
    --cpus 3 \
    --memory 1024 \
    --timeout 3000 \
    --server-workers 4 \
    --variation-name default \
    --daemon-mode false



By default the deployment command is executed asynchronously. The command triggers the deployment but does not wait for it to be completed. To execute the command in sync, use the --sync flag.

Deploying with GPU Resources

You can select to use GPU resources in your deployment. This is done by supplying the GPU Type and GPU Amount in the deployment request. For example:

qwak models deploy realtime \ 
    --model-id churn_model \
    --build-id 7121b796-5027-11ec-b97c-367dda8b746f \
    --pods 4 \
    --gpu-type NVIDIA_K80 \
    --gpu-amount 1 \
    --memory 1024 \
    --timeout 3000 \
    --server-workers 4 \
    --variation-name default \
    --daemon-mode false


When passing GPU resources in the deployment request, cpu and memory resources are ignored. This is due to those parameters being defined automatically by the GPU type and GPU amount you select.

Using Custom AWS IAM Role

In some cases, a model needs to access external services during the runtime.
If a build needs to access external AWS resources, a custom AWS IAM role can be passed to the Qwak deployment process.

The IAM role should be created with the following trust policy:

  "Version": "2012-10-17",
  "Statement": [
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<account-id>:root"
      "Action": "sts:AssumeRole",
      "Condition": {
        "ArnLike": {
          "aws:PrincipalArn": "arn:aws:iam::<account-id>:role/qwak-eks-base*"

The IAM role ARN can be passed directly to a deployment using the --iam-role-arn flag. For example:

qwak models deploy realtime \ 
    --model-id churn_model \
    --build-id 7121b796-5027-11ec-b97c-367dda8b746f \
    --pods 4 \
    --cpus 3 \
    --memory 1024 \
    --timeout 3000 \
    --server-workers 4 \
    --variation-name default \
    --daemon-mode false \
    --iam-role-arn arn:aws:iam::<account-id>:role/<role-name>

Local Deployment

To run the deployment locally using a local Docker engine, use the --local flag. For example:

qwak models deploy realtime \ 
    --model-id churn_model \
    --build-id 7121b796-5027-11ec-b97c-367dda8b746f \



You can only deploy a build locally that was generated locally, that is, using the --no-remote flag in the build command.

Performing Inference

Once you have successfully completed a real-time deployment, you can use the Qwak Inference SDK to perform invocations.

In this example, we'll invoke a model via the Python Runtime SDK which comes bundled as part of the Qwak SDK.

Invocations parameters are model specific. For the following model (assuming it was built and deployed successfully as a real-time endpoint, with the model ID iris_classifier):

from qwak import api, QwakModelInterface
from sklearn import svm, datasets
import pandas as pd

class IrisClassifier(QwakModelInterface):

    def __init__(self):
        self._gamma = 'scale'
        self._model = None

    def build(self):
        # load training data
        iris = datasets.load_iris()
        X, y =,

        # Model Training
        clf = svm.SVC(gamma=self._gamma)
        self._model =, y)

    def predict(self, df: pd.DataFrame) -> pd.DataFrame:
        return pd.DataFrame(data=self._model.predict(df), columns=['species'])

A prediction call from the Qwak Python SDK is:

from qwak.inference.clients import RealTimeClient

model_id = "iris_classifier"
feature_vector = [
      "sepal_width": 3,
      "sepal_length": 3.5,
      "petal_width": 4,
      "petal_length": 5

client = RealTimeClient(model_id=model_id)
response = client.predict(feature_vector)

Endpoint Monitoring

Qwak's endpoints are backed by Kubernetes, and Qwak automatically installs advanced monitoring tools for production-grade readiness. Qwak comes bundled with Grafana and Prometheus for dashboarding capabilities, and ElasticSearch for log collection, amongst other tools.

The following health metrics appear in the model Overview tab:


In addition, you can follow and search the applicable logs produced by your model in the Logs tab:


Auto Scaling

To attach a new auto scaling to a running model:

  1. Create a config file:
api_version: v1
  model_id: <model-id>
  variation_name: <variation-name>
    min_replica_count: 1
    max_replica_count: 10
    polling_interval: 30
    cool_down_period: 300
        - query_spec:
            metric_type: <cpu/gpu/memory/latency/error_rate/throughput>
            aggregation_type: <min/max/avg/sum>
            time_period: 30
          threshold: 60
  1. Run the following command:
qwak models autoscaling attach -f config.yaml


ParameterDescriptionDefault Value
min_replica_count (integer)The minimum number of replicas will scale the resource down to
max_replica_count (integer)The maximum number of replicas of the target resource
polling_interval (integer)This is the interval to check each trigger on30 seconds
cool_down_period (integer)The period to wait after the last trigger reported active before scaling the resource back to 0300 seconds
metric_type (prometheus_trigger)The type of the metriccpu/gpu/memory/latency/error_rate/throughput
The type of the aggregationmin/max/avg/sum
time_period (integer)
The period to run the query - value in minutes
threshold (integer)
Value to start scaling for.
cpu - usage in percentages
gpu - usage in percentages
memory - value in bytes
Latency - value in ms
Error Rate - usage in percentages
Throughput - usage in RPM

What’s Next

Next, let's look at the different options for performing real-time predictions