Real-Time Deployment
Preface
Qwak can help you set up a real-time Web service. Qwak takes your code and model and encases it in a lightweight REST API service, enabling it to be easily queried through the Web.
All you need to do is to provide a Qwak-based compatible model and Qwak does the rest.
Qwak sets up all the network requirements and deploys the service to Kubernetes, allowing you to leverage auto-scaling and ensuring that all incoming traffic is accommodated. Qwak adds an authentication token to secure access. Furthermore, it adds a suite of monitoring tools, simplifying the process of managing your Web service and its performance.
Once you have a successful build, you can opt for a real-time deployment. Real-time deployment uses an HTTPS server to expose the endpoint.
Deployment Configuration
Parameter | Description | Default Value |
---|---|---|
Model ID [Required] | The Model ID, as displayed on the model header. | |
Build ID [Required] | The Qwak-assigned build ID. | |
Variation name | The name of the variation to deploy the build on. | default |
Initial number of pods | The number of k8s pods to be used by the deployment. Each pod contains an HTTPS server. A load balancer splits the traffic between them. | 1 |
CPU fraction | A CPU fraction allocated to the pod. The CPU resource is measured in CPU units. One CPU, in Qwak, is equivalent to: 1 AWS vCPU 1 GCP Core 1 Azure vCore 1 Hyperthread on a bare-metal Intel processor with Hyperthreading | 1 |
Memory | The RAM memory (in MB) to allocate to each pod. | 512 |
Timeout | The number of milliseconds required for an API server request to time out. | 1000 |
Concurrent workers | The number of Gunicorn workers. The number of worker processes for handling requests. A positive integer is generally in the 2-4 x $(NUM_CORES) range. You may want to vary this a bit to find the optimal value for your particular application’s workload. | 2 |
Daemon mode | Whether or not to Daemonize the Gunicorn process. Detaches the server from the controlling terminal and enters the background. | Enabled |
IAM role ARN | The user-provided AWS custom IAM role. | None |
Max batch size | The maximal allowed batch size. | 1 |
Timeout | The prediction request timeout. | 5000 milliseconds |
GPU Type | The GPU Type to use in the model deployment. Supported options are, NVIDIA K80, NVIDIA Tesla V100, NVIDIA T4 and NVIDIA A10. | None |
GPU Amount | The number of GPUs available for the model deployment. Varies based on the selected GPU type | Based on GPU Type |
Real-Time Deployment from the App
To deploy a real-time model from the UI:
- In the left navigation bar in the Qwak UI, select Projects.
- Select a project and then select a model.
- Select the Builds tab. Find a build to deploy and click the deployment toggle. The Deploy dialog box appears.
- Select Realtime and then select Next.
- Configure your real-time deployment by selecting the initial number of pods, CPU fraction, and memory.
The Advanced settings real-time deployment configuration page includes additional settings such as injecting environment variables, setting invocation timeouts, and more.
Real-Time Deployment from the CLI
To deploy a model in real-time mode from the CLI, populate the following command template:
qwak models deploy realtime \
--model-id <model-id> \
--build-id <build-id> \
--pods <pods-count> \
--cpus <cpus-fraction> \
--memory <memory-size> \
--timeout <timeout> \
--server-workers <workers> \
--variation-name <variation-name> \
--daemon-mode <bool>
For example, for the model built in the Getting Started with Qwak section, the deployment command is:
qwak models deploy realtime \
--model-id churn_model \
--build-id 7121b796-5027-11ec-b97c-367dda8b746f \
--pods 4 \
--cpus 3 \
--memory 1024 \
--timeout 3000 \
--server-workers 4 \
--variation-name default \
--daemon-mode false
Note
By default the deployment command is executed asynchronously. The command triggers the deployment but does not wait for it to be completed. To execute the command in sync, use the
--sync
flag.
Deploying with GPU Resources
You can select to use GPU resources in your deployment. This is done by supplying the GPU Type and GPU Amount in the deployment request. For example:
qwak models deploy realtime \
--model-id churn_model \
--build-id 7121b796-5027-11ec-b97c-367dda8b746f \
--pods 4 \
--gpu-type NVIDIA_K80 \
--gpu-amount 1 \
--memory 1024 \
--timeout 3000 \
--server-workers 4 \
--variation-name default \
--daemon-mode false
When passing GPU resources in the deployment request, cpu and memory resources are ignored. This is due to those parameters being defined automatically by the GPU type and GPU amount you select.
Using Custom AWS IAM Role
In some cases, a model needs to access external services during the runtime.
If a build needs to access external AWS resources, a custom AWS IAM role can be passed to the Qwak deployment process.
The IAM role should be created with the following trust policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<account-id>:root"
},
"Action": "sts:AssumeRole",
"Condition": {
"ArnLike": {
"aws:PrincipalArn": "arn:aws:iam::<account-id>:role/qwak-eks-base*"
}
}
}
]
}
The IAM role ARN can be passed directly to a deployment using the --iam-role-arn flag. For example:
qwak models deploy realtime \
--model-id churn_model \
--build-id 7121b796-5027-11ec-b97c-367dda8b746f \
--pods 4 \
--cpus 3 \
--memory 1024 \
--timeout 3000 \
--server-workers 4 \
--variation-name default \
--daemon-mode false \
--iam-role-arn arn:aws:iam::<account-id>:role/<role-name>
Local Deployment
To run the deployment locally using a local Docker engine, use the --local
flag. For example:
qwak models deploy realtime \
--model-id churn_model \
--build-id 7121b796-5027-11ec-b97c-367dda8b746f \
--local
Note
You can only deploy a build locally that was generated locally, that is, using the
--no-remote
flag in the build command.
Performing Inference
Once you have successfully completed a real-time deployment, you can use the Qwak Inference SDK to perform invocations.
In this example, we'll invoke a model via the Python Runtime SDK which comes bundled as part of the Qwak SDK.
Invocations parameters are model specific. For the following model (assuming it was built and deployed successfully as a real-time endpoint, with the model ID iris_classifier
):
from qwak import api, QwakModelInterface
from sklearn import svm, datasets
import pandas as pd
class IrisClassifier(QwakModelInterface):
def __init__(self):
self._gamma = 'scale'
self._model = None
def build(self):
# load training data
iris = datasets.load_iris()
X, y = iris.data, iris.target
# Model Training
clf = svm.SVC(gamma=self._gamma)
self._model = clf.fit(X, y)
@api()
def predict(self, df: pd.DataFrame) -> pd.DataFrame:
return pd.DataFrame(data=self._model.predict(df), columns=['species'])
A prediction call from the Qwak Python SDK is:
from qwak.inference.clients import RealTimeClient
model_id = "iris_classifier"
feature_vector = [
{
"sepal_width": 3,
"sepal_length": 3.5,
"petal_width": 4,
"petal_length": 5
}]
client = RealTimeClient(model_id=model_id)
response = client.predict(feature_vector)
Endpoint Monitoring
Qwak's endpoints are backed by Kubernetes, and Qwak automatically installs advanced monitoring tools for production-grade readiness. Qwak comes bundled with Grafana and Prometheus for dashboarding capabilities, and ElasticSearch for log collection, amongst other tools.
The following health metrics appear in the model Overview tab:
In addition, you can follow and search the applicable logs produced by your model in the Logs tab:
Auto Scaling
To attach a new auto scaling to a running model:
- Create a config file:
api_version: v1
spec:
model_id: <model-id>
variation_name: <variation-name>
auto_scaling:
min_replica_count: 1
max_replica_count: 10
polling_interval: 30
cool_down_period: 300
triggers:
prometheus_trigger:
- query_spec:
metric_type: <cpu/gpu/memory/latency/error_rate/throughput>
aggregation_type: <min/max/avg/sum>
time_period: 30
threshold: 60
- Run the following command:
qwak models autoscaling attach -f config.yaml
Configuration
Parameter | Description | Default Value |
---|---|---|
min_replica_count (integer) | The minimum number of replicas will scale the resource down to | |
max_replica_count (integer) | The maximum number of replicas of the target resource | |
polling_interval (integer) | This is the interval to check each trigger on | 30 seconds |
cool_down_period (integer) | The period to wait after the last trigger reported active before scaling the resource back to 0 | 300 seconds |
metric_type (prometheus_trigger) | The type of the metric | cpu/gpu/memory/latency/error_rate/throughput |
aggregation_type (prometheus_trigger) | The type of the aggregation | min/max/avg/sum |
time_period (integer) (prometheus_trigger) | The period to run the query - value in minutes | |
threshold (integer) (prometheus_trigger) | Value to start scaling for. cpu - usage in percentages gpu - usage in percentages memory - value in bytes Latency - value in ms Error Rate - usage in percentages Throughput - usage in RPM |
Updated over 1 year ago
Next, let's look at the different options for performing real-time predictions