Streaming deployments let you easily connect Kafka streams with your models to perform real-time inference.
Using streaming deployments can be useful for processing large amounts of distributed data to avoidi complex triggering and scheduling architectures as fresh data arrives. A streaming deployment will consume messages from a Kafka topic and produce predictions into a Kafka topic of your choice.
Consider using streaming deployments in the following cases:
- Event-driven applications: Respond in real-time to specific events or triggers.
- Real-time decision-making: Process data as it arrives in time-sensitive systems.
- Scalable processing: Process large data volumes from multiple sources and ensure efficient data handling.
To use streaming inference on Qwak, use the following examples:
Deploying a streaming model on Qwak is simple, just a few clicks and your model is up and running.
- Build a new model
- Click Deploy and choose Streaming
- Select the instance type and configure the Kafka server and topics
- Click deploy and your model will be live!
To use this example, replace
--model-id with your model ID on Qwak.
# Deploy a FLAN-T5 streaming model on 2 A10.XL on-demand instances qwak models deploy stream \ --model-id "flan-t5" \ --build-id "0511ffa9-986e-4fd9-ae29-515269af5bac" \ --instance "gpu.a10.xl" \ --replicas 2 \ --purchase-option "on-demand" \ # Use only on demand instances for executors --consumer-bootstrap-server <bootstrap-server> \ --consumer-topic <consumer-topic-name> \ --producer-bootstrap-server <bootstrap-server> \ --producer-topic <producder-topic-name>
# Deploy a churn streaming model on 2 small instances qwak models deploy stream \ --model-id "stream-churn-model" \ --build-id "d4d56c1a-f326-40a0-88f3-f8e59c691a3e" \ --instance "small" \ --replicas 2 \ --consumer-bootstrap-server "10.0.0.8" \ --consumer-topic "model-input-topic" \ --producer-bootstrap-server "10.0.0.9" \ --producer-topic "model-output-topic"
Configure streaming inference via YAML as this example deployment:
build_id: "d4d56c1a-f326-40a0-88f3-f8e59c691a3e" model_id: "stream-churn-model" resources: instance_size: "small" pods: 2 stream: consumer_bootstrap_server: "10.0.0.8" consumer_topic: "model-input-topic" producer_bootstrap_server: "10.0.0.9" producer_topic: "model-output-topic"
And then run using the below command:
qwak models deploy stream --from-file stream_deploy_config.yaml
Additionally, you can mix and match between command line arguments and configuration file. In this case, command line arguments will take precedence and override the values specified in the YAML file.
qwak models deploy stream --replicas 5 --from-file stream_deploy_config.yaml
Qwak models run on Kubernetes, where we automatically install advanced production-grade monitoring tools.
Under the Model Overview tab, you can view the following metrics for your deployed build:
- Average message throughput
- Total consumed messages
- Total produced messages
- Consumer lag
- Errors consuming messages
- Errors producing messages
- Memory, CPU and GPU utilization
Qwak's streaming deployments allow you to configure multiple consumer and producer specifications for a more granular control.
A list of Kafka bootstrap server to consume messages for model inference.
The Kafka topic to consume messages for model inference.
The name of the consumer group the model inference is attached to.
Used to identify a group of consumers that belong to the same consumer group. Each message in a partition is delivered to only one consumer within the same group, enabling load balancing and parallel processing.
Defines what happens when a consumer first joins a consumer group or when the consumer is unable to find a valid offset (e.g., because the offset doesn't exist or has been deleted). It determines whether the consumer should start reading messages from the earliest available offset ("earliest") or from the latest offset ("latest") in the topic.
The maximum amount of time a consumer can be inactive before it is considered dead or no longer part of the consumer group. If a consumer does not send a heartbeat to Kafka within this timeout period, it may be removed from the group, and its partitions will be reassigned to other consumers.
The maximum number of records the consumer will attempt to fetch in a single poll request to Kafka, which can affect throughput and latency.
The maximum amount of time the consumer is willing to wait in a single poll request for new records to be available in the topic. If no new records are available within this time limit, the poll request will return empty.
A list of Kafka bootstrap server to produce model inference outputs.
The Kafka topic to produce messages after model inference
Determine how the messages sent by the producer are compressed before being stored in Kafka. Compressing messages can significantly reduce the amount of data transmitted over the network and storage requirements in Kafka. Kafka supports different compression types, and the producer can choose one of the following options
Messages are not compressed at all. This option is suitable when message size is not a concern, and the data is already compressed or highly optimized.
Messages are compressed using the gzip algorithm. Gzip provides a good balance between compression ratio and processing overhead.
Messages are compressed using the Snappy compression algorithm. Snappy is faster than gzip but may not achieve the same compression ratio.
Messages are compressed using the LZ4 compression algorithm. LZ4 is faster than both gzip and Snappy but may have a slightly lower compression ratio.
Updated 5 months ago