Getting Started
Connect data sources and create your first feature set in minutes!
In this tutorial, we'll walk you through training a machine learning model to predict credit risk using credit amount and personal data features.
We'll retrieve the necessary training data from the Qwak Feature Store using the code examples below.
Creating a feature set
Before diving in, we recommend going through the feature store overview to gain a better understanding of data sources, entities, and feature sets.
Configuring Qwak SDK
The first step is configuring your personal API key.
Log into Qwak and copy your API key from the settings page.
Open the terminal and type in the following command:
qwak configure
Paste your personal API key when prompted:
Please enter your API key:
Alternatively, configure the API key using a single command.
Replace <YOUR_API_KEY>
with your API key and ensure that you wrap the token in double quotation (") marks:
qwak configure --api-key "<YOUR_API_KEY>"
After successfully configuring your API key, the below message will appear:
User succesfully configured
In case your local system doesn't recognize the
qwak
command, add it to your system's PATH.
Defining a data source
First, we will define a data source using a CSV file stored in a public S3 bucket.
Let's create an empty Python file and put the data source definition inside it:
from qwak.feature_store.data_sources import CsvSource, AnonymousS3Configuration
csv_source = CsvSource(
name='credit_risk_data',
description='A dataset of personal credit details',
date_created_column='date_created',
path='s3://qwak-public/example_data/data_credit_risk.csv',
filesystem_configuration=AnonymousS3Configuration(),
quote_character='"',
escape_character='"'
)
Access public S3 buckets using
AnonymousS3Configuration
.Buckets such as
qwak-public
ornyc-tlc
do not require credentials.
Before exploring the feature set data sample, install a version of pandas
that suits your project best. If you have no version requirements, simply install the latest version.
pip install pandas
Exploring the data source
Explore the ingested data by running the get_sample
method:
# Get and print a sample from your live data source
pandas_df = csv_source.get_sample()
print(pandas_df)
The output should look like the following:
age sex job housing saving_account checking_account credit_amount duration purpose risk user_id date_created
0 67 male 2 own None little 1169 6 radio/TV good baf1aed9-b16a-46f1-803b-e2b08c8b47de 1609459200000
1 22 female 2 own little moderate 5951 48 radio/TV bad 574a2cb7-f3ae-48e7-bd32-b44015bf9dd4 1609459200000
2 49 male 1 own little None 2096 12 education good 1b044db3-3bd1-4b71-a4e9-336210d6503f 1609459200000
3 45 male 2 free little little 7882 42 furniture/equipment good ac8ec869-1a05-4df9-9805-7866ca42b31c 1609459200000
4 53 male 2 free little little 4870 24 car bad aa974eeb-ed0e-450b-90d0-4fe4592081c1 1609459200000
5 35 male 1 free None None 9055 36 education good 7b3d019c-82a7-42d9-beb8-2c57a246ff16 1609459200000
6 53 male 2 own quite rich None 2835 24 furniture/equipment good 6bc1fd70-897e-49f4-ae25-960d490cb74e 1609459200000
7 35 male 3 rent little moderate 6948 36 car good 193158eb-5552-4ce5-92a4-2a966895bec5 1609459200000
8 61 male 1 own rich None 3059 12 radio/TV good 759b5b46-dbe9-40ef-a315-107ddddc64b5 1609459200000
9 28 male 3 own little moderate 5234 30 car bad e703c351-41a8-43ea-9615-8605da7ee718 1609459200000
Defining an entity
Every feature set is associated with an Entity, which represents the business entity related to this feature set. In this example, a user.
Feature vectors in the Feature Store will be recognized by that key.
In the file we created earlier, we add the Entity
configuration:
from qwak.feature_store.entities.entity import Entity
entity = Entity(
name='user',
description='A User ID'
)
Defining a feature set
Finally, we are able to define the feature set. Feature set configuration consists of several elements:
1. Feature set type
Qwak cloud currently supports only Batch feature sets, which are defined by using the @batch
decorator.
2. Connecting data sources and entities to feature sets
Data sources and entities connect with feature sets by their name.
3. Backfill policy
The backfill policy determines how and from which date and time data will populate the new feature set with historical values.
4. Scheduling policy
The scheduling policy defines the feature set data freshness, i.e. how often new data is fetched and processed.
5. Data transformation
Qwak cloud supports Spark SQL transformations to transform the ingested data into features.
In our Python file, we must add the following @batch
decorators to create a new feature set:
from datetime import datetime
from qwak.feature_store.feature_sets import batch
from qwak.feature_store.feature_sets.transformations import SparkSqlTransformation
@batch.feature_set(
name="user-credit-risk-features",
entity="user",
data_sources=["credit_risk_data"],
)
@batch.metadata(
owner="John Doe",
display_name="User Credit Risk Features",
description="Features describing user credit risk",
)
@batch.scheduling(cron_expression="0 0 * * *")
@batch.backfill(start_date=datetime(2015, 1, 1))
def user_features():
return SparkSqlTransformation(
"""
SELECT user_id as user,
age,
sex,
job,
housing,
saving_account,
checking_account,
credit_amount,
duration,
purpose,
date_created
FROM credit_risk_data
"""
)
Registering a feature set
Use the qwak
CLI to register the feature set.
Run this command in the same directory where your feature_set.py
is located:
qwak features register
- An optional
-p
parameter allows your to define the path of your working directory. By default, it is the current working directory. - The CLI reads all Python files in this directory to find all feature set definitions. To speed up the process, it is recommended to separate feature configuration folders from the rest of your code.
During the process execution, you will be prompted with requests to create the user entity, credit_risk_data data source and user-credit-risk-features feature set.
Exploring feature set data
In order to test or explore features before registering them, use the get_sample
method of the feature set:
# Get a live sample of your ingested data from the feature store
df = user_features.get_sample()
print(df)
The output should look like the following:
user age sex job housing saving_account checking_account credit_amount duration purpose date_created
0 baf1aed9-b16a-46f1-803b-e2b08c8b47de 67 male 2 own None little 1169 6 radio/TV 1609459200000
1 574a2cb7-f3ae-48e7-bd32-b44015bf9dd4 22 female 2 own little moderate 5951 48 radio/TV 1609459200000
2 1b044db3-3bd1-4b71-a4e9-336210d6503f 49 male 1 own little None 2096 12 education 1609459200000
3 ac8ec869-1a05-4df9-9805-7866ca42b31c 45 male 2 free little little 7882 42 furniture/equipment 1609459200000
4 aa974eeb-ed0e-450b-90d0-4fe4592081c1 53 male 2 free little little 4870 24 car 1609459200000
5 7b3d019c-82a7-42d9-beb8-2c57a246ff16 35 male 1 free None None 9055 36 education 1609459200000
6 6bc1fd70-897e-49f4-ae25-960d490cb74e 53 male 2 own quite rich None 2835 24 furniture/equipment 1609459200000
7 193158eb-5552-4ce5-92a4-2a966895bec5 35 male 3 rent little moderate 6948 36 car 1609459200000
8 759b5b46-dbe9-40ef-a315-107ddddc64b5 61 male 1 own rich None 3059 12 radio/TV 1609459200000
9 e703c351-41a8-43ea-9615-8605da7ee718 28 male 3 own little moderate 5234 30 car 1609459200000
Training models with the offline store
Let's use the Qwak feature store data during model training.
We need to import the OfflineClient
and use the get_feature_values
method.
Creating a new model
Let's create a new model directory and start training our model. Before we start, we must have a few dependencies installed.
If you are using Conda, please specify the dependencies in the conda.yml
file. Otherwise, use any other tool you find suitable.
We will use the CatBoost
library, so our dependencies look like this:
name: CreditRisk
channels:
- defaults
- conda-forge
dependencies:
- python=3.9
- pip
- pandas
- scikit-learn
- catboost
Defining a new model
Next we define the model parameters in the constructor:
import qwak
from qwak.model.base import QwakModel
from catboost import CatBoostRegressor
class CreditRiskModel(QwakModel):
def __init__(self):
self.model = CatBoostRegressor(
iterations=1000,
loss_function='RMSE',
learning_rate=None
)
In the build
function, we are going to do several things:
- We retrieve the data from the feature store
- Log the training parameters
- Extract the relevant features from the dataset
- Deal with missing values. In this example, we will drop the rows with missing data. That isn't a recommended method of data preprocessing, but we want to focus on showing how to use Qwak.
- Split the dataset into training and validation parts
- Train the model
- Run cross-validation to evaluate the model
- Log the performance metrics
def build(self):
from qwak.feature_store.offline.client import OfflineClient
from qwak import log_param, log_metric
from sklearn.model_selection import train_test_split
import cv
import numpy as np
from multiprocessing import Pool
key_to_features = {'user': [
'user-credit-risk-features.checking_account',
'user-credit-risk-features.age',
'user-credit-risk-features.job'
]
}
population = pd.DataFrame(
columns=[ 'user', 'timestamp' ],
data =[[ '06cc255a-aa07-4ec9-ac69-b896ccf05322', '2021-01-01 00:00:00']]
)
data = offline_client.get_feature_values(entity_key_to_features=key_to_features,
population=population)
log_param({"iterations": 1000, "loss_function": "RMSE"})
data = data.rename(
columns={
"user-credit-risk-features.checking_account": "checking_account",
"user-credit-risk-features.age": "age",
"user-credit-risk-features.job": "job",
"user-credit-risk-features.duration": "duration",
"user-credit-risk-features.credit_amount": "credit_amount",
"user-credit-risk-features.housing": "housing",
"user-credit-risk-features.purpose": "purpose",
"user-credit-risk-features.saving_account": "saving_account",
"user-credit-risk-features.sex": "sex",
}
)
data = data[
[
"checking_account",
"age",
"job",
"credit_amount",
"housing",
"purpose",
"saving_account",
"sex",
"duration",
]
]
data = data.dropna() # in production, we should fill the missing values
# but we don't have a second data source for the missing data, so let's drop them
x = data[
[
"checking_account",
"age",
"job",
"credit_amount",
"housing",
"purpose",
"saving_account",
"sex",
]
]
y = data[["duration"]]
x_train, x_test, y_train, y_test = train_test_split(
x, y, train_size=0.85, random_state=42
)
cate_features_index = np.where(x_train.dtypes != int)[0]
self.model.fit(
x_train, y_train, cat_features=cate_features_index, eval_set=(x_test, y_test)
)
cv_data = cv(
Pool(x, y, cat_features=cate_features_index),
self.model.get_params(),
fold_count=5,
)
max_mean_row = cv_data[
cv_data["test-RMSE-mean"] == np.max(cv_data["test-RMSE-mean"])
]
log_metric(
{
"val_rmse_mean": max_mean_row["test-RMSE-mean"][0],
"val_rmse_std": max_mean_row["test-RMSE-std"][0],
}
)
If the client application sends the features in the correct order, we can have a quite simple predict
function looking like this:
@qwak.api()
def predict(self, df: pd.DataFrame) -> pd.DataFrame:
return self.model.predict(df)
Inference with the online feature store
We don't need to pass all features to the predict
function if we store the data in a feature store. In our example, we could send only the user_id
, and the model can retrieve all data from Qwak Data Lake.
We can do it in two ways: explicitly writing the feature retrieval code or using the features_extraction
decorator. Below, we will show both methods.
Using the online client
In the predict function, we create an instance of the OnlineClient
.
We then create the ModelSchema
with the feature set and the required feature values.
We create a data frame containing the user identifiers and pass it to the feature store. As a response, we get a Pandas DataFrame having the requested features.
import pandas as pd
from qwak.feature_store.online.client import OnlineClient
from qwak.model.schema import ModelSchema
from qwak.model.schema_entities import Entity, FeatureStoreInput
entity = Entity(name="user_id", type=str)
model_schema = ModelSchema(
entities=[entity],
inputs=[
FeatureStoreInput(name='user-credit-risk-features.checking_account', entity=entity),
FeatureStoreInput(name='user-credit-risk-features.age', entity=entity),
FeatureStoreInput(name='user-credit-risk-features.job', entity=entity),
FeatureStoreInput(name='user-credit-risk-features.duration', entity=entity),
FeatureStoreInput(name='user-credit-risk-features.credit_amount', entity=entity),
FeatureStoreInput(name='user-credit-risk-features.housing', entity=entity),
FeatureStoreInput(name='user-credit-risk-features.purpose', entity=entity),
FeatureStoreInput(name='user-credit-risk-features.saving_account', entity=entity),
FeatureStoreInput(name='user-credit-risk-features.sex', entity=entity),
])
online_client = OnlineClient()
df = pd.DataFrame(columns=['user_id'],
data =[ '06cc255a-aa07-4ec9-ac69-b896ccf05322'])
user_features = online_client.get_feature_values(model_schema, df)
Deleting Items
Deleting data sources
In order to delete a data source use the following command in the terminal:
qwak features delete --data-source <data source name>
A data source that is linked to a feature set cannot be deleted.
Deleting entities
To delete an entity use the following command in the terminal:
qwak features delete --entity <entity name>
An entity that is linked to a feature set cannot be deleted.
Deleting feature sets
To delete feature set use the following command in the terminal:
qwak features delete --feature-set <feature set name>
Deleting a feature set will delete all related data in the offline and online feature stores
Updated about 1 year ago