Getting Started

In this tutorial, we'll walk you through training a machine learning model to predict credit risk using credit amount and personal data features.

We'll retrieve the necessary training data from the Qwak Feature Store using the code examples below.

Creating a feature set

Before diving in, we recommend going through the feature store overview to gain a better understanding of data sources, entities, and feature sets.

Configuring Qwak SDK

The first step is configuring your personal API key.

Log into Qwak and copy your API key from the settings page.

Get your personal API key →

Open the terminal and type in the following command:

qwak configure

Paste your personal API key when prompted:

Please enter your API key:

Alternatively, configure the API key using a single command.

Replace <YOUR_API_KEY> with your API key and ensure that you wrap the token in double quotation (") marks:

qwak configure --api-key "<YOUR_API_KEY>"

After successfully configuring your API key, the below message will appear:

User succesfully configured

🚧
In case your local system doesn't recognize the qwak command, add it to your system's PATH.

Defining a data source

First, we will define a data source using a CSV file stored in a public S3 bucket.

Let's create an empty Python file and put the data source definition inside it:

from qwak.feature_store.data_sources import CsvSource, AnonymousS3Configuration

csv_source = CsvSource(
    name='credit_risk_data',
    description='A dataset of personal credit details',
    date_created_column='date_created',
    path='s3://qwak-public/example_data/data_credit_risk.csv',
    filesystem_configuration=AnonymousS3Configuration(),
    quote_character='"',
    escape_character='"'
)

🚧
Access public S3 buckets using AnonymousS3Configuration.
Buckets such as qwak-public or nyc-tlc do not require credentials.

Before exploring the feature set data sample, install a version of pandas that suits your project best. If you have no version requirements, simply install the latest version.

pip install pandas

Exploring the data source

Explore the ingested data by running the get_sample method:

# Get and print a sample from your live data source
pandas_df = csv_source.get_sample()
print(pandas_df)

The output should look like the following:

   age     sex  job housing saving_account checking_account  credit_amount  duration              purpose  risk                               user_id   date_created
0   67    male    2     own           None           little           1169         6             radio/TV  good  baf1aed9-b16a-46f1-803b-e2b08c8b47de  1609459200000
1   22  female    2     own         little         moderate           5951        48             radio/TV   bad  574a2cb7-f3ae-48e7-bd32-b44015bf9dd4  1609459200000
2   49    male    1     own         little             None           2096        12            education  good  1b044db3-3bd1-4b71-a4e9-336210d6503f  1609459200000
3   45    male    2    free         little           little           7882        42  furniture/equipment  good  ac8ec869-1a05-4df9-9805-7866ca42b31c  1609459200000
4   53    male    2    free         little           little           4870        24                  car   bad  aa974eeb-ed0e-450b-90d0-4fe4592081c1  1609459200000
5   35    male    1    free           None             None           9055        36            education  good  7b3d019c-82a7-42d9-beb8-2c57a246ff16  1609459200000
6   53    male    2     own     quite rich             None           2835        24  furniture/equipment  good  6bc1fd70-897e-49f4-ae25-960d490cb74e  1609459200000
7   35    male    3    rent         little         moderate           6948        36                  car  good  193158eb-5552-4ce5-92a4-2a966895bec5  1609459200000
8   61    male    1     own           rich             None           3059        12             radio/TV  good  759b5b46-dbe9-40ef-a315-107ddddc64b5  1609459200000
9   28    male    3     own         little         moderate           5234        30                  car   bad  e703c351-41a8-43ea-9615-8605da7ee718  1609459200000

Defining an entity

Every feature set is associated with an Entity, which represents the business entity related to this feature set. In this example, a user.
Feature vectors in the Feature Store will be recognized by that key.

In the file we created earlier, we add the Entity configuration:

from qwak.feature_store.entities.entity import Entity

entity = Entity(
     name='user',
     description='A User ID'
)

Defining a feature set

Finally, we are able to define the feature set. Feature set configuration consists of several elements:

1. Feature set type

Qwak cloud currently supports only Batch feature sets, which are defined by using the @batch decorator.

2. Connecting data sources and entities to feature sets

Data sources and entities connect with feature sets by their name.

3. Backfill policy

The backfill policy determines how and from which date and time data will populate the new feature set with historical values.

4. Scheduling policy

The scheduling policy defines the feature set data freshness, i.e. how often new data is fetched and processed.

5. Data transformation

Qwak cloud supports Spark SQL transformations to transform the ingested data into features.

In our Python file, we must add the following @batch decorators to create a new feature set:

from datetime import datetime
from qwak.feature_store.feature_sets import batch
from qwak.feature_store.feature_sets.transformations import SparkSqlTransformation


@batch.feature_set(
    name="user-credit-risk-features",
    entity="user",
    data_sources=["credit_risk_data"],
)
@batch.metadata(
    owner="John Doe",
    display_name="User Credit Risk Features",
    description="Features describing user credit risk",
)
@batch.scheduling(cron_expression="0 0 * * *")
@batch.backfill(start_date=datetime(2015, 1, 1))
def user_features():
    return SparkSqlTransformation(
        """
        SELECT user_id as user,
               age,
               sex,
               job,
               housing,
               saving_account,
               checking_account,
               credit_amount,
               duration,
               purpose,
               date_created
        FROM credit_risk_data
        """
    )

Registering a feature set

Use the qwak CLI to register the feature set.

Run this command in the same directory where your feature_set.py is located:

qwak features register

An optional -p parameter allows your to define the path of your working directory. By default, it is the current working directory.
The CLI reads all Python files in this directory to find all feature set definitions. To speed up the process, it is recommended to separate feature configuration folders from the rest of your code.

During the process execution, you will be prompted with requests to create the user entity, credit_risk_data data source and user-credit-risk-features feature set.

Exploring feature set data

In order to test or explore features before registering them, use the get_sample method of the feature set:

# Get a live sample of your ingested data from the feature store
df = user_features.get_sample()
print(df)

The output should look like the following:

                                   user  age     sex  job housing saving_account checking_account  credit_amount  duration              purpose   date_created
0  baf1aed9-b16a-46f1-803b-e2b08c8b47de   67    male    2     own           None           little           1169         6             radio/TV  1609459200000
1  574a2cb7-f3ae-48e7-bd32-b44015bf9dd4   22  female    2     own         little         moderate           5951        48             radio/TV  1609459200000
2  1b044db3-3bd1-4b71-a4e9-336210d6503f   49    male    1     own         little             None           2096        12            education  1609459200000
3  ac8ec869-1a05-4df9-9805-7866ca42b31c   45    male    2    free         little           little           7882        42  furniture/equipment  1609459200000
4  aa974eeb-ed0e-450b-90d0-4fe4592081c1   53    male    2    free         little           little           4870        24                  car  1609459200000
5  7b3d019c-82a7-42d9-beb8-2c57a246ff16   35    male    1    free           None             None           9055        36            education  1609459200000
6  6bc1fd70-897e-49f4-ae25-960d490cb74e   53    male    2     own     quite rich             None           2835        24  furniture/equipment  1609459200000
7  193158eb-5552-4ce5-92a4-2a966895bec5   35    male    3    rent         little         moderate           6948        36                  car  1609459200000
8  759b5b46-dbe9-40ef-a315-107ddddc64b5   61    male    1     own           rich             None           3059        12             radio/TV  1609459200000
9  e703c351-41a8-43ea-9615-8605da7ee718   28    male    3     own         little         moderate           5234        30                  car  1609459200000

Training models with the offline store

Let's use the Qwak feature store data during model training.

We need to import the OfflineClient and use the get_feature_values method.

Creating a new model

Let's create a new model directory and start training our model. Before we start, we must have a few dependencies installed.

If you are using Conda, please specify the dependencies in the conda.yml file. Otherwise, use any other tool you find suitable.

We will use the CatBoost library, so our dependencies look like this:

name: CreditRisk
channels:
  - defaults
  - conda-forge
dependencies:
  - python=3.9
  - pip
  - pandas
  - scikit-learn
  - catboost

Defining a new model

Next we define the model parameters in the constructor:

import qwak
from qwak.model.base import QwakModel
from catboost import CatBoostRegressor

class CreditRiskModel(QwakModel):
   
  def __init__(self):
      self.model = CatBoostRegressor(
          iterations=1000,
          loss_function='RMSE',
          learning_rate=None
      )

In the build function, we are going to do several things:

We retrieve the data from the feature store
Log the training parameters
Extract the relevant features from the dataset
Deal with missing values. In this example, we will drop the rows with missing data. That isn't a recommended method of data preprocessing, but we want to focus on showing how to use Qwak.
Split the dataset into training and validation parts
Train the model
Run cross-validation to evaluate the model
Log the performance metrics

def build(self):
    from qwak.feature_store.offline.client import OfflineClient
    from qwak import log_param, log_metric
    from sklearn.model_selection import train_test_split

    import cv
    import numpy as np
    from multiprocessing import Pool

    key_to_features = {'user': [
        'user-credit-risk-features.checking_account',
        'user-credit-risk-features.age',
        'user-credit-risk-features.job'
        ]
    }

    population = pd.DataFrame(
        columns=[                 'user', 'timestamp'                ],
        data   =[[ '06cc255a-aa07-4ec9-ac69-b896ccf05322', '2021-01-01 00:00:00']]
    )

    data = offline_client.get_feature_values(entity_key_to_features=key_to_features,
                                             population=population)

    log_param({"iterations": 1000, "loss_function": "RMSE"})

    data = data.rename(
        columns={
            "user-credit-risk-features.checking_account": "checking_account",
            "user-credit-risk-features.age": "age",
            "user-credit-risk-features.job": "job",
            "user-credit-risk-features.duration": "duration",
            "user-credit-risk-features.credit_amount": "credit_amount",
            "user-credit-risk-features.housing": "housing",
            "user-credit-risk-features.purpose": "purpose",
            "user-credit-risk-features.saving_account": "saving_account",
            "user-credit-risk-features.sex": "sex",
        }
    )

    data = data[
        [
            "checking_account",
            "age",
            "job",
            "credit_amount",
            "housing",
            "purpose",
            "saving_account",
            "sex",
            "duration",
        ]
    ]
    data = data.dropna()  # in production, we should fill the missing values
    # but we don't have a second data source for the missing data, so let's drop them

    x = data[
        [
            "checking_account",
            "age",
            "job",
            "credit_amount",
            "housing",
            "purpose",
            "saving_account",
            "sex",
        ]
    ]
    y = data[["duration"]]

    x_train, x_test, y_train, y_test = train_test_split(
        x, y, train_size=0.85, random_state=42
    )
    cate_features_index = np.where(x_train.dtypes != int)[0]
    self.model.fit(
        x_train, y_train, cat_features=cate_features_index, eval_set=(x_test, y_test)
    )

    cv_data = cv(
        Pool(x, y, cat_features=cate_features_index),
        self.model.get_params(),
        fold_count=5,
    )
    max_mean_row = cv_data[
        cv_data["test-RMSE-mean"] == np.max(cv_data["test-RMSE-mean"])
    ]
    log_metric(
        {
            "val_rmse_mean": max_mean_row["test-RMSE-mean"][0],
            "val_rmse_std": max_mean_row["test-RMSE-std"][0],
        }
    )

If the client application sends the features in the correct order, we can have a quite simple predict function looking like this:

@qwak.api()
def predict(self, df: pd.DataFrame) -> pd.DataFrame:
    return self.model.predict(df)

Inference with the online feature store

We don't need to pass all features to the predict function if we store the data in a feature store. In our example, we could send only the user_id, and the model can retrieve all data from Qwak Data Lake.

We can do it in two ways: explicitly writing the feature retrieval code or using the features_extraction decorator. Below, we will show both methods.

Using the online client

In the predict function, we create an instance of the OnlineClient.

We then create the ModelSchema with the feature set and the required feature values.

We create a data frame containing the user identifiers and pass it to the feature store. As a response, we get a Pandas DataFrame having the requested features.

import pandas as pd
from qwak.feature_store.online.client import OnlineClient
from qwak.model.schema import ModelSchema
from qwak.model.schema_entities import Entity, FeatureStoreInput

entity = Entity(name="user_id", type=str)

model_schema = ModelSchema(
  entities=[entity],
  inputs=[
        FeatureStoreInput(name='user-credit-risk-features.checking_account', entity=entity),
        FeatureStoreInput(name='user-credit-risk-features.age', entity=entity),
        FeatureStoreInput(name='user-credit-risk-features.job', entity=entity),
        FeatureStoreInput(name='user-credit-risk-features.duration', entity=entity),
        FeatureStoreInput(name='user-credit-risk-features.credit_amount', entity=entity),
        FeatureStoreInput(name='user-credit-risk-features.housing', entity=entity),
        FeatureStoreInput(name='user-credit-risk-features.purpose', entity=entity),
        FeatureStoreInput(name='user-credit-risk-features.saving_account', entity=entity),
        FeatureStoreInput(name='user-credit-risk-features.sex', entity=entity),
    ])
    
online_client = OnlineClient()

df = pd.DataFrame(columns=['user_id'],
                  data   =[ '06cc255a-aa07-4ec9-ac69-b896ccf05322'])

                  
user_features = online_client.get_feature_values(model_schema, df)

Deleting Items

Deleting data sources

In order to delete a data source use the following command in the terminal:

qwak features delete --data-source <data source name>

❗️
A data source that is linked to a feature set cannot be deleted.

Deleting entities

To delete an entity use the following command in the terminal:

qwak features delete --entity <entity name>

❗️
An entity that is linked to a feature set cannot be deleted.

Deleting feature sets

To delete feature set use the following command in the terminal:

qwak features delete --feature-set <feature set name>

❗️
Deleting a feature set will delete all related data in the offline and online feature stores