Getting Started

This tutorial will show you how to start using the Qwak Feature Store in less than 10 minutes.

We will train a machine learning model to predict the credit duration by using the credit amount and person's data, and we want to retrieve the training data from the feature store.

Before we start, look at the Feature Store Concepts documentation to learn about the difference between data sources, entities, and feature sets.

Adding Data to the Feature Store

Defining a Data Source

First, we will define a data source using a CSV file stored in a public S3 bucket.

Let's create an empty Python file and put the data source definition inside it:

from qwak.feature_store.sources.data_sources import CsvSource

csv_source = CsvSource(
     name = 'credit_risk_data',
     description = 'a csv source description',
     date_created_column = 'DATE_CREATED',
     path = 's3://qwak-public/example_data/data_credit_risk.csv',
     quote_character = '"',
     escape_character = '"'
)

You can explore the data source by running the get_sample function:

pandas_df = csv_source.get_sample()
print(pandas_df)

Defining an Entity

Every feature set is associated with an entity. An Entity defines not only the name of the data source content but also a set of fields we use as unique identifiers of every data point. In our example, we have only one key: user_id

In the file we created earlier, we add the Entity configuration:

from qwak.feature_store.entities import Entity, ValueType

entity = Entity(
     name = 'user',
     description = 'A User ID',
     key = ['user_id'],
     value_type = ValueType.STRING
)

Defining a Feature Set

Finally, we define the feature set. Note that the feature set configuration consists of several elements:

  • Feature set type - The Qwak feature store supports batch and streaming features. We decide which type of feature set to create by choosing the corresponding feature set implementation. In this example, we will use the @batch decorator for a batch feature set.
  • Relations between data sources and the entity - Note that we specify data source and entity names! Do not try to assign the variables you created earlier to those fields.
  • Backfill policy - How far in the past we should look while performing the initial data load
  • Scheduling policy - How often we retrieve new data from this data source
  • The SQL function retrieves the data from Qwak data lake. Even though we load a CSV data source, we will access it using SQL because the Qwak Feature Store loads the file content into Qwak Data Lake, so we will access a copy of the data!

In our Python file, we must add the following @batch decorators:

from qwak.feature_store.v1 import batch, SparkSqlTransformation
from datetime import datetime

@batch.feature_set(
    name="user-credit-risk-features",
    entity="user",
    data_sources=["credit_risk_data"],
)
@batch.metadata(
  owner = 'John Doe',
  display_name = 'User Credit Risk Features',
  description = 'Features describing user credit risk',
)
@batch.scheduling(cron_expression="@daily")
@batch.backfill(start_date=datetime(2015, 1, 1))
def user_features():
    return SparkSqlTransformation("""
        SELECT user_id,
               age,
               sex,
               job,
               housing,
               saving_account,
               checking_account,
               credit_amount,
               duration,
               purpose
        FROM credit_risk_data
        """)

Exploring Feature Sets

In order to test or explore features before registering them, use the get_sample method of the feature set

df = user_features.get_sample()
print(df)

Registering a Feature Set

We open a terminal and use the Qwak CLI to register the features.

Type: qwak features register -p . in the terminal.

The -p parameter points to the directory containing the file with our feature definitions. In our case, the current working directory.

The CLI tool reads all Python files in this directory to find all definitions. If you want to speed it up, you should separate feature configuration from the rest of your code.

Using an Offline Feature Store during Model Training

We will use the data retrieved from the Qwak feature store during model training.

We need to import the OfflineFeatureStore:

import qwak
from qwak import log_param, log_metric
from qwak.model import QwakModelInterface
from qwak.feature_store.offline import OfflineFeatureStore

import cv
from multiprocessing import Pool
from catboost import CatBoostRegressor
from sklearn.model_selection import train_test_split

And use the get_sample_data function:

data = offline_feature_store.get_sample_data(feature_set_name='user_credit_risk_features', number_of_rows=999)

Let's create a new model directory and start training our model.

Before we start, we must specify the dependencies in the conda.yml file. We will use the CatBoost library, so our dependencies look like this:

name: CreditRisk
channels:
  - defaults
  - conda-forge
dependencies:
  - python=3.8
  - pip=20.0.3
  - pandas=1.1.5
  - scikit-learn=0.24.1
  - catboost=0.26.1

First, we will define the model parameters in the constructor:

class CreditRisk(QwakModelInterface):
    
def __init__(self):
    self.model = CatBoostRegressor(
        iterations=1000,
        loss_function='RMSE',
        learning_rate=None
    )

In the build function, we are going to do several things:

  • We retrieve the data from the feature store
  • Log the training parameters
  • Extract the relevant features from the dataset
  • Deal with missing values. In this example, we will drop the rows with missing data. That isn't a recommended method of data preprocessing, but we want to focus on showing how to use Qwak.
  • Split the dataset into training and validation parts
  • Train the model
  • Run cross-validation to evaluate the model
  • Log the performance metrics
def build(self):
    offline_feature_store = OfflineFeatureStore()
    data = offline_feature_store.get_sample_data(feature_set_name='user_credit_risk_features', number_of_rows=999)

    log_param({'iterations': 1000, 'loss_function': 'RMSE'})

    data = data.rename(columns={
        'user_credit_risk_features.checking_account': 'checking_account',
        'user_credit_risk_features.age': 'age',
        'user_credit_risk_features.job': 'job',
        'user_credit_risk_features.duration': 'duration',
        'user_credit_risk_features.credit_amount': 'credit_amount',
        'user_credit_risk_features.housing': 'housing',
        'user_credit_risk_features.purpose': 'purpose',
        'user_credit_risk_features.saving_account': 'saving_account',
        'user_credit_risk_features.sex': 'sex'
    })

    data = data[['checking_account', 'age', 'job', 'credit_amount', 'housing', 'purpose', 'saving_account', 'sex', 'duration']]
    data = data.dropna() # in production, we should fill the missing values
    # but we don't have a second data source for the missing data, so let's drop them

    x = data[['checking_account', 'age', 'job', 'credit_amount', 'housing', 'purpose', 'saving_account', 'sex']]
    y = data[['duration']]

    x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=.85, random_state=42)
    cate_features_index = np.where(x_train.dtypes != int)[0]
    self.model.fit(x_train, y_train, cat_features=cate_features_index, eval_set=(x_test, y_test))

    cv_data = cv(Pool(x, y, cat_features=cate_features_index), self.model.get_params(), fold_count=5)
    max_mean_row = cv_data[cv_data['test-RMSE-mean'] == np.max(cv_data["test-RMSE-mean"])]
    log_metric({"val_rmse_mean" : max_mean_row["test-RMSE-mean"][0], 'val_rmse_std': max_mean_row["test-RMSE-std"][0]})

If the client application sends the features in the correct order, we can have a quite simple predict function looking like this:

@qwak.api()
def predict(self, df: pd.DataFrame) -> pd.DataFrame:
    return self.model.predict(df)

Of course, we can also add data preprocessing to the predict function.

Using an Online Feature Store during Inference

We don't need to pass all features to the predict function if we store the data in a feature store. In our example, we could send only the user_id, and the model can retrieve all data from Qwak Data Lake.

We can do it in two ways: explicitly writing the feature retrieval code or using the features_extraction decorator. Below, we will show both methods.

Using the Online Feature Store

In the predict function, we create an instance of the OnlineFeatureStore.

After that, we create a ModelSchema with the feature set and the required feature values.
We create a data frame containing the user identifiers and pass it to the feature store. As a response, we get a Pandas DataFrame having the requested features.

import pandas as pd
from qwak.feature_store.online import OnlineFeatureStore
from qwak.model.schema import ModelSchema, FeatureStoreInput


model_schema = ModelSchema(
    features=[
        FeatureStoreInput(name='user_credit_risk_features.checking_account'),
        FeatureStoreInput(name='user_credit_risk_features.age'),
        FeatureStoreInput(name='user_credit_risk_features.job'),
        FeatureStoreInput(name='user_credit_risk_features.duration'),
        FeatureStoreInput(name='user_credit_risk_features.credit_amount'),
        FeatureStoreInput(name='user_credit_risk_features.housing'),
        FeatureStoreInput(name='user_credit_risk_features.purpose'),
        FeatureStoreInput(name='user_credit_risk_features.saving_account'),
        FeatureStoreInput(name='user_credit_risk_features.sex'),
        
    ])
    
online_feature_store = OnlineFeatureStore()

df = pd.DataFrame(columns=[                 'user_id'                ],
                  data   =[ '06cc255a-aa07-4ec9-ac69-b896ccf05322'])
                  
user_features = online_feature_store.get_feature_values(
                    model_schema,
                    df)

Using the Feature Extraction Decorator

Alternatively, we could use the features_extraction decorator and get the features automatically extracted.

In this case, we have to implement the model's schema method and use it to define the features we want to extract:

def schema(self):
    from qwak.model.schema import ModelSchema, FeatureStoreInput, Prediction
    model_schema = ModelSchema(
        features=[
            FeatureStoreInput(name='user_credit_risk_features.checking_account'),
            FeatureStoreInput(name='user_credit_risk_features.age'),
            FeatureStoreInput(name='user_credit_risk_features.job'),
            FeatureStoreInput(name='user_credit_risk_features.duration'),
            FeatureStoreInput(name='user_credit_risk_features.credit_amount'),
            FeatureStoreInput(name='user_credit_risk_features.housing'),
            FeatureStoreInput(name='user_credit_risk_features.purpose'),
            FeatureStoreInput(name='user_credit_risk_features.saving_account'),
            FeatureStoreInput(name='user_credit_risk_features.sex'),
        ],
        predictions=[
            Prediction(name="duration", type=float)
        ])
    return model_schema

In the predict function, we add the decorator and an additional parameter. Note that the extracted_df variable contains raw features from the feature store. If those values need preprocessing before passing them to the model, you still have to do it!

@qwak.api(feature_extraction=True)
def predict(self, df, extracted_df):
    return self.model.predict(extracted_df) 

Deleting Feature Store Objects

A user can delete feature store objects such as a feature set, data source and entity

Deleting a Data Source

In order to delete a data source, type: qwak features delete --data-source <data source name> in the terminal.

  • A data source that is linked to a feature set cannot be deleted.

Deleting an Entity

In order to delete entity, type: qwak features delete --entity <entity name> in the terminal.

  • An entity that is linked to a feature set cannot be deleted.

Deleting a Feature Set

In order to delete feature set, type: qwak features delete --feature-set <feature set name> in the terminal.

  • A deletion of a feature set will delete all data that is related to the relevant feature set in the offline feature store. In addition, it will stop the the offline feature store job.

Generating Feature Store Objects for Autocomplete

In order to generate feature store objects (feature set, entities and data sources) for assistance purpose such is autocomplete or for exploration use the following command:

qwak features generate --path /my-working-env in the terminal

The command generates a file for autocomplete for each feature store object.

To use the file, import the generated file in your working file. For example, in order to use data sources:

from my-working-env.data_sources import *
csv_source.name #produce the defined name of the data source
csv_source.date_created_column #produce the defined created column of the data source

In order to use entities:

from my-working-env.entities import *
my_entity.entity_name #produce entity name
my_entity.entity_key #produce entity key
my_entity.value_type #produce the type of the entity key

In order to use autocomplete in a feature set:

from my-working-env.feature_sets import *
my_feature_set.<feature_name1> #produce defined feature_name1
my_feature_set.<feature_name2> #produce defined feature_name2
my_feature_set.<feature_name10> #produce defined feature_name10
my_feature_set.feature_set_name #produce feature set name

Note

  • Use --all-objects to generate feature_sets, entities and data sources in one command.
  • In order to update, run generate before starting to work.
  • You can use the flags: --data-source, --entity, --feature-set for specific object generation.