Features in Training
This documentation provides examples and usage patterns for interacting with the Offline Feature Store using the OfflineClientV2
in Python (available from SDK version 0.5.61 and higher). It covers how to retrieve feature values for machine learning model training and analysis.
Prerequisites:
Before using these examples, ensure you have the following Python packages installed:
pip install pyathena pyarrow
APIs:
Get Feature Values
This API retrieves features from an offline feature store for one or more feature sets, given a population
DataFrame. The resulting DataFrame will include the population
DataFrame enriched with the requested feature values as of the point_in_time
specified.
Arguments:
features: List[FeatureSetFeatures]
- required
A list of feature sets to fetch.population: pd.DataFrame
- required
A DataFrame containing:- All keys of the requested feature sets.
- A point in time column.
- Optional enrichments, e.g., labels.
point_in_time_column_name: str
- required
The name of the point in time column in thepopulation
DataFrame.
Returns: pd.DataFrame
Example call:
import pandas as pd
from qwak.feature_store.offline import OfflineClientV2
from qwak.feature_store.offline.feature_set_features import FeatureSetFeatures
offline_feature_store = OfflineClientV2()
user_impressions_features = FeatureSetFeatures(
feature_set_name='impressions',
feature_names=['number_of_impressions']
)
user_purchases_features = FeatureSetFeatures(
feature_set_name='purchases',
feature_names=['number_of_purchases', 'avg_purchase_amount']
)
features = [user_impressions_features, user_purchases_features]
population_df = pd.DataFrame(
columns=['impression_id', 'purchase_id', 'timestamp', 'label'],
data=[['1', '100', '2021-01-02 17:00:00', 1], ['2', '200', '2021-01-01 12:00:00', 0]]
)
train_df: pd.DataFrame = offline_feature_store.get_feature_values(
features=features,
population=population_df,
point_in_time_column_name='timestamp'
)
print(train_df.head())
Example results:
# impression_id purchase_id timestamp label impressions.number_of_impressions purchases.number_of_purchases purchases.avg_purchase_amount
# 0 1 100 2021-04-24 17:00:00 1 312 76 4.796842
# 1 2 200 2021-04-24 12:00:00 0 86 5 1.548000
In this example, the label
serves as an enhancement to the dataset, rather than a criterion for data selection. This approach is particularly useful when you possess a comprehensive list of keys along with their respective timestamps. The Feature Store API is designed to cater to scenarios requiring data amalgamation from multiple feature sets, ensuring that, for each row in population_df, no more than one corresponding record is returned. Leveraging JFrog ML time-series based feature store, which organizes data within start_timestamp
and end_timestamp
bounds for each feature vector (key), guarantees that a singular, most relevant result is retrieved for every unique key-timestamp combination.
Get Feature Range Values
Retrieve features from an offline feature-set for a given time range. The result data-frame will contain all data points of the given feature-set in the given time range. If population
is provided, then the result will be filtered by the key values it contains.
Arguments:
features: FeatureSetFeatures
- required:
A list of features to fetch from a single feature set.start_date: datetime
- required:
The lower time bound.end_date: datetime
- required:
The upper time bound.population: pd.DataFrame
- optional:
A DataFrame containing the following columns:- The key of the requested feature-set required
- Enrichments e.g., labels. optional
Returns: pd.DataFrame
Example Call:
from datetime import datetime
import pandas as pd
from qwak.feature_store.offline import OfflineClientV2
from qwak.feature_store.offline.feature_set_features import FeatureSetFeatures
offline_feature_store = OfflineClientV2()
start_date = datetime(year=2021, month=1, day=1)
end_date = datetime(year=2021, month=1, day=3)
features = FeatureSetFeatures(
feature_set_name='purchases',
feature_names=['number_of_purchases', 'avg_purchase_amount']
)
train_df: pd.DataFrame = offline_feature_store.get_feature_range_values(
features=features,
start_date=start_date,
end_date=end_date
)
print(train_df.head())
Example Results:
# purchase_id timestamp purchases.number_of_purchases purchases.avg_purchase_amount
# 0 1 2021-01-02 17:00:00 76 4.796842
# 1 1 2021-01-01 12:00:00 5 1.548000
# 2 2 2021-01-02 12:00:00 5 5.548000
# 3 2 2021-01-01 18:00:00 5 2.788000
Current limitations
The get_feature_range_values API call is currently not available for Streaming Aggregations feature sets and not available to fetch data for multiple feature sets at the same time (join data).
Updated 4 months ago