Data Sources
Introduction
JFrog ML data sources are used to configure connections to your data. Data sources are used in order to create create feature sets.
There are two main types of data sources:
- Batch: Data-at-rest sources of data, such as Athena, Snowflake, and Redshift.
- Streaming: Data in motion sources, such as Kafka and Kinesis.
To connect to a data source:
- Enable network connectivity between the data sources and JFrog ML cluster if they are not publicly accessible.
- Grant JFrog ML access to your data lake components by creating read-only service accounts and/or IAM roles.
Defining Data Sources
Data Sources can be defined and registered programatically via JFrog ML SDK and CLI or created altogether via the JFrog ML Dashboard.
Programatically
JFrog ML provides Python classes to define any Data Source type using the qwak.feature_store.data_sources
package.
For example, you can define a CsvSource to read from an S3 based CSV file as follows:
from qwak.feature_store.data_sources import CsvSource
# The S3 anonymous config class is required for public S3 buckets
from qwak.feature_store.data_sources import AnonymousS3Configuration
# Create a CsvSource object to represent a CSV data source
# This example uses a CSV file from a public S3 bucket
csv_source = CsvSource(
name='credit_risk_data', # Name of the data source
description='A dataset of personal credit details', # Description of the data source
date_created_column='date_created', # Column name that represents the creation date
path='s3://qwak-public/example_data/data_credit_risk.csv', # S3 path to the CSV file
filesystem_configuration=AnonymousS3Configuration(), # Configuration for anonymous access to S3
quote_character='"', # Character used for quoting in the CSV file
escape_character='"' # Character used for escaping in the CSV file
)
Data Sources defined with the Qwak SDK are not going to be registered in the cloud platform unless the
qwak features register
command is ran for that object.
From the UI
- Select Data Sources from the sidebar
- Click Create New Data Source.
- Select the required data source type from the list.
- Fill in the form (all required fields are marked with an asterisk).
- Test the connection to the data source to verify it's operating.,
- Click Save.
- The data source is created. 👍
Below is an example of creating a Batch / CSV file based Data Source in the JFrog ML Dashboard.
Registering Data Sources
To register a Data Source class defined with the SDK you can use the JFrog ML CLI features
command as follows:
qwak features register -p data_source.py
Deleting Data Sources
To delete a data source, execute the following qwak
command in the terminal:
qwak features delete --data-source <data-source-name>
Deleting Data Sources in use
Before you can delete a Data Source that is linked to one or more Feature Sets, you must either remove those Feature Sets or reassign them to a different Data Source.
Updated 4 months ago