Generate Text Embeddings using UDF Functions
This tutorial shows how you can use the user-defined function feature of the Qwak Feature Store to tokenize text for NLP models.
First, you must request an installation of required dependencies in your Qwak environment. To do so, send a message to our Tech Support team. Your requests will be approved/rejected on a case-by-case basis because additional libraries may interfere with the Qwak code.
This tutorial shows how to use the BertTokenizer from the transformers
library created by Huggingface.
After getting your request approved, you can define a data source and the entity.
For example, we can retrieve the text value from a CSV file stored in S3. In such a case, you should create an empty Python file and put the following CsvSource
and Entity
in the code:
from qwak.feature_store.sources.data_sources import CsvSource
from qwak.feature_store.entities import Entity, ValueType
csv_source = CsvSource(
name='test_text_data',
description='Some test text data',
date_created_column='created_date',
path='s3://path'
)
entity = Entity(
name = 'test_text_data_entity',
keys = ['id'],
description = 'Test data',
value_type = ValueType.INTEGER
)
Now, we can define the custom function to tokenize the text:
def tokenize_text(kdf_dict):
from transformers import BertTokenizer
from databricks.koalas import Series
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True, cache_dir='/tmp/transformers') # store the tokenizer in the /tmp directory because we can't write to other locations
kdf = kdf_dict['test_text_data'] # extracting the data source
text = kdf['text'].to_list() # we need to pass a list of strings to tokenizer
result = tokenizer(text,
max_length=128,
truncation=True,
pad_to_max_length = True,
return_attention_mask = True
)
# We are going to merge columns from two data sources, so we need to enable this operation in Koalas
from databricks.koalas.config import set_option
set_option("compute.ops_on_diff_frames", True)
kdf['input_ids'] = Series(result['input_ids'])
kdf['attention_mask'] = Series(result['attention_mask'])
kdf['token_type_ids'] = Series(result['token_type_ids'])
return kdf
Finally, we define the feature set using the custom processing function:
from qwak.feature_store.features.feature_sets import BatchFeatureSet, Metadata, Backfill
from qwak.feature_store.features.functions import UdfFunction
from qwak.feature_store.features.read_policies import ReadPolicy
batch_feature_set = BatchFeatureSet(
name='test_text_data_with_tokenization',
metadata=Metadata(
display_name='text data without tokenization',
description='Desc',
owner='[email protected]'
),
entity='test_text_data_entity',
data_sources={'test_text_data': ReadPolicy.FullRead},
backfill=Backfill(
start_date=datetime(2022, 4, 1)
),
scheduling_policy='*/10 * * * *',
function=UdfFunction(tokenize_text)
)
Remember to register the features by running the Qwak CLI command in the directory with the Python file containing the feature definitions:
qwak features register -p .
Updated 10 months ago