Generate Text Embeddings using UDF Functions

This tutorial shows how you can use the user-defined function feature of the Qwak Feature Store to tokenize text for NLP models.

First, you must request an installation of required dependencies in your Qwak environment. To do so, send a message to our Tech Support team. Your requests will be approved/rejected on a case-by-case basis because additional libraries may interfere with the Qwak code.

This tutorial shows how to use the BertTokenizer from the transformers library created by Huggingface.

After getting your request approved, you can define a data source and the entity.

For example, we can retrieve the text value from a CSV file stored in S3. In such a case, you should create an empty Python file and put the following CsvSource and Entity in the code:

from qwak.feature_store.sources.data_sources import CsvSource
from qwak.feature_store.entities import Entity, ValueType

csv_source = CsvSource(
    name='test_text_data',
    description='Some test text data',
    date_created_column='created_date',
    path='s3://path'
)

entity = Entity(
    name = 'test_text_data_entity',
    keys = ['id'],
    description = 'Test data',
    value_type = ValueType.INTEGER
)

Now, we can define the custom function to tokenize the text:

def tokenize_text(kdf_dict):
    from transformers import BertTokenizer
    from databricks.koalas import Series
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True, cache_dir='/tmp/transformers') # store the tokenizer in the /tmp directory because we can't write to other locations

    kdf = kdf_dict['test_text_data'] # extracting the data source
    text = kdf['text'].to_list() # we need to pass a list of strings to tokenizer

    result = tokenizer(text,
        max_length=128,
        truncation=True,
        pad_to_max_length = True,
        return_attention_mask = True
    )

    # We are going to merge columns from two data sources, so we need to enable this operation in Koalas
    from databricks.koalas.config import set_option
    set_option("compute.ops_on_diff_frames", True)

    kdf['input_ids'] = Series(result['input_ids'])
    kdf['attention_mask'] = Series(result['attention_mask'])
    kdf['token_type_ids'] = Series(result['token_type_ids'])
    return kdf

Finally, we define the feature set using the custom processing function:

from qwak.feature_store.features.feature_sets import BatchFeatureSet, Metadata, Backfill
from qwak.feature_store.features.functions import UdfFunction
from qwak.feature_store.features.read_policies import ReadPolicy

batch_feature_set = BatchFeatureSet(
    name='test_text_data_with_tokenization',
    metadata=Metadata(
        display_name='text data without tokenization',
        description='Desc',
        owner='[email protected]'
    ),
    entity='test_text_data_entity',
    data_sources={'test_text_data': ReadPolicy.FullRead},
    backfill=Backfill(
        start_date=datetime(2022, 4, 1)
    ),
    scheduling_policy='*/10 * * * *',
    function=UdfFunction(tokenize_text)
)

Remember to register the features by running the Qwak CLI command in the directory with the Python file containing the feature definitions:

qwak features register -p .