Examples
Text Sentiment Analysis
This is using a modified version of the IMDb dataset published by Andrew L. Maas.
The modified version have combined all samples into one file, and added an ID. Therefore, the schema follows the structure bellow.
from aligned import feature_view, String, Bool, FileSource
sentiment_dir = FileSource.directory("data/sentiment")
@feature_view(
description="""Sentiment analysis of text data.
[Learning Word Vectors for Sentiment Analysis](https://aclanthology.org/P11-1015)
(Maas et al., ACL 2011)""",
source=sentiment_dir.csv_at("sentiment.csv"),
)
class AnnotatedMovieReview:
review_id = String().as_entity()
text = String()
is_negative = Bool()
So we can load a small sample of the data by using the code bellow.
df = await AnnotatedMovieReview.query().all(limit=10).to_polars()
print(df)
This returns the following data frame
shape: (10, 3)
┌─────────────┬─────────────────────────────────┬───────────────────────┐
│ is_negative ┆ text ┆ review_id │
│ --- ┆ --- ┆ --- │
│ bool ┆ str ┆ str │
╞═════════════╪═════════════════════════════════╪═══════════════════════╡
│ true ┆ Working with one of the best S… ┆ train/neg/1821_4.txt │
│ true ┆ Well...tremors I, the original… ┆ train/neg/10402_1.txt │
│ true ┆ Ouch! This one was a bit painf… ┆ train/neg/1062_4.txt │
│ true ┆ I've seen some crappy movies i… ┆ train/neg/9056_1.txt │
│ true ┆ "Carriers" follows the exploit… ┆ train/neg/5392_3.txt │
│ false ┆ For a movie that gets no respe… ┆ train/pos/4715_9.txt │
│ false ┆ Bizarre horror movie filled wi… ┆ train/pos/12390_8.txt │
│ false ┆ A solid, if unremarkable film.… ┆ train/pos/8329_7.txt │
│ false ┆ It's a strange feeling to sit … ┆ train/pos/9063_8.txt │
│ false ┆ You probably all already know … ┆ train/pos/3092_10.txt │
└─────────────┴─────────────────────────────────┴───────────────────────┘
Embedded Data
To train a sentiment model did I want to leverage some embeddings as the input features to a logistic regression model.
For this can we setup a model contract that computes the text embeddings.
from aligned import Embedding, String, Bool, FileSource, model_contract
from aligned.exposed_model.interfact import openai_embedding
review = AnnotatedMovieReview()
@model_contract(
input_features=[review.text],
exposed_model=openai_embedding(
model="text-embedding-3-small",
)
)
class MovieReviewEmbedding:
review_id = String().as_entity()
text = String()
text_embedding = Embedding(embedding_size=1536)
embedding_version = String().as_model_version()
We can now chack out the embedding by running a predict_over
.
embeddings = await store.model(MovieReviewEmbedding).predict_over({
"review_id": ["train/neg/1821_4.txt", "train/pos/12390_8.txt"]
}).to_polars()
print(embeddings)
This will show something like the following
shape: (2, 4)
┌───────────────────────┬─────────────────────────────────┬─────────────────────────────────┬────────────────────────┐
│ review_id ┆ text ┆ text_embedding ┆ embedding_version │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ list[f32] ┆ str │
╞═══════════════════════╪═════════════════════════════════╪═════════════════════════════════╪════════════════════════╡
│ train/neg/1821_4.txt ┆ Working with one of the best S… ┆ [-0.000196, 0.065838, … -0.022… ┆ text-embedding-3-small │
│ train/pos/12390_8.txt ┆ Bizarre horror movie filled wi… ┆ [-0.019459, 0.038971, … -0.055… ┆ text-embedding-3-small │
└───────────────────────┴─────────────────────────────────┴─────────────────────────────────┴────────────────────────┘
The Sentiment Model
Finally we can get to the sentiment model, which will predict if a review is negative or not.
review_embedding = MovieReviewEmbedding()
@model_contract(
input_features=[review_embedding.text_embedding],
)
class MovieReviewIsNegative:
review_id = String().as_entity()
model_version = String().as_model_version()
is_negative_pred = review.is_negative.as_classification_label()
Create a Training Dataset
With this can we create a training set with the following.
from sklearn.linear_model import LogisticRegression
# Want to create a dataset out of all rows
entities = store.feature_view(AnnotatedMovieReview).all()
dataset = (store.model(MovieReviewIsNegative)
.with_labels()
.features_for(entities)
.train_test(train_size=0.7)
)
This will fetch the review_id
feature from the AnnotatedMovieReview
view. However, since the input do not contain the text_embedding
will it also load the text
feature in order to compute the text_embedding
for you.
Also note that the dataset
contains both the train and test dataset. And they are not computed yet, so they are still lazy so you can load the train and test set in seperate steps.
Train the model
Furthermore, since we are using embeddings and sklearn
models do not like nested input, do Aligned add a convenience method to explode each dim into it's own dedicated column. This is done with the unpack_embeddings()
.
Here will aligned look at the expected schema, and look for the embedding, and then explode that.
model = LogisticRegression(...)
train = await (dataset.train
# Splits each embedding dimention into it's dedicated column
.unpack_embeddings()
.to_polars()
)
model.fit(train.input, train.labels)
Regression Target
Notice how the input and labels are already structured for you through the train.input
and the train.labels
property.
Therefore, you do not need to split the input and ground truth anymore, as the model_contract
contains this information.
Test the model
Futhermore, using the test set is just as easy.
test = await (dataset.test
# Splits each embedding dimention into it's dedicated column
.unpack_embeddings()
.to_polars()
)
preds = model.predict(test.input)
Expose the Model
Assuming we have registered the model as an MLFlow model named movie_review
and the alias champion
. With this can we update the contract and add the model's location.
@model_contract(
input_features=[review_embedding.text_embedding],
exposed_model=mlflow_server(
host="http://your-movie-review-endpoint.com",
model_name="movie_review",
model_alias="champion"
)
)
class MovieReviewIsNegative:
review_id = String().as_entity()
model_version = Int32().as_model_version()
is_negative_pred = review.is_negative.as_classification_label()
This enables us to use the .predict_over()
, which will do the following.
- It will load the relevent embedding, or compute it if needed.
- It will construct the HTTP request to the MLFlow server
- It will set the correct name of the prediction
- It will figure out which model is currently deployed, and add that to the
model_version
column. As it is marked with the.as_model_version()
tag.
preds = await (store.model(MovieReviewIsNegative)
.predict_over({
"review_id": ["train/neg/1821_4.txt", "train/pos/12390_8.txt"]
}).to_polars()
)
Leading to the following result.
shape: (2, 5)
┌─────────────────────────────────┬───────────────────────┬───────────────┬──────────────────┐
│ text_embedding ┆ review_id ┆ model_version ┆ is_negative_pred │
│ --- ┆ --- ┆ --- ┆ --- │
│ list[f64] ┆ str ┆ int32 ┆ bool │
╞═════════════════════════════════╪═══════════════════════╪═══════════════╪══════════════════╡
│ [-0.293021, -0.035181, … 0.295… ┆ train/neg/1821_4.txt ┆ 4 ┆ true │
│ [-0.584421, 0.370267, … 0.5556… ┆ train/pos/12390_8.txt ┆ 4 ┆ false │
└─────────────────────────────────┴───────────────────────┴───────────────┴──────────────────┘