Text Sentiment Analysis

This is using a modified version of the IMDb dataset published by Andrew L. Maas.

The modified version have combined all samples into one file, and added an ID. Therefore, the schema follows the structure bellow.

from aligned import data_contract, String, Bool, FileSource

sentiment_dir = FileSource.directory("data/sentiment")

@data_contract(
    description="""Sentiment analysis of text data.
    [Learning Word Vectors for Sentiment Analysis](https://aclanthology.org/P11-1015) 
    (Maas et al., ACL 2011)""",
    source=sentiment_dir.csv_at("sentiment.csv"),
)
class AnnotatedMovieReview:
    review_id = String().as_entity()
    text = String()
    is_negative = Bool()

So we can load a small sample of the data by using the code bellow.

df = await AnnotatedMovieReview.query().all(limit=10).to_polars()
print(df)

This returns the following data frame

shape: (10, 3)
┌─────────────┬─────────────────────────────────┬───────────────────────┐
│ is_negative ┆ text                            ┆ review_id             │
│ ---         ┆ ---                             ┆ ---                   │
│ bool        ┆ str                             ┆ str                   │
╞═════════════╪═════════════════════════════════╪═══════════════════════╡
│ true        ┆ Working with one of the best S… ┆ train/neg/1821_4.txt  │
│ true        ┆ Well...tremors I, the original… ┆ train/neg/10402_1.txt │
│ true        ┆ Ouch! This one was a bit painf… ┆ train/neg/1062_4.txt  │
│ true        ┆ I've seen some crappy movies i… ┆ train/neg/9056_1.txt  │
│ true        ┆ "Carriers" follows the exploit… ┆ train/neg/5392_3.txt  │
│ false       ┆ For a movie that gets no respe… ┆ train/pos/4715_9.txt  │
│ false       ┆ Bizarre horror movie filled wi… ┆ train/pos/12390_8.txt │
│ false       ┆ A solid, if unremarkable film.… ┆ train/pos/8329_7.txt  │
│ false       ┆ It's a strange feeling to sit … ┆ train/pos/9063_8.txt  │
│ false       ┆ You probably all already know … ┆ train/pos/3092_10.txt │
└─────────────┴─────────────────────────────────┴───────────────────────┘

Embedded Data

To train a sentiment model did I want to leverage some embeddings as the input features to a logistic regression model.

For this can we setup a model contract that computes the text embeddings.

from aligned import Embedding, String, Bool, FileSource, model_contract
from aligned.exposed_model.interfact import openai_embedding

review = AnnotatedMovieReview()

@model_contract(
    input_features=[review.text],
    exposed_model=openai_embedding(
        model="text-embedding-3-small", 
    )
)
class MovieReviewEmbedding:
    review_id = String().as_entity()
    text = String()
    text_embedding = Embedding(embedding_size=1536)
    embedding_version = String().as_model_version()

We can now chack out the embedding by running a predict_over.

embeddings = await store.model(MovieReviewEmbedding).predict_over({
    "review_id": ["train/neg/1821_4.txt", "train/pos/12390_8.txt"]
}).to_polars()
print(embeddings)

This will show something like the following

shape: (2, 4)
┌───────────────────────┬─────────────────────────────────┬─────────────────────────────────┬────────────────────────┐
│ review_id             ┆ text                            ┆ text_embedding                  ┆ embedding_version      │
│ ---                   ┆ ---                             ┆ ---                             ┆ ---                    │
│ str                   ┆ str                             ┆ list[f32]                       ┆ str                    │
╞═══════════════════════╪═════════════════════════════════╪═════════════════════════════════╪════════════════════════╡
│ train/neg/1821_4.txt  ┆ Working with one of the best S… ┆ [-0.000196, 0.065838, … -0.022… ┆ text-embedding-3-small │
│ train/pos/12390_8.txt ┆ Bizarre horror movie filled wi… ┆ [-0.019459, 0.038971, … -0.055… ┆ text-embedding-3-small │
└───────────────────────┴─────────────────────────────────┴─────────────────────────────────┴────────────────────────┘

The Sentiment Model

Finally we can get to the sentiment model, which will predict if a review is negative or not.

review_embedding = MovieReviewEmbedding()

@model_contract(
    input_features=[review_embedding.text_embedding],
)
class MovieReviewIsNegative:
    review_id = String().as_entity()
    model_version = String().as_model_version()
    is_negative_pred = review.is_negative.as_classification_label()

Create a Training Dataset

With this can we create a training set with the following.

from sklearn.linear_model import LogisticRegression

# Want to create a dataset out of all rows
entities = store.contract(AnnotatedMovieReview).all()

dataset = (store.model(MovieReviewIsNegative)
    .with_labels()
    .features_for(entities)
    .train_test(train_size=0.7)
)

This will fetch the review_id feature from the AnnotatedMovieReview view. However, since the input do not contain the text_embedding will it also load the text feature in order to compute the text_embedding for you.

Also note that the dataset contains both the train and test dataset. And they are not computed yet, so they are still lazy so you can load the train and test set in seperate steps.

Train the model

Furthermore, since we are using embeddings and sklearn models do not like nested input, do Aligned add a convenience method to explode each dim into it's own dedicated column. This is done with the unpack_embeddings().

Here will aligned look at the expected schema, and look for the embedding, and then explode that.

model = LogisticRegression(...)

train = await (dataset.train
    # Splits each embedding dimention into it's dedicated column
    .unpack_embeddings() 
    .to_polars()
)
model.fit(train.input, train.labels)

Regression Target

Notice how the input and labels are already structured for you through the train.input and the train.labels property.

Therefore, you do not need to split the input and ground truth anymore, as the model_contract contains this information.

Test the model

Futhermore, using the test set is just as easy.

test = await (dataset.test
    # Splits each embedding dimention into it's dedicated column
    .unpack_embeddings() 
    .to_polars()
)
preds = model.predict(test.input)

Expose the Model

Assuming we have registered the model as an MLFlow model named movie_review and the alias champion. With this can we update the contract and add the model's location.

@model_contract(
    input_features=[review_embedding.text_embedding],
    exposed_model=mlflow_server(
        host="http://your-movie-review-endpoint.com",
        model_name="movie_review",
        model_alias="champion"
    )
)
class MovieReviewIsNegative:
    review_id = String().as_entity()
    model_version = Int32().as_model_version()
    is_negative_pred = review.is_negative.as_classification_label()

This enables us to use the .predict_over(), which will do the following.

It will load the relevent embedding, or compute it if needed.
It will construct the HTTP request to the MLFlow server
It will set the correct name of the prediction
It will figure out which model is currently deployed, and add that to the model_version column. As it is marked with the .as_model_version() tag.

preds = await (store.model(MovieReviewIsNegative)
    .predict_over({
        "review_id": ["train/neg/1821_4.txt", "train/pos/12390_8.txt"]
    }).to_polars()
)

Leading to the following result.

shape: (2, 5)
┌─────────────────────────────────┬───────────────────────┬───────────────┬──────────────────┐
│ text_embedding                  ┆ review_id             ┆ model_version ┆ is_negative_pred │
│ ---                             ┆ ---                   ┆ ---           ┆ ---              │
│ list[f64]                       ┆ str                   ┆ int32         ┆ bool             │
╞═════════════════════════════════╪═══════════════════════╪═══════════════╪══════════════════╡
│ [-0.293021, -0.035181, … 0.295… ┆ train/neg/1821_4.txt  ┆ 4             ┆ true             │
│ [-0.584421, 0.370267, … 0.5556… ┆ train/pos/12390_8.txt ┆ 4             ┆ false            │
└─────────────────────────────────┴───────────────────────┴───────────────┴──────────────────┘