Extract Features with an LLM

This describes how you can use an language model to extract structured features out of unstructured data.

This guide demonstrates how to use a language model to extract structured features from raw, unstructured text.

Define the Input Schema

We begin by defining the input schema for our text data. In this example, the input consists of a single string field. However, you can easily extend this schema to include additional information, such as image URLs, metadata, or any other relevant fields.

@feature_view()
class TextDocument:
    content = String()

Define the Expected Output

Now, let’s define the output structure we want the model to extract — in this case, name and age of any person mentioned in the document.

@model_contract(...)
class Persons:
    name = String().is_optional()
    age = Int32().is_optional()

Define the Model

Now specify the model to use and the features to send to it. In this example, we use ollama_extraction, which automatically generates a prompt based on the input features and ensures that the output matches the expected schema.

@model_contract(
    input_features=[TextDocument],
    exposed_model=ollama_extraction(model="mistral:latest")
)
class Persons:
    name = String().is_optional()
    age = Int32().is_optional()

Use the model

You can now run the extraction using the following code:

store = await ContractStore.from_dir(".")

extracts = await store.model(Persons).predict_over({
    "content": [
        "Donald Duck is almost 100 years old at this point",
        "Rick and morty is only 13 years old"
    ]
}).to_polars()
print(extracts)

The output will look something like this:

shape: (2, 4)
┌───────────────────────────────────┬────────────────────┬──────────────────────────────────┬──────┐
│ content                           ┆ name               ┆ prompt_output                    ┆ age  │
│ ---                               ┆ ---                ┆ ---                              ┆ ---  │
│ str                               ┆ str                ┆ str                              ┆ i64  │
╞═══════════════════════════════════╪════════════════════╪══════════════════════════════════╪══════╡
│ Donald Duck is almost 100 year…   ┆ Donald Duck        ┆ {"name": "Donald Duck", "age"…   ┆ 99   │
│ Rick and Morty is only 13 year…   ┆ Rick and Morty     ┆ {"age": 13, "name": "Rick and…   ┆ 13   │
└───────────────────────────────────┴────────────────────┴──────────────────────────────────┴──────┘

Extracting Multiple Entities

In some cases, you may want the model to extract multiple persons from a single text — for example, extracting Rick and Morty as separate entities rather than a single entry.

To do this, we can modify the output schema to return a list of Person objects:

Schema Definitions

Aligned supports schema definitions using pydantic.BaseModel, dataclass, and feature_views. These can be nested using the Struct data type for complex extractions.

@feature_view() 
class Person:
    name = String().is_optional()
    age = Int32().is_optional()


@model_contract(
    input_features=[TextDocument],
    exposed_model=ollama_extraction(model="mistral:latest")
)
class Persons:
    persons = List(Struct(Person))

With this updated schema, the output might look like this:

shape: (2, 3)
┌─────────────────────────────────┬─────────────────────────────┬─────────────────────────────────┐
│ content                         ┆ persons                     ┆ prompt_output                   │
│ ---                             ┆ ---                         ┆ ---                             │
│ str                             ┆ list[struct[2]]             ┆ str                             │
╞═════════════════════════════════╪═════════════════════════════╪═════════════════════════════════╡
│ Donald Duck is almost 100 year… ┆ [{"Donald Duck",99}]        ┆ {"persons": [{"name": "Donald … │
│ Rick and Morty is only 13 year… ┆ [{"Rick",13}, {"Morty",13}] ┆ {"persons": [{"name": "Rick", … │
└─────────────────────────────────┴─────────────────────────────┴─────────────────────────────────┘

Full Code Example

Bellow will you find the full code example.

@feature_view()
class TextDocument:
    content = String()

@feature_view() 
class Person:
    name = String().is_optional()
    age = Int32().is_optional()

@model_contract(
    input_features=[TextDocument],
    exposed_model=ollama_extraction(model="mistral:latest")
)
class Persons:
    persons = List(Struct(Person))


async def use_model():
    store = await ContractStore.from_dir(".")

    extracts = await store.model(Persons).predict_over({
        "content": [
            "Donald Duck is almost 100 years old at this point",
            "Rick and morty is only 13 years old"
        ]
    }).to_polars()
    print(extracts)