Core concepts
Feature View
To load features into models do aligned
use the concept of a feature view. Feature views can kind of be seen as a data model from the BI domain. Therefore, it could be the gold / mart, silver / intermediate, or bronze / staging layer if you want to get crazy.
Schema Definition
To define the schema is it almost as easy as setting up a dataclass
.
Let's use the following schema as an example.
Column name | Data type |
---|---|
zipcode | Int |
location_type | String |
population | Int |
event_timestamp | Datetime |
created_timestamp | Datetime |
To define the features we can we use the following code.
from aligned import FeatureView, String, Int64, EventTimestamp, Timestamp, FileSource
@feature_view(...)
class Zipcode:
zipcode = Int64().as_entity()
event_timestamp = EventTimestamp()
created_timestamp = Timestamp()
location_type = String()
population = Int64()
This defines all our columns above, with their data types, and some extra semantic meaning. Like entity and event timestamp in case of historic data.
@feature_view
But what is this @feature_view
?
This contains all metadata related to our schema. This could be, our main source, a materialized source, owners, descriptions, expected freshness and more.
Source
The main source of our features
@feature_view(
name="zipcode",
source=FileSource.parquet_at("data/zipcode_table.parquet")
)
class Zipcode:
...
Materialized Source
The materialized source of our features.
This can be usefull for caching downstream transformations, or moving data to a more performant data storage.
zipcode_source =
@feature_view(
name="zipcode",
source=FileSource.csv_at("data/zipcode_table.csv"),
materialized_source=FileSource.parquet_at("data/zipcode_table.parquet")
)
class Zipcode:
...
Freshness
Most use-cases will most likely not be streaming. Therefore, we often load data at a schedule. As a result, aligned
allow you to define how long of a time period is acceptable to not have updated features, but also what is unacceptable.
@feature_view(
name="zipcode",
source=FileSource.parquet_at("data/zipcode_table.parquet"),
acceptable_freshness=timedelta(hours=1),
unacceptable_freshness=timedelta(hours=3)
)
class Zipcode:
event_timestamp = EventTimestamp()
Metadata
Furthermore, you can also add description
, tags
and a list of contacts
.
@feature_view(
name="zipcode",
source=FileSource.parquet_at("data/zipcode_table.parquet"),
description="The zipcode features in Norway",
contacts=["MatsMoll"],
tags=["eta-team"]
)
class Zipcode:
...
Load data
Finaly we can load data with the following code.
df = await Zipcode.query().all().to_pandas()
Or if we have a loaded feature store.
df = await store.feature_view("zipcode_features").all().to_pandas()