Examples
Credit Scoring
See how you can use Aligned to load, process, train, and serve a credit scoring model.
For this project we will combine three differnet sources of features, or FeatureView
s.
View the full code at GitHub.
Defining our features
Zipcode
We have also some location features based on a zipcode. The features are stored in a Parquet
file in the following schime schema.
Column name | Data type |
---|---|
zipcode | Int |
city | String |
state | String |
location_type | String |
tax_returns_filed | Int |
population | Int |
total_wages | Int |
event_timestamp | Datetime |
created_timestamp | Datetime |
from aligned import feature_view, String, Int64, EventTimestamp, Timestamp, FileSource, RedshiftSQLConfig, KafkaConfig
from datetime import timedelta
zipcode_source = FileSource.parquet_at("data/zipcode_table.parquet")
@feature_view(
name="zipcode_features",
description="Zipcode features for a given location",
batch_source=zipcode_source,
)
class Zipcode:
zipcode = Int64().as_entity()
event_timestamp = EventTimestamp(ttl=timedelta(days=3650))
created_timestamp = Timestamp()
city = String()
state = String()
location_type = String()
tax_returns_filed = Int64()
population = Int64()
total_wages = Int64()
is_primary_location = location_type == "PRIMARY"
Credit History
Finally we have some features regarding the persons credit history. They are also stored in a new Parquet
file, with the following schema.
Column name | Data type |
---|---|
dob_ssn | String |
credit_card_due | Int |
mortgage_due | Int |
student_loan_due | Int |
vehicle_oan_due | Int |
hard_pulls | Int |
missed_payments_2y | Int |
missed_payments_1y | Int |
missed_payments_6m | Int |
event_timestamp | Datetime |
created_timestamp | Datetime |
from aligned import feature_view, String, EventTimestamp, Int64, FileSource, RedshiftSQLConfig
from datetime import timedelta
credit_history_source = FileSource.parquet_at("data/credit_history.parquet")
@feature_view(
name="credit_history",
description="The credit history for a given person",
batch_source=credit_history_source
)
class CreditHistory:
dob_ssn = String().as_entity().description(
"Date of birth and last four digits of social security number"
)
event_timestamp = EventTimestamp(ttl=timedelta(days=90))
credit_card_due = Int64()
mortgage_due = Int64()
student_loan_due = Int64()
vehicle_loan_due = Int64()
hard_pulls = Int64()
missed_payments_2y = Int64()
missed_payments_1y = Int64()
missed_payments_6m = Int64()
bankruptcies = Int64()
Loan
First lets look at the value we want to predict, if a person got a loan or not.
This will be a boolean value stored in a loan_status
in a Parquet
file. Furthermore, here are some more features stored in the same data file.
Column name | Data type |
---|---|
loan_status | Bool |
loan_id | Int |
dob_ssn | String |
zipcode | Int |
person_age | Int |
person_income | Int |
person_home_ownership | String |
person_emp_length | Float |
loan_intent | String |
loan_amnt | Int |
event_timestamp | Datetime |
Now let's describe this data using Aligned
.
from aligned import feature_view, Int64, String, FileSource, EventTimestamp, Bool, Float
loan_source = FileSource.parquet_at("data/loan_table.parquet", mapping_keys={
"loan_amnt": "loan_amount"
})
ownership_values = ['RENT', 'OWN', 'MORTGAGE', 'OTHER']
loan_intent_values = [
"PERSONAL", "EDUCATION", 'MEDICAL', 'VENTURE', 'HOMEIMPROVEMENT', 'DEBTCONSOLIDATION'
]
@feature_view(
name="loan",
description="The granted loans",
batch_source=loan_source
)
class Loan:
loan_id = String().as_entity()
event_timestamp = EventTimestamp()
loan_status = Bool().description("If the loan was granted or not")
person_age = Int64()
person_income = Int64()
person_home_ownership = String().accepted_values(ownership_values)
person_home_ownership_ordinal = person_home_ownership.ordinal_categories(ownership_values)
person_emp_length = Float().description(
"The number of months the person has been employed in the current job"
)
loan_intent = String().accepted_values(loan_intent_values)
loan_intent_ordinal = loan_intent.ordinal_categories(loan_intent_values)
loan_amount = Int64()
loan_int_rate = Float().description("The interest rate of the loan")
Defining the Model
Finaly, now that we have defined where our features are stored, and the processing we want. Now we can define which features our model will use.
First we need to import the feature views that we want to use. We then define that we want the loan_status
to be the label for our credit_scoring
model, and that it is a classification task.
from aligned import model_contract
from examples.credit_scoring.credit_history import CreditHistory
from examples.credit_scoring.zipcode import Zipcode
from examples.credit_scoring.loan import Loan
credit = CreditHistory()
zipcode = Zipcode()
loan = Loan()
@model_contract(
name="credit_scoring",
description="A model that do credit scoring",
features=[
credit.credit_card_due,
credit.mortgage_due,
credit.student_loan_due,
credit.vehicle_loan_due,
credit.hard_pulls,
credit.missed_payments_1y,
credit.missed_payments_2y,
credit.missed_payments_6m,
credit.bankruptcies,
zipcode.city,
zipcode.state,
zipcode.is_primary_location,
zipcode.tax_returns_filed,
zipcode.total_wages,
loan.person_age,
loan.person_income,
loan.person_emp_length,
loan.person_home_ownership_ordinal,
loan.loan_amount,
loan.loan_int_rate,
loan.loan_intent_ordinal
]
)
class CreditScoring:
was_granted_loan = loan.loan_status.as_classification_target()
Training a model
To train a model can we easily load a training data set with the following few lines.
store = await FileSource.json_at("features.json").feature_store()
entities = FileSource.csv_at("training_entities.parquet")
training_data = await store.model("credit_scoring")\
.with_targets()\
.features_for(entities)\
.to_pandas()
Now that we have some data. We train a model with ease. No need to define which features to use your self, it is handled for you.
from sklearn.tree imoprt DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(training_data.input, training_data.target)
Serving a model
Furthermore, we can do something similar for serving our model. However, rather then using our bach source can we change to an online store.
This can be done with the following code.
from aligned import RedisConfig
online_store = store.with_source(
RedisConfig(env_key="REDIS_URL)
)
entities = {
"zipcode": [...],
"dob_ssn": [...],
"loan_id": [...]
}
online_job = online_store.model("credit_scoring")\
.features_for(entities)
feature_columns = online_job.request_result.feature_columns
features = await online_job.to_pandas()
y = classifier.predict(
features[feature_columns]
)