Credit Scoring

See how you can use Aligned to load, process, train, and serve a credit scoring model.

For this project we will combine three differnet sources of features, or FeatureViews.

View the full code at GitHub.

Defining our features

Zipcode

We have also some location features based on a zipcode. The features are stored in a Parquet file in the following schime schema.

Column name	Data type
zipcode	Int
city	String
state	String
location_type	String
tax_returns_filed	Int
population	Int
total_wages	Int
event_timestamp	Datetime
created_timestamp	Datetime

from aligned import feature_view, String, Int64, EventTimestamp, Timestamp, FileSource, RedshiftSQLConfig, KafkaConfig
from datetime import timedelta

zipcode_source = FileSource.parquet_at("data/zipcode_table.parquet")

@feature_view(
    name="zipcode_features",
    description="Zipcode features for a given location",
    batch_source=zipcode_source,
)
class Zipcode:

    zipcode = Int64().as_entity()

    event_timestamp = EventTimestamp(ttl=timedelta(days=3650))
    created_timestamp = Timestamp()

    city = String()
    state = String()
    location_type = String()
    tax_returns_filed = Int64()
    population = Int64()
    total_wages = Int64()

    is_primary_location = location_type == "PRIMARY"

Credit History

Finally we have some features regarding the persons credit history. They are also stored in a new Parquet file, with the following schema.

Column name	Data type
dob_ssn	String
credit_card_due	Int
mortgage_due	Int
student_loan_due	Int
vehicle_oan_due	Int
hard_pulls	Int
missed_payments_2y	Int
missed_payments_1y	Int
missed_payments_6m	Int
event_timestamp	Datetime
created_timestamp	Datetime

from aligned import feature_view, String, EventTimestamp, Int64, FileSource, RedshiftSQLConfig
from datetime import timedelta

credit_history_source = FileSource.parquet_at("data/credit_history.parquet")

@feature_view(
    name="credit_history",
    description="The credit history for a given person",
    batch_source=credit_history_source
)
class CreditHistory:

    dob_ssn = String().as_entity().description(
        "Date of birth and last four digits of social security number"
    )

    event_timestamp = EventTimestamp(ttl=timedelta(days=90))

    credit_card_due = Int64()
    mortgage_due = Int64()
    student_loan_due = Int64()
    vehicle_loan_due = Int64()
    hard_pulls = Int64()
    missed_payments_2y = Int64()
    missed_payments_1y = Int64()
    missed_payments_6m = Int64()
    bankruptcies = Int64()

Loan

First lets look at the value we want to predict, if a person got a loan or not.

This will be a boolean value stored in a loan_status in a Parquet file. Furthermore, here are some more features stored in the same data file.

Column name	Data type
loan_status	Bool
loan_id	Int
dob_ssn	String
zipcode	Int
person_age	Int
person_income	Int
person_home_ownership	String
person_emp_length	Float
loan_intent	String
loan_amnt	Int
event_timestamp	Datetime

Now let's describe this data using Aligned.

from aligned import feature_view, Int64, String, FileSource, EventTimestamp, Bool, Float

loan_source = FileSource.parquet_at("data/loan_table.parquet", mapping_keys={
    "loan_amnt": "loan_amount"
})

ownership_values = ['RENT', 'OWN', 'MORTGAGE', 'OTHER']
loan_intent_values = [
    "PERSONAL", "EDUCATION", 'MEDICAL', 'VENTURE', 'HOMEIMPROVEMENT', 'DEBTCONSOLIDATION'
]

@feature_view(
    name="loan",
    description="The granted loans",
    batch_source=loan_source
)
class Loan:

    loan_id = String().as_entity()

    event_timestamp = EventTimestamp()

    loan_status = Bool().description("If the loan was granted or not")

    person_age = Int64()
    person_income = Int64()

    person_home_ownership = String().accepted_values(ownership_values)
    person_home_ownership_ordinal = person_home_ownership.ordinal_categories(ownership_values)

    person_emp_length = Float().description(
        "The number of months the person has been employed in the current job"
    )

    loan_intent = String().accepted_values(loan_intent_values)
    loan_intent_ordinal = loan_intent.ordinal_categories(loan_intent_values)

    loan_amount = Int64()
    loan_int_rate = Float().description("The interest rate of the loan")

Defining the Model

Finaly, now that we have defined where our features are stored, and the processing we want. Now we can define which features our model will use.

First we need to import the feature views that we want to use. We then define that we want the loan_status to be the label for our credit_scoring model, and that it is a classification task.

from aligned import model_contract
from examples.credit_scoring.credit_history import CreditHistory
from examples.credit_scoring.zipcode import Zipcode
from examples.credit_scoring.loan import Loan

credit = CreditHistory()
zipcode = Zipcode()
loan = Loan()

@model_contract(
    name="credit_scoring",
    description="A model that do credit scoring",
    features=[
        credit.credit_card_due,
        credit.mortgage_due,
        credit.student_loan_due,
        credit.vehicle_loan_due,
        credit.hard_pulls,
        credit.missed_payments_1y,
        credit.missed_payments_2y,
        credit.missed_payments_6m,
        credit.bankruptcies,

        zipcode.city,
        zipcode.state,
        zipcode.is_primary_location,
        zipcode.tax_returns_filed,
        zipcode.total_wages,

        loan.person_age,
        loan.person_income,
        loan.person_emp_length,
        loan.person_home_ownership_ordinal,
        loan.loan_amount,
        loan.loan_int_rate,
        loan.loan_intent_ordinal
    ]
)
class CreditScoring:

    was_granted_loan = loan.loan_status.as_classification_target()

Training a model

To train a model can we easily load a training data set with the following few lines.

store = await FileSource.json_at("features.json").feature_store()

entities = FileSource.csv_at("training_entities.parquet")
training_data = await store.model("credit_scoring")\
    .with_targets()\
    .features_for(entities)\
    .to_pandas()

Now that we have some data. We train a model with ease. No need to define which features to use your self, it is handled for you.

from sklearn.tree imoprt DecisionTreeClassifier

classifier = DecisionTreeClassifier()

classifier.fit(training_data.input, training_data.target)

Serving a model

Furthermore, we can do something similar for serving our model. However, rather then using our bach source can we change to an online store.

This can be done with the following code.

from aligned import RedisConfig

online_store = store.with_source(
    RedisConfig(env_key="REDIS_URL)
)

entities = {
    "zipcode": [...], 
    "dob_ssn": [...], 
    "loan_id": [...]
}

online_job = online_store.model("credit_scoring")\
    .features_for(entities)

feature_columns = online_job.request_result.feature_columns
features = await online_job.to_pandas()

y = classifier.predict(
    features[feature_columns]
)