California housing tutorial (1/2)
This is the part 1 of our tutorial. It explains how to prepare data and launch AntakIA. Those steps are common to most of AntakIA usages. If you feel familiar enough, you can directly jump to the second part.
The California housing dataset
We'll use the California housing dataset, very famous in the datascience ecosystem
This dataset describes 20 640 block groups (ie. groups of houses) in California using 8 variables :
Variable | Description |
---|---|
MedInc | median income in block group |
HouseAge | median house age in block group |
AveRooms | average number of rooms per household |
AveBedrms | average number of bedrooms per household |
Population | block group population |
AveOccup | average number of household members |
Latitude | block group latitude |
Longitude | block group longitude |
The dataset also gives for each block group the average price of a house. This data comes from real real estate transactions.
In our noteboox, this dataset is stored in a Pandas Dataframe named X
.
If you type X.head()
you'll get :
The "medium house values" are stored in a Pandas Series named y
.
A y.head()
will give you something like :
The use case
We can imagine several use cases where AntakIA could be useful: * Let's say you're a real estate agent in California. A data scientist in your team has trained a wonderful ML model that is capable of predicting the market value of any house in the state, as long as you provide sufficent data. You're amazed and want to understand how this model works in order to gain insights of your market: what drives the price ? any segmentation ? So you decided to use AntakIA. * Or, you don't have such model. But you still want to have an accurate understanding of your market. Then you ask a data scientist to train a model. Add then you use AntakIA on it.
It's quite the same story: you have a dataset X
, you do a supervised training on (X
,y
) to get a fitted model M. AntakIA will help you to understand how and why M can predict house values.
Preparing the data
Launch a Jupyter server and open the notebook california_housing.ipynb
(see Getting started page).
Let's analyze the first cells :
import pandas as pd
df = pd.read_csv('../data/california_housing.csv').drop(['Unnamed: 0'], axis=1)
We start creating a dataframe from a local CSV file. You could have imported this dataset from the Scikit-learn package here. As you'll see, AntakIA needs to compute other values (eg. SHAP values for the data and the model). So as to make this tutorial quicker and more pleaseant, our CSV file includes these pre-computed SHAP values.
# Remove outliers:
df = df.loc[df['Population']<10000]
df = df.loc[df['AveOccup']<6]
df = df.loc[df['AveBedrms']<1.5]
df = df.loc[df['HouseAge']<50]
# # Only San Francisco :
df = df.loc[(df['Latitude']<38.07)&(df['Latitude']>37.2)]
df = df.loc[(df['Longitude']>-122.5)&(df['Longitude']<-121.75)]
In the same way, the previous lines are not compulsory. But it appears that the dataset for the sole city of San Francisco is better to get rapidly a good intuition of how AntakIA works.
X = df.iloc[:,0:8] # the dataset
y = df.iloc[:,9] # the target variable
shap_values = df.iloc[:,[10,11,12,13,14,15,16,17]] # the SHAP values
Here we have extracted from our big CSV dataset : the X
values, the y
Series and the shap_values
(we'll explain those values further).
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor(random_state = 9)
model.fit(X, y)
We decided to use a GradientBoosting model and have it trained (or fitted) on our data. Now our data is prepared and our model trained, we can then launch AntakIA :
from antakia.antakia import AntakIA
atk = AntakIA([X], y, model)
atk.start_gui()
It's that simple !
Yet, in this tutorial we'll use another method to launch AntakIA :
from antakia.antakia import AntakIA
variables_df = pd.DataFrame(
{'col_index': [0, 1, 2, 3, 4, 5, 6, 7],
'descr': ['Median income', 'House age', 'Average nb rooms', 'Average nb bedrooms', 'Population', 'Average occupancy', 'Latitude', 'Longitude'],
'type': ['float64', 'int', 'float64', 'float64', 'int', 'float64', 'float64', 'float64'],
'unit': ['k$', 'years', 'rooms', 'rooms', 'people', 'ratio', 'degrees', 'degrees'],
'critical': [True, False, False, False, False, False, False, False],
'lat': [False, False, False, False, False, False, True, False],
'lon': [False, False, False, False, False, False, False, True]},
index=['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
)
atk = AntakIA(X, y, model, variables_df, shap_values)
atk.start_gui()
Two differences with this method :
- we've passed to AntakIA a description of the dataset variables :
- description
- it critical ?
- do we have geographical data ?
- type of the variable
-
which unit is used ?
-
we've also passed pre-computed SHAP values.
Now we're ready to discover AntakIA. You can go to the second part of our tutorial.