Data Exploration: Preparing a dataset

A Dataset preparation by example : DataCamp

Exploring Data and understanding the features

This session will be an interactive session
A main "exploration scenario" is prepared, but you can propose your own ideas!
Dataset is about fires in california, available here :California fire incidents

Dataset discovery

What is the composition of this dataset

Take some time to understand the dataset
What question(s) this dataset could answer?
What is the dataset quality?
Analyse the features vs the potential targets

Imposed question: How to predict if a fire will be a major fire or not?

This target is defined by the feature "MajorIncident"
(Re)discovery of this dataset knowing that target
Visualize: scatterplot, distributions, geomap (may need to install the appropriate extension)
make a first filter, for instance to remove duplicates

Hint: to remove duplicates (using pandas)

from Orange.data.pandas_compat import table_from_frame, table_to_frame

df = table_to_frame(in_data)
df = df.drop_duplicates()

out_data = table_from_frame(df)

About the target variable balance

What is the balance between the two labels?

the ratio is around 3 vs 1
This is imbalanced, but not critically
Imbalanced datasets can be bad for your algorithms, as the probability will be affected by this imbalanced dataset
In this case, predicting always False will lead to a precision of 66%! This could have an impact on some algorithms

Imbalanced classes strategies

What to do in case of imbalanced data?

Resampling : OverSampling, UnderSampling
Weight attributions
Find more data :-)

Imbalanced classes strategies (2)

What about undersampling?

It's a technique where you drop randomly data from the most present class
Example

Go to your first eval using a random forest algorithm

Are you happy of what you've done?
What are the performances of this model?
What features should be selected finally? (chek
What extra data could you add to this dataset to improve the score?

Dealing with missing values

3 main strategies exist if you cannot recover missing values

Remove them
Impute them, with different techniques (average, median, or regressors)
Mark the row with a flag indicating there is a missing value
- Case of a numerical variable, we can add a column to encode the fact the value is missing, the original value (missing) will be imputed
- Case of a categorical variable, we can create a category "missing" and replace the missing value by this new category

Example with Orange

Dealing with Date and time

Year is to be removed from event data prevent overfitting, actually no new value with a past year (for new events)

Example with Orange

Adding external data

append the following dataset daily weather in california since 1998 to 2020

Implementation with Orange