Data Exploration: Preparing a dataset
A Dataset preparation by example : DataCamp
Exploring Data and understanding the features
- This session will be an interactive session
- A main "exploration scenario" is prepared, but you can propose your own ideas!
- Dataset is about fires in california, available here :California fire incidents
Dataset discovery
What is the composition of this dataset
- Take some time to understand the dataset
- What question(s) this dataset could answer?
- What is the dataset quality?
- Analyse the features vs the potential targets
Imposed question: How to predict if a fire will be a major fire or not?
- This target is defined by the feature "MajorIncident"
- (Re)discovery of this dataset knowing that target
- Visualize: scatterplot, distributions, geomap (may need to install the appropriate extension)
- make a first filter, for instance to remove duplicates
Hint: to remove duplicates (using pandas)
from Orange.data.pandas_compat import table_from_frame, table_to_frame
df = table_to_frame(in_data)
df = df.drop_duplicates()
out_data = table_from_frame(df)
About the target variable balance
What is the balance between the two labels?
- the ratio is around 3 vs 1
- This is imbalanced, but not critically
- Imbalanced datasets can be bad for your algorithms, as the probability will be affected by this imbalanced dataset
- In this case, predicting always False will lead to a precision of 66%! This could have an impact on some algorithms
Imbalanced classes strategies
What to do in case of imbalanced data?
- Resampling : OverSampling, UnderSampling
- Weight attributions
- Find more data :-)
Imbalanced classes strategies (2)
What about undersampling?
- It's a technique where you drop randomly data from the most present class
- Example
Go to your first eval using a random forest algorithm
- Are you happy of what you've done?
- What are the performances of this model?
- What features should be selected finally? (chek
- What extra data could you add to this dataset to improve the score?
Dealing with missing values
3 main strategies exist if you cannot recover missing values
- Remove them
- Impute them, with different techniques (average, median, or regressors)
- Mark the row with a flag indicating there is a missing value
- Case of a numerical variable, we can add a column to encode the fact the value is missing, the original value (missing) will be imputed
- Case of a categorical variable, we can create a category "missing" and replace the missing value by this new category
Example with Orange
Dealing with Date and time
Year is to be removed from event data prevent overfitting, actually no new value with a past year (for new events)
Example with Orange