AI & Data

Dataset Preparation by Example

Lecture 2

Hands-on data exploration and feature engineering using real-world datasets and Orange data mining

A Dataset preparation by example : DataCamp

Exploring Data and understanding the features

Dataset discovery

What is the composition of this dataset

Imposed question: How to predict if a fire will be a major fire or not?

Hint: to remove duplicates (using pandas)

from Orange.data.pandas_compat import table_from_frame, table_to_frame

df = table_to_frame(in_data)
df = df.drop_duplicates()

out_data = table_from_frame(df)

About the target variable balance

What is the balance between the two labels?

Imbalanced classes strategies

What to do in case of imbalanced data?

Imbalanced classes strategies (2)

What about undersampling?

Go to your first eval using a random forest algorithm

Dealing with missing values

3 main strategies exist if you cannot recover missing values

Example with Orange

Dealing with Date and time

Year is to be removed from event data prevent overfitting, actually no new value with a past year (for new events)

Example with Orange

Adding external data

append the following dataset daily weather in california since 1998 to 2020

Implementation with Orange

Slide Overview