Data Exploration

& Preparation

Session 1

Introduction to data science workflows and preparation techniques for ML projects

The Data Preparation Challenge

Why data preparation is critical for ML success

Experimental Process

Data science requires iterative experimentation with different transformations and approaches

Performance Impact

Quality of data preparation has massive impact on model performance

Format Knowledge

Understanding data formats (JSON, XML, CSV, APIs) and their constraints is essential

Organization

Without structure, data pipelines become messy and hard to maintain

Reality check: Data preparation often takes 60-80% of a data scientist's time in real-world projects.

The Data Science Workflow

From raw data to production models

1

Collect

Raw data

2

Clean

Handle missing

3

Engineer

Features

4

Train

Build model

5

Evaluate

Test & tune

6

Deploy

Production

Data Preparation (Steps 1-3)

Acquire data from various sources
Handle missing values, outliers, duplicates
Transform features for ML algorithms

Model Development (Steps 4-6)

Train models with prepared data
Iterate: evaluate → improve features → retrain
Deploy validated model to production

Non-linear process: The workflow is iterative—you'll loop back to feature engineering based on evaluation results.

The Combinatorial Explosion

Multiple transformation options per feature = exponential complexity

Example: Single Feature Options

Scaling — StandardScaler, MinMax, Robust

Missing Values — Mean, Median, Drop, Interpolate

Outliers — Remove, Cap, Transform, Keep

Reality

A single feature has dozens of paths
Multiply by all features in your dataset
Execution is iterative—try, evaluate, backtrack

Solution: Use systematic workflows (Orange, sklearn pipelines) to manage complexity.

ML-Ready Data Types

What algorithms actually need

Numerical Data

Integer: Counts, IDs, discrete values
Float: Continuous measurements, prices, coordinates

Examples: Age (int), Temperature (float), Salary (float)

Categorical Data

Nominal: No order (color, country)
Ordinal: Ordered categories (rating, size)
Boolean: True/False, Yes/No

Examples: Gender (nominal), Education Level (ordinal), Is_Customer (boolean)

Key insight: ML algorithms only accept these types. Everything else (text, images, dates, URLs) must be transformed into numerical or categorical features.

The DataFrame: ML's Data Format

Algorithms want "rectangle" data

DataFrame: A 2D table where rows are observations and columns are features, with one column as the target variable.

Rows = Observations

Each row represents a single data point or instance (e.g., one customer, one transaction)

Columns = Features

Each column represents a variable or attribute (e.g., age, price, category)

Target Column

One special column contains the label or value you want to predict

Feature_1	Feature_2	Feature_3	Target
25	Male	50000	Purchased
34	Female	65000	Not
45	Male	78000	Purchased
...	...	...	...

Why This Course?

Real-world data comes in many forms

The Reality

Data preparation requires software engineering skills, not just statistics

Data Scientists vs AI Engineers have different focuses, but both need to handle diverse data sources

Data Sources We'll Cover

JSON & XML files
CSV & ARFF formats
SQL Databases
REST APIs
Web scraping

1 Extract

Get data from various sources (APIs, databases, files)

2 Transform

Convert to ML-ready format (numerical/categorical)

3 Load

Create clean DataFrame for model training

Tools & Methodology

Choosing the right approach for your workflow

Jupyter Notebooks

Pros: Interactive, visual feedback, great for exploration

Cons: Cell execution order issues, hard to version control properly

Python Scripts

Pros: Reproducible, easy to version, can be automated

Cons: Less interactive, slower feedback loop

Orange (Visual)

Pros: No coding required, drag-and-drop workflow, great for learning

Cons: Limited customization, not production-ready

Best practice: Start with Orange or Jupyter for exploration, then transition to Python scripts for production pipelines.

Orange: Visual Data Mining

Our tool for hands-on learning

What is Orange?

Open-source visual programming tool for data analysis
Drag-and-drop widgets for loading, transforming, and modeling
Perfect for understanding ML workflows without code
Outputs Python code you can learn from

Installation

pip install orange3

Download

orangedatamining.com

Available for Windows, macOS, and Linux

Pro tip: Orange is excellent for rapid prototyping. Build your workflow visually, then export to Python for production use.

Practical Exercise: Titanic Dataset (1/2)

Loading and exploring the data in Orange

1

File

Load CSV

2

Data Table

View raw data

3

Feature Statistics

Check types

4

Select Columns

Set target

5

Impute

Fix missing

Steps 1-3: Load & Inspect

File: Load titanic_train.csv
Data Table: Browse rows and columns
Feature Statistics: See missing values in Age, Cabin

Steps 4-5: Prepare

Select Columns: Set "Survived" as Target
Impute: Replace missing Age with average

Practical Exercise: Titanic Dataset (2/2)

Visualization and pattern discovery

6

Distributions

By class

7

Distributions

By gender

8

Box Plot

Age vs survival

9

Scatter Plot

Correlations

10

Tree Viewer

Build model

Steps 6-8: Visualize

Distributions: Compare survival by Pclass, Sex
Box Plot: Age distribution for survivors vs non-survivors

Steps 9-10: Analyze

Scatter Plot: Age vs Fare, colored by Survived
Tree Viewer: Connect Tree widget to see decision rules

Expected finding: Women and 1st class passengers had much higher survival rates. Children also survived more often.

Data Exploration

The Data Preparation Challenge

Experimental Process

Performance Impact

Format Knowledge

Organization

The Data Science Workflow

Data Preparation (Steps 1-3)

Model Development (Steps 4-6)

The Combinatorial Explosion

Example: Single Feature Options

Reality

ML-Ready Data Types

Numerical Data

Categorical Data

The DataFrame: ML's Data Format

Rows = Observations

Columns = Features

Target Column

Why This Course?

The Reality

Data Sources We'll Cover

1 Extract

2 Transform

3 Load

Tools & Methodology

Jupyter Notebooks

Python Scripts

Orange (Visual)

Orange: Visual Data Mining

What is Orange?

Installation

Download

Practical Exercise: Titanic Dataset (1/2)

Steps 1-3: Load & Inspect

Steps 4-5: Prepare

Practical Exercise: Titanic Dataset (2/2)

Steps 6-8: Visualize

Steps 9-10: Analyze

Slide Overview