Data Exploration

& Preparation

Session 1

Introduction to data science workflows and preparation techniques for ML projects

The Data Preparation Challenge

Why data preparation is critical for ML success

Experimental Process

Data science requires iterative experimentation with different transformations and approaches

Performance Impact

Quality of data preparation has massive impact on model performance

Format Knowledge

Understanding data formats (JSON, XML, CSV, APIs) and their constraints is essential

Organization

Without structure, data pipelines become messy and hard to maintain

Reality check: Data preparation often takes 60-80% of a data scientist's time in real-world projects.

The Data Science Workflow

From raw data to production models

1
Collect
Raw data
2
Clean
Handle missing
3
Engineer
Features
4
Train
Build model
5
Evaluate
Test & tune
6
Deploy
Production

Data Preparation (Steps 1-3)

  • Acquire data from various sources
  • Handle missing values, outliers, duplicates
  • Transform features for ML algorithms

Model Development (Steps 4-6)

  • Train models with prepared data
  • Iterate: evaluate → improve features → retrain
  • Deploy validated model to production
Non-linear process: The workflow is iterative—you'll loop back to feature engineering based on evaluation results.

The Combinatorial Explosion

Multiple transformation options per feature = exponential complexity

Example: Single Feature Options

Scaling — StandardScaler, MinMax, Robust
Missing Values — Mean, Median, Drop, Interpolate
Outliers — Remove, Cap, Transform, Keep

Reality

  • A single feature has dozens of paths
  • Multiply by all features in your dataset
  • Execution is iterative—try, evaluate, backtrack
Solution: Use systematic workflows (Orange, sklearn pipelines) to manage complexity.

ML-Ready Data Types

What algorithms actually need

Numerical Data

  • Integer: Counts, IDs, discrete values
  • Float: Continuous measurements, prices, coordinates

Examples: Age (int), Temperature (float), Salary (float)

Categorical Data

  • Nominal: No order (color, country)
  • Ordinal: Ordered categories (rating, size)
  • Boolean: True/False, Yes/No

Examples: Gender (nominal), Education Level (ordinal), Is_Customer (boolean)

Key insight: ML algorithms only accept these types. Everything else (text, images, dates, URLs) must be transformed into numerical or categorical features.

The DataFrame: ML's Data Format

Algorithms want "rectangle" data

DataFrame: A 2D table where rows are observations and columns are features, with one column as the target variable.

Rows = Observations

Each row represents a single data point or instance (e.g., one customer, one transaction)

Columns = Features

Each column represents a variable or attribute (e.g., age, price, category)

Target Column

One special column contains the label or value you want to predict

Feature_1 Feature_2 Feature_3 Target
25 Male 50000 Purchased
34 Female 65000 Not
45 Male 78000 Purchased
... ... ... ...

Why This Course?

Real-world data comes in many forms

The Reality

Data preparation requires software engineering skills, not just statistics

Data Scientists vs AI Engineers have different focuses, but both need to handle diverse data sources

Data Sources We'll Cover

  • JSON & XML files
  • CSV & ARFF formats
  • SQL Databases
  • REST APIs
  • Web scraping

1 Extract

Get data from various sources (APIs, databases, files)

2 Transform

Convert to ML-ready format (numerical/categorical)

3 Load

Create clean DataFrame for model training

Tools & Methodology

Choosing the right approach for your workflow

Jupyter Notebooks

Pros: Interactive, visual feedback, great for exploration

Cons: Cell execution order issues, hard to version control properly

Python Scripts

Pros: Reproducible, easy to version, can be automated

Cons: Less interactive, slower feedback loop

Orange (Visual)

Pros: No coding required, drag-and-drop workflow, great for learning

Cons: Limited customization, not production-ready

Best practice: Start with Orange or Jupyter for exploration, then transition to Python scripts for production pipelines.

Orange: Visual Data Mining

Our tool for hands-on learning

What is Orange?

  • Open-source visual programming tool for data analysis
  • Drag-and-drop widgets for loading, transforming, and modeling
  • Perfect for understanding ML workflows without code
  • Outputs Python code you can learn from

Installation

pip install orange3

Download

orangedatamining.com

Available for Windows, macOS, and Linux

Pro tip: Orange is excellent for rapid prototyping. Build your workflow visually, then export to Python for production use.

Practical Exercise: Titanic Dataset (1/2)

Loading and exploring the data in Orange

1
File
Load CSV
2
Data Table
View raw data
3
Feature Statistics
Check types
4
Select Columns
Set target
5
Impute
Fix missing

Steps 1-3: Load & Inspect

  • File: Load titanic_train.csv
  • Data Table: Browse rows and columns
  • Feature Statistics: See missing values in Age, Cabin

Steps 4-5: Prepare

  • Select Columns: Set "Survived" as Target
  • Impute: Replace missing Age with average

Practical Exercise: Titanic Dataset (2/2)

Visualization and pattern discovery

6
Distributions
By class
7
Distributions
By gender
8
Box Plot
Age vs survival
9
Scatter Plot
Correlations
10
Tree Viewer
Build model

Steps 6-8: Visualize

  • Distributions: Compare survival by Pclass, Sex
  • Box Plot: Age distribution for survivors vs non-survivors

Steps 9-10: Analyze

  • Scatter Plot: Age vs Fare, colored by Survived
  • Tree Viewer: Connect Tree widget to see decision rules
Expected finding: Women and 1st class passengers had much higher survival rates. Children also survived more often.

Slide Overview