& Preparation
Session 1
Introduction to data science workflows and preparation techniques for ML projects
2026 WayUp
Why data preparation is critical for ML success
Data science requires iterative experimentation with different transformations and approaches
Quality of data preparation has massive impact on model performance
Understanding data formats (JSON, XML, CSV, APIs) and their constraints is essential
Without structure, data pipelines become messy and hard to maintain
From raw data to production models
Multiple transformation options per feature = exponential complexity
What algorithms actually need
Examples: Age (int), Temperature (float), Salary (float)
Examples: Gender (nominal), Education Level (ordinal), Is_Customer (boolean)
Algorithms want "rectangle" data
DataFrame: A 2D table where rows are observations and columns are features, with one column as the target variable.
Each row represents a single data point or instance (e.g., one customer, one transaction)
Each column represents a variable or attribute (e.g., age, price, category)
One special column contains the label or value you want to predict
| Feature_1 | Feature_2 | Feature_3 | Target |
|---|---|---|---|
| 25 | Male | 50000 | Purchased |
| 34 | Female | 65000 | Not |
| 45 | Male | 78000 | Purchased |
| ... | ... | ... | ... |
Real-world data comes in many forms
Data preparation requires software engineering skills, not just statistics
Data Scientists vs AI Engineers have different focuses, but both need to handle diverse data sources
Get data from various sources (APIs, databases, files)
Convert to ML-ready format (numerical/categorical)
Create clean DataFrame for model training
Choosing the right approach for your workflow
Pros: Interactive, visual feedback, great for exploration
Cons: Cell execution order issues, hard to version control properly
Pros: Reproducible, easy to version, can be automated
Cons: Less interactive, slower feedback loop
Pros: No coding required, drag-and-drop workflow, great for learning
Cons: Limited customization, not production-ready
Our tool for hands-on learning
pip install orange3
Loading and exploring the data in Orange
titanic_train.csvVisualization and pattern discovery