AI / Machine Learning • April 20, 2025 • Aditya Rawas

Data in AI and Machine Learning: What It Is and How to Use It

You may have heard the phrase “data is the new oil” in the context of AI. But what is data in machine learning, how do you collect it, how do you prepare it, and what common mistakes kill AI projects before they start?

This guide covers everything you need to know about data in AI with real-world examples.

What is Data in AI?

In machine learning, data refers to the information we collect to teach a machine to make predictions or decisions. Data is typically represented as tables — where each row is an example (also called a sample or observation) and each column is a feature or label.

The model’s job is to learn the mapping from input features (A) to output label (B).

Example 1: Predicting House Prices

Size (sq ft)	Bedrooms	Location	Price ($)
1,200	2	Suburb	300,000
1,800	3	City	500,000
2,500	4	Suburb	620,000

Input A (Features): Size, bedrooms, location
Output B (Label): House price

The goal is to train a model that learns the A→B mapping so it can predict prices for houses it hasn’t seen before.

Example 2: Recognizing Cats in Images

Image	Is Cat?
img1.jpg	Yes
img2.jpg	No
img3.jpg	Yes

Input A: Image pixels
Output B: Cat / Not Cat (binary classification)

This is unstructured data — the model must learn to extract features (edges, shapes, textures) directly from raw pixel values.

Structured vs Unstructured Data

Type	Examples	Common Use Cases
Structured	Tables, CSVs, databases	Price prediction, fraud detection, sales forecasting
Unstructured	Images, audio, text, video	Object detection, speech recognition, text generation
Semi-structured	JSON, XML, logs	Web scraping, API data, event streams

Most generative AI (LLMs, image generators) works with unstructured data. Traditional ML for tabular predictions works with structured data.

How Do You Get Training Data?

1. Manual Labeling

Assign labels to data by hand — either internally or through services like Amazon Mechanical Turk or Scale AI.

Image	Label
img101.jpg	Cat
img102.jpg	Not Cat

Pros: High quality, precise control. Cons: Expensive and slow at scale. Labeling 10,000 images can take weeks.

2. Behavioral Observation

Collect data passively from user or system behavior — no manual labeling needed.

E-commerce example:

User ID	Time	Price	Purchased?
123	10:01 AM	$99	Yes
456	10:05 AM	$149	No

Factory monitoring example:

Machine ID	Temperature	Pressure	Failed?
M001	80°C	30 PSI	No
M002	120°C	45 PSI	Yes

Behavioral data enables predictive maintenance — catching failures before they happen, which is worth millions in manufacturing.

3. Downloaded or Partner-Sourced Data

Public datasets:

Kaggle — competition datasets across many domains
UCI ML Repository — classic academic datasets
Hugging Face Datasets — NLP and multimodal datasets
ImageNet — large-scale image classification

Always check licensing and usage restrictions before using third-party datasets in production systems.

Feature Engineering

Raw data rarely comes in the format a model needs. Feature engineering is the process of transforming raw data into features that better represent the underlying patterns.

Example: Predicting Loan Default

Raw data might have a date_of_birth column. A raw date isn’t useful — but the derived age is:

df['age'] = (pd.Timestamp.now() - pd.to_datetime(df['date_of_birth'])).dt.days // 365

Other common feature engineering techniques:

Technique	Example
Normalization	Scale income to [0, 1] so large values don’t dominate
One-hot encoding	Convert `city: ["London", "NY"]` → binary columns
Bucketing	Convert age into groups: `young/middle/senior`
Interaction features	Create `income_per_family_member = income / family_size`
Log transform	Apply `log(price)` to reduce skew in right-tailed distributions

Good feature engineering often matters more than algorithm choice for structured data.

Train / Test / Validation Split

You can’t evaluate a model on the same data you trained it on — it would just memorize the answers. Instead, split your data into three sets:

Set	Purpose	Typical Size
Training	Model learns patterns from this	70–80%
Validation	Tune hyperparameters, prevent overfitting	10–15%
Test	Final evaluation — used once at the end	10–15%

from sklearn.model_selection import train_test_split

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

Never use the test set during development. Looking at test performance while tuning is called data leakage — it inflates your reported accuracy, and the model won’t generalize.

Data Augmentation

When you don’t have enough labeled data, augmentation artificially increases your dataset by creating modified versions of existing examples.

For Images

from torchvision import transforms

augment = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(15),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.RandomCrop(224),
])

A single photo of a cat becomes dozens of training examples — flipped, rotated, cropped, adjusted.

For Text

Synonym replacement: “The car is fast” → “The vehicle is rapid”
Back-translation: Translate to French, then back to English to get paraphrases
Random insertion/deletion: Insert or drop words to create noisy versions

For Tabular Data

SMOTE: Synthesize new minority-class examples by interpolating between existing ones (useful for imbalanced classification)
Gaussian noise: Add small random perturbations to numeric features

Common Data Mistakes

”Collect for 3 years, then build the AI”

Don’t wait. Start small, build a prototype with whatever data you have, and iterate. Involve AI engineers early so they can guide what data is actually useful — data collected without clear objectives often turns out to be the wrong data.

”We have lots of data — it must be valuable”

Volume doesn’t equal value. 10 million mislabeled examples are worse than 10,000 accurately labeled ones. Data quality beats data quantity every time.

Not accounting for distribution shift

Your model is trained on data from one time period or context, but deployed in another. A fraud detection model trained on 2022 transactions may fail on 2025 patterns. Monitor model performance over time and retrain regularly.

Leaking the label into features

If any feature in your training data is derived from or correlated with the label at prediction time (but unavailable in production), your model will be unrealistically accurate in training and useless in production.

Data Can Be Messy (and That’s Okay)

Real-world data almost always has problems:

Problem	Example	Fix
Missing values	Empty cells in a spreadsheet	Impute with mean/median, or create an `is_missing` flag
Incorrect values	A house listed at $1	Remove outliers or apply business-rule validation
Inconsistent units	Mix of sq ft and sq m	Normalize to a single unit
Duplicate rows	Same transaction recorded twice	Deduplicate by key columns
Class imbalance	99% non-fraud, 1% fraud	Oversample minority class or use class weights

AI teams spend 60–80% of their time on data — cleaning, labeling, and validating. This is normal and expected.

Key Takeaways

Data in ML is a table of examples (rows) and features + labels (columns). Define what’s A (input) and what’s B (output) before collecting anything.
Get data through manual labeling, behavioral observation, or public datasets — always verify licensing.
Feature engineering transforms raw data into meaningful signals. It often matters more than the algorithm.
Always split data into train / validation / test sets. Never evaluate on training data, never tune on test data.
Data augmentation multiplies limited labeled data — essential for computer vision and NLP.
Data quality beats quantity. Clean, well-labeled, representative data outperforms large but messy datasets.
Start small, validate early, and iterate — don’t wait until you have “enough” data to begin.

To understand what happens with this data once it’s collected — how models are trained on it using transformers and attention — see How Large Language Models Work.

AI / Machine Learning

How Large Language Models Work: A Beginner's Guide

Learn how large language models like ChatGPT work — transformers, self-attention, tokenization, context windows, temperature sampling, hallucination, RLHF fine-tuning, and what this means for developers building with AI.

Written by

Aditya Rawas

Full-stack engineer writing deep-dives on JavaScript, TypeScript, React, AWS, Docker, and Kubernetes. Passionate about making complex engineering concepts accessible to developers at every level.

GitHub LinkedIn Twitter/X