AI / Machine Learning April 20, 2025 Aditya Rawas

Data in AI and Machine Learning: What It Is and How to Use It

You may have heard the phrase “data is the new oil” in the context of AI. But what is data in machine learning, how do you collect it, and what common mistakes kill AI projects before they start?

This guide covers everything you need to know about data in AI with real-world examples.


What is Data in AI?

In machine learning, data refers to the information we collect to teach a machine to make predictions or decisions. Data is represented as tables — like spreadsheets — where each row is an example and each column is a feature or label.


Example 1: Predicting House Prices

Size (sq ft)BedroomsPrice ($)
1,2002300,000
1,8003450,000
2,5004650,000

The goal is to train a model that learns the mapping from A to B. Interestingly, you can also flip the inputs and outputs depending on your business goal — e.g., given a budget, predict how much space you can afford.


Example 2: Recognizing Cats in Images

ImageIs Cat?
img1.jpgYes
img2.jpgNo
img3.jpgYes

This is a classification task — the model learns to distinguish between categories.


How Do You Get Data?

There are three main strategies for acquiring training data:

1. Manual Labeling

Assign labels to data by hand.

ImageLabel
img101.jpgCat
img102.jpgNot Cat

Pros: High quality, fine-grained control. Cons: Time-consuming and labor-intensive at scale.

2. Behavioral Observation

Collect data passively from user or system behavior — no manual labeling needed.

E-commerce example:

User IDVisit TimePrice OfferedPurchased?
12310:01 AM$99Yes
45610:05 AM$149No

Factory monitoring example:

Machine IDTemperature (°C)Pressure (PSI)Failed?
M0018030No
M00212045Yes

This kind of data enables predictive maintenance systems that catch failures before they happen.

3. Downloaded or Partner-Sourced Data

Use public datasets (Kaggle, ImageNet, UCI ML Repository) or partner with organizations that hold relevant data (e.g., hospitals sharing medical imaging data).

Important: Always check licensing and permissions before using third-party datasets.


Common Data Misconceptions

”We’ll collect data for 3 years, then build the AI.”

Don’t wait. Start small and iterate. Involve AI engineers early so they can guide what data is actually useful — data collected without clear objectives often turns out to be the wrong data.

”We have lots of data — it must be valuable.”

Not all data is useful. Volume doesn’t equal value. The quality, relevance, and labeling accuracy of your data matters far more than its size.


Data Can Be Messy (and That’s Okay)

Real-world data almost always has problems:

ProblemExample
Incorrect valuesA house listed at $1
Missing valuesBlank entries in the price column
Inconsistent unitsSize recorded in both sq ft and sq m

AI teams need to clean and preprocess data before training. This is normal — expecting perfect data is unrealistic.


Structured vs Unstructured Data

TypeExamplesCommon Use Cases
StructuredTables, spreadsheets, CSVsHouse pricing, sales prediction, fraud detection
UnstructuredImages, audio, textCat detection, speech recognition, language generation

Most generative AI today works with unstructured data (text, images). Supervised learning for tabular predictions typically uses structured data.


Key Takeaways