New Jobs Simplified, AI University
← Back to courses

AI Data

Datasets & Data Sources

This lesson covers the basics of datasets and data sources in AI, including where to find public datasets, how to filter and share them, and how to work with massive datasets. We'll also explore how to create new datasets by repeating or batching existing ones.

Why It Matters

In the real world of AI, finding the right dataset is crucial for training accurate models. Without a suitable dataset, even the best AI model won't perform well. Learning how to find, filter, and work with datasets is essential for AI practitioners and researchers.

Key Points

A dataset is a collection of data used to train, test, and validate AI models. It can be sourced from various places, including public repositories like Hugging Face Datasets, Kaggle, and Zenodo.
When searching for a dataset, you can filter by type, such as text-classification datasets, to find the most relevant ones.
The AG News dataset is a well-known non-commercial dataset used for benchmarking text-classification models and research.
To work with massive datasets, you can use databases like Milvus, Weaviate, and Qdrant, which are vector databases that specialize in handling large amounts of data.
TensorFlow Datasets (TFDS) is a library that provides a simple way to load and manipulate datasets, including downloading and loading existing datasets.
You can also use online platforms like PapersWithCode and Meta portals to find open data repositories and datasets.
When working with datasets, you can create new ones by repeating or batching existing ones, which can be useful for testing or training purposes.

Key Concepts

dataset

A collection of data used to train, test, and validate AI models.

vector database

A type of database that specializes in handling large amounts of data by converting it into vectors.

Milvus

A vector database that allows you to store and query large amounts of data efficiently.

TFDS

A library that provides a simple way to load and manipulate datasets, including downloading and loading existing datasets.

batching

A process of grouping existing data into chunks, called batches, to create a new dataset.

Code Examples

Loading a dataset using TensorFlow Datasets

import tensorflow_datasets as tfds
train_dataset, test_dataset = tfds.load('ag_news', split=['train', 'test'])
From the books
“a dataset from your company). Some good places to find public datasets are Hugging Face Datasets, Kaggle, Zenodo, and Google Dataset Search. With hundreds of thousands of datasets out there, we need h…”
“databases such as Milvus, Weaviate, and Qdrant may prove useful when you have to work with massive datasets. Diving into vector databases is outside the scope of this book, but they are also a quickly…”
“and much more. You can visit https://homl.info/tfds to view the full list, along with a description of each dataset. You can also check out Know Your Data, which is a tool to explore and understand ma…”

Quick Quiz

1. Where can you find public datasets?

Hugging Face Datasets
Kaggle
Zenodo
All of the above

2. What is a vector database?

A type of database that specializes in handling large amounts of data
A type of database that stores data in tables
A type of database that stores data in files
None of the above

3. What is TFDS?

A library that provides a simple way to load and manipulate datasets
A library that provides a simple way to train AI models
A library that provides a simple way to test AI models
None of the above