AI Data
Datasets & Data Sources
This lesson covers the basics of datasets and data sources in AI, including where to find public datasets, how to filter and share them, and how to work with massive datasets. We'll also explore how to create new datasets by repeating or batching existing ones.
Why It Matters
In the real world of AI, finding the right dataset is crucial for training accurate models. Without a suitable dataset, even the best AI model won't perform well. Learning how to find, filter, and work with datasets is essential for AI practitioners and researchers.
Key Points
Key Concepts
A collection of data used to train, test, and validate AI models.
A type of database that specializes in handling large amounts of data by converting it into vectors.
A vector database that allows you to store and query large amounts of data efficiently.
A library that provides a simple way to load and manipulate datasets, including downloading and loading existing datasets.
A process of grouping existing data into chunks, called batches, to create a new dataset.
Code Examples
Loading a dataset using TensorFlow Datasets
import tensorflow_datasets as tfds
train_dataset, test_dataset = tfds.load('ag_news', split=['train', 'test'])
From the books
Quick Quiz
1. Where can you find public datasets?
2. What is a vector database?
3. What is TFDS?