New Jobs Simplified, AI University
← Back to courses

RAG — Retrieval-Augmented Generation

Chunking & Embedding Strategies

In this lesson, we'll learn about chunking and embedding strategies used in AI search systems. We'll cover how to break down large documents into manageable chunks and represent them as numerical vectors. This helps AI systems quickly find relevant information in a large text archive.

Why It Matters

Chunking and embedding strategies are crucial in AI search systems because they enable fast and accurate information retrieval. By breaking down documents into smaller chunks, AI systems can quickly find relevant information, making search more efficient and effective. This matters in real-world applications like search engines, chatbots, and language translation systems.

Key Points

Chunking involves breaking down large documents into smaller, manageable pieces called chunks. This is often done to reduce the complexity of the document and make it easier to process.
A chunk size of 500 characters is commonly used, with a 100-character overlap between chunks to ensure continuity.
Embedding involves representing text as numerical vectors, which can be compared and searched efficiently.
The process of embedding text involves using techniques like word embeddings, which capture the semantic meaning of words and phrases.
Rerankers, like monoBERT, use embeddings to score the relevance of each document to a search query.
Retrieval-Augmented Generation (RAG) systems use embeddings to find the most similar documents to a search query.
Packing is the process of efficiently organizing short training documents into the context, minimizing padding and ensuring efficient training.

Key Concepts

Chunking

Breaking down large documents into smaller, manageable pieces called chunks.

Embedding

Representing text as numerical vectors that can be compared and searched efficiently.

Reranker

A system that uses embeddings to score the relevance of each document to a search query.

Packing

Efficiently organizing short training documents into the context, minimizing padding and ensuring efficient training.

Word Embeddings

Techniques that capture the semantic meaning of words and phrases, enabling efficient text representation.

Code Examples

Splitting a document into chunks using a specified chunk size.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
chunks = text_splitter.split_documents(docs)
From the books
“the best approach for a real system because a lot of information would be left out of the index and would be unsearchable. • Embedding the document in chunks, embedding those chunks, and then aggre‐ •…”
“chunk size we chose). from langchain_text_splitters import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=100 ) chunks = text_splitter.spl…”
“context. It would 102 | Chapter 3: Looking Inside Large Language Models be inefficient to allocate the entire, say, 4K context to a short 10-word sentence. So during model training, documents are pack…”

Quick Quiz

1. What is the main purpose of chunking in AI search systems?

A) To reduce the complexity of documents
B) To increase the size of documents
C) To improve the accuracy of search results
D) To decrease the efficiency of search systems

2. What is embedding in the context of AI search systems?

A) Representing text as numerical vectors
B) Breaking down documents into smaller chunks
C) Scoring the relevance of documents to a search query
D) Efficiently organizing training documents

3. What is the process of packing in AI search systems?

A) Breaking down documents into smaller chunks
B) Representing text as numerical vectors
C) Efficiently organizing short training documents into the context
D) Scoring the relevance of documents to a search query