In this tutorial, we’ll build an AI chatbot with GPT-4o and Pinecone to query our data in a basic chat interface. We’ll build ChatResearch, a RAG chatbot that answers questions about the latest in AI research papers.
To use the code, check out https://colab.research.google.com/drive/1gWZNUyWg0xrCmA1PMByNutvY542Uur2N?usp=sharing
What You’ll Learn
In this beginner-friendly tutorial, we’ll:
- Build a RAG application, ChatResearch, to help us read research papers faster.
- Learn the fundamentals behind embeddings and vector databases.
- How to use LLM Evaluations to test the quality of your AI apps
- Time to complete: 15-30 minutes
Prerequisites
- You want to build AI products
- No prior experience with AI/ML needed.
- Beginner to intermediate Python experience
ChatResearch Project
In this tutorial, you’ll build ChatResearch. As a developer, you want to stay up to date with the latest AI research. But let’s be honest, reading research papers is boring and time-consuming! Even hiring someone out to summarize these papers could take thousands of dollars. Instead, we’ll build an AI assistant to automate this process. With AI, we can process thousands of research papers in minutes, for cents on the dollar per research paper.

Project Outline
In this tutorial, we’ll build an AI application that lets us chat with research papers. To do so, we’ll need to set up a vector database that lets us store our data in a format that stores the semantic meaning of the text.
This is just 7 steps!

The steps will be as follows for setting up our vector database:
- Transform a PDF into embeddings
- Upload the embeddings to a vector database (Pinecone)
Then, we’ll need to be able to query the vector database to get data-driven answers to the paper.
- Convert user query to an embedding
- Query the vector database with the embedding
- Run a similarity search to determine the top-k relevant context
- Ask LLM (GPT-4o) for a response based on the data
- Return the response
Step 1: Install Dependencies
First things first, we need to install a couple of libraries for our RAG chatbot.
These libraries will help us with:
- OpenAI — Text Generation
- Pinecone — Vector Database
- PyPDF2 — Read PDF files
- Requests — Make API calls to OpenAI and Pinecone
# -q to install quietly (without so many outputs)
!pip install -q "pinecone-client[grpc]" openai PyPDF2 requests
Step 2: Add Your API Keys
Step 2a: OpenAI API Key
You’ll need to sign up for OpenAI Developer Platform.
This screenshows how to get your API key.

OPENAI_API_KEY = "your-openai_api_key"
You’ll also need to sign up a Pinecone account and get your API key.
An image of where to get your API key is here: 
PINECONE_INDEX_NAME = "chat-research"
PINECONE_API_KEY = "your_pinecone_api_key"
keyboard_arrow_down
Step 3: Create a Pinecone Index
Next, we create a Pinecone index, where we’ll store our embeddings. Here’s how:
Note: This will fail if your API key is invalid.


[ ]
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index(PINECONE_INDEX_NAME)
# Create the index if not found
if PINECONE_INDEX_NAME not in [index.name for index in pc.list_indexes().indexes]:
pc.create_index(
name=PINECONE_INDEX_NAME,
dimension=1536, # OpenAI embedding size
metric="cosine",
spec=ServerlessSpec(cloud='aws', region='us-east-1')
)
else:
print(f"Index '{PINECONE_INDEX_NAME}' already exists.")
# View indexes
pc.list_indexes()
{'indexes': [{'deletion_protection': 'disabled',
'dimension': 1536,
'host': 'chat-research-8d9af63.svc.aped-4627-b74a.pinecone.io',
'metric': 'cosine',
'name': 'chat-research',
'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
'status': {'ready': True, 'state': 'Ready'}},
...
Index Configuration Parameters
dimension=1536: OpenAI embeddings are vectors of 1536 dimensions. Other models, like XYZ, may have smaller (e.g., 128) or larger dimensions, but more dimensions often mean higher accuracy for similarity searches.metric="cosine": We’re using cosine similarity to measure the distance between vectors. You could also use metrics like “euclidean” or “dot_product,” but “cosine” is typically great for textual data and will slightly change how the similarity search works.region="us-east-1": Pinecone’s region, set to a closer region (likeus-westoreu-central), can reduce latency and give faster responses.
With the index created, we now have a place to store our embeddings and query them later.
View your index on the Pinecone Console
Now, if you refresh Pinecone console, you can go to the new chat-research index. 
You should see zero records. 
What is Pinecone?
Now, we’ve set up our Pinecone Index, which is a type of vector database. This allows us to store embeddings (vectors representing text) and perform similarity searches on them.
Pinecone’s cosine similarity is how we measure the distance between vectors to determine which are most similar.
Step 4: Load Text (Download the PDF)
Let’s download a public PDF file, extract its text, and get it ready for the next step.
The research paper we’ll start with is one of the most important AI papers of all time – “Attention Is All You Need” paper by Google Brain.
View the PDF we’re uploading to the database: https://hippo-ai-public-assets.s3.amazonaws.com/Attention+is+all+you+need.pdf
import requests
from PyPDF2 import PdfReader
from io import BytesIO
def download_and_read_pdf(url):
response = requests.get(url)
if response.status_code == 200:
pdf_file = BytesIO(response.content)
reader = PdfReader(pdf_file)
text = ""
for page_num in range(len(reader.pages)):
page = reader.pages[page_num]
text += page.extract_text()
return text
else:
print(f"Failed to download PDF, status code: {response.status_code}")
return None
pdf_url = "https://hippo-ai-public-assets.s3.amazonaws.com/Attention+is+all+you+need.pdf"
pdf_content = download_and_read_pdf(pdf_url)
pdf_content
Outputs:
.... We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and conv..
We successfully downloaded and extracted text from the PDF. Now, we have the raw text from the PDF, but it’s too large to fit in OpenAI’s context window (which has a limit of 8191 tokens). We’ll need to split it into smaller chunks.
print(f"Length of text to be converted to embeddings: {len(pdf_content)} characters")
print(f"OpenAI token limit: 8191 or {8191*4} characters")
Outputs:
Length of text to be converted to embeddings: 39472 characters
OpenAI token limit: 8191 or 32764 characters
We successfully downloaded and extracted text from the PDF. Now, we have the raw text from the PDF, but it’s too large to fit in OpenAI’s context window (which has a limit of 8191 tokens). We’ll need to split it into smaller chunks!
Step 5: Split and Convert Text to Embeddings
We need to split large documents into smaller chunks because OpenAI has a token limit. Let’s do this by “chunking” the text into manageable sizes.
def split_text(text, chunk_size, chunk_overlap=0):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk)
start = end - chunk_overlap
return chunks
text_chunks = split_text(pdf_content, 1000)
text_chunks
Here, we split the text into chunks of 1000 characters to avoid exceeding OpenAI’s token limit. The chunk size can be adjusted based on your model’s token limit, but 1000 is generally a safe size.
- Why use overlap? Overlapping chunks (using chunk_overlap) can improve search results. If two chunks have similar context, the overlap helps capture better semantic meaning when performing similarity searches. This is optional but can improve the quality of results when querying data.
To check how the chunks look:
text_chunks[:2] # Preview the first two chunks
print(f"Length previously: {len(pdf_content)}")
print(f"Approximate of Each chunk: {len(text_chunks[0])} for {len(text_chunks)}")
print(f"Length of last chunk (less than 1000 chars): {len(text_chunks[-1])}")
Outputs:
Length previously: 39472 Approximate of Each chunk: 1000 for 40 Length of last chunk (less than 1000 chars): 472
Step 6: Convert Text Chunks to Embeddings
Next, we’ll convert each text chunk into an embedding, which is a vector representation of the text.
Note: If this fails, check out your OpenAI and Pinecone API keys were set correctly in Step 1.
from openai import OpenAI
def create_embeddings(text):
client = OpenAI(api_key=OPENAI_API_KEY)
response = client.embeddings.create(input=text, model="text-embedding-3-small")
return response.data[0].embedding
pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index(PINECONE_INDEX_NAME)
for i, chunk in enumerate(text_chunks):
embedding = create_embeddings(chunk)
metadata = {"chunk_id": i, "text": chunk}
index.upsert(vectors=[(f"chunk_{i}", embedding, metadata)])
We’ve now converted each chunk into an embedding using OpenAI’s text-embedding-3-small model. These embeddings are stored in Pinecone, where we can later query them for similarity searches.
To see the first embedding:
sample_embedding = create_embeddings(text_chunks[0])
print(f"Embedding Preview: {sample_embedding[:10]}") # Preview the first 10 values of the first embedding
print(f"Embedding Dimensions: {len(sample_embedding)}") # This should match the 1536 dimensions from above
Embedding Preview: [0.020347388461232185, 0.010789184831082821, -0.03198136389255524, 0.018175069242715836, 0.025681639090180397, -0.03654323145747185, -0.007192789576947689, 0.048925451934337616, -0.06917629390954971, -0.024836847558617592] Embedding Dimensions: 1536
Each embedding is a vector (list of numbers) representing the meaning of the text chunk. Pinecone stores these vectors, which will allow us to query the data efficiently in the next steps.
Step 7: Similarity Search
Let’s define a similarity search function that will allow us to query the vector database (Pinecone) and retrieve the top 3 most relevant results based on a user query. This is crucial for RAG systems because it retrieves context that is most similar to the input query.
top_k=3: This retrieves the top 3 most similar matches from the database. You can adjust this value depending on how many relevant chunks you want to return.include_values: Set to False. We don’t need to see the embedding values as we can’t understand them anyway.include_metadata: Set to True so that we can get the chunk_id and text_content from the metadata.
def similarity_search(query: str) -> str:
"""
Based on a query, return the 3 most relevant embeddings in the vector database.
"""
try:
pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index(PINECONE_INDEX_NAME)
# Create embedding for the query
search_embedding = create_embeddings(query)
# Query the Pinecone index
response = index.query(
vector=search_embedding,
top_k=3, # Number of top results to return
include_values=False, # We don't need the embedding values
include_metadata=True, # Return metadata with the result
)
# Extract and format the relevant information
results = []
for match in response['matches']:
results.append({
'chunk_id': match['metadata']['chunk_id'],
'text': match['metadata']['text'],
'score': match['score']
})
return str(results)
except Exception as e:
print(f"Error in similarity search: {e}")
return None
- Similarity Search is a process where the system compares a query (turned into an embedding) to all the stored embeddings in the database to find the closest matches. We use the cosine similarity metric (as discussed earlier) to measure how similar the query is to each stored embedding.
The diagram below is a visual representation of how embeddings work in a high-dimensional space.

Explanation
- Each point represents a concept or word, and their position relative to each other is determined by their meaning. This vector database has information about animals, fruits, and companies.
- In this example, words like “Chicken,” “Dog,” “Wolf,” and “Cat” are placed closer together due to their semantic similarity (animals)
- Related objects like “Banana” and “Apple” (fruits) are placed further away from the animals.
- Semantics: The embedding for Apple (iPhones) is nearby apple (fruit). However, Apple (iPhones) is also close to Google, since they’re both technology companies.
Let’s test out the similairty serach to see what type of embeddings are returned.
result = similarity_search(query="What is a Transformer model?")
print(result) # Preview chunk content
[{'chunk_id': 7.0, 'text': ' auto-regressive\n[10], consuming the previously generated symbols as additional input when generating the next.\n2Figure 1: The Transformer - model architecture.\nThe Transformer follows this overall architecture using stacked self-attention and point-wise, fully\nconnected layers for both the encoder and decoder, shown in the left and righ ...
Step 8: Build the Chatbot
In this step, we’ll create the chatbot that combines embeddings from Pinecone with OpenAI’s GPT-4 model. The chatbot will take user queries, search the vector database for relevant chunks using similarity search, and use those results to provide data-driven responses.

def get_completion(user_query, context=None):
"""
Based on the user query and additional context, generate a response using GPT-4o.
"""
client = OpenAI(api_key=OPENAI_API_KEY)
system_prompt = f"""
You are ChatResearch, a helpful RAG assistant on research papers.
If the context does NOT answer the question, mention what the text references, and give other similar topics the user could ask about.
Only use the context provided to answer the question.
Given the user's query give a helpful answer.
Additional Context:
{context}
"""
print(f"System Prompt: {system_prompt}")
completion = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", f"content": system_prompt},
{
"role": "user",
"content": f"User Query: {user_query}"
}
]
)
return completion.choices[0].message.content
This function accepts a user_query and optionally uses additional context (e.g., retrieved via similarity search). It sends the query to the GPT-4 model and returns the response.
Now, we’ll define another function that will
- Call the vector database to get the closest embedding based on meaning (using cosine similarity search)
- It will pass the user query + embedding metadata to the LLM (GPT-4o) to create a data-driven answer.
def ask_rag_chatbot(query):
"""
The simplest RAG application.
Query + Context -> Answer
"""
search_results = similarity_search(query)
bot_response = get_completion(query, search_results)
print(f"\nPerforming similarity search for query: '{query}'")
print(f"RAG Context: {search_results}")
print(f"RAG Chatbot response: {bot_response}")
return bot_response
Step 9: Test the Chatbot
LLM evaluation is a crucial part of every RAG system to ensure your chatbot provides reliable and accurate information without hallucination or fabricating answers.
Since AI-generated responses can be non-deterministic (meaning the same prompt can produce slightly different responses), it is essential to carefully evaluate them across key aspects.
- Accuracy
- Non-hallucinatory
- Summarization
1. Accuracy Test
Purpose: This test ensures that the chatbot retrieves the correct information from the embeddings stored in Pinecone.
- What is being tested? The chatbot’s ability to return precise answers from the research paper, aligned with the context stored in the vector database. Even with some variation in phrasing, the answer should match the factual content found in the source text.
Test Query:
query_1 = "What is the name of the architecture introduced in the paper?"
response_1 = ask_rag_chatbot(query_1)
# Evaluate Response
expected_answer_1 = "Transformer"
if expected_answer_1.lower() in response_1.lower():
print("✅ Accuracy Test Passed: Correct architecture name returned.")
else:
print(f"❌ Accuracy Test Failed: Expected '{expected_answer_1}', but got '{response_1}'")
RAG Chatbot response: The architecture introduced in the paper is called the Transformer. It is a new network architecture based solely on attention mechanisms, eliminating the need for recurrence and convolutions.
✅ Accuracy Test Passed: Correct architecture name returned.
✅ Accuracy Test passed, as the word “architecture” was mentioned in the response Let’s move onto the next test.keyboard_arrow_down
2. Non-Hallucinatory Test (Negative Test Case)
Purpose: This test ensures the chatbot stays within the boundaries of the provided data and does not hallucinate or generate unsupported information from its pre-trained knowledge.
Test Query:link text
[ ]
query_2 = "How did the authors determine the environmental impact (carbon emissions) of training the Transformer model on 8 GPUs?"
response_2 = ask_rag_chatbot(query_2)
# Evaluate Response
print(response_2)
RAG Chatbot response: The provided text does not specifically address how the authors determined the environmental impact or carbon emissions of training the Transformer model on 8 GPUs. However, it does mention that the training cost in terms of floating point operations (FLOPs) is considered, which involves estimating the sustained single-precision floating-point capacity of each GPU. To evaluate energy usage and environmental impact, similar research might involve the power consumption of the GPUs and converting that to carbon emissions using local energy grid carbon intensity data. You may want to explore topics like "Estimating Carbon Footprint in Machine Learning" or "Calculating Energy Consumption for Training AI Models" for more information on similar studies.
The provided text does not specifically address how the authors determined the environmental impact or carbon emissions of training the Transformer model on 8 GPUs. However, it does mention that the training cost in terms of floating point operations (FLOPs) is considered, which involves estimating the sustained single-precision floating-point capacity of each GPU. To evaluate energy usage and environmental impact, similar research might involve the power consumption of the GPUs and converting that to carbon emissions using local energy grid carbon intensity data. You may want to explore topics like "Estimating Carbon Footprint in Machine Learning" or "Calculating Energy Consumption for Training AI Models" for more information on similar studies.
✅ Non-Hallucinatory Test Passed: As we can see above, the chatbot mentions “the provided context does not specify how the authors determined the environmental impact.” It didn’t try to hallucinate, which is the desired behaviour. Also, the similarity search did not provide any context related to environmental impact (because there was none!
3. Summarization Test
Purpose: This test checks if the chatbot can create concise summaries from the paper, e.g., generating tweets or short descriptions.
- What is being tested? This tests the chatbot’s ability to synthesize information into a coherent and compact form, rather than simply extracting or copying content verbatim.
Test Query:
[ ]
query_3 = "Summarize the Transformer model in a tweet. The response should be under 280 characters"
response_3 = ask_rag_chatbot(query_3)
# Evaluate Response
if len(response_3) <= 280:
print("✅ Summarization Test Passed: Tweet-sized response generated.")
else:
print(f"❌ Summarization Test Failed: Response too long ({len(response_3)} characters).")
print(f"Tweet: {response_3}")
...
Outputs:
❌ Summarization Test Failed: Response too long (334 characters).
Tweet: The Transformer model uses stacked self-attention and point-wise, fully connected layers in an encoder-decoder structure. It processes input sequences without relying on RNNs or convolution, leveraging self-attention to generate efficient and scalable representations for tasks like translation and parsing. #NLP #AI #TransformerModel
❌ Summarization Test sometimes fails (usually)
Summarization tests evaluate the non-deterministic nature of generative models. Each response may vary, but the chatbot must still capture the main points in a tweet-sized response (within 280 characters). The test ensures the output is concise and on-topic.
Chatbot Performance Summary and Improvements
Based on the three tests—accuracy, non-hallucination, and summarization—we can assess the performance of the ChatResearch chatbot. Each test targeted different key areas to ensure the chatbot provides correct, relevant, and well-synthesized information.
- Accuracy Test:
- Result: ✅ Passed
The chatbot successfully retrieved the correct architecture name (“Transformer”) from the embedded research paper. This demonstrates that the RAG system effectively utilizes the stored vector embeddings to retrieve precise answers.
- Result: ✅ Passed
- Non-Hallucination Test (Negative Case):
- Result: ✅ Passed
When asked about the environmental impact—a topic absent from the paper—the chatbot correctly identified that the information was not present. This confirms the chatbot adhered to its system prompt by not generating false information and only using the context provided through embeddings.
- Result: ✅ Passed
- Summarization Test:
- Result: ❌ Failed
The chatbot produced a response longer than 280 characters when tasked with generating a tweet-sized summary of the Transformer model. Although the content was accurate and relevant, the length indicates room for improvement in concise text generation for specific output formats like tweets.
- Result: ❌ Failed
Opportunities for Improvement
While the chatbot performed well in most areas, we identified specific strategies to enhance the quality of RAG applications:
- Prompt Engineering:
Modifying the system prompt to emphasize brevity could improve performance in summarization tasks. For example:- “Generate concise, tweet-sized summaries under 280 characters.”
- Model Variation:
Utilizing different models optimized for summarization—such as gpt-o1-preview or task-specific fine-tuned models—can improve accuracy and response format for outputs like tweets. - Fine-Tuning Techniques:
Fine-tuning a model with domain-specific data (e.g., academic research) ensures better alignment with the expected style and structure of answers, reducing variability in responses.
Conclusion
Congratulations! You’ve built ChatResearch—a working RAG chatbot that reads research papers and answers questions.
You’re well on your way in your AI journey.
Optional: What’s Next?
If you enjoyed this tutorial, consider trying out my Function Calling Tutorial or joining my Generative AI Masterclass to build full-stack AI applications.
If you need help with a project, reach out at shawnesquivel24@gmail.com for a free consultation call—no strings attached.
About Your Instructor
Hi, I’m Shawn, a Generative AI solutions developer.
I’m also a highly rated instructor instructor on Udemy, having taught AI courses to over 3,000 students.

If you need help with a project, reach out at shawnesquivel24@gmail.com for a free consultation call—no strings attached.
