๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

[3์ฃผ์ฐจ] RAG์˜ ๊ตฌ์„ฑ ์š”์†Œ์™€ ์‹ค์Šต

์‹œํ๋ฆฌํ‹ฐ์ง€ํ˜ธ 2026. 4. 5.

๊ฐœ์š”

์ด๋ฒˆ ์‹œ๊ฐ„์€ RAG(Retrieval-Augmented Generation)๊ฐ€ ๋ฌด์—‡์ด๊ณ , ์–ด๋–ค ๊ตฌ์„ฑ ์š”์†Œ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋Š”์ง€๋ฅผ ์กฐ์‚ฌํ•œ๋‹ค. ์‹ค์Šต์—์„œ๋Š” LangChain ๊ธฐ๋ฐ˜ RAG ํŒŒ์ดํ”„๋ผ์ธ์„ ์ง์ ‘ ๊ตฌํ˜„ํ•ด๋ณด๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ด๋‹ค.


ํ•„์ˆ˜ ์กฐ์‚ฌ ํ•ญ๋ชฉ

1. RAG ๋ž€

Retrieval-Augmented Generation์˜ ์•ฝ์ž๋กœ ์™ธ๋ถ€ ์ง€์‹์„ ๊ฒ€์ƒ‰ํ•˜์—ฌ LLM์˜ ์‘๋‹ต ์ƒ์„ฑ์— ํ™œ์šฉํ•˜๋Š” ๊ตฌ์กฐ์ด๋‹ค.

 

RAG ํŒŒ์ดํ”„๋ผ์ธ์˜ ์ „์ฒด ํ๋ฆ„ :

  • Indexting(๋ฌธ์„œ -> ์ฒญํฌ -> ๋ฒกํ„ฐ -> ์ €์žฅ) + Retrieval(์งˆ๋ฌธ -> ๊ฒ€์ƒ‰ -> ์ƒ์„ฑ)

RAG ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์กฐ

 

RAG๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ 

 

  1. ์ตœ์‹  ๋ฐ ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ ํ™œ์šฉ๊ฐ€๋Šฅ
    • ๊ธฐ์กด LLM์€ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ์ œํ•œ๋˜์ง€๋งŒ, RAG๋Š” ์ตœ์‹  ์ •๋ณด ๋ฐ ์‚ฌ๋‚ด ๋ฐ์ดํ„ฐ(PDF, DB, Wiki ๋“ฑ) ํ™œ์šฉ ๊ฐ€๋Šฅ
  2. ํ™˜๊ฐ(Hallucination) ๊ฐ์†Œ ๋ฐ ์ •ํ™•๋„ ํ–ฅ์ƒ
    • ์‹ค์ œ ๋ฌธ์„œ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ต๋ณ€ํ•˜์—ฌ ์—†๋Š” ์ •๋ณด๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ์ค„์ž„
  3. ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ์‘๋‹ต ์ œ๊ณต
    • ๋‹ต๋ณ€์˜ ์ถœ์ฒ˜๋ฅผ ๋ช…ํ™•ํžˆ ์ œ์‹œํ•  ์ˆ˜ ์žˆ์–ด ์‹ ๋ขฐ์„ฑ ํ™•๋ณด

 

System Prompt ๋ฐฉ์‹๊ณผ ์ฐจ์ด

  • System Prompt ๋ฐฉ์‹์€ ์ „์ฒด ๋ฌธ์„œ๋ฅผ ํ”„๋กฌํ”„ํŠธ์— ํฌํ•จํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ถˆํ•„์š”ํ•œ ์ปจํ…์ŠคํŠธ๊นŒ์ง€ ์ฒ˜๋ฆฌํ•˜๊ฒŒ ๋˜์–ด ํ† ํฐ ๋น„์šฉ์ด ์ฆ๊ฐ€ํ•˜๊ณ , ์ •๋ณด ๊ณผ๋ถ€ํ•˜๋กœ ์ธํ•ด ๋ชจ๋ธ์˜ ์‘๋‹ต ์ •ํ™•๋„ ๋˜ํ•œ ์ €ํ•˜๋  ์ˆ˜ ์žˆ๋‹ค.
  • RAG๋Š” ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋งŒ ๊ทธ๋•Œ๊ทธ๋•Œ ๊ฒ€์ƒ‰ํ•ด์„œ ๊ทธ ๊ธฐ๋ฐ˜์œผ๋กœ LLM์ด ๋‹ต์„ ์ƒ์„ฑํ•˜๋Š” ๊ตฌ์กฐ์ด๋‹ค.

 

์ฐธ๊ณ ์ž๋ฃŒ:

https://arxiv.org/abs/2005.11401

 

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still lim

arxiv.org

https://arxiv.org/abs/2312.10997

 

Retrieval-Augmented Generation for Large Language Models: A Survey

Large Language Models (LLMs) showcase impressive capabilities but encounter challenges like hallucination, outdated knowledge, and non-transparent, untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by

arxiv.org

 

2. RAG ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์šฉ์–ด

์šฉ์–ด ์กฐ์‚ฌ ๋‚ด์šฉ
Chunking ๋ฌธ์„œ๋ฅผ ๋ถ„ํ• ํ•˜๋Š” ๋ฐฉ๋ฒ•. chunk_size = ์ฒญํฌ ๊ธธ์ด, chunk_overlap = ๊ฒน์น˜๋Š” ๊ตฌ๊ฐ„
Embedding ํ…์ŠคํŠธ๋ฅผ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐฉ๋ฒ• ex. ๊ฐ•์•„์ง€๋Š” ๊ฐœ๋ž‘ ์œ ์‚ฌํ•œ ๋ฒกํ„ฐ๊ฐ’
Vector Store ๋ฒกํ„ฐ๋ฅผ ์ €์žฅํ•˜๊ณ  ์œ ์‚ฌ๋„ ๊ฒ€์ƒ‰ํ•˜๋Š” ์ €์žฅ์†Œ FAISS, Chroma ๋“ฑ
Retriever ์งˆ๋ฌธ๊ณผ ์œ ์‚ฌํ•œ ์ฒญํฌ๋ฅผ ๊ฑฐ ใ…์ƒ‰ํ•˜๋Š”  ์ปดํฌ๋„ŒํŠธ. Top-L = ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ๋ฌธ์„œ K๊ฐœ
Generation ๊ฒ€์ƒ‰๋œ ์ฒญํฌ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ LLM์ด ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜๋Š” ๋‹จ๊ณ„

 

์ฐธ๊ณ  ์ž๋ฃŒ :

OpenAI Embeddings Guide

 

https://developers.openai.com/api/docs/guides/embeddings

 

Vector embeddings | OpenAI API

Learn how to turn text into numbers, unlocking use cases like search, clustering, and more with OpenAI API embeddings.

developers.openai.com

 

FAISS Getting Started

https://github.com/facebookresearch/faiss/wiki/Getting-started

 

Getting started

A library for efficient similarity search and clustering of dense vectors. - facebookresearch/faiss

github.com

 

3. LangChain ๊ธฐ๋ฐ˜ RAG ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์กฐ ๊ฐœ๋…

๋‹จ๊ณ„ LangChain ์ปดํฌ๋„ŒํŠธ
๋ฌธ์„œ ๋กœ๋”ฉ Document Loader (PyPDFLoader ๋“ฑ)
์ฒญํ‚น Text Splitter (RecursiveCharactoerTextSpliter ๋“ฑ)
์ž„๋ฒ ๋”ฉ Embeddings (OpenAIEmbeddings ๋“ฑ)
๋ฒกํ„ฐ ์ €์žฅ Vector Store (FAISS, Chroma ๋“ฑ)
๊ฒ€์ƒ‰ + ์ƒ์„ฑ Retriever + Chain

 

  • ๋ฌธ์„œ ๋กœ๋”ฉ
    • ์—ญํ•  : ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ๋ฅผ LangChain์ด ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜
    • ๋Œ€ํ‘œ ์ปดํฌ๋„ŒํŠธ :
      • PyPDFLoader -> PDF
      • TextLoader -> txt
      • CSVLoader -> CSV
      • WebBaseLoader -> ์›นํŽ˜์ด์ง€
  • ์ฒญํ‚น(Chunking)
    • ์—ญํ•  : ๊ธด ๋ฌธ์„œ๋ฅผ LLM์ด ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅํ•œ ์ž‘์€ ๋‹จ์œ„๋กœ ๋ถ„ํ• 
    • ๋Œ€ํ‘œ ์ปดํฌ๋„ŒํŠธ :
      • RecursiveCharacterTextSplitter
    • ํ•„์š”ํ•œ ์ด์œ  :
      • LLM ํ† ํฐ ์ œํ•œ ๋•Œ๋ฌธ
      • ๊ฒ€์ƒ‰ ์ •ํ™•๋„ ํ–ฅ์ƒ
  • ์ž„๋ฒ ๋”ฉ(Embedding)
    • ์—ญํ•  : ํ…์ŠคํŠธ -> ์˜๋ฏธ ๊ธฐ๋ฐ˜ ๋ฒกํ„ฐ ๋ณ€ํ™˜
      • ๋‹จ์–ด๊ฐ€ ์•„๋‹ˆ๋ผ ์˜๋ฏธ ์œ ์‚ฌ๋„ ๊ธฐ๋ฐ˜ ๊ฒ€์ƒ‰ ๊ธฐ๋ฐ˜
    • ๋Œ€ํ‘œ ์ปดํฌ๋„ŒํŠธ : 
      • OpenAIEmbeddings
      • HuggingFaceEmbeddings
  • ๋ฒกํ„ฐ ์ €์žฅ (Vector Store)
    • ์—ญํ•  : ์ž„๋ฒ ๋”ฉ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅ + ๊ฒ€์ƒ‰ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“ฆ
    • ๋Œ€ํ‘œ ์ปดํฌ๋„ŒํŠธ :
      • FAISS (๋กœ์ปฌ, ๋น ๋ฆ„)
      • Chroma (๊ฐ„ํŽธ, ๋กœ์ปฌ DB)
      • Pinecone (ํด๋ผ์šฐ๋“œ)
  • ๊ฒ€์ƒ‰ + ์ƒ์„ฑ (Retrieval + Generation)
    • Retrieval : ์งˆ๋ฌธ -> ๊ด€๋ จ ๋ฌธ์„œ ๊ฒ€์ƒ‰
    • Chain : LLM ์—ฐ๊ฒฐ. ๊ฒ€์ƒ‰๋œ ๋ฌธ์„œ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ต๋ณ€ ์ƒ์„ฑ
      • ๋Œ€ํ‘œ ์ฒด์ธ
        • RetrievalQA
        • ConversationRetrievalChain

 

์ฐธ๊ณ  ์ž๋ฃŒ :

https://docs.langchain.com/oss/python/langchain/rag

 

Build a RAG agent with LangChain - Docs by LangChain

Build a RAG agent with LangChain

docs.langchain.com

https://developers.llamaindex.ai/python/framework/getting_started/starter_example/

 

Starter Tutorial (Using OpenAI)

As you can see, we are using async python functions. Many LLMs and models support async calls, and using async code is recommended to improve performance of your application. To learn more about async code and python, we recommend this short section on asy

developers.llamaindex.ai

 


 

์‹ค์Šต ์ง„ํ–‰

1. ์‚ฌ์šฉ ๋ฐ์ดํ„ฐ

  • 2024 ์•Œ๊ธฐ ์‰ฌ์šด ์˜๋ฃŒ๊ธ‰์—ฌ์ œ๋„ pdf

2. ๊ฐ€์„ค

  • RAG ๋ฐฉ์‹์€ ๊ธฐ์กด System Prompt ๋ฐฉ์‹๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ, ์ „์ฒด ๋ฌธ์„œ๋ฅผ ์ž…๋ ฅํ•˜์ง€ ์•Š๊ณ  ์งˆ๋ฌธ๊ณผ ๊ด€๋ จ๋œ ์ •๋ณด๋งŒ ์„ ๋ณ„ํ•˜์—ฌ ํ™œ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ† ํฐ ์‚ฌ์šฉ๋Ÿ‰๊ณผ ๋น„์šฉ์ด ํฌ๊ฒŒ ๊ฐ์†Œํ•  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋œ๋‹ค.
  • ๋˜ํ•œ, ๊ฒ€์ƒ‰๋œ ๊ทผ๊ฑฐ ๋ฌธ์„œ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹ต๋ณ€์˜ ๊ทผ๊ฑฐ๊ฐ€ ๋ช…ํ™•ํ•ด์ง€๊ณ  ์‹ ๋ขฐ๋„๊ฐ€ ํ–ฅ์ƒ๋  ๊ฒƒ์ด๋‹ค.
  • ๋‹ค๋งŒ ์ •ํ™•๋„๋Š” ๋‹ค์Œ ์š”์†Œ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ๋‹ค.
    • Chunking ์ „๋žต
    • Embedding ๋ชจ๋ธ ์„ฑ๋Šฅ
    • Retriever์˜ Top-K ์„ค์ •
    • LLM ๋ชจ๋ธ์˜ ์ดํ•ด ๋ฐ ์ƒ์„ฑ ๋Šฅ๋ ฅ
  • ์˜๋ฃŒ๊ธ‰์—ฌ PDF๋Š” ํ‘œ๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— ์ด ๊ธฐ๋ฐ˜ ๋ฌธ์žฅ์„ ์˜๋ฏธ ๋‹จ์œ„๋กœ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ์–ด๋ ค์šธ ๊ฒƒ์ด๋‹ค.
  • *Golden Dataset ์ค‘ hard ๋ถ€๋ถ„์œผ๋กœ ๊ฐˆ ์ˆ˜๋ก ๊ทผ๊ฑฐ ์ถ”์ถœ์— ์‹คํŒจํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋‹ค.
    • Golden Dataset ์€ RAG ํŒŒ์ดํ”„๋ผ์ธ์˜ ๊ฐ๊ด€์  ํ’ˆ์งˆ ์ง€ํ‘œ์ด๋‹ค. ์ด๊ฒƒ์ด ์—†์œผ๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ƒํ™ฉ ๋ฐœ์ƒ
      1. ์ฒญํ‚น ์ „๋žต์„ ๋ฐ”๊ฟจ๋Š”๋ฐ, ์ข‹์•„์ง„ ๊ฑด์ง€ ๋‚˜๋น ์ง„ ๊ฑด์ง€ ์•Œ ์ˆ˜ ์—†์Œ
      2. ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์„ ๊ต์ฒดํ–ˆ๋Š”๋ฐ, ๋น„์šฉ๋งŒ ๋Š˜๊ณ  ๊ฐœ์„ ์€ ์—†์—ˆ๋Š”์ง€ ํ™•์ธํ•  ๋ฐฉ๋ฒ•์ด ์—†์Œ
      3. ๋ชจ๋ธ ๊ต์ฒด, ํ”„๋กฌํ”„ํŠธ ๋ณ€๊ฒฝ, ์ปจํ…์ŠคํŠธ ๊ฒ€์ƒ‰ ๋ฐฉ์‹ ๋ณ€๊ฒฝ ๋“ฑ์œผ๋กœ ๋ฌธ์ œ ๋ฐœ์ƒ์‹œ ํ™•์ธํ•  ๋ฐฉ๋ฒ•์ด ์—†์Œ

 

3. ์‹ค์Šต ์ง„ํ–‰

Step 1 : Golden Dataset ๊ตฌ์ถ•

  • easy : 2(๋‹จ์ผ ์กฐ๊ฑด), medium: 2(2-3๊ฐœ ์กฐ๊ฑด ์กฐํ•ฉ), hard: 1(๋‹ค์ค‘ ์กฐ๊ฑด + ๊ณ„์‚ฐ or ์˜ˆ์™ธ)
  • ๊นƒํ—ˆ๋ธŒ golden_dataset.jsonl ์ฐธ๊ณ 

 

Step 2 : RAG Indexing ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์ถ• ๋‘ ๊ฐ€์ง€ ๋ฐฉ์‹ ์ง„ํ–‰

  • PDF
    1. PyPDFLoader ์‚ฌ์šฉํ•˜์—ฌ pdf ๋กœ๋”
    2. ์ฒญํ‚น
      • chunk_size=300
      • chunk_overlap=50
    3. ์ž„๋ฒ ๋””๋“œ ์ง„ํ–‰
      • OpenAIEmbeddings ์‚ฌ์šฉ
    4. FAISS์— ๋ฒกํ„ฐ ์ €์žฅ

 

  • ๋งˆํฌ๋‹ค์šด(md)
    1. TextLoader ์‚ฌ์šฉํ•˜์—ฌ md ๋กœ๋”
    2. ์ฒญํ‚น
      • chunk_size=300
      • chunk_overlap=50
      • "---"๋กœ ์†Œ์ œ๋ชฉ ๊ธฐ์ค€์œผ๋กœ ๊ตฌ๋ถ„ํ•˜์—ฌ ์ฒญํ‚น
    3. ์ž„๋ฒ ๋””๋“œ ์ง„ํ–‰
      • OpenAIEmbeddings ์‚ฌ์šฉ
    4. FAISS์— ๋ฒกํ„ฐ ์ €์žฅ

 

 

Step 3 : ๊ฒ€์ƒ‰ ํ’ˆ์งˆ ํ™•์ธ

  • ์ฒซ ๋ฒˆ์งธ(pdf) ํ…Œ์ŠคํŠธ ์š”์•ฝ
์งˆ๋ฌธ ๋‚œ์ด๋„ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ
q1 easy Top 1 ์—์„œ ์ •ํ™•ํ•œ ๊ทผ๊ฑฐ๋ฅผ ๊ฐ€์ ธ์™”์œผ๋ฏ€๋กœ ์„ฑ๊ณต
q2 easy ๊ฒ€์ƒ‰ ์‹คํŒจ
q3 medium Top 1 ์—์„œ ํŒŒ์•…์€ ๋˜์ง€๋งŒ ๊ทผ๊ฑฐ๋Š” ์‚ด์ง ๋ถ€์กฑํ•จ
q4 medium ๊ฒ€์ƒ‰ ์‹คํŒจ
q5 hard ์ „์ฒด์ ์œผ๋กœ ๊ทผ๊ฑฐ๊ฐ€ ์• ๋งคํ•จ ์‹คํŒจ๋ผ๊ณ  ๋ด์•ผ ํ•  ๋“ฏ
  • ์ž์„ธํ•œ ๊ฒฐ๊ณผ๋Š” ๊นƒํ—ˆ๋ธŒ retrieval_results_pdf.txt ์ฐธ๊ณ 

 

  • ๋‘ ๋ฒˆ์งธ(md) ํ…Œ์ŠคํŠธ ์š”์•ฝ
    • md๋กœ ํŒŒ์‹ฑ ํ›„, --- ๊ตฌ๋ถ„์ž๋กœ ์ฒญํฌ ๋‚˜๋ˆ”
์งˆ๋ฌธ ๋‚œ์ด๋„ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ
q1 easy Top 1 ์—์„œ ์ •ํ™•ํ•œ ๊ทผ๊ฑฐ๋ฅผ ๊ฐ€์ ธ์™”์œผ๋ฏ€๋กœ ์„ฑ๊ณต
q2 easy ๊ฒ€์ƒ‰ ์‹คํŒจ
q3 medium Top 1 ์—์„œ ์ •ํ™•ํ•œ ๊ทผ๊ฑฐ๋ฅผ ๊ฐ€์ ธ์™”์œผ๋ฏ€๋กœ ์„ฑ๊ณต
q4 medium ๊ฒ€์ƒ‰ ์‹คํŒจ
q5 hard ์‹๋Œ€์— ๋Œ€ํ•ด์„œ๋งŒ ๋งž๊ณ  ๋ฐ˜์€ ํ‹€๋ฆผ
  • ์ž์„ธํ•œ ๊ฒฐ๊ณผ๋Š” ๊นƒํ—ˆ๋ธŒ retrieval_results_md.txt ์ฐธ๊ณ 

md๋กœ ๋ฐ”๊พผ ์‚ฌ์ดํŠธ ์ฐธ๊ณ 

https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5?ref=blog.aibox.today

 

PaddlePaddle/PaddleOCR-VL-1.5 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

 

4. ์‹ค์Šต ๊ฒฐ๊ณผ

  • ์˜ˆ์ƒ๋Œ€๋กœ easy, medium์—์„œ ์ •๋‹ต์ด ๋‚˜์™”๋‹ค.
  • ์‹คํŒจ ํ•œ ๊ฒƒ๋“ค์„ ๋ณด๋ฉด ์›์ธ์€ ์ด 2๊ฐ€์ง€๋กœ ํŒŒ์•…์ด ๋œ๋‹ค.
    1. ๋กœ๋” ์ด์Šˆ
      • pdf ๋ฐฉ์‹์œผ๋กœ ํ•  ๋•Œ ๋กœ๋” ์ž์ฒด๊ฐ€ ๊ธฐ๋ณธ์ด๋ฉด ํ‘œ๋ฅผ ์ œ๋Œ€๋กœ ์ธ์‹ ๋ชปํ•œ๋‹ค.
      • ํ•ด๊ฒฐ
        • ๋งˆํฌ๋‹ค์šด์œผ๋กœ ๋ฐ”๊พผ ํ˜•ํƒœ๋กœ ๋กœ๋”๋ฅผ ์‚ฌ์šฉ
        • ๋กœ๋” ๋ชจ๋ธ ๋ณ€๊ฒฝ -> pdfplumber (ํ‘œ ์ถ”์ถœ์— ๊ฐ•์ )
    2. chunk ์ด์Šˆ
      • ์ฒซ ๋ฒˆ์งธ ํ…Œ์ŠคํŠธ ์ฒ˜๋Ÿผ pdf๋กœ ๋กœ๋”๋ฅผ ์ง„ํ–‰ํ•˜๋ฉด ํ‘œ๋Š” ๊นจ์ง€๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์–ด, chunking์— ํ•œ๊ณ„๊ฐ€ ์žˆ์Œ
      • ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋‘ ๋ฒˆ์งธ ํ…Œ์ŠคํŠธ๋Š” md๋กœ ํ…Œ์ด๋ธ”์„ ํฌํ•จํ•œ ์†Œ์ œ๋ชฉ ์œ„์ฃผ๋กœ ๊ตฌ๋ถ„ํ•ด์„œ chunking์„ ์‹œ๋„ ํ–ˆ์ง€๋งŒ, ์ด๋ฒˆ์—๋Š” ๋ฒ”์œ„๊ฐ€ ๋„ˆ๋ฌด ์ปค์„œ ์ •ํ™•๋„๊ฐ€ ํฌ๊ฒŒ ํ–ฅ์ƒ๋˜์ง€๋Š” ๋ชปํ–ˆ๋‹ค.

5. ์ธ์‚ฌ์ดํŠธ

  1. ์˜๋ฃŒ ๊ธ‰์—ฌ๋Š” ํ‘œ ํ˜•ํƒœ๊ฐ€ ๋งŽ์•„ ํ‘œ ๋‹จ์œ„๋กœ ์ฒญํ‚นํ•ด์•ผ ํ•œ๋‹ค. ํ‘œ๋ฅผ ์ผ๋ฐ˜ ํ…์ŠคํŠธ ์ฒ˜๋Ÿผ ๊ณ ์ • ํฌ๊ธฐ๋กœ ์ž๋ฅด๋ฉด ํ–‰/์—ด ๊นจ์ง„๋‹ค.
  2. ์˜๋ฃŒ ๊ธ‰์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ๋งˆํฌ๋‹ค์šด์œผ๋กœ ๋ฐ”๊พธ๋Š” ๊ฒƒ์ด ํ›จ์”ฌ ํšจ๊ณผ์ ์ด๋‹ค. ๋˜๋Š” ๋กœ๋”๋ฅผ ๋ฐ”๊พธ๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.
  3. chunking ๋ฒ”์œ„๋ฅผ ์—ฌ๋Ÿฌ ๋ฒˆ ๋ณ€๊ฒฝ ํ•ด๋ณด๋ฉด์„œ ๊ฐ€์žฅ ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ์ง€์ ์„ ์ฐพ์•„์•ผ ํ•œ๋‹ค.
    • ๋„ˆ๋ฌด ํฌ๋ฉด ๊ด€๋ จ ์—†๋Š” ์ •๋ณด๊ฐ€ ์„ž์ž„
    • ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด ๋ฌธ๋งฅ ๋Š๊น€

ํ…Œ์ŠคํŠธ ๊นƒํ—ˆ๋ธŒ ์ฃผ์†Œ

https://github.com/jasonpark112/aiagent-repo/tree/jasonpark112/week-3/jasonpark112

 

aiagent-repo/week-3/jasonpark112 at jasonpark112 · jasonpark112/aiagent-repo

Contribute to jasonpark112/aiagent-repo development by creating an account on GitHub.

github.com

 

๋Œ“๊ธ€