[ BLOG / AGENTI AI ]

Implementing RAG on an enterprise knowledge base: practical guide

Practical tutorial for implementing RAG on an Italian enterprise knowledge base in 2026: chunking, embeddings, vector store, retrieval, native prompt. Ready-to-use TypeScript code.

May 15, 2026 Adrian Ciocaianu 11 min

RAG (Retrieval-Augmented Generation) is the pattern that allows an LLM to answer using specific enterprise documentation without fine-tuning the model. In 2026 it is the default strategy for building AI assistants that know your context. Let’s see how to implement it concretely with a modern stack, ready-to-use TypeScript code, and the details that make the difference between a RAG that works and one that produces hallucination.

What you will get at the end

A Node.js service that:

Indexes enterprise documents (PDF, Word, Markdown, web pages) in a vector database
On a user question, retrieves the most relevant passages
Builds a prompt with the retrieved passages and passes them to an LLM
Returns an answer in native language based on the documents, with source citation

Target stack:

Node.js + TypeScript
PostgreSQL + pgvector as vector store (alternative to Pinecone/Weaviate, cheaper and self-hosted)
OpenAI Embeddings API (text-embedding-3-small) for embeddings
Claude 3.5 Sonnet or GPT-4o for final generation

Prerequisites

Node.js 22+ and pnpm/npm
PostgreSQL 15+ with pgvector extension installed
OpenAI key (for embeddings) and Anthropic or OpenAI (for generation)
Enterprise documents to index: PDF, Markdown, text files

Step 1: environment setup

mkdir rag-knowledge-base && cd rag-knowledge-base
pnpm init
pnpm add openai @anthropic-ai/sdk pg dotenv zod pdf-parse markdown-it
pnpm add -D typescript tsx @types/pg @types/node
npx tsc --init --target es2022 --module nodenext --moduleResolution nodenext

PostgreSQL setup with pgvector:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id SERIAL PRIMARY KEY,
  source VARCHAR(500) NOT NULL,
  title VARCHAR(500),
  content TEXT NOT NULL,
  metadata JSONB,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE document_chunks (
  id SERIAL PRIMARY KEY,
  document_id INTEGER REFERENCES documents(id) ON DELETE CASCADE,
  chunk_index INTEGER NOT NULL,
  content TEXT NOT NULL,
  embedding vector(1536),
  metadata JSONB,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX ON document_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

CREATE INDEX ON document_chunks (document_id);

vector(1536) because text-embedding-3-small produces embeddings of dimension 1536. The ivfflat index accelerates search on large volumes (above 100k chunks). lists = 100 is a good default below 5M records.

Step 2: document loading and chunking

Chunking is the single step that determines RAG quality. Too small = chunks without context. Too large = retrieval of noise.

Empirical parameters for an Italian knowledge base:

Chunk size: 800-1,200 characters (~200-300 Italian tokens)
Overlap between chunks: 150-200 characters (~30-50 tokens)
Strategy: semantic chunking where possible (respects paragraphs and sections), fixed-size as fallback

// src/chunking.ts

export interface Chunk {
  content: string
  index: number
  metadata?: Record<string, unknown>
}

const CHUNK_SIZE = 1000
const CHUNK_OVERLAP = 180

/**
 * Semantic chunking: tries to respect paragraphs.
 * Falls back to fixed-size chunking if paragraphs are too long.
 */
export function chunkText(text: string): Chunk[] {
  const paragraphs = text
    .split(/\n\s*\n/)
    .map((p) => p.trim())
    .filter((p) => p.length > 0)

  const chunks: Chunk[] = []
  let currentChunk = ''
  let chunkIndex = 0

  for (const para of paragraphs) {
    // If the paragraph alone exceeds CHUNK_SIZE, we split it fixed-size
    if (para.length > CHUNK_SIZE) {
      // First close the current chunk if not empty
      if (currentChunk.length > 0) {
        chunks.push({ content: currentChunk.trim(), index: chunkIndex++ })
        currentChunk = ''
      }
      // Then split the long paragraph
      for (let i = 0; i < para.length; i += CHUNK_SIZE - CHUNK_OVERLAP) {
        chunks.push({
          content: para.slice(i, i + CHUNK_SIZE).trim(),
          index: chunkIndex++,
        })
      }
      continue
    }

    // If adding the paragraph would exceed CHUNK_SIZE, close the current chunk
    if (currentChunk.length + para.length > CHUNK_SIZE) {
      chunks.push({ content: currentChunk.trim(), index: chunkIndex++ })
      // Overlap: carry the last CHUNK_OVERLAP characters into the next chunk
      currentChunk = currentChunk.slice(-CHUNK_OVERLAP)
    }

    currentChunk += (currentChunk.length > 0 ? '\n\n' : '') + para
  }

  if (currentChunk.length > 0) {
    chunks.push({ content: currentChunk.trim(), index: chunkIndex++ })
  }

  return chunks
}

For PDF and Markdown:

// src/loaders.ts
import { readFile } from 'node:fs/promises'
import pdfParse from 'pdf-parse'

export async function loadPdf(filePath: string): Promise<string> {
  const buffer = await readFile(filePath)
  const parsed = await pdfParse(buffer)
  return parsed.text
}

export async function loadMarkdown(filePath: string): Promise<string> {
  return await readFile(filePath, 'utf-8')
}

export async function loadText(filePath: string): Promise<string> {
  return await readFile(filePath, 'utf-8')
}

Step 3: embeddings generation

// src/embeddings.ts
import OpenAI from 'openai'

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })

const EMBEDDING_MODEL = 'text-embedding-3-small'
const BATCH_SIZE = 100

export async function embedTexts(texts: string[]): Promise<number[][]> {
  const allEmbeddings: number[][] = []

  // Batch to avoid API timeouts
  for (let i = 0; i < texts.length; i += BATCH_SIZE) {
    const batch = texts.slice(i, i + BATCH_SIZE)
    const response = await openai.embeddings.create({
      model: EMBEDDING_MODEL,
      input: batch,
    })
    allEmbeddings.push(...response.data.map((d) => d.embedding))
  }

  return allEmbeddings
}

export async function embedText(text: string): Promise<number[]> {
  const [embedding] = await embedTexts([text])
  return embedding
}

text-embedding-3-small is the most cost-effective model in 2026: dimension 1536, cost ~0.02 USD per 1M tokens. For medium knowledge bases (10k-100k chunks) the initial indexing cost is 0.50-5 USD total.

Step 4: document indexing

// src/indexing.ts
import { Client } from 'pg'
import { chunkText } from './chunking'
import { embedTexts } from './embeddings'

const dbClient = new Client({ connectionString: process.env.DATABASE_URL })
await dbClient.connect()

export async function indexDocument(
  source: string,
  title: string,
  content: string,
  metadata?: Record<string, unknown>,
): Promise<number> {
  // 1. Save document
  const docResult = await dbClient.query<{ id: number }>(
    'INSERT INTO documents (source, title, content, metadata) VALUES ($1, $2, $3, $4) RETURNING id',
    [source, title, content, metadata ?? {}],
  )
  const documentId = docResult.rows[0].id

  // 2. Chunking
  const chunks = chunkText(content)

  // 3. Embeddings
  const embeddings = await embedTexts(chunks.map((c) => c.content))

  // 4. Bulk insert chunks
  for (let i = 0; i < chunks.length; i++) {
    await dbClient.query(
      `INSERT INTO document_chunks (document_id, chunk_index, content, embedding, metadata)
       VALUES ($1, $2, $3, $4, $5)`,
      [
        documentId,
        chunks[i].index,
        chunks[i].content,
        JSON.stringify(embeddings[i]),
        chunks[i].metadata ?? {},
      ],
    )
  }

  return documentId
}

Step 5: retrieval

// src/retrieval.ts
import { Client } from 'pg'
import { embedText } from './embeddings'

const dbClient = new Client({ connectionString: process.env.DATABASE_URL })
await dbClient.connect()

export interface RetrievedChunk {
  content: string
  source: string
  title: string | null
  similarity: number
  document_id: number
}

const TOP_K = 5
const SIMILARITY_THRESHOLD = 0.65

export async function retrieve(
  query: string,
  topK: number = TOP_K,
  threshold: number = SIMILARITY_THRESHOLD,
): Promise<RetrievedChunk[]> {
  const queryEmbedding = await embedText(query)

  const result = await dbClient.query<RetrievedChunk>(
    `SELECT
       c.content,
       d.source,
       d.title,
       1 - (c.embedding <=> $1::vector) AS similarity,
       c.document_id
     FROM document_chunks c
     JOIN documents d ON d.id = c.document_id
     WHERE 1 - (c.embedding <=> $1::vector) > $2
     ORDER BY c.embedding <=> $1::vector
     LIMIT $3`,
    [JSON.stringify(queryEmbedding), threshold, topK],
  )

  return result.rows
}

The pgvector <=> operator computes cosine distance. 1 - cosine_distance = cosine_similarity. The 0.65 threshold eliminates chunks too poorly relevant that would produce noise in the prompt.

Step 6: generation with native prompt

// src/generation.ts
import Anthropic from '@anthropic-ai/sdk'
import { retrieve, type RetrievedChunk } from './retrieval'

const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY })

const MODEL = 'claude-3-5-sonnet-20241022'

function buildPrompt(question: string, chunks: RetrievedChunk[]): string {
  const sources = chunks
    .map(
      (c, i) =>
        `[Source ${i + 1}: ${c.title ?? c.source}]\n${c.content.trim()}`,
    )
    .join('\n\n---\n\n')

  return `Question: ${question}\n\nAvailable context:\n\n${sources}`
}

export interface RagResponse {
  answer: string
  sources: Array<{ source: string; title: string | null; document_id: number }>
}

export async function answer(question: string): Promise<RagResponse> {
  const chunks = await retrieve(question)

  if (chunks.length === 0) {
    return {
      answer:
        'I did not find information in the available documentation to answer this question. Rephrase it more specifically or check whether the topic is covered in our knowledge base.',
      sources: [],
    }
  }

  const userPrompt = buildPrompt(question, chunks)

  const response = await anthropic.messages.create({
    model: MODEL,
    max_tokens: 800,
    temperature: 0.2,
    system: `You are an assistant who answers questions based exclusively on the provided context.

Binding rules:
- Answer in formal, concise, professional Italian (max 250 words unless the question requires detail).
- Use only the information in the context. If the context is not enough for a complete answer, state it explicitly.
- When you cite information, indicate the source in square brackets (e.g. [Source 2]).
- Do not invent. Do not add information that is not in the context.
- If the question is ambiguous, ask for clarification instead of guessing.
- Do not assume knowledge external to the context (do not take for granted that the reader knows things that are not written).`,
    messages: [{ role: 'user', content: userPrompt }],
  })

  const answerText =
    response.content[0].type === 'text' ? response.content[0].text : ''

  // Deduplicate sources (different chunks of the same document)
  const uniqueSources = Array.from(
    new Map(
      chunks.map((c) => [
        c.document_id,
        { source: c.source, title: c.title, document_id: c.document_id },
      ]),
    ).values(),
  )

  return {
    answer: answerText,
    sources: uniqueSources,
  }
}

Three details worth the ink:

temperature: 0.2 for deterministic business output. RAG with high temperature = hallucination on demand.
Binding system prompt in the native output language with explicit rules. Never say “in italian style” or “professionally”: specify the rules in the output language itself.
Source citation explicitly required in the prompt and deduplicated in the final response: the user sees which documents were used.

Step 7: end-to-end usage example

// src/example.ts
import { indexDocument } from './indexing'
import { loadPdf, loadMarkdown } from './loaders'
import { answer } from './generation'

async function main() {
  // Indexing
  const policyPdf = await loadPdf('./docs/policy-aziendale.pdf')
  await indexDocument(
    'policy-aziendale.pdf',
    'Policy Aziendale 2026',
    policyPdf,
  )

  const procMd = await loadMarkdown('./docs/procedure-onboarding.md')
  await indexDocument('procedure-onboarding.md', 'Procedure Onboarding', procMd)

  // Q&A
  const result = await answer(
    'Quanti giorni di ferie spettano a un dipendente con 5 anni di anzianità?',
  )

  console.log('RISPOSTA:\n', result.answer)
  console.log('\nFONTI:')
  result.sources.forEach((s) => console.log(`  - ${s.title} (${s.source})`))
}

main().catch(console.error)

Common pitfalls

1. Chunks too small or too large. 200-character chunks = too narrow context, the LLM has fragments without meaning. 3,000-character chunks = too much noise in the prompt, less focused answer. The Italian sweet spot is 800-1,200 characters with 150-200 overlap.

2. Similarity threshold too low. Below 0.6 cosine similarity, retrieved chunks are often noise. Above 0.85, you risk finding nothing for generic queries. 0.65-0.75 is the typical useful range.

3. No citation in answers. Without forcing the LLM to cite sources, the user cannot verify. Verifiability is the line between a trustworthy RAG and a RAG that “seems” trustworthy.

4. Mixing languages. Embeddings of Italian documents + English query work poorly even with multilingual models. Converting the query to the same language as the documents before embedding significantly improves quality.

5. Ignoring metadata. Filtering by tag/date/department before the vector search improves quality and speed. Example: “HR questions” -> search only in chunks with metadata->>'department' = 'HR'.

6. Wrong vector indexes for the scale. ivfflat is fine up to 5M chunks. Above that, you need hnsw (more memory but faster). Below 100k chunks, even a sequential brute force works acceptably.

7. Non-automatic re-indexing. Enterprise documents change. Without a pipeline that re-indexes when documents change, after 6 months RAG answers with outdated information.

Variations of the approach

Hybrid search (vector + keyword). Combining cosine similarity (semantic) with BM25 / full-text search (exact lexicon) for scenarios where specific keywords matter (product codes, proper names, employee IDs). PostgreSQL supports both natively; combined reranking improves quality by 10-20% in real business scenarios.

Re-ranking with a small LLM. Retrieve 15-20 chunks with vector search, then pass to a small LLM (gpt-4o-mini, Claude Haiku) to pick the 3-5 best before final generation. Improves quality by 10-15% at the cost of an additional LLM call.

Self-hosted embeddings. For very sensitive data, open-source self-hosted embedding models (Nomic Embed Text v1.5, Mistral Embed, BAAI/bge-m3) eliminate data transfer to OpenAI. Slightly inferior quality (-10-15%), similar latency, GPU costs of 50-200 euros/month but maximum compliance.

Conversational RAG. Maintain conversation context (previous questions + answers) to handle follow-ups (“tell me more”, “expand on point X”). Requires conversation state management and query re-formulation at the retrieval stage.

Limitations of this base approach

1. Does not update facts in real time. RAG searches indexed documents. If documents are 6 months old, answers will reflect the situation 6 months ago. For data that changes in real time (e.g. inventory, prices), an AI agent with tools is more suitable than pure RAG.

2. Bad at “global” queries. Questions like “what are our 5 best-selling products” or “summarize all HR documentation” do not work with standard RAG: they would require reading all chunks, not just the most similar ones. For these, graph RAG or hybrid approaches are needed.

3. Residual hallucination. Even with a rigorous prompt and citation, LLMs can “interpret” too creatively. An external guardrail (e.g. automatic fact-checking on specific numbers) is desirable in critical scenarios (healthcare, tax, legal).

4. Steady-state cost. For knowledge bases with 50,000+ queries/month, LLM costs (~0.01-0.05 USD/query with Claude 3.5 Sonnet) add up: 500-2,500 USD/month. For large scales, it is worth evaluating smaller or self-hosted models.

FAQ

How much does it cost to implement RAG on an average enterprise knowledge base?

For an Italian SME with 1,000-10,000 enterprise documents:

Initial setup: 15-40k euros (development + initial indexing + consultation UI).
Recurring costs: 300-1,500 euros/month (LLM, embedding refresh, PostgreSQL hosting).

Typical ROI if it replaces manual staff search: 6-15 months.

Can pgvector be used instead of Pinecone/Weaviate/Qdrant?

Yes, and for most cases it is the best choice. pgvector is open-source, part of PostgreSQL which you probably already have, scales up to millions of chunks without issues. Pinecone/Weaviate/Qdrant offer superior performance at enterprise scales (tens of millions of vectors) but for SME scales they are over-engineered.

Which embedding model to choose in 2026?

OpenAI text-embedding-3-small: excellent default, low price (~0.02 USD/1M tokens), dimension 1536.
OpenAI text-embedding-3-large: 5x more expensive, slightly better, dimension 3072. Worth it only for high-precision scenarios.
Cohere embed-multilingual-v3: good for knowledge bases in multiple languages.
Self-hosted (Nomic, Mistral, BGE-m3): for compliance / privacy. Slightly inferior quality.

For medium Italian-only knowledge bases: OpenAI text-embedding-3-small is the right choice.

How to handle documents that change often?

Three approaches:

Scheduled re-indexing: nightly, weekly, monthly. Simple, but can leave freshness gaps.
Event-driven re-indexing: webhook from the document management system (Confluence, Notion, SharePoint) that triggers re-indexing of the modified document. More complex but more up-to-date.
Versioning: keep multiple versions of the same document, filter by date at retrieval time.

The choice depends on how much the documentary base changes and how critical freshness is.

Can RAG be used to support code / technical documentation?

Yes but with care. Code and technical documentation have specific structures (hierarchies, cross-references, snippets with syntax). Generalist embeddings tend to mix technical prose and code suboptimally. Solutions: embedding models specific to code (e.g. CodeT5, BAAI/bge-code-v1), chunking that respects function/class boundaries.

How is the quality of a RAG system measured?

Standard metrics:

Precision@K: of the top K retrieved chunks, how many are actually relevant?
Recall@K: of the total relevant chunks, how many are retrieved in the top K?
Faithfulness: is the answer actually supported by the cited sources?
Answer relevance: is the answer pertinent to the question?

Automatic eval tools (RAGAS, Trulens) automate these metrics. A reasonable baseline for well-built RAG: Precision@5 > 0.75, Faithfulness > 0.85.

Conclusion

Implementing RAG on an Italian enterprise knowledge base in 2026 is feasible with a modern stack, reusable code, and a contained budget. The difference between a RAG that works and one that produces hallucination lies in the details: well-tuned chunking, sensible similarity thresholds, native prompt with binding rules, mandatory source citation. Companies that invest in it get a productivity asset that returns compound savings over time.

If you are evaluating implementing RAG on your enterprise knowledge base and want support on scope and implementation choices, let’s talk. We can build a POC on a portion of your documentation in 3-5 weeks.

To go deeper: the pillar page AI agents, the dedicated page on RAG for enterprise knowledge base, and the related articles how to integrate GPT in TeamSystem for another complementary technical pattern, and AI agents vs chatbots to understand where RAG fits in the 2026 AI landscape.

Tags: agenti-airagembeddingknowledge-basetutorialtypescriptpgvector