Building a RAG-as-a-Service Platform with Model Context Protocol

Imagine being able to sell access to your proprietary datasets directly to AI assistants like Claude or ChatGPT. Not as static file downloads, but as live, queryable knowledge bases that AI can search in real-time during conversations. This is the vision behind FLTR (Backsplash): a RAG-as-a-Service platform that makes proprietary datasets accessible to AI through the Model Context Protocol (MCP).

The Vision: "Context as a Service"

Large Language Models (LLMs) are powerful, but they're limited by their training data cutoff dates and lack of domain-specific knowledge. Retrieval Augmented Generation (RAG) solves this by allowing AI to query external knowledge bases, but implementing RAG at scale comes with challenges:

Complex infrastructure (vector databases, embedding pipelines, document processing)
High costs (GPU compute, storage, API calls)
Integration difficulties (connecting to AI chat interfaces)
Multi-tenancy concerns (isolating customer data)

FLTR solves these problems by providing dataset-as-a-service with built-in RAG capabilities and MCP integration.

What is the Model Context Protocol (MCP)?

MCP is an open standard created by Anthropic for connecting AI assistants to external data sources. Think of it as a universal adapter between LLMs and your data:

┌──────────────┐
│  Claude/GPT  │ ← AI Assistant
└──────┬───────┘
       │ MCP Protocol
       ▼
┌──────────────┐
│  MCP Server  │ ← Your Data Source
└──────┬───────┘
       │
       ▼
┌──────────────┐
│   Database   │ ← PostgreSQL, Milvus, etc.
└──────────────┘

Benefits:

Standardized interface for AI-to-data connections
Real-time queries during AI conversations
Proper citations in AI responses
Access control and authentication built-in

System Architecture

FLTR uses a dual-architecture approach:

┌─────────────────────────────────────────────┐
│           FLTR Architecture                  │
├─────────────────────────────────────────────┤
│                                             │
│  ┌──────────────┐      ┌──────────────┐   │
│  │   Next.js    │      │   FastAPI    │   │
│  │   Frontend   │◄────►│   Backend    │   │
│  │  (Drizzle)   │      │ (SQLAlchemy) │   │
│  └──────┬───────┘      └──────┬───────┘   │
│         │                      │           │
│         ▼                      ▼           │
│  ┌──────────────┐      ┌──────────────┐   │
│  │ fltr_auth DB │      │   fltr DB    │   │
│  │(Users/Teams) │      │  (Datasets)  │   │
│  └──────────────┘      └──────┬───────┘   │
│                               │           │
│                               ▼           │
│                        ┌──────────────┐   │
│                        │    Milvus    │   │
│                        │  (Vectors)   │   │
│                        └──────────────┘   │
│                                           │
│         ┌────────────────────┐            │
│         │   Modal.com        │            │
│         │ (Document Proc.)   │            │
│         └────────────────────┘            │
└───────────────────────────────────────────┘

Key Components:

Next.js Frontend: User dashboard, dataset management, credit purchasing
FastAPI Backend: API server, MCP integration, vector search
Two-Database Strategy:
- fltr_auth: User accounts, teams, authentication
- fltr: Datasets, documents, chunks, embeddings metadata
Milvus: Vector database for semantic search
Modal.com: Serverless GPU for document processing
Cloudflare R2: Object storage for uploaded files

Database Design: The Two-Database Strategy

Why two databases?

User authentication and dataset operations have different requirements:

-- fltr_auth database (managed by Better Auth)
CREATE TABLE users (
  id UUID PRIMARY KEY,
  email TEXT NOT NULL UNIQUE,
  name TEXT,
  credits INTEGER DEFAULT 0
);

CREATE TABLE teams (
  id UUID PRIMARY KEY,
  name TEXT NOT NULL,
  owner_id UUID REFERENCES users(id)
);

-- fltr database (dataset operations)
CREATE TABLE datasets (
  id UUID PRIMARY KEY,
  name TEXT NOT NULL,
  owner_id UUID, -- References fltr_auth.users.id
  milvus_collection TEXT NOT NULL,
  chunk_size INTEGER DEFAULT 512,
  chunk_overlap INTEGER DEFAULT 50
);

CREATE TABLE documents (
  id UUID PRIMARY KEY,
  dataset_id UUID REFERENCES datasets(id),
  filename TEXT NOT NULL,
  r2_key TEXT NOT NULL, -- Cloudflare R2 path
  status TEXT DEFAULT 'pending',
  processed_at TIMESTAMP
);

CREATE TABLE chunks (
  id UUID PRIMARY KEY,
  document_id UUID REFERENCES documents(id),
  content TEXT NOT NULL,
  milvus_id TEXT NOT NULL, -- Reference to Milvus vector
  chunk_index INTEGER,
  metadata JSONB
);

Cross-database references use UUIDs, allowing flexible deployment (same server or separate instances).

Document Processing Pipeline

The processing pipeline uses Modal.com for serverless GPU compute:

1. Upload to R2 (Direct from Browser)

// app/actions/upload.ts
'use server'

import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3'
import { getSignedUrl } from '@aws-sdk/s3-request-presigner'

export async function getUploadUrl(filename: string) {
  const r2 = new S3Client({
    region: 'auto',
    endpoint: process.env.R2_ENDPOINT!,
    credentials: {
      accessKeyId: process.env.R2_ACCESS_KEY_ID!,
      secretAccessKey: process.env.R2_SECRET_ACCESS_KEY!,
    },
  })

  const key = `uploads/${nanoid()}/${filename}`

  const command = new PutObjectCommand({
    Bucket: process.env.R2_BUCKET!,
    Key: key,
  })

  const signedUrl = await getSignedUrl(r2, command, { expiresIn: 3600 })

  return { signedUrl, key }
}

Client-side upload (no API bottleneck):

// components/upload.tsx
const { signedUrl, key } = await getUploadUrl(file.name)

// Upload directly to R2
await fetch(signedUrl, {
  method: 'PUT',
  body: file,
})

// Trigger processing
await triggerProcessing(key)

2. Modal.com Processing

# modal_app.py
import modal

stub = modal.Stub("fltr-processor")

@stub.function(
    image=modal.Image.debian_slim().pip_install(
        "pypdf2", "sentence-transformers", "pymilvus"
    ),
    gpu="T4",  # GPU for embeddings
    timeout=600,
)
async def process_document(r2_key: str, dataset_id: str):
    """
    Process PDF: Extract text, chunk, embed, store in Milvus
    """
    # Download from R2
    content = download_from_r2(r2_key)

    # Extract text from PDF
    text = extract_text_from_pdf(content)

    # Chunk with overlap
    chunks = chunk_text(text, size=512, overlap=50)

    # Generate embeddings (GPU-accelerated)
    model = SentenceTransformer('all-MiniLM-L6-v2')
    embeddings = model.encode(chunks)

    # Store in Milvus
    collection = Collection(f"dataset_{dataset_id}")
    collection.insert([
        chunks,
        embeddings,
        [{"chunk_index": i} for i in range(len(chunks))]
    ])

    # Update database status
    update_document_status(dataset_id, "completed")

# fastapi_app/routes/webhooks.py
from fastapi import APIRouter, BackgroundTasks
import modal

router = APIRouter()

@router.post("/r2-upload-complete")
async def r2_webhook(
    data: dict,
    background_tasks: BackgroundTasks
):
    r2_key = data["key"]
    dataset_id = data["dataset_id"]

    # Trigger Modal processing asynchronously
    f = modal.Function.lookup("fltr-processor", "process_document")
    background_tasks.add_task(
        f.remote,
        r2_key=r2_key,
        dataset_id=dataset_id
    )

    return {"status": "processing"}

Vector Search with Milvus

Milvus provides high-performance vector similarity search:

Collection Strategy: Shared vs. Per-Dataset

Option 1: One collection per dataset ❌

# Expensive! Creates overhead per dataset
collection = Collection(f"dataset_{dataset_id}")

Option 2: Shared collection with filtering ✅

# Efficient! One collection, filter by dataset_id
collection = Collection("all_datasets")

# Search with filter
results = collection.search(
    data=[query_embedding],
    anns_field="embedding",
    param={"metric_type": "L2", "params": {"nprobe": 10}},
    limit=10,
    expr=f"dataset_id == '{dataset_id}'"  # Filter to specific dataset
)

Why shared collection?

Lower memory overhead
Faster startup (no need to load multiple collections)
Easier to scale
Still isolated via filtering

Search Implementation

# fastapi_app/services/search.py
from pymilvus import Collection, connections
from sentence_transformers import SentenceTransformer

connections.connect(host="milvus", port="19530")
collection = Collection("all_datasets")

model = SentenceTransformer('all-MiniLM-L6-v2')

def search_dataset(query: str, dataset_id: str, top_k: int = 5):
    # Generate query embedding
    query_embedding = model.encode([query])[0]

    # Search Milvus
    results = collection.search(
        data=[query_embedding],
        anns_field="embedding",
        param={"metric_type": "L2", "params": {"nprobe": 10}},
        limit=top_k,
        expr=f"dataset_id == '{dataset_id}'",
        output_fields=["chunk_id", "content", "metadata"]
    )

    # Fetch full chunk content from PostgreSQL
    chunks = []
    for hit in results[0]:
        chunk = db.query(Chunk).filter_by(
            milvus_id=hit.id
        ).first()

        chunks.append({
            "content": chunk.content,
            "score": hit.distance,
            "metadata": chunk.metadata,
            "document": chunk.document.filename
        })

    return chunks

MCP Integration: Connecting AI to Your Data

The MCP server exposes datasets as queryable tools for AI:

// mcp-server/index.ts
import { Server } from '@modelcontextprotocol/sdk/server'
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio'

const server = new Server({
  name: 'fltr-mcp-server',
  version: '1.0.0',
})

// Register search tool
server.setRequestHandler(ListToolsRequestSchema, async () => ({
  tools: [
    {
      name: 'search_dataset',
      description: 'Search a FLTR dataset using semantic search',
      inputSchema: {
        type: 'object',
        properties: {
          dataset_id: { type: 'string' },
          query: { type: 'string' },
          top_k: { type: 'number', default: 5 },
        },
        required: ['dataset_id', 'query'],
      },
    },
  ],
}))

// Handle tool execution
server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name === 'search_dataset') {
    const { dataset_id, query, top_k } = request.params.arguments

    // Call FastAPI backend
    const response = await fetch(
      `${process.env.FASTAPI_URL}/search`,
      {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ dataset_id, query, top_k }),
      }
    )

    const results = await response.json()

    return {
      content: [
        {
          type: 'text',
          text: JSON.stringify(results, null, 2),
        },
      ],
    }
  }
})

// Start server
const transport = new StdioServerTransport()
await server.connect(transport)

Usage in Claude Desktop:

// ~/Library/Application Support/Claude/claude_desktop_config.json
{
  "mcpServers": {
    "fltr": {
      "command": "node",
      "args": ["/path/to/mcp-server/index.ts"],
      "env": {
        "FASTAPI_URL": "https://api.fltr.app"
      }
    }
  }
}

Now in Claude conversations:

User: What are the key findings in the Q3 2024 revenue report?

Claude: I'll search the FLTR dataset for that information.

[Calls MCP tool: search_dataset]
  dataset_id: "user_reports"
  query: "Q3 2024 revenue findings"

[Results with citations]
Based on the Q3 2024 Revenue Report (page 4):
- Total revenue increased 23% YoY to $4.2M
- SaaS ARR grew 31% to $3.1M
- Customer churn decreased to 2.1%

Source: q3-2024-revenue-report.pdf, chunks 12-15

Credit System Implementation

Users purchase credits to access datasets:

// lib/schema.ts (Drizzle - fltr_auth)
export const users = pgTable('users', {
  id: uuid('id').primaryKey(),
  email: text('email').notNull().unique(),
  credits: integer('credits').default(0),
})

export const creditTransactions = pgTable('credit_transactions', {
  id: uuid('id').primaryKey(),
  userId: uuid('user_id').references(() => users.id),
  amount: integer('amount').notNull(), // Positive = purchase, negative = usage
  type: text('type').notNull(), // 'purchase', 'search', 'upload'
  datasetId: uuid('dataset_id'), // References fltr.datasets
  createdAt: timestamp('created_at').defaultNow(),
})

Deduct credits on search:

# fastapi_app/routes/search.py
@router.post("/search")
async def search(
    dataset_id: str,
    query: str,
    user_id: str,
    db: Session = Depends(get_db)
):
    # Check user has credits (cross-database query)
    user = db_auth.query(User).filter_by(id=user_id).first()

    if user.credits < 1:
        raise HTTPException(status_code=402, detail="Insufficient credits")

    # Perform search
    results = search_dataset(query, dataset_id)

    # Deduct credit
    user.credits -= 1
    db_auth.commit()

    # Log transaction
    transaction = CreditTransaction(
        user_id=user_id,
        amount=-1,
        type="search",
        dataset_id=dataset_id
    )
    db_auth.add(transaction)
    db_auth.commit()

    return results

Cost Optimization Strategies

1. DigitalOcean Droplet Downsizing

Before: $50/month droplet (overkill for FastAPI) After: $24/month droplet (right-sized)

# Optimize gunicorn workers
gunicorn -w 2 -k uvicorn.workers.UvicornWorker main:app
# Only 2 workers needed for moderate traffic

2. Modal.com Serverless GPU

Why not always-on GPU?

GPU VMs cost $1-3/hour
Document processing is sporadic (not 24/7)

Modal.com pricing:

$0 when idle
$0.60/hour GPU T4 (only when processing)
Average cost: ~$15/month (vs $720+ for dedicated)

3. Milvus Shared Collections

Before: One collection per dataset

100 datasets = 100 collections
High memory overhead

After: One shared collection with filtering

100 datasets = 1 collection
80% memory reduction

Production Deployment Challenges

1. Cross-Database Foreign Keys

Challenge: Drizzle (Next.js) and SQLAlchemy (FastAPI) both need access to fltr_auth.users

Solution: UUID references without actual foreign key constraints

// Drizzle doesn't enforce FK across databases
export const datasets = pgTable('datasets', {
  id: uuid('id').primaryKey(),
  ownerId: uuid('owner_id'), // Refers to fltr_auth.users.id (not enforced)
})

# SQLAlchemy also uses UUID references
class Dataset(Base):
    __tablename__ = "datasets"
    id = Column(UUID, primary_key=True)
    owner_id = Column(UUID)  # Refers to fltr_auth.users.id

Trade-off: Lose referential integrity, but gain flexibility.

2. Dual ORM Strategy

Drizzle (Next.js):

Type-safe SQL generation
Great TypeScript integration
Fast for simple queries

SQLAlchemy (FastAPI):

Mature Python ORM
Better for complex queries
Event system for lifecycle hooks

Coordination: Use migrations carefully to keep schemas in sync.

Lessons Learned

MCP is powerful but early - Protocol is evolving, expect breaking changes
Two databases = more complexity - Worth it for separation of concerns
Shared Milvus collection wins - Better performance, lower costs
Modal.com = game changer - Serverless GPU eliminates massive fixed costs
Presigned URLs >> proxying - Direct R2 uploads save bandwidth and latency

Future Roadmap

Short-term:

Multi-format support (DOCX, CSV, JSON)
Dataset versioning
Team collaboration features
API keys for programmatic access

Long-term:

Federated search across multiple datasets
Custom embedding models
Real-time dataset updates (webhooks on changes)
Marketplace for public datasets

Conclusion

FLTR demonstrates that RAG-as-a-Service is feasible and can be cost-effective with the right architecture:

Next.js + FastAPI = Best of both worlds (TypeScript + Python)
Modal.com = Serverless GPU unlocks affordability
MCP = Standardized AI-to-data integration
Milvus shared collections = Scalable vector search

Whether you're building a dataset marketplace, internal knowledge base, or AI-powered research tool, these patterns apply. The combination of modern tooling and serverless compute makes RAG accessible to indie developers, not just enterprises.

Learn more: fltr.app