Using Mistral-AI to shield Claude-AI

It started, as these things often do, with a vague sense of discomfort.

You want to leverage the power of cloud LLMs such as Claude. You’ve got a neat little retrieval pipeline going. Documents go in, magic comes out. But one day you look at your logs and realise:

“Wait… where did that customer name go?”
“Why is my prompt 14 KB long?”
“Who exactly is keeping pieces of PII in storage?”

We’d been playing with all the usual suspects: LangChain, LlamaIndex, Pinecone, OpenAI APIs. But every time we got something working, we’d hit a wall:

The system functioned, but we couldn’t really explain what it knew, what it remembered, or what it might leak.

So we asked ourselves the uncomfortable question:

Can we leverage a local LLM to shield a public LLM from personal data and still get useful responses?

Turns out, the answer is yes.
But not without some trial, error, and a couple of long evenings yelling at a bunch of Python libraries, LLM settings and heuristics algorithms.

This is the story of how we built a privacy-first document retrieval server: a local Retrieval-Augmented Generation (RAG) server built on FastAPI that safely opens up private documents to public LLMs using MCP as the standardised orchestrator and FAISS as the standardised vector store.

In fact, our anonymiser server acts as a trusted retrieval layer.

Here is how it works :

  1. You use an LLM client (e.g. Claude or an LLM wrapper) and enter a prompt that needs a private document
  2. The LLM client calls the anonymiser via the standard MCP interface and requests the relevant document chunks
  3. The anonymiser manages the prompt locally with its own LLM (e.g. Mistral) and queries a vector store it has pre-built and locally curated.
  4. The anonymiser reconstructs the curated, relevant document chunks and feeds these back to the LLM client as context
  5. The LLM client builds the final prompt with this context and submits it to the public LLM service, without leaking any private data
  6. The LLM responds with context-aware output

The document retrieval pipeline our DPO will love

We didn’t set out to make a framework. Or a platform. Or something with three nested YAMLs and a dashboard full of sadness.

We just wanted an architecture to process documents, store their meaning in vectors, and get helpful responses, without leaking anything private along the way.

Sounds simple enough, right?

You can supplement LLM’s pre-existing training data with your own data using RAG. The problem is, most RAG pipelines are:

  • built for speed, not scrutiny,
  • rely on cloud APIs, not consent,
  • and treat privacy as a plugin, not a foundation.

We wanted something else. Something we’d feel confident running inside a regulated or privacy-conscious environment. Something a privacy officer could walk through and not raise an eyebrow (or worse: a red flag).

So we made a wishlist:

  • Everything local
    No OpenAI keys, no Pinecone, no “oops, we just sent a GDPR-sensitive vector to California.”
  • Anonymisation built-in
    Before anything gets embedded or stored, it gets cleaned. Either with rules, with a model, or both.
  • Clean, auditable flow
    Every file, every chunk, every prompt, inspectable and explainable.

To present the curated documents to a public LLM, we use MCP, short for Modular Context Processor. MCP is an open standard, open-source framework introduced by Anthropic in November 2024 to standardise the way LLMs integrate and share data with external tools, systems, and data sources.

From Prompt Engineering to Retrieval Engineering

Like everyone else, we started by obsessing over prompts.
You know the drill:

“Be concise but comprehensive.”
“Act like a helpful lawyer from 2035.”
“Summarize this document, but make it privacy-aware, culturally sensitive, and somehow charming.”

And sure, clever prompting helps. But at some point, we realized we were treating the LLM like a magician, expecting it to infer nuance from a pile of random text.

Then it clicked:
It’s not just about what you ask the model, it’s about what you feed it.
That’s when we shifted focus from prompt engineering to retrieval engineering.

Because in a RAG setup, what really matters is what context you give the model:

  • If the input is polluted with personal data, the output is, too.
  • If the chunks are garbage, the answer is garbage.
  • And if your retrieval is loose, your answer becomes legally… risky.

So instead of prompt hacking, we started working on:

  • smarter chunking,
  • tighter filters,
  • cleaner embeddings,
  • and retrieval strategies that default to “better safe than sorry.”

The result? Our LLM responses didn’t just sound better, they behaved better. Less hallucination, more precision. Less exposure, more control.

Now, how do we actually control what the model sees, without letting it see too much?

Designing a local, minimal AI server that doesn’t phone home

When we started designing our project, we had one golden rule:

Nothing leaves the machine.

No API keys.
No telemetry.
No mysterious “usage reporting” in the background.

Just a local server, cleanly doing its job: load documents, anonymize them, embed chunks, store vectors, retrieve when needed, and feed that to a model that also runs locally.

That’s it. No drama. No dependencies with a Terms of Service longer than the Bible.

We chose FastAPI to run the server. Why? Because it’s fast, cleanly structured, has automatic OpenAPI docs, and doesn’t try to be too clever.

Here’s what our server does in plain steps:

  • File Upload
    You send a document via API (or a local form). It currently supports .pdf and .docx, because let’s face it, that’s 90% of the world’s knowledge right there.
  • Text Extraction & Chunking
    We extract the text, then slice it into overlapping chunks, because LLMs need context, and context doesn’t always start at a paragraph boundary.
  • Anonymisation
    Before anything gets embedded, it passes through a privacy filter. This is first: dumb but fast (regexes), then smart but slow via LLM using Ollama. If it fails filtering, it doesn’t get embedded. Period.
  • Embedding & Storage
    The cleaned chunks are embedded locally using Ollama models, then stored in a FAISS vector index.
  • Retrieval & Generation
    At query time, we embed your question, find the top matching chunks, re-check them for leakage (yes, again), and build a prompt for the LLM, which also runs locally using Ollama.

No outside calls. No cloud AI involved.

We didn’t want a framework. We didn’t want a black box.
We wanted a visible pipeline where every step could be explained to a privacy auditor, or to your slightly paranoid co-founder (hi, Marc).

We also wanted to keep it modular:

  • Swap in a different embedding model? Go ahead.
  • Want chunking based on semantic similarity instead of tokens? Sure.
  • Need to add extra filters for domain-specific redaction? Just drop in a module.

Chunking is where it all goes right (or horribly, horribly wrong)

Chunking sounds technical and minor. Like something you can slap together with a quick .split() and a shrug. But in Retrieval-Augmented Generation, chunking is the unsung hero… or the silent saboteur.

So why do we chunk at all? Because large language models can’t handle your entire document.

They operate within a context window, typically a few thousand tokens (or words). So if you’ve got a 40-page pdf, you can’t just toss it in whole. You need to break it into smaller, digestible pieces.

That’s chunking.

Okay, so why not just split on paragraphs?

That’s what we tried first.
“Let’s just split the document by newline or paragraph markers.” Easy, right?

Wrong.

We quickly ran into real-world weirdness:

  • Some paragraphs were a single word.
  • Others were massive blocks of legal text.
  • Names, dates, or sensitive info would get split across chunks, making them harder to detect, or worse, partially embedded.

Turns out, splitting by format doesn’t respect “meaning.”
It’s not the structure of the page that matters, it’s the logic of the content.

What we learned (the Hard Way)

Chunking randomly, or too cleanly, creates several problems:

  1. Lost context
    If a sentence spans two chunks, each chunk becomes confusing. The LLM sees incomplete thoughts and starts guessing.
  2. Redaction failures
    If a name is cut mid-chunk, the privacy filter might miss it. That means private data gets embedded and retrieved later. (Ask us how we know.)
  3. Overlapping is your friend
    We moved to fixed-size chunks with overlap, e.g. 300 tokens with 80-token overlap. That way, context is preserved across boundaries, and filters get a second chance to catch sensitive data.
  4. Semantic chunking is tempting, but tricky
    We explored splitting on sentence boundaries or using model-driven segmentation, but that brought more compute and edge cases than we wanted in v1. Still, it’s on our roadmap.

The Goldilocks Problem

Too short? You get fragments.
Too long? You lose precision, and privacy guarantees.

We eventually landed on:

  • Token-based chunking
  • Overlapping windows
  • Re-running privacy filters per chunk, not per document

It’s a compromise, but one that balances context preservation, privacy accuracy, and retrieval precision.

So yeah. Chunking.. it’s not sexy. But if you get it wrong, your fancy LLM pipeline becomes a hallucinating privacy risk machine.

If you get it right? You start getting answers you can trust.
And that’s kind of the whole point.

Privacy Filtering: catch the IBAN before it is vectorised

Here’s the thing about RAG: once something is embedded, it’s too late.

You can’t un-leak an IBAN.
You can’t redact a phone number that’s already turned into a vector and stored in your index.
You certainly can’t tell an LLM not to hallucinate private info if that info is already part of its context.

So we made a hard rule early on:

Every piece of text must pass through a privacy filter before it gets embedded.
No exceptions. No “we’ll fix it later.”

Layer 1: Heuristics and Handcrafted Rules

We started with what we knew: patterns.

Belgian IBAN? Regex.
11-digit national registry number? Regex.
Dates, postcodes, email addresses? Regex, regex, regex.

This was surprisingly effective, especially in domains we knew well.
We could catch 80% of sensitive content using plain old pattern matching.
Plus, regex pattern matching is:

  • Fast
  • Deterministic
  • Auditable (try showing a DPO your fine-tuned transformer…)

But of course, regex has limits:

  • It doesn’t handle context (“Is ‘Jan Jansen’ a person or a company?”).
  • It misses weird formatting (“IBANs with extra spaces, tabs, or typos”).
  • It can’t generalize across domains.

That’s where the second layer comes in.

Layer 2: Using a Local LLM to Outsmart the Public One

Let’s get something out of the way:

We’re not anti LLM. We’re anti leaky-LLM.

In fact, we use an LLM as part of our privacy filter, just not the kind that phones home to California. We run a small, local model via Ollama that’s been instructed to act like a paranoid privacy officer:

“Read this text. Tell me if there’s anything in here that a public LLM should never see.”

It doesn’t embed. It doesn’t generate. It judges.

This was a turning point for us:

Instead of just fearing what an LLM might leak…
…we started using one to prevent leaks.

And not just any model, a fully local one.
No API calls. No logging. No risk.

Just you, your CPU (or better: your graphical and neural processor), and a model that helps you pre-screen context before handing it off to the big LLM. It’s like hiring a bouncer who’s smarter than the DJ.

False Negatives leak. But False Positives kill usefulness

When dealing with privacy, false negatives are dangerous. Leakage is bad, we get it.

But here’s the twist: too many false positives are just as destructive.

If your filters overreact, if they redact every name, every location, every product code, you end up with:

  • neutered chunks,
  • broken sentences,
  • and an LLM that’s flying blind.

This doesn’t just hurt accuracy, it breaks trust in the system. Users ask for real answers and get empty fluff because the context got wiped.

So we tuned our filters not just for strictness, but for proportionality.

Privacy-preserving AI isn’t just about blocking sensitive info.
It’s about doing so without destroying meaning.

Bottom line: If you care about privacy, filter early, filter often, but filter smartly.
Because a blank answer is just as useless as a risky one.

Bonus: Why we check again, before generation

This one surprised us.

Even after filtering before embedding, we found that some retrieved chunks still needed another look. Maybe the filters evolved. Maybe something slipped through.

So at generation time, we re-run the privacy checks on the context chunks before they go into the prompt. It’s cheap insurance.

Flipping the Script

This idea of using a local LLM to protect you from a bigger, potentially less-trustworthy LLM, felt… kind of revolutionary.

Most people still see LLMs as the privacy problem.
But used wisely, they can become part of the solution.

Especially when you:

  • keep them local,
  • scope their task clearly,
  • and make them your first line of defense.

We think this is going to be a pattern for the next wave of AI infra:

LLMs calling other LLMs — not to ask for answers, but to shield them.

Embedding & Retrieval: vector magic, but keep it local

After filtering, chunking, and a few stern lectures about privacy, your document is finally ready for the next phase: embedding. This is the part where meaning gets turned into math, or more precisely, into vectors.

And while this step can feel abstract, it’s the beating heart of any RAG system. It’s how your question finds its way to the right passage, ideally without passing through California en route.

Local vectorisation : no cloud dependencies

We use Ollama again here, not for text generation this time, but for embedding.

Our go-to model was nomic-embed-text, running locally and fast. Why this model?

  • It performs well on semantic search.
  • It’s compact enough to run on a laptop.
  • And, most importantly, it doesn’t need a cloud key.

There are great public embedding APIs out there. But the moment you use them:

  • You give up control.
  • You leak vector representations of private content.
  • You risk future traceability problems if your data ever needs to be explained or deleted.

So we went full local:

If we can’t embed it ourselves, we don’t embed it.

The Separation of Powers: vectorisation ≠ generation

This is where a lot of RAG architectures blur the lines.
In our set-up, we draw a strict boundary:

  • Embedding/vectorization happens once, when we ingest the chunk.
  • Generation happens later, only with selected, approved chunks.

At no point does the cloud LLM get access to the full document, the full vector index, or any private metadata. We don’t send a vector to the model. We send curated chunks of clean text, retrieved by matching vectors, but never exposing them directly.

Think of it like this:

  • Embedding = encoding the knowledge
  • Retrieval = finding relevant pieces
  • Prompting = asking the model, but on your terms

That separation makes the whole thing safer, explainable, and audit-friendly.

FAISS: because it just works

For our vector store, we use Facebook AI Similarity Search (FAISS). It is an open-source library for similarity search and clustering of vectors. It contains algorithms that search in sets of vectors of any size, Facebook’s open-source indexing library.

  • It’s fast.
  • It’s proven.
  • It doesn’t need a database server, a subscription plan, or a PhD in config files.

We index each embedded chunk, store them in a local FAISS index, and query them at runtime based on similarity to the user’s prompt.

It’s clean, fast, and… did we mention local?

Bonus: retrieval-time privacy checks

Even at retrieval time, we’re still paranoid.

After retrieving the top N matches, we re-run the privacy filter (yep, again) just in case:

  • filters improved since ingestion,
  • data was updated,
  • or a weird chunk slipped through.

You can never be too careful, especially when your LLM is about to put that chunk in a prompt and say it out loud.

So yeah: vectorization and retrieval might sound like “just glue code,” but in a privacy-preserving system, they’re firewalls with brains.

Aiming for real-world usability

Let’s be honest: there are a lot of cool-looking LLM projects out there that break the moment you try to… do anything with them. We wanted our MCP-RAG server to be different: a clean, local server that gets out of your way and does what it says on the tin.

So we built our server to be API-first and developer-friendly, whether you’re hacking a prototype, building a Slack bot, or wiring it into an internal search tool.

📁 File Upload: Simple, Secure, and Local

At its core, MCP lets you send documents to a local server. You can upload via a clean FastAPI form, Postman, or your own script. It accepts .pdf, .docx files, and plain text. The server responds with:

  • what it extracted,
  • what it filtered,
  • and what made it to the vector index.

No need for external storage. No “cloud bucket configuration.” Just call it in, and you’re rolling, for example:

curl -F "file=@contract.pdf" http://localhost:8000/upload

You’ll get a JSON back showing how many chunks were created, which filters flagged content, and a clean audit trail you can store or inspect.

🔍 Query Time: Bring Your Question, Get a Clean Answer

Once documents are embedded, you query our local server by hitting the /ask endpoint with your prompt:

{
  "query": "What are the termination clauses in this agreement?"
}

And equally:

{
  "query": " Quelles sont les clauses de résiliation prévues par cet accord ? "
}

The local server will:

  • Embed the question locally
  • Retrieve the most relevant chunks
  • Re-check them for sensitive info
  • Build a clean prompt
  • Run the prompt through your chosen local LLM (e.g. mistral, llama2, gemma, etc.)

You get back a response that’s grounded in your data, but filtered through multiple privacy layers.

⚙️ CLI or API, Your Choice

Our server works both as:

  • An MCP server (via FastAPI, with OpenAPI docs for every endpoint)
  • A CLI (for power users who like to script everything)

You can:

  • batch-process a folder of files,
  • reload the index,
  • test filters,
  • or monitor logs, all from the command line.

The local server isn’t a platform. It’s not a replacement for LangChain or Haystack. It’s a tool you can wrap around your own needs:

  • Building a secure internal chatbot? Plug it into your retrieval layer.
  • Processing documents for sensitive analysis? Process them before storage.
  • Need to anonymise before sending anything to Claude? Use it as a local buffer.

The point is: you stay in control. A lot of AI tools promise privacy. The server we built just… doesn’t talk to anyone else. Ever. And sometimes that’s all the guarantee you need.

Our Lessons Learned

Building the MCP-RAG server wasn’t some heroic moonshot. It was a pile of small, slightly obsessive decisions, one after another, mostly driven by paranoia, a bit of curiosity, and the occasional “Wait… what did we just embed?!”

Along the way, we learned a few things that surprised us, and probably will surprise you too.

💡 1. Local LLMs are better than think

We started out skeptical: “Sure, it runs locally, but can it really do anything?”
Turns out: yes. Especially for tasks like:

  • flagging sensitive content,
  • classifying chunk types,
  • filtering context based on tone or subject.

We don’t ask our local LLM to write poems or pass bar exams. We ask it to be useful, fast, and locate privacy-sensitive pieces. And for that, models like mistral or gemma do a great job.

Local LLMs have gone from “cute toy” to serious infrastructure piece, especially in a privacy-preserving stack.

🤖 2. Regex is very useful but limited

Everyone wants fancy models. But sometimes, all you need is:

re.compile(r"\bBE\d{2}\s?\d{4}\s?\d{4}\s?\d{4}\b")

Regex is fast, transparent, debuggable and, most importantly, predictable.

The real magic is when you combine it with LLM-based filtering:

  • Regex handles the obvious.
  • LLM handles the sneaky.

Together? You’ve got a pretty strong perimeter.

🪜 3. Going local forces you to think in layers

Once you cut out the cloud, you lose some of the conveniences.
But you gain clarity.
No magic vector store.
No outsourced model logic.
Just you, your documents, and the need to build something that actually makes sense.

We found ourselves asking questions like:

  • “Should this even be embedded?”
  • “Can we explain why this chunk was retrieved?”
  • “How would we delete this from the index?”

Those questions don’t always come up when you’re using hosted platforms.

But they’re the exact questions your privacy officer, or regulator, will ask.

😅 4. Developer UX still matters (even in a paranoid system)

It’s easy to overcorrect and build something secure but painful.
We didn’t want that.

So we invested in:

  • clean logging
  • helpful errors
  • clear API docs (FastAPI FTW)
  • CLI for batch work
  • modular code you can actually read and modify

Being privacy-first shouldn’t mean being developer-last.

⚖️ 5. The real art: balancing False Negatives and False Positives

In privacy, you want to talk about avoiding false negatives, and rightly so. If something sensitive slips through the filter, it gets embedded, retrieved, and possibly exposed. That’s leakage. That’s bad.

But there’s another risk that gets less attention: false positives.

If your filters are too strict:

  • you start redacting everything that looks vaguely personal,
  • chunks get shredded into Swiss cheese,
  • and suddenly, your LLM is working with incomplete, contextless nonsense.

This leads to:

  • irrelevant answers,
  • vague summaries,
  • or the dreaded “I’m sorry, I cannot answer that”… even when the data is right there, just overly blanked out.

We had to find a sweet spot:

  • Strict enough to block actual risk.
  • Loose enough to keep the LLM useful.

This is why we run filtering per chunk, not per document, and why we give the LLM just enough rope to catch nuanced stuff, without censoring useful content.

A privacy-preserving RAG isn’t just about what you block, it’s about what you let through, and why.

Building an MCP-RAG server changed how we think about LLM pipelines. It’s not just about what the model can do. It’s about what it should be allowed to see — and how much of that process you can actually explain.

Where we’re headed next

We started out as a simple idea:

Can we build a RAG pipeline that doesn’t leak?

Now it’s turning into something bigger, not in size, but in scope. Because once you’ve got privacy-preserving building blocks, you start seeing new possibilities. Here’s what we’re exploring next.

Prompt Classification

Today, we treat all prompts equally. But not all questions carry the same risk. We want to add prompt classification as a first step, using a local LLM to decide:

  • Is this a legal question?
  • Is it requesting personal data?
  • Does it require stricter filters?
  • Should it even be allowed?

This adds a new control point: “Should we even answer this?” It’s like having a tiny AI policy engine at the gate.

Persona-Based Context

Different users, different views. We’re experimenting with a persona system, where the same document can produce different prompts depending on:

  • Who is asking
  • What their access level is
  • What context they’re allowed to see

Think “privacy-aware multi-tenancy for LLMs”, without needing to shard your data 17 different ways.

Slack, Notion & Ticket Integrations

The RAG pipeline doesn’t live in a vacuum. Neither do your users. We’re building lightweight integrations to:

  • pull context from Notion pages or tickets,
  • post summaries back to Slack,
  • and eventually let users self-serve knowledge safely from inside the tools they already use.

Of course, every one of these integrations runs through the same privacy filter stack.
No backdoors. No side-channels.

Metadata, Auditability, and Explainability

Another big one: embedding more metadata alongside chunks.

  • Imagine being able to say:
  • When was this chunk created?
  • What filters were applied?
  • What PII tags were detected?
  • Who ingested it?

We want to build tooling where you can trace every output back to its source, not just for transparency, but for governance, compliance, and peace of mind. We didn’t set out to build a platform, but we might be building the infrastructure we wish existed, a retrieval system that’s fast, local, explainable, and never leaks on purpose.

Why all this matters: “Control isn’t optional anymore”

We didn’t build the MCP-RAG server to chase hype.
We built it because we were tired of trade-offs:

  • “You can have smart search, but not privacy.”
  • “You can use RAG, but only if your data goes to the cloud.”
  • “You can inspect the output, but not the process.”

We didn’t buy it. So we built something else. We don’t think privacy is just a legal checkbox. We think it’s an architectural principle. If you’re feeding documents into an AI system, whether it’s to answer questions, generate summaries, or power a chatbot, you deserve to:

  • Know what’s stored.
  • Know what’s seen.
  • Know what’s said.
  • And most importantly: know what’s never included.

What we think we’ve done is sketch a pattern that others can build on:

  • Embed locally.
  • Filter early.
  • Use LLMs to protect against LLMs.
  • Think in layers, not shortcuts.

It’s not complex. It’s just deliberate.
And that, we think, is the future of AI systems you can actually trust.

So if you’re building something similar…
Steal from us. Fork it. Remix it.
Make it better.

Because the more developers think like this, privacy-aware, modular, explainable, the less we’ll have to rely on blind trust.

And the more we’ll be able to say:

Yes, this is private.
And yes, the AI still works.

Github

To stimulate this type of trusted AI architecture, we open-sourced our experiment so the community can fork it, run it, adapt it:

https://github.com/schuppnura/python-anonymization

Leave a comment