← All field guides · ← The Convergence of AI
A BlueAlly Field Guide
A model does not read words. It reads numbers. We take one plain sentence and follow it all the way — from raw text to a running, governed system. No mystery. Just engineering.
Conquer Complexity
What's inside
01 The premise
The model you are about to buy reads zero words. It reads numbers — and only numbers. The sentence on the page becomes a list of integers before the model sees a thing. Everything you call "AI" is the work of turning your language into numbers honestly, and turning the numbers back into something true.
That gap is where the money is made or lost. The model is rarely the problem. The system around it — how text is cut, mapped, stored, found, and governed — decides whether your AI ships or stalls. So this guide skips the hype and follows the wiring.
We take one sentence. A real one, the kind that lives in your contracts, your tickets, your claims files: "Contract 4471 renews on the first of March." Watch what the machine does to it. By the end you will know how it is cut, counted, mapped, stored, found, and acted upon — and where each piece of your own infrastructure belongs.
02 Tokenization · Byte-Pair Encoding
Before a model can think about a sentence, the sentence must become a sequence of integers. The cutting is called tokenization, and the most common knife is Byte-Pair Encoding — BPE. It is old, it is simple, and it works.
BPE is trained once, on a mountain of text, before the model ever sees your data. The recipe is short. Start with single characters. Count every adjacent pair. Merge the most frequent pair into one new piece, and add it to the vocabulary. Repeat — tens of thousands of times — until the vocabulary is full.1
The result is a fixed vocabulary, today roughly 100,000 to 200,000 pieces.2 Common words survive whole. Rare words break into familiar fragments. And because modern BPE works on raw bytes, nothing is ever truly out-of-vocabulary — in the worst case, a word falls all the way back to single bytes.1
Each piece in the vocabulary has a fixed ID. The sentence becomes a list of integers, and this list — nothing else — is what enters the model.
The model never meets a word it cannot cut. That is the whole trick.
03 Vectorization · Embeddings
An integer ID says nothing about meaning. Token 5541 is not "more" than token 1156. So each token is traded for a long list of numbers — a vector. This is the embedding, and it is where meaning lives.
Things that mean similar things land near each other. That is the entire idea. "Renews," "extends," and "continues" sit close together in this space. "Terminates" and "cancels" sit close together too — somewhere else. The distance between points is the distance between meanings. Nobody hand-places a single point; the map is learned from how words are actually used.
There are two kinds worth knowing. Token embeddings live inside the model, one vector per vocabulary piece. Document embeddings are produced by a dedicated embedding model that reads a whole passage and emits one vector for the entire thing. These are what you store in a vector database. When we say "embed your documents," we mean this. Today's embedding models emit vectors of roughly one to three thousand numbers — OpenAI's text-embedding-3-large defaults to 3,072, and many models let you shorten them.4
More numbers means finer meaning — and more storage to pay for, on every chunk, forever. Modern embedding models let you trade one for the other. Trained with a method called Matryoshka learning, they pack the most important meaning into the first numbers of each vector, so you can keep a shorter slice and lose little.4 Pick the length that fits your budget, not the longest one on offer.
A database stores what you said. An embedding stores what you meant.
04 Indexing · Vector vs. Structured
A structured database is a filing cabinet. You find things by their labels — exact, fast, and unforgiving. A vector database is a map of meaning. You find things by their neighbours. An enterprise needs both, and the failures we see most often come from asking one to do the other's job.
| Dimension | Structured (SQL) | Vector |
|---|---|---|
| How you ask | Exact predicates — WHERE id = 4471 | "Things like this" — a query vector |
| What "match" means | Equality. True or false. No middle. | Similarity. A score from 0 to 1. All middle. |
| Index under the hood | B-tree, hash — sorted exactness | HNSW, IVF — navigable neighbourhoods6 |
| A wrong answer looks like | An empty result. A loud failure. | A plausible-but-off neighbour. A quiet one. |
| Built for | Transactions, joins, totals, audit | Search, recommendation, RAG retrieval |
Use SQL for what is true. Use vectors for what is relevant. Never confuse the two.
05 Retrieval-Augmented Generation
A model knows what it was trained on, and nothing after, and nothing of yours. RAG fixes that without retraining anything. Before the model answers, the system finds the right pages — from your documents, your policies, your contracts — and lays them in front of it. Then the model answers with the book open.5
Update a document, re-embed the chunk, and the next answer reflects it. Minutes, not months.
Every claim points back to a chunk, and every chunk to a page. When the auditor asks "how do you know?" — you show them.
Nothing is baked into model weights. Permissions can be enforced at retrieval time, before chunks reach the prompt.
06 Training vs. Retrieval
People say "train the model on our data" when they almost always mean "let the model retrieve our data." The difference is not pedantry. It is millions of dollars and months of calendar, spent or saved.
Pre-training is the long game: the model reads trillions of tokens and guesses the next one, billions of times, nudging billions of weights. Months of compute, tens of millions of dollars. What comes out knows language, facts, and reasoning — frozen at a point in time.
Fine-tuning nudges an existing model on your own examples. It is good for tone, format, and narrow skills. It is poor for facts: knowledge stuffed into weights cannot be updated, cited, or access-controlled, and it can leak. RAG changes nothing in the model. It changes only what the model can see this turn.
Training changes what the model is. Retrieval changes what the model can see.
| Question | Pre-train | Fine-tune | RAG |
|---|---|---|---|
| What changes | All weights | Some weights | Nothing — context only |
| Best for | Building the model | Tone, format, skills | Facts, documents, freshness |
| Time to update | Months | Days to weeks | Minutes |
| Cost profile | Tens of millions | Thousands per run | Pennies per query |
| Can it cite? | No | No | Yes — by design |
| Honours permissions? | No | No | Yes — filter at retrieval |
The honest rule we give every client: fine-tune for behaviour, retrieve for knowledge, and pre-train never — unless you are a frontier lab, and you are not paying frontier-lab bills to find out.
07 Model Context Protocol
An API is a door built for developers. A person reads the docs, writes the glue code, ships it, and maintains it forever. MCP is a door built for models. The model walks up, asks the door what it does, and uses it — at runtime, with no custom glue. Same buildings. Different visitor.11
Actions the model may take — query_contracts, create_ticket, send_summary. Each declares its name, what it does, and the exact shape of its inputs.
Things the model may read — files, records, schemas. Context the server is willing to expose, addressed cleanly and governed by the server.
Reusable instruction templates the server provides — the playbooks a team has already proven, offered to the model as starting points.
Be precise here, because vendors will not be: MCP does not replace your APIs. It stands in front of them — a universal adapter that lets any model discover and use them safely, with your permissions and your audit trail intact.11
08 Agentic RAG · The Full System
Plain RAG retrieves once and answers. An agent thinks in a loop: plan, act, observe, and decide whether it is done.10 Give that loop a vector database for meaning, MCP tools for facts and actions, and guardrails around everything — and you have agentic RAG. This is the architecture behind every serious enterprise deployment we run.
Plan: "I need March renewals, then risk signals." Act: MCP → SQL for the exact list. Observe: six contracts. Act again: vector search across emails for churn language. Observe: two run hot. Done: a cited answer, with dollar figures from the system of record — never from the model's memory.
The 06:00 sweep fires. The same loop runs against every account renewing in 90 days, scores risk, files its evidence, and posts a digest. A human wakes to a short list and a recommendation — and the agent waits for approval before touching the CRM. Autonomy for reading. Permission for writing.
09 Observability · Traces · Evals
A traditional program fails loudly. An AI system fails politely — fluent, confident, and wrong. So observability is not a nice-to-have bolted on at the end. It is the difference between a demo and a system you bet the quarter on. Four things to watch: traces, logs, evals, and dashboards.
Latency tells you the system is fast. Only evaluation tells you it is right. Mature teams run a scored test suite on every change — the way software teams run unit tests. They grade groundedness (is every claim backed by a retrieved chunk?), retrieval quality (did the right chunks come back?), answer quality (correct, complete, in the asked-for format), and safety (PII handled, permissions honoured).
A demo is judged by its best answer. A system is judged by its worst.
10 The Fine Print
The diagrams above are honest, but production systems live or die on details that rarely make the slide. Here are six that decide whether yours works.
Cut documents wrong and retrieval finds fragments without their context — a renewal clause severed from its contract number. Chunk by structure, overlap the edges, carry metadata. Most "RAG doesn't work" complaints are chunking complaints.7
Vectors miss exact strings — part numbers, names, "4471." Keyword search (BM25) misses paraphrase. Run both, fuse the results, then let a reranker put the truly best chunks on top. This moves quality more than any model swap.7
Models classify, summarize, and route. They do not do arithmetic, and they do not enforce policy. Totals come from SQL. Approvals come from code. The model decides which calculation to run — never the answer to it.
Retrieval must filter by who is asking, before the chunks reach the prompt. An agent with every employee's access is a breach with good manners. Scope the tools. Scope the indexes. Log the denials too.
Reading at machine speed is leverage. Writing at machine speed is risk. The pattern that survives audits: agents draft, humans approve, systems record. Loosen it only where the evidence says you can.
11 The sentence, delivered
It was cut into tokens. Mapped into meaning. Indexed for finding. Retrieved with evidence. Reasoned over, acted upon, measured, and trusted. That is the whole stack — and now you have walked it end to end.
The hard part was never the concepts. It is the judgment — what to retrieve, what to govern, what to automate, and what to leave alone. That judgment is what BlueAlly brings to the table.
12 Sources
Every factual claim above is drawn from a primary source — model cards, papers, and the official documentation of the labs that build these systems. Specific cosine scores and timings are illustrative; the mechanisms are not.