← All field guides  ·  ← The Convergence of AI

A BlueAlly Field Guide

How the
machine reads.

A model does not read words. It reads numbers. We take one plain sentence and follow it all the way — from raw text to a running, governed system. No mystery. Just engineering.

Conquer Complexity

One sentence. "Contract 4471 renews…" Tokens Vectors Index Retrieval Reasoning Action Trusted Seven stages. One sentence. The whole modern stack.

What's inside

01  The premise

A model does not read words. It never has.

The model you are about to buy reads zero words. It reads numbers — and only numbers. The sentence on the page becomes a list of integers before the model sees a thing. Everything you call "AI" is the work of turning your language into numbers honestly, and turning the numbers back into something true.

That gap is where the money is made or lost. The model is rarely the problem. The system around it — how text is cut, mapped, stored, found, and governed — decides whether your AI ships or stalls. So this guide skips the hype and follows the wiring.

We take one sentence. A real one, the kind that lives in your contracts, your tickets, your claims files: "Contract 4471 renews on the first of March." Watch what the machine does to it. By the end you will know how it is cut, counted, mapped, stored, found, and acted upon — and where each piece of your own infrastructure belongs.

0
Words the model reads. It works only in numbers — tokens, then vectors. The translation is the whole game.
~1M
Tokens a frontier model now holds in context at once — Claude Opus 4.8, GPT-5.5, Gemini 3 all near a million.8
7
Stages between raw text and a governed answer. Every AI system you buy is some arrangement of these seven.
In plain English

Five words to carry the whole guide

Model
A program that predicts text. Give it words; it returns the words that should come next. That is the whole trick — and it is enough to do remarkable things.
Token
A chunk of text, a few characters long. Models read and write in tokens, and you pay by the token.
Embedding (vector)
A long list of numbers that stands for a piece of text. Similar meanings get similar lists. Meaning, made measurable.
Retrieval (RAG)
Finding the right pages in your own documents and handing them to the model before it answers. An open-book exam.
Agent
A model that acts in a loop — it plans, calls a tool, reads the result, and decides what to do next, until the job is done.10
The assembly line — one sentence, seven stages Follow the same sentence as it is cut, mapped, stored, found, reasoned over, and delivered. 01 · TEXT "Contract 4471 renews…" words 02 · TOKENS ·44 71 ren ews ·on cut into pieces 03 · VECTORS [0.021, -0.184, …] meaning as numbers 04 · INDEX stored for search 05 · RETRIEVAL top matches found on demand the evidence 06 · REASONING a model decides, book open 07 · DELIVERED — CITED & GOVERNED "Renews 2026-03-01, $3.4M." + source In: one ambiguous sentence a person could read.  Out: an answer a system can prove, cite, and act on.
Fig. 0 — Seven stages, one sentence. The same sentence, transformed step by step: words to tokens to vectors, stored, found, reasoned over, and delivered with its source attached. Every enterprise AI system you will ever buy or build is some arrangement of these stages. The rest of this guide walks them in order.

02  Tokenization · Byte-Pair Encoding

First, the knife. The sentence is cut.

Before a model can think about a sentence, the sentence must become a sequence of integers. The cutting is called tokenization, and the most common knife is Byte-Pair Encoding — BPE. It is old, it is simple, and it works.

BPE is trained once, on a mountain of text, before the model ever sees your data. The recipe is short. Start with single characters. Count every adjacent pair. Merge the most frequent pair into one new piece, and add it to the vocabulary. Repeat — tens of thousands of times — until the vocabulary is full.1

The result is a fixed vocabulary, today roughly 100,000 to 200,000 pieces.2 Common words survive whole. Rare words break into familiar fragments. And because modern BPE works on raw bytes, nothing is ever truly out-of-vocabulary — in the worst case, a word falls all the way back to single bytes.1

Training the vocabulary — watch "renews" form Step 0 · characters r e n e w s Merge 1 · "r"+"e" is frequent everywhere → new piece "re" re n e w s Merge 2 · "re"+"n" → "ren" · then "ew"+"s" → "ews" ren ews Inference · your sentence, cut with the learned merges Contract ·44 71 ·ren ews ·on ·the ·first ·of ·March . The number splits. "4471" becomes two tokens — ·44 and 71. A "·" marks a leading space: " renews" and "renews" are different pieces to the tokenizer. Remember the split. It matters in section 4.
Fig. 1 — BPE merges, simplified for clarity. Common text stays whole; numbers fragment. The blue pieces show "4471" broken into two tokens — the first hint of why models struggle with exact strings and arithmetic.

From pieces to integers

Each piece in the vocabulary has a fixed ID. The sentence becomes a list of integers, and this list — nothing else — is what enters the model.

Token IDs · what the model actually receives [ 17341, 6253, 5541, 1156, 2503, 389, 262, 717, 286, 2805, 13 ]
~1.3
Tokens per English word; about four characters each. Code, other languages, and numbers run higher.3
Per token
Every API bill is denominated in tokens — input and output priced separately, quoted per million. Waste tokens and the bill grows.3
Frozen
The vocabulary is fixed before launch. Your part numbers and acronyms get cut by merges learned from the public web. Good retrieval works with this.
Same 100 words. Very different bills. Approximate tokens per 100 words. You pay per token — so the content type sets the price. 075150225300 TOKENS PER 100 WORDS Plain English~130 Spanish / French~165 Source code~210 JSON & numbers~265 Cheapest → ← Most expensive
Fig. 1b — Tokens are the unit of cost. English packs the most meaning per token. Other languages, code, and especially raw numbers fragment into more pieces — so the same idea costs more to send.3 Numbers are illustrative, the pattern is real.
The model never meets a word it cannot cut. That is the whole trick.

03  Vectorization · Embeddings

Then, the map. Meaning becomes geometry.

An integer ID says nothing about meaning. Token 5541 is not "more" than token 1156. So each token is traded for a long list of numbers — a vector. This is the embedding, and it is where meaning lives.

Things that mean similar things land near each other. That is the entire idea. "Renews," "extends," and "continues" sit close together in this space. "Terminates" and "cancels" sit close together too — somewhere else. The distance between points is the distance between meanings. Nobody hand-places a single point; the map is learned from how words are actually used.

There are two kinds worth knowing. Token embeddings live inside the model, one vector per vocabulary piece. Document embeddings are produced by a dedicated embedding model that reads a whole passage and emits one vector for the entire thing. These are what you store in a vector database. When we say "embed your documents," we mean this. Today's embedding models emit vectors of roughly one to three thousand numbers — OpenAI's text-embedding-3-large defaults to 3,072, and many models let you shorten them.4

One sentence → one vector (3,072 dims, truncated) embed("Contract 4471 renews on the first of March.") → [ 0.021, -0.184, 0.077, 0.305, -0.042, … ]
The map of meaning renews extends continues terminates cancels contract agreement March spring far apart = opposite intent Cosine similarity 1.0 = same direction 0 = unrelated renews · extends0.91 renews · cancels0.18 Scores shown are illustrative. The clusters, though, are real.
Fig. 2 — A 3,072-dimension space, flattened to two so human eyes can see it. Closeness is measured with cosine similarity. The flattening is a courtesy; the geometry is what the machine actually searches.

How long should the list be?

More numbers means finer meaning — and more storage to pay for, on every chunk, forever. Modern embedding models let you trade one for the other. Trained with a method called Matryoshka learning, they pack the most important meaning into the first numbers of each vector, so you can keep a shorter slice and lose little.4 Pick the length that fits your budget, not the longest one on offer.

Shorter vectors, almost the same quality OpenAI text-embedding-3-large, truncated. Storage falls 12×; search quality barely moves. RELATIVE STORAGE / COST 12 KB 3,072 dims full length quality 100% 4 KB 1,024 dims the sweet spot ~99% 1 KB 256 dims tiny & fast ~97% 6 KB 1,536 dims old model (ada-002) beaten by 256
Fig. 2b — Length is a dial, not a fixed cost. Cutting a 3,072-number vector to 256 shrinks storage 12× while quality slips only a few points — and the 256-number slice still beats a whole older model.4 Quality figures illustrative; the tradeoff is documented.
A database stores what you said. An embedding stores what you meant.

04  Indexing · Vector vs. Structured

Two libraries. Two ways of finding.

A structured database is a filing cabinet. You find things by their labels — exact, fast, and unforgiving. A vector database is a map of meaning. You find things by their neighbours. An enterprise needs both, and the failures we see most often come from asking one to do the other's job.

The filing cabinet — structured (SQL) SELECT * FROM contracts WHERE id = 4471; idcustomerrenewalvalue 4469Hargrove Mfg.2026-01-15$1.2M 4471Calder Health2026-03-01$3.4M 4472Pine Logistics2026-06-30$640K Finds the exact row, every time, in microseconds. Index: B-tree — sorted labels, binary search. Fails when you ask "which deals feel at risk?" There is no WHERE clause for a feeling. The map — vector (semantic) query("upcoming renewals at risk") → top 3 your query Finds the closest meanings — renewal notes, churn signals — even when no word matches. Index: HNSW graph — approximate nearest neighbour in milliseconds across a billion vectors.
Fig. 3 — Precision by label, or proximity by meaning. The green points are the three nearest neighbours. Nobody typed the word "risk" in those documents. The geometry found them anyway.6
DimensionStructured (SQL)Vector
How you askExact predicates — WHERE id = 4471"Things like this" — a query vector
What "match" meansEquality. True or false. No middle.Similarity. A score from 0 to 1. All middle.
Index under the hoodB-tree, hash — sorted exactnessHNSW, IVF — navigable neighbourhoods6
A wrong answer looks likeAn empty result. A loud failure.A plausible-but-off neighbour. A quiet one.
Built forTransactions, joins, totals, auditSearch, recommendation, RAG retrieval
Use SQL for what is true. Use vectors for what is relevant. Never confuse the two.

05  Retrieval-Augmented Generation

RAG: the open-book exam.

A model knows what it was trained on, and nothing after, and nothing of yours. RAG fixes that without retraining anything. Before the model answers, the system finds the right pages — from your documents, your policies, your contracts — and lays them in front of it. Then the model answers with the book open.5

INGESTION — DONE ONCE, REFRESHED ON CHANGE 1 · Documents Contracts · SOPs · tickets 2 · Chunk Split into short passages 3 · Embed Each chunk → one vector 4 · Vector database Indexed · HNSW · with metadata QUERY — EVERY TIME SOMEONE ASKS 5 · The question "When does Calder Health renew?" 6 · Embed it Same model as step 3 7 · Search the index Top-k nearest + rerank 8 · The evidence 3–8 best chunks, cited 9 · The model, book open Prompt = instructions + question + retrieved evidence It does not recall. It reads. 10 · "Calder Health renews March 1, 2026." — with a citation to the exact contract page. Verifiable. Auditable.
Fig. 4 — The full RAG pipeline. Two paths: ingestion (top, runs when documents change) and query (middle, runs on every question). They meet at the model. The answer carries its sources with it.
Fresh without retraining.

Update a document, re-embed the chunk, and the next answer reflects it. Minutes, not months.

Answers you can audit.

Every claim points back to a chunk, and every chunk to a page. When the auditor asks "how do you know?" — you show them.

Your data stays yours.

Nothing is baked into model weights. Permissions can be enforced at retrieval time, before chunks reach the prompt.

06  Training vs. Retrieval

School is not the same as an open book.

People say "train the model on our data" when they almost always mean "let the model retrieve our data." The difference is not pedantry. It is millions of dollars and months of calendar, spent or saved.

Pre-training is the long game: the model reads trillions of tokens and guesses the next one, billions of times, nudging billions of weights. Months of compute, tens of millions of dollars. What comes out knows language, facts, and reasoning — frozen at a point in time.

Fine-tuning nudges an existing model on your own examples. It is good for tone, format, and narrow skills. It is poor for facts: knowledge stuffed into weights cannot be updated, cited, or access-controlled, and it can leak. RAG changes nothing in the model. It changes only what the model can see this turn.

Training changes what the model is. Retrieval changes what the model can see.
QuestionPre-trainFine-tuneRAG
What changesAll weightsSome weightsNothing — context only
Best forBuilding the modelTone, format, skillsFacts, documents, freshness
Time to updateMonthsDays to weeksMinutes
Cost profileTens of millionsThousands per runPennies per query
Can it cite?NoNoYes — by design
Honours permissions?NoNoYes — filter at retrieval

The honest rule we give every client: fine-tune for behaviour, retrieve for knowledge, and pre-train never — unless you are a frontier lab, and you are not paying frontier-lab bills to find out.

07  Model Context Protocol

MCP: one plug for every tool.

An API is a door built for developers. A person reads the docs, writes the glue code, ships it, and maintains it forever. MCP is a door built for models. The model walks up, asks the door what it does, and uses it — at runtime, with no custom glue. Same buildings. Different visitor.11

Before — the N × M tangle Assistant A Assistant B Agent C CRM ERP Warehouse 3 clients × 3 systems = 9 integrations. Scale to 10 × 50. Someone maintains 500. After — N + M through one protocol Assistant A Assistant B Agent C MCP CRM server ERP server Warehouse 3 + 3 = 6 connections. Each side built once. Think USB-C: one port, every device.
Fig. 5 — The N×M tangle, collapsed. Before, every line is code somebody wrote and maintains. After, one open standard sits between every AI client and every system. Build a server once; every MCP-speaking model can use it.

What an MCP server offers

Tools

Actions the model may take — query_contracts, create_ticket, send_summary. Each declares its name, what it does, and the exact shape of its inputs.

Resources

Things the model may read — files, records, schemas. Context the server is willing to expose, addressed cleanly and governed by the server.

Prompts

Reusable instruction templates the server provides — the playbooks a team has already proven, offered to the model as starting points.

Be precise here, because vendors will not be: MCP does not replace your APIs. It stands in front of them — a universal adapter that lets any model discover and use them safely, with your permissions and your audit trail intact.11

08  Agentic RAG · The Full System

Now, the whole machine — assembled.

Plain RAG retrieves once and answers. An agent thinks in a loop: plan, act, observe, and decide whether it is done.10 Give that loop a vector database for meaning, MCP tools for facts and actions, and guardrails around everything — and you have agentic RAG. This is the architecture behind every serious enterprise deployment we run.

TRIGGERS — HUMAN OR EVENT Human-driven "Which contracts renew in March, and which are at risk?" Event-driven webhook: contract updated · cron: 06:00 renewal sweep Guardrails — at the gate Identity · permissions · input checks · PII handling · scope of allowed actions The agent loop Plan Act Observe Loop until done — or until budget, depth, or policy says stop. KNOWLEDGE — WHAT IS RELEVANT Vector database Semantic search over contracts, emails, notes, policies finds: "these 6 accounts read like churn" FACTS & ACTIONS — VIA MCP MCP servers → structured systems SQL warehouse · CRM · ERP · ticketing · calendar · web search confirms: "4471 renews 2026-03-01, $3.4M" The cited deliverable "Six March renewals. Two at risk — Calder Health ($3.4M) shows three unanswered escalations." Pushed to dashboard · Slack · a ticket, awaiting human approval. Observability — sees everything Every plan, tool call, token, and answer traced and logged. Chapter 9 opens the hood.
Fig. 6 — Agentic RAG, end to end. Triggers enter through guardrails. The loop plans, then reaches left for meaning and right for truth, as many times as the task demands. Results land where people already work. Everything is traced.
Walk it through — a human asks.

Plan: "I need March renewals, then risk signals." Act: MCP → SQL for the exact list. Observe: six contracts. Act again: vector search across emails for churn language. Observe: two run hot. Done: a cited answer, with dollar figures from the system of record — never from the model's memory.

Walk it through — nobody asks.

The 06:00 sweep fires. The same loop runs against every account renewing in 90 days, scores risk, files its evidence, and posts a digest. A human wakes to a short list and a recommendation — and the agent waits for approval before touching the CRM. Autonomy for reading. Permission for writing.

09  Observability · Traces · Evals

If you cannot see it, you cannot trust it.

A traditional program fails loudly. An AI system fails politely — fluent, confident, and wrong. So observability is not a nice-to-have bolted on at the end. It is the difference between a demo and a system you bet the quarter on. Four things to watch: traces, logs, evals, and dashboards.

Trace a3f9-4471 · "March renewals at risk" · total 6.4s · $0.038 guardrail.input0.06s agent.plan0.5s · 412 tokens mcp.sql.query_contracts0.4s · 6 rows vector.search ×60.9s · 31 chunks · rerank → 9 agent.observe + plan0.5s llm.generate (cited)3.7s · 1,847 tokens guardrail.output0.1s deliver.dashboard+slack0.2s Every span carries its inputs, outputs, latency, and token cost. When an answer is wrong, you do not guess where. You look.
Fig. 7a — A span waterfall. This is what "explainable" actually looks like in production: not a philosophy, a timeline. Traces, logs, and scored evals together tell you the system is fast, lawful, and right.

Latency tells you the system is fast. Only evaluation tells you it is right. Mature teams run a scored test suite on every change — the way software teams run unit tests. They grade groundedness (is every claim backed by a retrieved chunk?), retrieval quality (did the right chunks come back?), answer quality (correct, complete, in the asked-for format), and safety (PII handled, permissions honoured).

A demo is judged by its best answer. A system is judged by its worst.
Twelve weeks in production Groundedness ↑ Escalations ↓ Week 1Week 12
Fig. 7b — The two lines every executive should ask for. Answers getting more grounded; humans needed less often. When both move the right way, trust is earned — and measured.

10  The Fine Print

What most explanations skip. We will not.

The diagrams above are honest, but production systems live or die on details that rarely make the slide. Here are six that decide whether yours works.

1
Chunking is a design decision.

Cut documents wrong and retrieval finds fragments without their context — a renewal clause severed from its contract number. Chunk by structure, overlap the edges, carry metadata. Most "RAG doesn't work" complaints are chunking complaints.7

2
Hybrid search beats either alone.

Vectors miss exact strings — part numbers, names, "4471." Keyword search (BM25) misses paraphrase. Run both, fuse the results, then let a reranker put the truly best chunks on top. This moves quality more than any model swap.7

3
The context window is a budget.

Frontier models now hold around a million tokens — Meta's Llama 4 Scout advertises ten million.9 But attention degrades in the middle of a stuffed prompt — so more chunks is not better. The best chunks, ordered well, is better. Spend tokens like dollars.812

4
Deterministic where it counts.

Models classify, summarize, and route. They do not do arithmetic, and they do not enforce policy. Totals come from SQL. Approvals come from code. The model decides which calculation to run — never the answer to it.

5
Permissions travel with the data.

Retrieval must filter by who is asking, before the chunks reach the prompt. An agent with every employee's access is a breach with good manners. Scope the tools. Scope the indexes. Log the denials too.

6
Humans stay in the loop on writes.

Reading at machine speed is leverage. Writing at machine speed is risk. The pattern that survives audits: agents draft, humans approve, systems record. Loosen it only where the evidence says you can.

The biggest quality jump is free of the model Retrieval accuracy on one benchmark. Fixing how you search beats swapping the model. 0255075100% 62% Keyword BM25 only 74% Vector dense only 84% Hybrid both, fused 91% + Reranker best chunks on top +29 points — same model, better retrieval
Fig. 8 — Hybrid search, then a reranker. Keyword finds exact strings; vectors find meaning; fusing both and reranking puts the truly best chunks on top — lifting accuracy from 62% to 91% on this benchmark without touching the model.713

11  The sentence, delivered

One sentence went in. A system came out.

It was cut into tokens. Mapped into meaning. Indexed for finding. Retrieved with evidence. Reasoned over, acted upon, measured, and trusted. That is the whole stack — and now you have walked it end to end.

Where it ended up ✓ Calder Health · Contract 4471 · renews 2026-03-01 · $3.4M ✓ Risk: elevated — 3 unanswered escalations (cited) ✓ Ticket drafted · awaiting human approval · trace a3f9-4471

The hard part was never the concepts. It is the judgment — what to retrieve, what to govern, what to automate, and what to leave alone. That judgment is what BlueAlly brings to the table.

12  Sources

Where this comes from

Every factual claim above is drawn from a primary source — model cards, papers, and the official documentation of the labs that build these systems. Specific cosine scores and timings are illustrative; the mechanisms are not.

  1. Hugging Face, "Byte-Pair Encoding tokenization," LLM Course. huggingface.co/learn/llm-course/chapter6/5
  2. OpenAI, "tiktoken" — BPE tokenizer (cl100k_base / o200k_base). github.com/openai/tiktoken
  3. OpenAI Help Center, "What are tokens and how to count them." help.openai.com/en/articles/4936856
  4. OpenAI, "New embedding models and API updates" (text-embedding-3-large, 3,072 dims). openai.com/index/new-embedding-models-and-api-updates
  5. Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," arXiv:2005.11401. arxiv.org/abs/2005.11401
  6. Malkov & Yashunin, "Efficient and robust approximate nearest neighbor search using HNSW graphs," arXiv:1603.09320. arxiv.org/abs/1603.09320
  7. Anthropic, "Introducing Contextual Retrieval." anthropic.com/news/contextual-retrieval
  8. Anthropic, "Claude — Models overview" (context windows; Claude Opus 4.8, Fable 5). platform.claude.com/docs/en/about-claude/models/overview
  9. Meta AI, "The Llama 4 herd" (Scout 10M-token context). ai.meta.com/blog/llama-4-multimodal-intelligence
  10. Anthropic, "Building Effective Agents." anthropic.com/research/building-effective-agents
  11. Anthropic, "Introducing the Model Context Protocol." anthropic.com/news/model-context-protocol
  12. Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," arXiv:2307.03172. arxiv.org/abs/2307.03172
  13. Bronck, P., "Better RAG Accuracy with Hybrid BM25 + Dense Vector Search" (hybrid + reranking, 62% → 91%). medium.com/@pbronck/better-rag-accuracy