How the Machine Reads — A BlueAlly Field Guide

01 The premise

A model does not read words. It never has.

The model you are about to buy reads zero words. It reads numbers — and only numbers. The sentence on the page becomes a list of integers before the model sees a thing. Everything you call "AI" is the work of turning your language into numbers honestly, and turning the numbers back into something true.

That gap is where the money is made or lost. The model is rarely the problem. The system around it — how text is cut, mapped, stored, found, and governed — decides whether your AI ships or stalls. So this guide skips the hype and follows the wiring.

We take one sentence. A real one, the kind that lives in your contracts, your tickets, your claims files: "Contract 4471 renews on the first of March." Watch what the machine does to it. By the end you will know how it is cut, counted, mapped, stored, found, and acted upon — and where each piece of your own infrastructure belongs.

Words the model reads. It works only in numbers — tokens, then vectors. The translation is the whole game.

~1M

Tokens a frontier model now holds in context at once — Claude Opus 4.8, GPT-5.5, Gemini 3 all near a million.⁸

Stages between raw text and a governed answer. Every AI system you buy is some arrangement of these seven.

In plain English

Five words to carry the whole guide

Model: A program that predicts text. Give it words; it returns the words that should come next. That is the whole trick — and it is enough to do remarkable things.
Token: A chunk of text, a few characters long. Models read and write in tokens, and you pay by the token.
Embedding (vector): A long list of numbers that stands for a piece of text. Similar meanings get similar lists. Meaning, made measurable.
Retrieval (RAG): Finding the right pages in your own documents and handing them to the model before it answers. An open-book exam.
Agent: A model that acts in a loop — it plans, calls a tool, reads the result, and decides what to do next, until the job is done.¹⁰

Fig. 0 — Seven stages, one sentence. The same sentence, transformed step by step: words to tokens to vectors, stored, found, reasoned over, and delivered with its source attached. Every enterprise AI system you will ever buy or build is some arrangement of these stages. The rest of this guide walks them in order.

02 Tokenization · Byte-Pair Encoding

First, the knife. The sentence is cut.

Before a model can think about a sentence, the sentence must become a sequence of integers. The cutting is called tokenization, and the most common knife is Byte-Pair Encoding — BPE. It is old, it is simple, and it works.

BPE is trained once, on a mountain of text, before the model ever sees your data. The recipe is short. Start with single characters. Count every adjacent pair. Merge the most frequent pair into one new piece, and add it to the vocabulary. Repeat — tens of thousands of times — until the vocabulary is full.¹

The result is a fixed vocabulary, today roughly 100,000 to 200,000 pieces.² Common words survive whole. Rare words break into familiar fragments. And because modern BPE works on raw bytes, nothing is ever truly out-of-vocabulary — in the worst case, a word falls all the way back to single bytes.¹

Fig. 1 — BPE merges, simplified for clarity. Common text stays whole; numbers fragment. The blue pieces show "4471" broken into two tokens — the first hint of why models struggle with exact strings and arithmetic.

From pieces to integers

Each piece in the vocabulary has a fixed ID. The sentence becomes a list of integers, and this list — nothing else — is what enters the model.

Token IDs · what the model actually receives [ 17341, 6253, 5541, 1156, 2503, 389, 262, 717, 286, 2805, 13 ]

~1.3

Tokens per English word; about four characters each. Code, other languages, and numbers run higher.³

Per token

Every API bill is denominated in tokens — input and output priced separately, quoted per million. Waste tokens and the bill grows.³

Frozen

The vocabulary is fixed before launch. Your part numbers and acronyms get cut by merges learned from the public web. Good retrieval works with this.

Fig. 1b — Tokens are the unit of cost. English packs the most meaning per token. Other languages, code, and especially raw numbers fragment into more pieces — so the same idea costs more to send.³ Numbers are illustrative, the pattern is real.

The model never meets a word it cannot cut. That is the whole trick.

03 Vectorization · Embeddings

Then, the map. Meaning becomes geometry.

An integer ID says nothing about meaning. Token 5541 is not "more" than token 1156. So each token is traded for a long list of numbers — a vector. This is the embedding, and it is where meaning lives.

Things that mean similar things land near each other. That is the entire idea. "Renews," "extends," and "continues" sit close together in this space. "Terminates" and "cancels" sit close together too — somewhere else. The distance between points is the distance between meanings. Nobody hand-places a single point; the map is learned from how words are actually used.

There are two kinds worth knowing. Token embeddings live inside the model, one vector per vocabulary piece. Document embeddings are produced by a dedicated embedding model that reads a whole passage and emits one vector for the entire thing. These are what you store in a vector database. When we say "embed your documents," we mean this. Today's embedding models emit vectors of roughly one to three thousand numbers — OpenAI's text-embedding-3-large defaults to 3,072, and many models let you shorten them.⁴

One sentence → one vector (3,072 dims, truncated) embed("Contract 4471 renews on the first of March.") → [ 0.021, -0.184, 0.077, 0.305, -0.042, … ]

Fig. 2 — A 3,072-dimension space, flattened to two so human eyes can see it. Closeness is measured with cosine similarity. The flattening is a courtesy; the geometry is what the machine actually searches.

How long should the list be?

More numbers means finer meaning — and more storage to pay for, on every chunk, forever. Modern embedding models let you trade one for the other. Trained with a method called Matryoshka learning, they pack the most important meaning into the first numbers of each vector, so you can keep a shorter slice and lose little.⁴ Pick the length that fits your budget, not the longest one on offer.

Fig. 2b — Length is a dial, not a fixed cost. Cutting a 3,072-number vector to 256 shrinks storage 12× while quality slips only a few points — and the 256-number slice still beats a whole older model.⁴ Quality figures illustrative; the tradeoff is documented.

A database stores what you said. An embedding stores what you meant.

04 Indexing · Vector vs. Structured

Two libraries. Two ways of finding.

A structured database is a filing cabinet. You find things by their labels — exact, fast, and unforgiving. A vector database is a map of meaning. You find things by their neighbours. An enterprise needs both, and the failures we see most often come from asking one to do the other's job.

Fig. 3 — Precision by label, or proximity by meaning. The green points are the three nearest neighbours. Nobody typed the word "risk" in those documents. The geometry found them anyway.⁶

Dimension	Structured (SQL)	Vector
How you ask	Exact predicates — WHERE id = 4471	"Things like this" — a query vector
What "match" means	Equality. True or false. No middle.	Similarity. A score from 0 to 1. All middle.
Index under the hood	B-tree, hash — sorted exactness	HNSW, IVF — navigable neighbourhoods⁶
A wrong answer looks like	An empty result. A loud failure.	A plausible-but-off neighbour. A quiet one.
Built for	Transactions, joins, totals, audit	Search, recommendation, RAG retrieval

Use SQL for what is true. Use vectors for what is relevant. Never confuse the two.

05 Retrieval-Augmented Generation

RAG: the open-book exam.

A model knows what it was trained on, and nothing after, and nothing of yours. RAG fixes that without retraining anything. Before the model answers, the system finds the right pages — from your documents, your policies, your contracts — and lays them in front of it. Then the model answers with the book open.⁵

Fig. 4 — The full RAG pipeline. Two paths: ingestion (top, runs when documents change) and query (middle, runs on every question). They meet at the model. The answer carries its sources with it.

Fresh without retraining.

Update a document, re-embed the chunk, and the next answer reflects it. Minutes, not months.

Answers you can audit.

Every claim points back to a chunk, and every chunk to a page. When the auditor asks "how do you know?" — you show them.

Your data stays yours.

Nothing is baked into model weights. Permissions can be enforced at retrieval time, before chunks reach the prompt.

06 Training vs. Retrieval

School is not the same as an open book.

People say "train the model on our data" when they almost always mean "let the model retrieve our data." The difference is not pedantry. It is millions of dollars and months of calendar, spent or saved.

Pre-training is the long game: the model reads trillions of tokens and guesses the next one, billions of times, nudging billions of weights. Months of compute, tens of millions of dollars. What comes out knows language, facts, and reasoning — frozen at a point in time.

Fine-tuning nudges an existing model on your own examples. It is good for tone, format, and narrow skills. It is poor for facts: knowledge stuffed into weights cannot be updated, cited, or access-controlled, and it can leak. RAG changes nothing in the model. It changes only what the model can see this turn.

Training changes what the model is. Retrieval changes what the model can see.

Question	Pre-train	Fine-tune	RAG
What changes	All weights	Some weights	Nothing — context only
Best for	Building the model	Tone, format, skills	Facts, documents, freshness
Time to update	Months	Days to weeks	Minutes
Cost profile	Tens of millions	Thousands per run	Pennies per query
Can it cite?	No	No	Yes — by design
Honours permissions?	No	No	Yes — filter at retrieval

The honest rule we give every client: fine-tune for behaviour, retrieve for knowledge, and pre-train never — unless you are a frontier lab, and you are not paying frontier-lab bills to find out.

07 Model Context Protocol

MCP: one plug for every tool.

An API is a door built for developers. A person reads the docs, writes the glue code, ships it, and maintains it forever. MCP is a door built for models. The model walks up, asks the door what it does, and uses it — at runtime, with no custom glue. Same buildings. Different visitor.¹¹

Fig. 5 — The N×M tangle, collapsed. Before, every line is code somebody wrote and maintains. After, one open standard sits between every AI client and every system. Build a server once; every MCP-speaking model can use it.

What an MCP server offers

Tools

Actions the model may take — query_contracts, create_ticket, send_summary. Each declares its name, what it does, and the exact shape of its inputs.

Resources

Things the model may read — files, records, schemas. Context the server is willing to expose, addressed cleanly and governed by the server.

Prompts

Reusable instruction templates the server provides — the playbooks a team has already proven, offered to the model as starting points.

Be precise here, because vendors will not be: MCP does not replace your APIs. It stands in front of them — a universal adapter that lets any model discover and use them safely, with your permissions and your audit trail intact.¹¹

08 Agentic RAG · The Full System

Now, the whole machine — assembled.

Plain RAG retrieves once and answers. An agent thinks in a loop: plan, act, observe, and decide whether it is done.¹⁰ Give that loop a vector database for meaning, MCP tools for facts and actions, and guardrails around everything — and you have agentic RAG. This is the architecture behind every serious enterprise deployment we run.

Fig. 6 — Agentic RAG, end to end. Triggers enter through guardrails. The loop plans, then reaches left for meaning and right for truth, as many times as the task demands. Results land where people already work. Everything is traced.

Walk it through — a human asks.

Plan: "I need March renewals, then risk signals." Act: MCP → SQL for the exact list. Observe: six contracts. Act again: vector search across emails for churn language. Observe: two run hot. Done: a cited answer, with dollar figures from the system of record — never from the model's memory.

Walk it through — nobody asks.

The 06:00 sweep fires. The same loop runs against every account renewing in 90 days, scores risk, files its evidence, and posts a digest. A human wakes to a short list and a recommendation — and the agent waits for approval before touching the CRM. Autonomy for reading. Permission for writing.

09 Observability · Traces · Evals

If you cannot see it, you cannot trust it.

A traditional program fails loudly. An AI system fails politely — fluent, confident, and wrong. So observability is not a nice-to-have bolted on at the end. It is the difference between a demo and a system you bet the quarter on. Four things to watch: traces, logs, evals, and dashboards.

Fig. 7a — A span waterfall. This is what "explainable" actually looks like in production: not a philosophy, a timeline. Traces, logs, and scored evals together tell you the system is fast, lawful, and right.

Latency tells you the system is fast. Only evaluation tells you it is right. Mature teams run a scored test suite on every change — the way software teams run unit tests. They grade groundedness (is every claim backed by a retrieved chunk?), retrieval quality (did the right chunks come back?), answer quality (correct, complete, in the asked-for format), and safety (PII handled, permissions honoured).

A demo is judged by its best answer. A system is judged by its worst.

Fig. 7b — The two lines every executive should ask for. Answers getting more grounded; humans needed less often. When both move the right way, trust is earned — and measured.

10 The Fine Print

What most explanations skip. We will not.

The diagrams above are honest, but production systems live or die on details that rarely make the slide. Here are six that decide whether yours works.

Chunking is a design decision.

Cut documents wrong and retrieval finds fragments without their context — a renewal clause severed from its contract number. Chunk by structure, overlap the edges, carry metadata. Most "RAG doesn't work" complaints are chunking complaints.⁷

Hybrid search beats either alone.

Vectors miss exact strings — part numbers, names, "4471." Keyword search (BM25) misses paraphrase. Run both, fuse the results, then let a reranker put the truly best chunks on top. This moves quality more than any model swap.⁷

The context window is a budget.

Frontier models now hold around a million tokens — Meta's Llama 4 Scout advertises ten million.⁹ But attention degrades in the middle of a stuffed prompt — so more chunks is not better. The best chunks, ordered well, is better. Spend tokens like dollars.⁸¹²

Deterministic where it counts.

Models classify, summarize, and route. They do not do arithmetic, and they do not enforce policy. Totals come from SQL. Approvals come from code. The model decides which calculation to run — never the answer to it.

Permissions travel with the data.

Retrieval must filter by who is asking, before the chunks reach the prompt. An agent with every employee's access is a breach with good manners. Scope the tools. Scope the indexes. Log the denials too.

Humans stay in the loop on writes.

Reading at machine speed is leverage. Writing at machine speed is risk. The pattern that survives audits: agents draft, humans approve, systems record. Loosen it only where the evidence says you can.

Fig. 8 — Hybrid search, then a reranker. Keyword finds exact strings; vectors find meaning; fusing both and reranking puts the truly best chunks on top — lifting accuracy from 62% to 91% on this benchmark without touching the model.⁷¹³

11 The sentence, delivered

One sentence went in. A system came out.

It was cut into tokens. Mapped into meaning. Indexed for finding. Retrieved with evidence. Reasoned over, acted upon, measured, and trusted. That is the whole stack — and now you have walked it end to end.

Where it ended up ✓ Calder Health · Contract 4471 · renews 2026-03-01 · $3.4M ✓ Risk: elevated — 3 unanswered escalations (cited) ✓ Ticket drafted · awaiting human approval · trace a3f9-4471

The hard part was never the concepts. It is the judgment — what to retrieve, what to govern, what to automate, and what to leave alone. That judgment is what BlueAlly brings to the table.

Next: The Architecture of Intelligent Systems →

12 Sources

Where this comes from

Every factual claim above is drawn from a primary source — model cards, papers, and the official documentation of the labs that build these systems. Specific cosine scores and timings are illustrative; the mechanisms are not.

Hugging Face, "Byte-Pair Encoding tokenization," LLM Course. huggingface.co/learn/llm-course/chapter6/5
OpenAI, "tiktoken" — BPE tokenizer (cl100k_base / o200k_base). github.com/openai/tiktoken
OpenAI Help Center, "What are tokens and how to count them." help.openai.com/en/articles/4936856
OpenAI, "New embedding models and API updates" (text-embedding-3-large, 3,072 dims). openai.com/index/new-embedding-models-and-api-updates
Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," arXiv:2005.11401. arxiv.org/abs/2005.11401
Malkov & Yashunin, "Efficient and robust approximate nearest neighbor search using HNSW graphs," arXiv:1603.09320. arxiv.org/abs/1603.09320
Anthropic, "Introducing Contextual Retrieval." anthropic.com/news/contextual-retrieval
Anthropic, "Claude — Models overview" (context windows; Claude Opus 4.8, Fable 5). platform.claude.com/docs/en/about-claude/models/overview
Meta AI, "The Llama 4 herd" (Scout 10M-token context). ai.meta.com/blog/llama-4-multimodal-intelligence
Anthropic, "Building Effective Agents." anthropic.com/research/building-effective-agents
Anthropic, "Introducing the Model Context Protocol." anthropic.com/news/model-context-protocol
Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," arXiv:2307.03172. arxiv.org/abs/2307.03172
Bronck, P., "Better RAG Accuracy with Hybrid BM25 + Dense Vector Search" (hybrid + reranking, 62% → 91%). medium.com/@pbronck/better-rag-accuracy

How themachine reads.