← All field guides

A BlueAlly Field Guide · Begin here

The convergence
of AI.

It did not arrive on a single day. Three forces — data, compute, and algorithms — climbed for sixty years on separate tracks. Around 2017 they met. What came out was not a better tool. It was a new foundation, and everything since is built on it.

Conquer Complexity

Data Compute Algorithms Generative AI one foundation

What's inside

01  The thesis

One foundation. Three forces that made it.

For sixty years, artificial intelligence was a set of separate crafts. Vision had its own conferences. Language had its own. Each field kept its own data, its own math, its own heroes. Progress in one rarely moved the others.

Then three forces — data, compute, and algorithms — grew large at the same time, and began to feed one another. By the late 2010s they had merged into a single thing: a model you can train once on almost everything, then point at almost any task. We call it generative AI.

This guide answers two questions, in order. First the harder one — not when this happened, but why it had to. Then the proof: a dated line of milestones that shows the forces meeting on a real calendar. The why is the spine. The when is the evidence.

In plain English

Five words to carry the whole guide

Compute
Raw calculation — the number of math operations a machine does per second. More compute means a bigger model, trained faster.
Corpus
The pile of text and images a model learns from. Once it was thousands of hand-labeled examples. Now it is much of the public internet.
Embedding
A long list of numbers that stands for a word, an image, or a passage — placed so that similar meanings land near each other.11
Transformer
The model design, published in 2017, that learns by paying attention to every part of its input at once. It is the shared engine under today's systems.5
Scaling law
A measured rule: add data, compute, and model size in step, and error falls on a smooth, predictable curve.1
Three forces in. One foundation out. A field re-shaped. Data Compute Algorithms Generative AI one trained foundation Computer vision Language Machine learning Knowledge Reasoning Robotics planning The six classical disciplines of AI now share one engine underneath.
Fig. 1 — From many crafts to one foundation. The six fields once named as the pieces of machine intelligence now run on a shared architecture. Convergence means the engine is the same, even when the job is not.2

For a business, that single sentence reorders the strategy. You no longer buy a vision system, a language system, and a forecasting system. You build on one foundation — and your own data becomes the moat on top of it.

02  Force one · Data

The corpus became the model.

A model is mostly what it has read. Before 2010 it read little — a few thousand hand-labeled examples, curated by graduate students. The work was careful and the piles were small. Then the piles grew a thousandfold, and a millionfold, and the kind of thing a model could learn changed with them.

Two releases mark the shift. In 2009, ImageNet gathered 3.2 million labeled images across 5,247 categories, organized on the WordNet hierarchy.3 Suddenly vision research had millions of examples, not thousands. A few years later, Common Crawl began archiving the open web — today a corpus of several petabytes and billions of pages, refreshed every month.4 When GPT-3 was trained in 2020, it learned from 300 billion tokens, most of them filtered Common Crawl.10 The training set was no longer a benchmark. It was the internet.

Dataset scale — each step is roughly 100× the last (log scale) 10³ 10⁶ 10⁹ 10¹¹ ~10³ Early ML benchmarks hand-labeled examples 3.2M images ImageNet · 2009 5,247 categories ~985M words BookCorpus · 2015 11,000 books 300B tokens GPT-3 training · 2020 mostly Common Crawl 100T+ Common Crawl the open web
Fig. 2 — A thousand to a trillion. The vertical axis is logarithmic, so each gridline is 1,000× the one below. Hand-labeled benchmarks gave way to web-scale corpora. The fuel changed before the engine did.34
Before, you taught a model with a textbook. After, you handed it the library.

The data was never free of labor, and it was never neutral. ImageNet was labeled by tens of thousands of crowd workers. Common Crawl must be filtered, de-duplicated, and re-tokenized by every lab that uses it. The corpus is engineering, not magic — and what goes into it shapes everything that comes out.

03  Force two · Compute

Silicon caught up to the ambition.

A large model is a mountain of multiplication. For decades the mountain was too tall — the math existed, but no machine could finish it in a human lifetime. Then the graphics chip, built to draw video-game worlds, turned out to be a near-perfect engine for the one operation models need most: multiplying big grids of numbers.

The turn was hardware made on purpose. In 2017, NVIDIA's V100 shipped the first tensor cores — circuitry built only to multiply matrices — and delivered 125 trillion such operations per second.7 Three years later the A100 multiplied that again. The scale is hard to overstate. By one measure, the compute used in the largest training runs grew more than 300,000× between 2012 and 2018 — a doubling roughly every 3.4 months, far faster than chips themselves improved.8 Across the whole deep-learning era, training compute has risen about ten-billion-fold, doubling near every six months.9

Training compute, 2012–2024 — a straight line on a log scale means exponential growth 10¹⁸ 10²⁰ 10²² 10²⁴ 10²⁶ FLOP 2012 2015 2018 2021 2024 Moore's law pace — ~2-year doubling AlexNet on 2 GPUs tensor-core era ~175B-parameter models frontier training runs
Fig. 3 — The steep line and the shallow one. Both axes climb, but the blue AI-compute line doubles about every six months while chip progress (dashed) doubles about every two years. The gap is the story: demand outran the silicon, so the field built silicon to match.89
A caution worth keeping

Be skeptical of any single compute number

Headline figures — "teraflops," "cost per operation" — hide what really gates a training run: memory bandwidth, the speed of the links between chips, and how well the software keeps the silicon busy. A faster chip on paper can train slower in practice. When a vendor quotes one number, ask for the other three.

04  Force three · Algorithms

Meaning learned to live as math.

Data and compute are inert without a method that can use them. The third force was a run of ideas, each one teaching machines to turn messy human stuff — pixels, words, intent — into numbers a computer can work with, and back again.

The keystone idea is the embedding: represent a word, an image, or a passage as a long list of numbers, placed so that similar meanings land near each other.11 Once meaning has coordinates, a machine can measure it, search it, and generate from it. Layer enough of these representations and you get a network that does not just classify what it sees — it can produce what it has never seen. That last step, from recognizing to generating, is the line between the old AI and the new.

Narrow ML — recognize: input in, one label out an image one model,one task "cat" dog A separate model for vision, another for language, another for speech. None of them transfer. General model — generate: many inputs, one meaning space text images code shared meaning space newoutput One backbone reads everything into the same space — then writes, draws, or answers from it.
Fig. 4 — From sorting to making. Narrow models pick a label from a short list. A general model maps every kind of input into one space of meaning, then generates from it. Embeddings are what make the right-hand picture possible.11
Old AI told you what something was. New AI can make you something new.

05  The interplay

A loop, not a coincidence.

Here is the part most accounts miss. The three forces did not rise side by side by luck. Each one pulled the others up. That feedback — not any single breakthrough — is what made the decade.

Data web-scale corpora Compute tensor hardware Algorithms transformers & embeddings Generative AI more data needs more compute better methods use more data hardware is built for the math the methods demand
Fig. 5 — The reinforcing loop. Tensor cores were designed because the algorithms demanded matrix math. The open web was scraped because models could finally use it. Bigger models were trainable because the hardware arrived. Each arrow points both ways.
Two forces the common story leaves out.

Open-source tools — PyTorch, TensorFlow, Hugging Face — and the open preprint culture of arXiv collapsed the gatekeeping that slowed earlier AI cycles. The time from a published idea to a working product fell from years to weeks. Neither fits neatly under "data, compute, algorithms." Both were structural.

Why the loop matters to you.

A feedback loop does not pause politely. The same three forces are still climbing. Plan for capability that keeps compounding — treat the model layer as something you refresh on a cycle, not a thing you buy once and freeze.

06  The measurable target

Scaling laws turned a hunch into a target.

For years, "bigger is better" was a feeling. In 2020 it became a measurement. A team at OpenAI showed that a model's error falls as a smooth power law in three things together — the number of parameters, the size of the training set, and the compute spent — across more than seven orders of magnitude.1 Add resources in the right ratio, and you can predict the gain before you spend the money.

Two years later, DeepMind sharpened the recipe. The Chinchilla study found that most large models of the day were over-sized and under-fed — trained on too little data for their parameter count. Its rule: to use a compute budget well, grow the model and the data in step. A 70-billion-parameter model trained on 1.4 trillion tokens beat a 280-billion model trained on far less, for the same cost.12 Convergence now had an equation. Spend here, gain there — on a curve you could draw in advance.

Scaling laws — error falls as a power law in compute (log–log axes) Test loss (lower is better) Compute spent on training → A straight line here is a power law. Each step right multiplies compute; loss drops by a steady, predictable fraction. more compute · more data · bigger model, in ratio →
Fig. 6 — The curve you can draw in advance. On log–log axes, the relationship between resources and error is a straight line — the signature of a power law. This is why frontier labs can budget a result before they train it.112
In plain English

What a "power law" buys a business

A power law is a steady trade: every time you multiply the inputs, the error shrinks by a fixed fraction. The practical gift is predictability. A lab can forecast how much better a model will get before spending a dollar — and you can plan around the fact that next year's model will, on this curve, be reliably stronger than this year's. Scaling is not magic. It is a budget line.

07  The inflection

One architecture absorbed the field.

Three forces pushed. One idea let them all pull in the same direction. On June 12, 2017, eight researchers published "Attention Is All You Need" and introduced the transformer.5 It threw out the step-by-step reading of older networks and let the model weigh every part of its input against every other part, all at once. That made training run in parallel — which is exactly what the new compute and the new data were waiting for.

The old networks read a sentence one word at a time, in order. That was slow, and it made them forgetful over long passages. The transformer reads the whole passage at once and learns, for every word, which other words matter most. Two gifts came with that change. It trained far faster on modern hardware, and it held context far better. Speed and memory — the two things that had held the field back — improved together.

The transformer's reach is the clearest sign of convergence. It started in translation. Then vision adopted it: a 2020 paper showed "a pure transformer applied directly to sequences of image patches" matches the best image models.13 Today the same core design writes text, reads images, generates code, and helps robots plan. Six fields that once needed six toolkits now share one. The model layer became the meeting point.

Before 2017 — a model family per field CNNs · vision RNNs · language SVMs · ML ontologies · knowledge logic · reasoning controllers · robotics Separate tools. Little transfer. 2017 the transformer the inflection "attention," in parallel After — one backbone, every field sharedtransformer vision language machine learning knowledge reasoning robotics planning One engine. Many jobs.
Fig. 7 — The inflection point. Before, each discipline grew its own model family. After 2017, one architecture spread across all of them. The transformer did not win an argument the fields were having — it ended the need for the argument.513
1
It dropped recurrence.

No more reading word by word. The whole sequence is processed together, so training uses modern GPUs to the fullest.

2
It kept attention.

For every word, the model weighs every other word. Long-range meaning survives where older networks forgot it.

3
It scaled cleanly.

Make it bigger, feed it more, and it kept improving. That property is what made the next decade possible.

4
It generalized.

Built for translation, it turned out to fit text, code, images, audio, and protein structure. One design, many fields.

Six fields had six engines. Now they share one.

08  The evidence · 1958–2026

The when that proves the why.

The forces are the argument. Here is the proof. Every part of modern AI — the neuron, the training trick, the data, the architecture — was invented by someone, on a date you can name. Read the line top to bottom. The pace is the point: half a century of quiet groundwork, then a sudden run after 2017, then a year-by-year sprint to the present.

From the first neuron to the frontier Each dot is a milestone. Colour marks the era. The line never breaks. GROUNDWORK DEEP LEARNING TRANSFORMER FRONTIER 1958 The perceptron First trainable artificial neuron. The idea begins. 1986 Backpropagation, popularized A practical way to train networks with many layers. 1997 LSTM Networks that remember across long sequences. 2009 ImageNet A labeled dataset finally big enough to matter. 2012 AlexNet Deep learning wins decisively. The modern era opens. 2013 word2vec Meaning becomes geometry — the word embedding. 2014 Seq2seq + attention Map sequence to sequence; learn what to focus on. 2017 The Transformer "Attention Is All You Need." The architecture that scales. 2018 BERT & GPT-1 Pretrain once on the web; adapt to many tasks. 2019 GPT-2 Scale starts to surprise. Fluent text, unprompted. 2020 Scaling laws & GPT-3 Bigger is predictably better. Few-shot learning emerges. 2022 ChatGPT AI meets everyone. A million users in five days. 2023 GPT-4 & open models Frontier multimodal arrives; Llama opens the weights. 2024 MCP One open plug connects any model to any tool. 2026 The frontier, now Opus 4.8, Fable 5, GPT-5.5, Gemini 3 — context near 1M tokens. 54 years of slow groundwork 14 years of the run we live in now
Fig. 8 — Sixty-eight years on one rail. Look at the spacing. The first half-century is sparse; everything since 2017 is dense. The transformer (green) is the hinge the whole modern stack turns on — the forces meeting on a calendar.
Groundwork Deep-learning era Pretrain & scale A hinge moment

Read the line in three movements.

The groundwork ran from 1958 to 2012 — the perceptron, backpropagation, LSTM, then ImageNet. For half a century the ideas ran ahead of the machines. Then 2012 arrived, and the three forces lined up for the first time: a deep network (the algorithm), ImageNet (the data), and two gaming GPUs repurposed for math (the compute). AlexNet won the ImageNet contest by a margin nobody had seen, and the argument was over.6 Five years later the transformer turned that single win into a general method. Everything after is the sprint.

1958

The perceptron

Frank Rosenblatt · Cornell

One trainable neuron learns to separate patterns. The first machine that adjusts itself toward an answer.14

2017

The Transformer — the hinge

Vaswani et al. · Google · June 12

"Attention Is All You Need." Recurrence out, attention in. Nearly every model since is built on it.5

2022

ChatGPT — AI meets everyone

OpenAI · November 30

A capable model wrapped in a text box. One million users in five days — the fastest adoption in software history.15

2024

The Model Context Protocol

Anthropic · November

One open standard to connect any model to any tool or data source. The wiring that lets today's agents actually do things.16

09  The payoff, June 2026

Where the convergence stands today.

Follow the three forces to 2026 and you arrive at the frontier model: one system, trained on much of human knowledge, that reads, writes, sees, and reasons across a window of about a million tokens at once. The leading models have converged not just in design, but in size — and the gap between major releases is now measured in weeks, not years.

As of June 2026, these are the shipping flagships. Anthropic's Claude Opus 4.8 (May 28) and Claude Fable 5 (June 9) lead its line; Fable 5 is built for long, autonomous work.1718 OpenAI's GPT-5.5 (April 23) was its first model to ship a one-million-token context window through the API.19 Google's Gemini 3 (November 18, 2025) matched that million-token window.20 On the open side, Meta's Llama 4 Scout (April 5, 2025) advertises a striking ten-million-token context.21 The numbers, once wildly apart, now sit close together.

The frontier, June 2026 — context windows Tokens a model can hold at once. Four leaders converged near 1M; one open model reaches far past it. Bars below share a 0–1M scale. Llama 4 Scout (10M) is shown clipped, with its true figure called out. Claude Opus 4.8 Anthropic · May 28 2026 1M Claude Fable 5 Anthropic · Jun 9 2026 1M GPT-5.5 OpenAI · Apr 23 2026 1M Gemini 3 Google · Nov 18 2025 1M Llama 4 Scout Meta · Apr 5 2025 · open weights 10M
Fig. 9 — Four leaders near a million tokens; one open model past it. The frontier has converged on roughly 1M tokens of context, with Llama 4 Scout advertising 10M. The bars share a 0–1M scale; Scout's bar is clipped because its true length runs ten times off the chart.1721

The honest edge of the story

Convergence is real. Convergence is not complete — and a guide that pretends otherwise is selling something. Three places where the old methods still win:

Low-level robot control

A transformer can plan a robot's next move. It does not run the tight, millisecond control loop that keeps the arm steady. Classical control — the math of feedback and stability — still does that.

Correctness you must certify

When an answer has to be provably right — a safety case, a formal proof — symbolic logic engines still beat a model that reasons in probabilities. "Usually correct" is not the same as "certified correct."

Tight latency and power budgets

At the edge — a sensor, a phone, a camera — a small purpose-built model can beat a giant general one on speed and energy. Biggest is not always best when milliwatts matter.

10  Why this is where your journey starts

The model is a moving target. The system around it is the asset.

Step back from the dates and one lesson stands above the rest. The clock is speeding up. Fifty years separated the perceptron from AlexNet. Weeks now separate major model releases. The model you choose today will be matched or beaten within months.

So the convergence hands an enterprise a gift and a caution. The gift: one foundation now does what six fields once did, and your own data is the moat on top of it. The caution: plan for a frontier that moves faster than your procurement cycle — favour systems that swap models cleanly, and know the edges where the old methods still win. The hard part was never the model. It is the judgment about where it belongs — and that judgment is what BlueAlly brings.

You now know why the foundation exists and when it was built. Next, we open it up. We take one plain sentence and follow it all the way through a working system — tokens, vectors, retrieval, and agents — so you can see exactly how the machine turns your words into something it can act on.

11  Sources

Where this comes from

Every factual claim above is drawn from a primary source — the original papers, the labs' own datasheets and announcements, and the measured trend data of independent researchers. Release dates were confirmed as of June 2026. Where a figure is a rounded scale rather than an exact count, the text says so.

  1. Kaplan et al., "Scaling Laws for Neural Language Models," arXiv:2001.08361. arxiv.org/abs/2001.08361
  2. Russell & Norvig, Artificial Intelligence: A Modern Approach — the capabilities required for the Turing Test. aima.cs.berkeley.edu
  3. Deng, Dong, Socher, Li, Li & Fei-Fei, "ImageNet: A Large-Scale Hierarchical Image Database," CVPR 2009 (3.2M images, 5,247 synsets). image-net.org/static_files/papers/imagenet_cvpr09.pdf
  4. Common Crawl, "Get Started" — petabyte-scale, billions of pages, monthly crawls. commoncrawl.org/get-started
  5. Vaswani et al., "Attention Is All You Need," arXiv:1706.03762 (12 Jun 2017; NeurIPS 2017). arxiv.org/abs/1706.03762
  6. Krizhevsky, Sutskever & Hinton, "ImageNet Classification with Deep Convolutional Neural Networks" (AlexNet), NeurIPS 2012. papers.nips.cc/paper/4824
  7. NVIDIA, "NVIDIA Tesla V100 GPU Architecture" whitepaper (640 tensor cores, 125 TFLOPS deep-learning performance). images.nvidia.com/content/volta-architecture
  8. Amodei & Hernandez (OpenAI), "AI and Compute" (300,000× growth 2012–2018; 3.4-month doubling). openai.com/index/ai-and-compute
  9. Sevilla et al. / Epoch AI, "Compute Trends Across Three Eras of Machine Learning" (~6-month doubling; ~10-billion× since 2010). epoch.ai/blog/compute-trends
  10. Brown et al., "Language Models are Few-Shot Learners" (GPT-3: 175B params; 300B training tokens), arXiv:2005.14165. arxiv.org/abs/2005.14165
  11. Mikolov et al., "Distributed Representations of Words and Phrases and their Compositionality" (word embeddings), arXiv:1310.4546. arxiv.org/abs/1310.4546
  12. Hoffmann et al. (DeepMind), "Training Compute-Optimal Large Language Models" (Chinchilla: 70B on 1.4T tokens), arXiv:2203.15556. arxiv.org/abs/2203.15556
  13. Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (Vision Transformer), arXiv:2010.11929. arxiv.org/abs/2010.11929
  14. Rosenblatt, F., "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain," Psychological Review 65(6), 1958. doi.org/10.1037/h0042519
  15. OpenAI, "Introducing ChatGPT," 30 Nov 2022 (1M users in 5 days, per G. Brockman). openai.com/index/chatgpt
  16. Anthropic, "Introducing the Model Context Protocol," 25 Nov 2024. anthropic.com/news/model-context-protocol
  17. Anthropic, "Claude — Models overview" (context windows; Claude Opus 4.8, Fable 5 — 1M tokens). platform.claude.com/docs/en/about-claude/models/overview
  18. Anthropic, "Introducing Claude Fable 5 and Claude Mythos 5," 9 Jun 2026. platform.claude.com/docs/.../introducing-claude-fable-5-and-claude-mythos-5
  19. OpenAI, "Introducing GPT-5.5" (1M-token API context), 23 Apr 2026. openai.com/index/introducing-gpt-5-5
  20. Google, "Gemini 3: Introducing the latest Gemini AI model," 18 Nov 2025. blog.google/products/gemini/gemini-3
  21. Meta AI, "The Llama 4 herd" (Scout 10M-token context), 5 Apr 2025. ai.meta.com/blog/llama-4-multimodal-intelligence