// term 01 · Foundational Architecture
LLM
Large Language Model
A neural network trained on internet-scale text using a transformer architecture. Through trillions of next-token predictions, the model develops a compressed probabilistic representation of language, reasoning, and world knowledge — encoded across billions of floating-point parameters.
// Scale
70B–405B
Parameters in frontier models. Each encodes statistical associations learned across the training corpus.
// Unit
~0.75 words
Per token — the atomic unit of all LLM input and output. Cost, speed, and capability all denominate in tokens.
// Paradigm shift
1 model
Replaces dozens of specialist AI systems. LLMs collapse point-solution AI into a single general-purpose foundation.
// full definition
What LLM actually is
LLMs are trained on internet-scale datasets — often trillions of tokens spanning web pages, books, scientific papers, and codebases — using a self-supervised learning objective: predict the next token given all preceding context. Through trillions of such predictions, the model learns not just language patterns but causal relationships, factual associations, reasoning structures, and stylistic conventions across virtually every domain of recorded human knowledge.
What distinguishes modern LLMs is the transformer architecture introduced in the 2017 paper "Attention Is All You Need." Multi-head self-attention allows the model to dynamically weight the relevance of any prior token when generating each output — enabling coherent, long-range reasoning that previous recurrent architectures couldn't achieve at scale. This parallelism is what makes 100B+ parameter training computationally feasible.
Capabilities that emerge from sufficient scale are qualitatively different from smaller models. Above roughly 70B parameters, complex multi-step reasoning, code generation, and nuanced instruction-following approach or exceed human expert performance on narrow tasks. These emergent properties are not explicitly programmed — they arise from the optimization process itself, which has profound implications for capability forecasting and safety governance.
Critically, LLMs are not databases. They do not retrieve stored facts — they generate statistically plausible continuations based on learned probability distributions. This distinction is the root cause of hallucinations and the core reason RAG architectures exist: to ground model reasoning in verified, retrievable information rather than parametric memory alone.
// how it works
From raw text to production model
Each stage compounds on the last — the pipeline executives use to assess build-vs-buy, fine-tuning ROI, and deployment trade-offs.
Data Collection
Curated trillion-token corpus assembled from web crawls, books, codebases, and scientific literature. Data quality and composition are the single largest determinant of base model capability and bias profile.
Tokenization
Raw text is split into sub-word units (~0.75 words each) and converted to numerical IDs. The tokenizer vocabulary size directly affects multilingual capability, cost efficiency, and code performance.
Pre-training
The transformer learns to predict each next token across the entire corpus using self-supervised learning. Requires massive compute — GPT-4 class training runs cost $50M–$100M+. This builds the foundational capability layer.
Instruction Tuning
Model trained on curated instruction-response pairs to follow natural language commands reliably. Transforms a raw text predictor into a useful assistant capable of structured task completion.
RLHF Alignment
Human raters score outputs; a reward model learns their preferences and shapes the LLM's behavior via reinforcement learning. Aligns the model with helpfulness, safety, and brand voice requirements.
Inference
The trained model receives live prompts and generates token-by-token responses based on learned distributions. All enterprise AI interactions — from chatbots to agentic workflows — are inference operations.
// anatomy
The components teams must understand
01
Transformer Architecture
Multi-head self-attention
The foundational neural network design. Self-attention enables the model to reason across long sequences in parallel — making 100B+ scale computationally feasible. All modern LLMs are transformer-based.
02
Parameters & Weights
Learned intelligence as math
Billions of floating-point values encoding all learned knowledge. A 405B model stores over 800GB of compressed world understanding entirely within these numbers. Parameter count drives both capability and compute cost.
03
Context Window
Operational memory limit
The active memory available per session — how much document, conversation, or retrieved data the model can simultaneously reason over. Ranges from 8K to 1M+ tokens. Window size governs your RAG architecture requirements.
04
Tokenizer
Text → numerical IDs
Maps raw text to numerical token IDs before processing. Vocabulary size and sub-word algorithm affect multilingual performance and code quality. All API cost is measured in tokens, not words.
05
Temperature
Output randomness control
Controls sampling randomness at inference time. Low (0.1) = deterministic outputs ideal for factual tasks. High (1.0+) = creative and exploratory. Must be calibrated to use-case requirements in every production deployment.
06
System Prompt
Hidden instruction layer
Hidden instructions prepended to every conversation, defining model persona, constraints, and guardrails. The primary customization mechanism without fine-tuning. Counts against the context window budget on every request.
// strategic implications
What this changes for the business
01 · Strategy
LLMs are infrastructure, not features
LLMs are general-purpose platforms enabling dozens of enterprise applications. The strategic question is not which feature to build, but which foundation model to build on — and whether to fine-tune, use an API, or run open weights on-premises. This decision determines your cost curve, data privacy posture, and the competitive defensibility of everything built on top.
02 · Moat
Proprietary data is your defensible advantage
Fine-tuned models trained on internal knowledge consistently outperform general-purpose models on domain-specific tasks. Organizations that move early on systematic data curation and fine-tuning build AI capabilities that are genuinely difficult to replicate. Every competitor can access GPT-4 or Claude. The moat is what you've taught it about your business, customers, and domain.
03 · Economics
Token economics reshape cost structures
Deployment costs scale with usage, not licensing. A single agentic workflow making 50 LLM calls per user interaction can generate costs orders of magnitude beyond naive estimates. This fundamentally changes ROI modeling and infrastructure budgeting. Executives need token-denominated cost models before approving AI deployment at scale.
// common misconceptions
What LLM is not
Myth
“LLMs understand language and reason the way humans do.”
Reality
LLMs compute probability distributions over token sequences. Whether this constitutes “understanding” is an active scientific debate. Treat LLM outputs as probabilistic, not authoritative — and design systems with verification layers accordingly.
Myth
“Bigger models always perform better for our use case.”
Reality
Scale matters less than task fit. A fine-tuned 7B model on your domain will consistently outperform a general-purpose 70B model on specific enterprise tasks, at dramatically lower latency and cost. Right-sizing is a competitive advantage, not a compromise.
Myth
“LLMs store and retrieve facts like a database.”
Reality
LLMs generate statistically plausible continuations — they do not retrieve stored facts. This is the root cause of hallucinations and the reason RAG architectures exist. Any production system requiring factual accuracy needs grounding infrastructure, not just a larger model.
// from literacy to leverage
Know the term. Now build the strategy.
Vocabulary is the entry fee. Turning these primitives into pipeline, moats, and margin is the work. That's the conversation.