What Is a Transformer in AI?

The transformer is the neural network architecture that powers every modern large language model — including GPT, Claude, Gemini, and Llama.

The Key Innovation: Attention

Introduced in the 2017 paper “Attention Is All You Need,” the transformer replaced older sequential architectures (RNNs, LSTMs) with a mechanism called self-attention.

Self-attention lets the model look at all words in a sentence simultaneously and learn which words are related to each other, regardless of distance:

“The cat sat on the mat because it was tired.”

A transformer understands that “it” refers to “the cat” — not “the mat” — by computing attention scores between all word pairs.

Why Transformers Won

Parallelization: Unlike RNNs, transformers process all tokens at once, making them much faster to train on GPUs.
Scaling: Performance improves predictably as you add more parameters and data.
Versatility: The same architecture works for text, code, images, audio, and video.

Transformer Variants

Type	Used For	Examples
Decoder-only	Text generation	GPT, Claude, Llama
Encoder-only	Text understanding	BERT, RoBERTa
Encoder-decoder	Translation, summarization	T5, BART

Modern LLMs are almost exclusively decoder-only transformers — trained to predict the next token in a sequence.

What Is a Transformer in AI?

The Key Innovation: Attention

Why Transformers Won

Transformer Variants

Elvean is Mac-only (for now)