What Is AI Inference?

Inference is the process of using a trained AI model to generate outputs — answering questions, writing code, creating images. It’s the “running” phase, as opposed to the “training” phase.

Training vs. Inference

Aspect	Training	Inference
What happens	Model learns from data	Model generates predictions
Compute	Massive GPU clusters	Single GPU or CPU
Duration	Days to months	Milliseconds to seconds
Cost	$millions for large models	Fractions of a cent per query

Local vs. Cloud Inference

Cloud Inference

Models run on the provider’s servers (OpenAI, Anthropic, Google)
Pay per token
Access to the largest models
Requires internet connection

Local Inference

Models run on your own hardware
Free after initial setup
Private — data never leaves your machine
Limited by your hardware (RAM, GPU)
Tools like Ollama make local inference easy on Mac

Speed Factors

Inference speed depends on:

Model size: Smaller models are faster
Hardware: GPU > CPU; more VRAM = larger models
Quantization: Compressed models run faster with minimal quality loss
Batch size: Processing multiple requests together improves throughput

Inference in Elvean

Elvean supports both local inference (via Ollama on Apple Silicon) and cloud inference (via API keys for OpenAI, Anthropic, Google, and more) — letting you choose the right tradeoff for each task.