What Is AI Inference?
Inference is the process of using a trained AI model to generate outputs — answering questions, writing code, creating images. It’s the “running” phase, as opposed to the “training” phase.
Training vs. Inference
| Aspect | Training | Inference |
|---|---|---|
| What happens | Model learns from data | Model generates predictions |
| Compute | Massive GPU clusters | Single GPU or CPU |
| Duration | Days to months | Milliseconds to seconds |
| Cost | $millions for large models | Fractions of a cent per query |
Local vs. Cloud Inference
Cloud Inference
- Models run on the provider’s servers (OpenAI, Anthropic, Google)
- Pay per token
- Access to the largest models
- Requires internet connection
Local Inference
- Models run on your own hardware
- Free after initial setup
- Private — data never leaves your machine
- Limited by your hardware (RAM, GPU)
- Tools like Ollama make local inference easy on Mac
Speed Factors
Inference speed depends on:
- Model size: Smaller models are faster
- Hardware: GPU > CPU; more VRAM = larger models
- Quantization: Compressed models run faster with minimal quality loss
- Batch size: Processing multiple requests together improves throughput
Inference in Elvean
Elvean supports both local inference (via Ollama on Apple Silicon) and cloud inference (via API keys for OpenAI, Anthropic, Google, and more) — letting you choose the right tradeoff for each task.
Elvean brings all these concepts together in one native Mac app — local models, cloud APIs, agentic tools, and more.
Learn more about Elvean