← AI Glossary

What Is AI Inference?

Inference is the process of using a trained AI model to generate outputs — answering questions, writing code, creating images. It’s the “running” phase, as opposed to the “training” phase.

Training vs. Inference

AspectTrainingInference
What happensModel learns from dataModel generates predictions
ComputeMassive GPU clustersSingle GPU or CPU
DurationDays to monthsMilliseconds to seconds
Cost$millions for large modelsFractions of a cent per query

Local vs. Cloud Inference

Cloud Inference

  • Models run on the provider’s servers (OpenAI, Anthropic, Google)
  • Pay per token
  • Access to the largest models
  • Requires internet connection

Local Inference

  • Models run on your own hardware
  • Free after initial setup
  • Private — data never leaves your machine
  • Limited by your hardware (RAM, GPU)
  • Tools like Ollama make local inference easy on Mac

Speed Factors

Inference speed depends on:

  • Model size: Smaller models are faster
  • Hardware: GPU > CPU; more VRAM = larger models
  • Quantization: Compressed models run faster with minimal quality loss
  • Batch size: Processing multiple requests together improves throughput

Inference in Elvean

Elvean supports both local inference (via Ollama on Apple Silicon) and cloud inference (via API keys for OpenAI, Anthropic, Google, and more) — letting you choose the right tradeoff for each task.

Elvean brings all these concepts together in one native Mac app — local models, cloud APIs, agentic tools, and more.

Learn more about Elvean