In AI, 'training' is the expensive one-time process of teaching the model; 'inference' is the cheap (per call) repeated process of using it. For LLMs, inference cost is paid in tokens (input + output) and varies by model: GPT-4 costs ~$10/M output tokens, smaller models cost cents. Inference latency depends on the model and the request — small models respond in 100ms, large models in 1-10 seconds. Most consumer AI costs are inference costs: a chatbot serving 1M users a day costs vastly more in inference than the one-time training. Inference optimization (quantization, distillation, KV-cache, batching) is the hottest area in AI engineering in 2026.
מילון
מה זה Inference?
The act of running a trained AI model to generate output — i.e., 'using' the model, as opposed to training it.
מונחים קשורים
LLM (Large Language Model)
An AI system trained on massive text datasets to predict and generate human-like text — the technology behind ChatGPT, Claude, Gemini, and most modern AI chatbots.
Token
The basic unit that LLMs read and produce. Roughly 0.75 words in English. APIs charge per token consumed and produced.
חזרה ל- מילון ה-AI