Inference Speed Calculator
Calculate AI model inference speed in tokens per second across different hardware configurations. Compare LLM throughput on GPU, TPU, and CPU with batch size optimization. Essential for AI deployment planning, cost estimation, and production serving infrastructure.
Inference Speed
Speed = (Memory Bandwidth / Bits per Token) × UtilizationVariables:
- SpeedSpeed (tokens/second)Speed (tokens/second)
- BWMemory bandwidth (GB/s)Memory bandwidth (GB/s)
- BitsBits per token (FP16=16, INT8=8)Bits per token (FP16=16, INT8=8)
- UtilUtilization (0.6-0.9)Utilization (0.6-0.9)
How to Use
- 1
Select Model
Choose model and quantization (FP16, INT8, etc.).
- 2
Select GPU
Choose the GPU being used.
- 3
Calculate
Get tokens per second.
Examples
Llama 3 8B on RTX 4090
Problem:
FP16, BW=1008GB/s, Util=0.8. Speed?
Solution:
- 1.Speed = (1008 / 16) × 0.8
- 2.Speed ≈ 50 tokens/s
Result:≈ 50 tokens/second
An RTX 4090 can run Llama 3 8B at about 50 tokens per second.
Frequently Asked Questions
Does batch size matter?
Larger batches increase throughput but reduce responsiveness.
FP16 vs INT8?
INT8 is about 2× faster but accuracy drops slightly.