Loading calculator…

Inference Speed Calculator

Calculate AI model inference speed in tokens per second across different hardware configurations. Compare LLM throughput on GPU, TPU, and CPU with batch size optimization. Essential for AI deployment planning, cost estimation, and production serving infrastructure.

Inference Speed

Speed = (Memory Bandwidth / Bits per Token) × Utilization

Variables:

  • SpeedSpeed (tokens/second)
    Speed (tokens/second)
  • BWMemory bandwidth (GB/s)
    Memory bandwidth (GB/s)
  • BitsBits per token (FP16=16, INT8=8)
    Bits per token (FP16=16, INT8=8)
  • UtilUtilization (0.6-0.9)
    Utilization (0.6-0.9)

How to Use

  1. 1

    Select Model

    Choose model and quantization (FP16, INT8, etc.).

  2. 2

    Select GPU

    Choose the GPU being used.

  3. 3

    Calculate

    Get tokens per second.

Examples

Llama 3 8B on RTX 4090

Problem:

FP16, BW=1008GB/s, Util=0.8. Speed?

Solution:
  1. 1.Speed = (1008 / 16) × 0.8
  2. 2.Speed ≈ 50 tokens/s
Result:≈ 50 tokens/second

An RTX 4090 can run Llama 3 8B at about 50 tokens per second.

Frequently Asked Questions

Does batch size matter?
Larger batches increase throughput but reduce responsiveness.
FP16 vs INT8?
INT8 is about 2× faster but accuracy drops slightly.

Related Calculators

References