GPU vs LPU: The Bandwidth Problem
GPUs are designed for training — massively parallel floating-point throughput. During LLM inference, however, the bottleneck is almost never compute: it is memory bandwidth. Each forward pass reads billions of model weights from DRAM, and DRAM bandwidth tops out around 3 TB/s even on an H100. Most of that time is spent waiting for data, not computing.
Groq's Language Processing Unit (LPU) is built differently. It uses a deterministic, single-threaded compute fabric with large on-chip SRAM — no speculative execution, no cache hierarchies, no DRAM reads during the critical path. All model weights for supported model sizes fit entirely in SRAM, which provides 10–80x higher effective bandwidth per watt than DRAM-based GPU inference.
The result: GroqCloud delivers 800+ tokens/sec on Llama 3.1 70B, compared to 30–60 tokens/sec on a typical GPU-based API.
Supported Models on GroqCloud (2026)
- Llama 3.1 8B Instruct
- Llama 3.1 70B Instruct
- Llama 3.3 70B Versatile
- Gemma 2 9B IT
- Mistral 8x7B (Mixtral)
- Whisper Large v3 (audio)
Getting Started
Install the SDK:
pip install groq
The groq-python SDK mirrors the OpenAI client API:
from groq import Groq
client = Groq(api_key="YOUR_GROQ_API_KEY")
response = client.chat.completions.create(
model="llama-3.1-70b-versatile",
messages=[{"role": "user", "content": "Explain the LPU in plain English"}],
max_tokens=512,
)
print(response.choices[0].message.content)
Streaming Example
pip install groq
stream = client.chat.completions.create(
model="llama-3.1-8b-instant",
messages=[{"role": "user", "content": "Write a sorting algorithm in Rust"}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
Because Groq processes tokens deterministically and in-order, the time-to-first-token (TTFT) is typically under 200ms even on the 70B model — comparable to a cached local model response.
Free Tier and Cost
GroqCloud has a generous free tier: 14,400 requests/day and 500k tokens/day per model as of early 2026. Paid tiers are priced at approximately $0.05–0.59 per million tokens depending on model size — significantly cheaper than equivalent throughput on AWS or Azure GPU instances.
When to Use Groq vs vLLM vs Ollama
| Use case | Best choice | |---|---| | Lowest latency API, no GPU needed | Groq | | Self-hosted, high throughput, enterprise | vLLM | | Local dev, privacy, offline | Ollama |
Groq is ideal for user-facing applications where latency is the primary UX metric — chat interfaces, voice assistants, and real-time coding tools.