What is a Language Processing Unit (LPU) and how does it differ from a GPU?
The LPU is Groq's custom-designed chip built specifically for AI inference, not training. Unlike GPUs that use shared memory hierarchies and caches, the LPU uses hundreds of megabytes of onboard SRAM with direct chip-to-chip connectivity and static scheduling. This eliminates batching delays and delivers deterministic, low-latency performance for sequential token generation.
How does Groq pricing work?
Groq uses pure pay-per-token pricing with no monthly subscription. You pay per million input and output tokens, with rates varying by model — for example, Llama 3.1 8B costs $0.05/$0.08 per million input/output tokens, while larger models like Llama 3.3 70B cost $0.59/$0.79. Prompt caching and batch API each offer 50% discounts.
Can I use Groq as a drop-in replacement for OpenAI?
Yes. Groq's API is OpenAI-compatible, so you typically only need to change the base URL and API key in your existing code. Most OpenAI SDK features including function calling, streaming, and JSON mode are supported.
What models are available on Groq?
Groq supports open-source LLMs including Llama 4 Scout and Maverick, Llama 3.3 70B, Qwen3 32B, GPT-OSS 20B and 120B, and Kimi K2. For audio, it offers Whisper v3 Large and Turbo for transcription, and Canopy Labs Orpheus for text-to-speech.
Does Groq offer on-premises deployment?
Yes, through GroqRack — a purpose-built inference appliance for enterprises that need to run models in their own data centers. GroqRack uses the same LPU technology as GroqCloud. Pricing and availability require contacting the Groq sales team.
What are the rate limits on Groq's free tier?
Groq offers free-tier access with rate limits that vary by model. Exact limits are displayed in the GroqCloud developer console upon signup. Paid usage removes or significantly raises these limits, and enterprise plans offer fully custom rate limits.