
The most expressive open-source voice AI model for realistic and conversational speech generation.
Visit WebsitePros
Cons
Ratings aggregated from independent review platforms. Learn more
Fish Audio S2 offers a generous free tier with optional paid upgrades for advanced features.
No reviews yet. Be the first to review Fish Audio S2!
Top alternatives based on features, pricing, and user needs.
Fish Audio S2 Pro is built on a Dual-Autoregressive (Dual-AR) architecture, combining a 4B-parameter Slow AR for semantic prediction and a 400M-parameter Fast AR for acoustic detail. It's trained on over 10M+ hours of audio data across 80+ languages and uses reinforcement learning alignment to achieve its fine-grained prosody and emotion control.
On a single NVIDIA H200 GPU, S2 Pro achieves a Real-Time Factor (RTF) of 0.195, a time-to-first-audio of approximately 100ms, and a throughput of over 3,000 acoustic tokens per second, while maintaining RTF below 0.5. Its SGLang-based inference engine incorporates LLM-native serving optimizations like continuous batching and paged KV cache.
Tier 1 languages, such as Japanese, English, and Chinese, offer the highest quality speech generation. Tier 2 languages, including Korean, Spanish, Portuguese, Arabic, Russian, French, and German, also provide support, with many additional languages available.
The S2 Pro model is licensed under the Fish Audio Research License, which permits research and non-commercial use free of charge. Commercial use requires a separate license directly from Fish Audio.
Each minute of speech generation costs approximately 600 to 625 credits. These credits are replenished and reset monthly, and any unused credits do not roll over to the next billing cycle.
Source: fish.audio