How does Ollama leverage Apple's MLX framework to improve performance on Apple Silicon?
Ollama integrates with Apple's MLX framework to take advantage of its unified memory architecture and the new GPU Neural Accelerators on M5, M5 Pro, and M5 Max chips. This significantly accelerates both time to first token (TTFT) and generation speed (tokens per second) for LLMs running on Apple Silicon devices.
What is NVFP4 support, and how does it benefit Ollama users?
NVFP4 support means Ollama can leverage NVIDIA's NVFP4 format, which maintains model accuracy while reducing memory bandwidth and storage requirements for inference. This allows users to achieve results consistent with production environments and opens up the ability to run models optimized by NVIDIA's model optimizer.
How do Ollama's improved caching mechanisms enhance efficiency for coding and agentic tasks?
Ollama's upgraded cache reuses cache across conversations to lower memory utilization and increase cache hits, especially with shared system prompts. It also stores intelligent checkpoints in the prompt, reducing processing time and leading to faster responses, while shared prefixes survive longer even when older branches are dropped.
What are the key differences in cloud usage and concurrency between the Free, Pro, and Max plans?
The Free plan allows 1 concurrent cloud model and light usage. The Pro plan offers 3 concurrent cloud models and 50x more usage than Free, suitable for day-to-day work. The Max plan provides 10 concurrent cloud models and 5x more usage than Pro, designed for heavy, sustained tasks and continuous agent workflows. Local model usage is unlimited across all plans.
Can I use Ollama with custom fine-tuned models, and what are the plans for easier import?
While the current preview release focuses on specific models, Ollama is actively working to support future models and will introduce an easier way to import custom models fine-tuned on supported architectures. In the meantime, they plan to expand the list of supported architectures.