Skip to content
Inferless logo

Inferless

Unclaimed

Deploy and scale machine learning models on serverless GPUs in minutes.

Visit Website

TL;DR - Inferless

  • Deploys machine learning models to serverless GPUs rapidly.
  • Automatically scales GPU resources from zero to hundreds based on demand.
  • Offers usage-based billing and fast cold starts for cost-effective inference.
Pricing: Free plan available
Best for: Growing teams

Pros & Cons

Pros

  • Eliminates infrastructure management for GPU clusters
  • Scales automatically with workload, paying only for usage
  • Achieves sub-second cold starts for large models
  • Provides significant cost savings compared to traditional GPU clusters
  • Offers enterprise-grade security with SOC-2 Type II certification

Cons

  • Specific pricing details for enterprise plans require direct contact
  • Currently in private beta for certain offerings, requiring waitlist access

Ratings Across the Web

4(2 reviews)

Ratings aggregated from independent review platforms. Learn more

Key Features

One-click deployment from Hugging Face, Git, Docker, or CLIAutomatic scaling from zero to hundreds of GPUsCustomizable container runtimesNFS-like writable volumes with simultaneous connectionsAutomated CI/CD for model re-importsDetailed call and build logs for monitoringDynamic batching for increased throughputCustomizable private endpoints (scale down, timeout, concurrency, testing, webhooks)

Pricing Plans

Free Trial

Starter

$0.000555/sec

  • Designed for small teams and independent developers
  • Deploy models in minutes without worrying about the cost

Enterprise

Contact us

  • Built for fast-growing startups and larger organizations
  • Scale quickly at an affordable cost with desired latency results

Nvidia T4 Dedicated

$0.000185/sec

  • GPU RAM: 16GB
  • vCPUs: 3x
  • RAM: 20GB

Nvidia A10 Dedicated

$0.000341/sec

  • GPU RAM: 24GB
  • vCPUs: 7x
  • RAM: 30GB

Nvidia A100 Dedicated

$0.001491/sec

  • GPU RAM: 80GB
  • vCPUs: 20x
  • RAM: 200GB

Nvidia T4 Shared

$0.000092/sec

  • GPU RAM: 8GB
  • vCPUs: 1.5x
  • RAM: 10GB

Nvidia A10 Shared

$0.000170/sec

  • GPU RAM: 12GB
  • vCPUs: 3x
  • RAM: 15GB

Nvidia A100 Shared

$0.000745/sec

  • GPU RAM: 40GB
  • vCPUs: 10x
  • RAM: 100GB

Volume Pricing - Storage

Free 50GB/month, then $0.3/GB/month

  • 50 GB free every month
  • Extra storage costs $0.3/GB/month

Join Waitlist (Startup)

Contact us

  • Min 10,000 Inference Requests per month
  • Unlimited deployed webhook endpoints
  • GPU concurrency of 5
  • 15 day of log retention
  • Support via private Slack connect within 48 working hours
  • Include Credits : $30

Get Early Access (Enterprise)

Contact us

  • Min 100,000 Inference Requests per month
  • Unlimited deployed webhook endpoints
  • GPU concurrency of 50
  • 365 day of log retention
  • Support via private Slack connect & support engineer
  • Include Credits : Custom

What is Inferless?

Editorial review
Inferless provides a serverless GPU inference platform designed for deploying machine learning models quickly and affordably. It allows users to take a model file and deploy it as an endpoint in minutes, supporting deployments from Hugging Face, Git, Docker, or CLI with automatic redeploy options. The platform is engineered to handle spiky and unpredictable workloads, automatically scaling from zero to hundreds of GPUs using an in-house load balancer, ensuring efficient resource utilization and minimal overhead. This platform is ideal for machine learning engineers, data scientists, and developers who need to deploy compute-intensive deep learning models without managing underlying infrastructure. It offers features like custom runtimes, NFS-like writable volumes, automated CI/CD, and detailed monitoring. Inferless aims to optimize high-end computing resources, enabling companies to run custom models built on open-source frameworks efficiently and cost-effectively, with a focus on reducing cold starts and providing usage-based billing. Key benefits include zero infrastructure management, on-demand scaling with payment only for actual usage, and lightning-fast cold starts. The platform supports various GPU types like Nvidia A100, A10, and T4, and is built with enterprise-level security, including SOC-2 Type II certification and regular vulnerability scans. It's particularly beneficial for applications in computer vision, NLP, recommendations, and scientific computing.

Reviews

Be the first to review Inferless

Your take helps the next buyer. Verified LinkedIn reviewers get a badge.

Write a review

Best Inferless Alternatives

Top alternatives based on features, pricing, and user needs.

View full list →

Explore More

Inferless FAQ

How does Inferless manage GPU sharing and elasticity, given that Kubernetes typically doesn't allow for direct GPU sharing?

Inferless addresses the challenges of GPU elasticity and sharing by utilizing a proprietary algorithm and an in-house built load balancer. This system optimizes model loading and maintains SLAs by using a cluster of always-on machines, ensuring efficient utilization of GPU resources and balancing autoscaling with desired latency, even though GPUs are not as elastic as CPUs in standard Kubernetes deployments.

What is the practical difference in performance and cost between a 'Shared' and 'Dedicated' GPU instance on Inferless?

Shared instances on Inferless allocate GPU resources among multiple users, offering a cost-effective solution with variable performance suitable for smaller or infrequent tasks. Dedicated instances, conversely, provide exclusive access to an entire GPU, delivering consistent high performance at a higher cost, which is optimal for large-scale tasks or when data isolation is critical. The choice depends on workload demands, performance requirements, and budget.

Can I deploy a model that requires specific pre-processing and post-processing functions alongside the model file itself?

Yes, Inferless allows engineering teams to deploy not only the model file but also integrate pre-processing and post-processing functions. The platform automatically creates the necessary endpoints and provides monitoring data for these end-to-end model deployments, simplifying the entire inference pipeline.

How does Inferless achieve a 99% reduction in model cold start times, particularly for large models like GPT-J?

Inferless is optimized for instant model loading, ensuring sub-second responses even for large models. While a model like GPT-J might take 25 minutes to cold start traditionally, Inferless can reduce this to approximately 10 seconds. This is achieved through its serverless GPU architecture and proprietary algorithms that optimize model load and resource allocation, eliminating warm-up delays.

What security measures are in place to ensure data and model isolation for customers using Inferless?

Inferless prioritizes customer data and privacy by isolating execution environments using Docker containerization, preventing interaction between individual customer environments. Log streams are securely separated with AWS CloudWatch Logs access controls, retained for 30 days, and then deleted. Model hosting storage is encrypted using AES-256, and models and data are never shared across customers.

If my model has varying inference request patterns, how does Inferless's billing model ensure I only pay for what I use, especially if there are periods of no activity?

Inferless operates on a pay-per-second, usage-based billing model. You are only charged for the compute resources used when your models are actively running in a healthy state. If you configure your minimum replicas to zero, no machines are spun up when there are no inference requests, meaning you incur no charges during periods of inactivity. This ensures cost efficiency by avoiding idle costs.