Question 1

How does Inferless manage GPU sharing and elasticity, given that Kubernetes typically doesn't allow for direct GPU sharing?

Accepted Answer

Inferless addresses the challenges of GPU elasticity and sharing by utilizing a proprietary algorithm and an in-house built load balancer. This system optimizes model loading and maintains SLAs by using a cluster of always-on machines, ensuring efficient utilization of GPU resources and balancing autoscaling with desired latency, even though GPUs are not as elastic as CPUs in standard Kubernetes deployments.

Question 2

What is the practical difference in performance and cost between a 'Shared' and 'Dedicated' GPU instance on Inferless?

Accepted Answer

Shared instances on Inferless allocate GPU resources among multiple users, offering a cost-effective solution with variable performance suitable for smaller or infrequent tasks. Dedicated instances, conversely, provide exclusive access to an entire GPU, delivering consistent high performance at a higher cost, which is optimal for large-scale tasks or when data isolation is critical. The choice depends on workload demands, performance requirements, and budget.

Question 3

Can I deploy a model that requires specific pre-processing and post-processing functions alongside the model file itself?

Accepted Answer

Yes, Inferless allows engineering teams to deploy not only the model file but also integrate pre-processing and post-processing functions. The platform automatically creates the necessary endpoints and provides monitoring data for these end-to-end model deployments, simplifying the entire inference pipeline.

Question 4

How does Inferless achieve a 99% reduction in model cold start times, particularly for large models like GPT-J?

Accepted Answer

Inferless is optimized for instant model loading, ensuring sub-second responses even for large models. While a model like GPT-J might take 25 minutes to cold start traditionally, Inferless can reduce this to approximately 10 seconds. This is achieved through its serverless GPU architecture and proprietary algorithms that optimize model load and resource allocation, eliminating warm-up delays.

Question 5

What security measures are in place to ensure data and model isolation for customers using Inferless?

Accepted Answer

Inferless prioritizes customer data and privacy by isolating execution environments using Docker containerization, preventing interaction between individual customer environments. Log streams are securely separated with AWS CloudWatch Logs access controls, retained for 30 days, and then deleted. Model hosting storage is encrypted using AES-256, and models and data are never shared across customers.

Question 6

If my model has varying inference request patterns, how does Inferless's billing model ensure I only pay for what I use, especially if there are periods of no activity?

Accepted Answer

Inferless operates on a pay-per-second, usage-based billing model. You are only charged for the compute resources used when your models are actively running in a healthy state. If you configure your minimum replicas to zero, no machines are spun up when there are no inference requests, meaning you incur no charges during periods of inactivity. This ensures cost efficiency by avoiding idle costs.

Inferless

TL;DR - Inferless

Pros & Cons

Key Features

Pricing Plans

Starter

Enterprise

Nvidia T4 Dedicated

Nvidia A10 Dedicated

Nvidia A100 Dedicated

Nvidia T4 Shared

Nvidia A10 Shared

Nvidia A100 Shared

Volume Pricing - Storage

Join Waitlist (Startup)

Get Early Access (Enterprise)

About Inferless

Reviews

Best Inferless Alternatives

Explore More

Inferless FAQ