Skip to content
Aqueduct logo

Aqueduct

Unclaimed

The AI SRE that integrates with your stack, investigates alerts, and accelerates incident resolution.

Visit Website
Tracked since2026
0 reviews tracked

The Bottom Line

Entry price

Paid plans only

Biggest pro

Significantly reduces Mean Time To Resolution (MTTR)

Biggest con

Requires integration with existing observability and operational tools

TL;DR - Aqueduct

  • AI SRE for automated incident investigation and resolution.
  • Correlates data across observability, code, and tickets to provide clear next steps.
  • Continuously learns from incidents to improve MTTR and prevent recurrence.
Pricing: Paid only
Best for: Enterprises & pros

What is Aqueduct?

Editorial review
RunLLM is an AI Site Reliability Engineer (SRE) that automates and accelerates incident resolution by integrating with existing observability tools, code repositories, ticketing systems, and chat platforms. It acts as an always-on agent, correlating alerts, logs, metrics, traces, and tickets to provide evidence-backed investigations and clear next steps for mitigation and root cause analysis (RCA). Designed for SRE teams, on-call engineers, and anyone responsible for system uptime, RunLLM aims to reduce alert fatigue, improve uptime, and prevent repeat incidents. It continuously learns from every investigation and user correction, adapting to specific environments and capturing tribal knowledge to provide veteran-level guidance during live incidents. The platform is built for rapid deployment, offering day-one value by connecting to existing tools without complex setup. RunLLM leverages advanced AI and LLM research from UC Berkeley, employing sophisticated data engineering, model specialization, and decision intelligence to understand complex technical data. It provides transparent reasoning, starting in read-only mode for safety, and allows for human-in-the-loop approvals for any actions, ensuring trust and control.

Available on: macOS, Linux

Pros & Cons

Pros

  • Significantly reduces Mean Time To Resolution (MTTR)
  • Decreases alert fatigue and burnout for on-call teams
  • Prevents repeat incidents by identifying risks and learning from past events
  • Rapid deployment and day-one value with existing tool integrations
  • Provides transparent, evidence-backed reasoning, not a black box

Cons

  • Requires integration with existing observability and operational tools
  • Initial trust-building phase may be needed for read-write actions
  • Effectiveness improves over time with continuous learning and feedback

Preview

Key Features

Correlates alerts, logs, metrics, traces, and ticketsGuides through RCA, mitigation, and postmortemsIntegrates with observability tools, code, ticketing, docs, and chatProvides evidence-backed investigations and prioritized next stepsContinuously learns from investigations and user correctionsAnalyzes incidents, logs, and customer tickets to surface risksStarts in read-only mode with OAuth-based scoped permissionsSlack-first delivery with a full UI

Pricing

Paid

Aqueduct offers paid plans. Visit their website for current pricing details.

View pricing

Reviews

Be the first to review Aqueduct

Your take helps the next buyer. Verified LinkedIn reviewers get a badge.

Write a review

Best Aqueduct Alternatives

Top alternatives based on features, pricing, and user needs.

View full list →

Most buyers shortlist 2 or 3 tools before committing. Pull a side-by-side comparison or browse the full alternatives shortlist below.

Explore More

Aqueduct FAQ

How does RunLLM ensure the safety and trustworthiness of its automated actions, especially when starting in read-only mode?

RunLLM prioritizes safety by starting in a read-only mode, investigating incidents without making any changes. It uses OAuth-based access with scoped permissions that your existing tools already support. For any actions that could modify your system, such as opening PRs, RunLLM requires explicit human-in-the-loop approval. Every agent is isolated, no data is shared between agents, and all steps are subject to audit logging and policy enforcement.

What specific types of technical data does RunLLM ingest and how does it process this information to provide causal context?

RunLLM ingests massive volumes of technical data including logs, traces, metrics, tickets, and documentation. It processes this information through custom pipelines, structures it into a knowledge graph, and uses GraphRAG to map dependencies and historical events. This allows it to identify causal context, not just correlations, by understanding service relationships and historical incident patterns.

How does RunLLM's continuous learning mechanism improve its performance and adapt to a specific organization's incident patterns?

RunLLM continuously learns from every investigation and user-provided correction. It identifies which checks and queries are most effective for specific alert patterns and reuses proven investigation steps from similar past incidents. This process captures tribal knowledge, automatically updates runbooks, and refines its models, leading to a reduction in MTTR and more accurate, organization-specific incident responses over time.

Beyond incident resolution, how does RunLLM contribute to preventing future incidents and improving system reliability proactively?

RunLLM proactively prevents future incidents by continuously analyzing past incidents, logs, and customer tickets to surface risks early, before they impact customers. It also uses human feedback to refine future responses, clusters recurring issues to highlight documentation gaps, and ensures runbooks and knowledge bases evolve automatically, thereby reducing system drift and improving overall reliability.

Can RunLLM integrate with both proprietary and open-source observability and ticketing systems, and what is the typical setup time?

Yes, RunLLM is designed for universal integration, offering connectors for popular tools like Datadog, Grafana, PagerDuty, Jira, and Zendesk, as well as open APIs for custom and homegrown systems. The platform aims for rapid deployment, allowing teams to connect their tools and see results quickly, often getting live in days rather than weeks, without requiring installation on your infrastructure.

Guides & Articles