
Nessie
UnclaimedTransactional catalog for data lakes with Git-like semantics for consistent data views.
Visit WebsiteFreeVisit Website
Tracked since2026
0 reviews trackedThe Bottom Line
Entry price
Free, no paid tier
Biggest pro
Provides strong data consistency and isolation
Biggest con
Requires understanding of Git concepts for full utilization
TL;DR - Nessie
- Applies Git-like version control to data lakes.
- Ensures always-consistent, isolated, and atomic data changes.
- Manages data files and metadata without copying actual data.
Pricing: Free forever
Best for: Individuals & startups
What is Nessie?
Nessie is a transactional catalog for data lakes that brings Git-like version control semantics to data. It allows users to manage data in data lakes with concepts like commits, branches, and tags, similar to how source code is managed. This enables an always-consistent view of data across all involved datasets and tables, ensuring that changes from batch jobs or experiments are isolated and applied atomically. Nessie eliminates the need for manual tracking of individual data files by referencing existing immutable data files and automatically managing their lifecycle, including garbage collection.
This tool is designed for data engineers and organizations working with large data lakes, especially those using tools like Apache Hive or Apache Spark. It simplifies data management by providing a robust versioning system that prevents incomplete changes from being visible to users and allows for safe experimentation and development without impacting production data. Nessie does not copy data but rather tracks metadata, making it efficient for managing vast amounts of data and numerous tables.
Available on: Web
Pros & Cons
Pros
- Provides strong data consistency and isolation
- Simplifies data management in large data lakes
- Enables safe experimentation and development environments
- Eliminates manual tracking of data files
- Efficient as it does not copy actual data
Cons
- Requires understanding of Git concepts for full utilization
- Primarily focused on data lake management, not a general-purpose database
Key Features
Git-like semantics for data (commits, branches, tags)Always-consistent view of data across all datasetsIsolated changes for batch jobs and experimentsAtomic application of data changesAutomatic garbage collection of unused data filesSupport for multi-table transactionsReferences existing immutable data filesIntegration with Apache Hive and Apache Spark
Pricing Plans
Open Source
Free
- Transactional Catalog for Data Lakes
- Git-inspired data version control
- Cross-table transactions and visibility
- Open data lake approach, supporting Hive, Spark, Dremio, Trino, etc.
- Works with Apache Iceberg tables
- Run as a Docker image or on Kubernetes
Reviews
Be the first to review Nessie
Your take helps the next buyer. Verified LinkedIn reviewers get a badge.
Write a reviewBest Nessie Alternatives
Top alternatives based on features, pricing, and user needs.
Still deciding?
Most buyers shortlist 2 or 3 tools before committing. Pull a side-by-side comparison or browse the full alternatives shortlist below.
Explore More
Nessie FAQ
How does Nessie ensure data consistency across multiple tables during a commit?
Nessie treats a commit as a multi-table transaction, meaning that a single commit can group data file changes from various tables. This ensures that all changes within that commit are applied atomically, providing an always-consistent view of the data across all affected tables simultaneously.
Can Nessie be used to manage schema changes in a data lake, such as adding or renaming columns?
Yes, Nessie tackles the challenge of managing schema changes, including adding or removing columns, changing column types, and renaming columns in tables and views. It tracks the metadata associated with data files, allowing it to manage these structural changes effectively while maintaining data consistency.
What happens to data files that are no longer referenced by any commit or branch in Nessie?
Nessie includes an automatic garbage collection mechanism. It knows which data files are actively being used and which are no longer referenced by any commit or branch. These unreferenced data files can then be safely and automatically removed, optimizing storage and preventing data clutter.
How does Nessie integrate with existing data processing frameworks like Apache Spark or Apache Hive?
Nessie is designed for easy integration with existing data processing tools. For frameworks like Apache Spark or Apache Hive, integrating Nessie typically involves a simple configuration change rather than requiring modifications to existing production code. This allows current jobs to leverage Nessie's versioning capabilities without significant refactoring.
Since Nessie doesn't copy data, how does it handle updates or deletions of existing data within a data file?
Data files in a data lake are immutable. When an update or deletion is required for data within an existing file, Nessie's underlying mechanism involves reading the original data file, applying the necessary changes, and then writing a new data file containing the updated or modified data. The original data file then becomes irrelevant and is eventually subject to garbage collection, while Nessie references the new, updated file.
Source: projectnessie.org