
Transactional catalog for data lakes with Git-like semantics for consistent data views.
Visit WebsitePros
Cons
Free
No reviews yet. Be the first to review Nessie!
Top alternatives based on features, pricing, and user needs.
Nessie treats a commit as a multi-table transaction, meaning that a single commit can group data file changes from various tables. This ensures that all changes within that commit are applied atomically, providing an always-consistent view of the data across all affected tables simultaneously.
Yes, Nessie tackles the challenge of managing schema changes, including adding or removing columns, changing column types, and renaming columns in tables and views. It tracks the metadata associated with data files, allowing it to manage these structural changes effectively while maintaining data consistency.
Nessie includes an automatic garbage collection mechanism. It knows which data files are actively being used and which are no longer referenced by any commit or branch. These unreferenced data files can then be safely and automatically removed, optimizing storage and preventing data clutter.
Nessie is designed for easy integration with existing data processing tools. For frameworks like Apache Spark or Apache Hive, integrating Nessie typically involves a simple configuration change rather than requiring modifications to existing production code. This allows current jobs to leverage Nessie's versioning capabilities without significant refactoring.
Data files in a data lake are immutable. When an update or deletion is required for data within an existing file, Nessie's underlying mechanism involves reading the original data file, applying the necessary changes, and then writing a *new* data file containing the updated or modified data. The original data file then becomes irrelevant and is eventually subject to garbage collection, while Nessie references the new, updated file.
Source: projectnessie.org