Open-source data catalog for discovering and trusting data.
Automates metadata collection and facilitates context sharing.
Improves productivity for data professionals by breaking data silos.
Pricing: Free forever
Best for: Individuals & startups
Pros & Cons
Pros
Enhances data discoverability and trust
Increases productivity for data professionals
Reduces data silos and improves collaboration
Provides automated and curated metadata
Open-source with an Apache License 2.0
Cons
Requires self-hosting and setup
Integration effort may be needed for specific data sources
Community support may be the primary resource
Preview
Key Features
PageRank-inspired search algorithm for data discoveryAutomated metadata collection (descriptions, statistics, usage)Curated metadata editing (table/column descriptions)Data preview functionality (if permitted)Link ETL jobs and code to data assetsVisibility into co-worker data usage and bookmarksDisplay of common queries and dashboards built on tablesIntegration with existing data infrastructure
Amundsen is an open-source data catalog and metadata engine designed to help organizations discover, understand, and trust their data. It serves as a central hub for data assets, allowing data professionals to quickly find relevant data for their analysis, models, and pipelines.
The platform is built to benefit Data Analysts, Data Scientists, Data Engineers, and Software Engineers. It helps analysts and data scientists be more productive by breaking down data silos, providing immediate context, and showing how others are using the data. For engineers, it reduces interruptions by automatically sharing context, ensures the use of correct data in pipelines, and speeds up debugging by centralizing all table-related information.
Amundsen achieves this by offering a PageRank-inspired search algorithm for data discovery, automated and curated metadata (descriptions, usage statistics, last updated times, data previews), and features for sharing context among co-workers. Users can update table and column descriptions, see frequently used or bookmarked data by peers, and view common queries or dashboards built on specific tables.
How does Amundsen prioritize search results for data discovery?
Amundsen utilizes a PageRank-inspired search algorithm. This algorithm recommends results based on factors such as names, descriptions, tags, and the querying or viewing activity associated with a table or dashboard.
What types of metadata does Amundsen automatically collect and display?
Amundsen automatically collects and displays metadata including descriptions of tables and columns, information on frequent users, the last update time for tables, statistics, and a data preview if permissions allow. It also links to the ETL job and code that generated the data for easier debugging.
How does Amundsen facilitate collaboration and knowledge sharing among data professionals?
Amundsen allows users to update tables and columns with descriptions, reducing ambiguity about data usage. It also enables users to see what data their colleagues frequently use, own, or have bookmarked, and to view common queries for a table through associated dashboards.
What are the deployment options for setting up Amundsen?
Amundsen is designed for quick setup and can be run on various platforms. It supports deployment using Docker, EC2, and Kubernetes environments.
What is the licensing model for Amundsen?
Amundsen is an open-source project. It is released under the Apache License, Version 2.0, allowing for broad use and modification.