Home » Secure External Access to Unity Catalog Assets via Open APIs

Secure External Access to Unity Catalog Assets via Open APIs

We’re excited to announce the Public Preview of credential vending for Unity Catalog’s open APIs, allowing external clients to securely access Unity Catalog external and managed tables via open source Unity REST APIs, and UniForm-enabled tables through the Iceberg REST catalog APIs. This feature facilitates seamless interoperability across a wide range of engines and tools such as Apache Spark™, DuckDB, Daft, PuppyGraph, StarRocks, Spice AI, Microsoft Fabric, Salesforce Data Cloud, and Iceberg REST catalog engines like Trino and Dremio.

As the industry’s only unified and open governance solution for data and AI assets, Unity Catalog continues to evolve with a focus on interoperability across the modern data and AI stack. This open approach empowers organizations to adopt best-in-class solutions for their data and AI use cases while avoiding vendor lock-in. Credential vending for open APIs is a key part of our comprehensive open source roadmap, following the announcement of the open-sourcing Unity Catalog at the 2024 Data and AI Summit. Credential vending is also available in the open source Unity Catalog 0.2 release.

Unified governance across any engine with credential vending

Governance challenges without credential vending

Query execution in cloud environments depended on static, broad access policies for both metadata and data retrieval, making it difficult to scale. Query engines, like Apache Spark™, are given broad access to the metadata catalog and rely on cloud storage access policies to fetch data from cloud storage. For example, when a user runs a query, the engine needs to access metadata from the catalog and the actual data from the cloud storage like AWS S3, Azure ADLS and GCS. Administrators typically grant the engine full access to the metadata catalog (such as Hive metastore) and create Instance Profiles/Managed Service Identities to define which cloud storage locations the engine can access based on the user’s permissions. These instance profiles map user-level access to specific data storage policies.

Query execution without credential vending in a Lakehouse

While this model works for small environments with few users and datasets, it breaks down when scaling to large organizations with thousands of users, different tools/compute engines, and hundreds of thousands of data objects. Administrators need to ensure that catalog and storage permissions are in sync, which can be challenging as the number of users and data assets grows. This static approach becomes increasingly complex, error-prone, and difficult to sustain, leading to inefficiencies, security risks, and governance challenges at scale.

Scalable governance with credential vending

Credential vending allows a catalog to grant temporary access to storage for an engine performing data processing. This is done through time-limited, downscoped storage credentials generated on demand. These credentials are restricted to the specific storage needed for a higher-level object, like a table. The catalog manages both metadata and governance, meaning it has permanent access to all data, while the engine only gets just-in-time access. For example, if an engine needs to access a specific table stored at a path on AWS S3, the catalog generates a credential limited to that path and provides it to the engine, allowing access. Credential vending leverages the downscoping mechanisms offered by cloud providers like AWS session tokens or Azure delegation SAS credentials.

Key benefits:

  • Centralized access control: Allows for centralized management of data access permissions through the catalog, rather than having to configure access controls separately for each underlying data source.
  • Temporary, scoped access: Provides temporary, scoped-down credentials to access data, enhancing security by limiting the lifetime and permissions of access tokens.
  • Simplified permissions management: Admins don’t need to update individual storage bucket policies or IAM roles – permissions can be managed centrally through the catalog.
  • Foundation for advanced governance features: This provides the foundational building blocks for implementing higher-level access policies. These could include basic access controls ormore advanced policies like RBAC (Role-Based Access Control) or ABAC (Attribute-Based Access Control) that are dynamic in nature.

Implement policies once in Unity Catalog, and enforce them everywhere

How credential vending enables secure access for external clients

Unity Catalog provides open source REST APIs, allowing external clients to securely access objects such as tables. Admins can define access policies for these objects in Unity Catalog, with Unity Catalog retaining permanent storage access. When an external engine, like Apache Spark™, requests access to a table through the REST APIs using UC credentials like PAT or OAuth tokens, Unity Catalog issues temporary credentials and URLs to control storage access based on the user’s specific IAM roles or managed identities, enabling data retrieval and query execution. This simplifies administration, enhances interoperability across engines and tools, and lays the foundation for advanced governance features like RBAC and ABAC to scale access management.

Query execution with credential vending
Query execution with credential vending using an external compute engine

This capability also extends to Iceberg tables managed in Unity Catalog through Iceberg REST Catalog interface, leveraging the same temporary credential vending process to read Iceberg tables. By enhancing accessibility for a wide range of external engines integrated through Unity REST APIs—such as Apache Spark™, DuckDB, Daft, PuppyGraph, StarRocks, Spice AI, Microsoft Fabric, Salesforce Data Cloud, and Iceberg REST catalog engines like Trino and Dremio—organizations can leverage the tools of their choice while maintaining consistent discovery and governance experiences across platforms. We also plan to extend credential vending support to other Unity Catalog assets, including volumes (unstructured data, arbitrary files). Stay tuned!

See it in action with Apache Spark™ and Unity Catalog

Unity Catalog Open APIs allow external clients, like Apache Spark™, to interact with the catalog with unified governance. You can fulfill operations like creating, reading, and writing to your Delta tables through vending temporary credentials. You no longer need to confirm and manage IAM permissions for your workloads and keep them in sync across different systems.

The following example demonstrates how to set up your Spark Session to connect to Unity Catalog on Databricks for accessing tables stored in AWS S3.

Access to read tables is governed by Catalog/Schema/Table privileges. Users require USE CATALOG, USE SCHEMA, EXTERNAL USE SCHEMA, SELECT privileges to read a table.

To create a table users require CREATE EXTERNAL TABLE on the external storage location, as well as the catalog privileges USE CATALOG, USE SCHEMA and EXTERNAL USE SCHEMA.

Similarly, you query your UniForm Iceberg tables from the Unity Catalog through the Iceberg REST API. This allows you to access these tables from any client that supports Iceberg REST without introducing new dependencies!

Next steps

This is just the start of our ongoing roadmap to deliver open access and unified governance for any data or AI asset, in any format, across any workload, and compatible with any compute engine or tool. Credential vending is a powerful building block for governance, and look out for further updates to support secure external access to volumes (Unstructured data, arbitrary files).

  • To learn more about credential vending in Unity Catalog and requirements, refer to the documentation forAWS, Azure, GCP.
  • To get started with the Unity Catalog, explore the setup guides available for AWS, Azure, and GCP.
  • You can also read about the open source 0.2 release of Unity Catalog for more details

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *