Unlock the Power of Active Metadata with a Data Catalog
Learn how Data Catalog can enable and activate your metadata platform.
A good metadata platform is incomplete without a great Data Catalog capability for searching data assets and performing data discovery.
To put it in perspective, a Data Catalog is a reflection of your enterprise IT landscape. It is the main entry point of enterprise data discovery. With that as a background, let us deep-dive into how the Data Catalog can be a catalyst for activating the metadata platform.
In the previous article, I described the enablers of the Active Metadata Platform. For better understanding and continuity, my suggestion is to read that article. Here is the link.
To put things in context, metadata management is a platform, whereas Data Catalog is one of the many capabilities of this platform. Perhaps, the most significant capability. The below diagram of the Active Metadata Platform and the highlighted layer signifies this.
“The biggest challenge to succeeding with data mesh is creating an internal data marketplace with data producers and data consumers,” says Sandra Cannon, the first chief data officer at the Rochester, N.Y.
The notion of an ‘internal data marketplace’1 is a common platform where data providers share the information with trust for data consumers to consume. This is illustrated with the data access layer in the reference architecture above. The goal is to go deeper into understanding how the Data Catalog can be a catalyst to activate metadata and learn how it addresses the key challenges to successfully build internal data marketplaces and help with successful data mesh implementations.
Data Catalog and its Function in Enterprise Data Landscape
A Data Catalog is a searchable view of a metadata model. A metadata model is a structured and organized inventory of the enterprise landscape that stores metadata - data about data. These models are defined in the metadata management tool for a specific enterprise domain (for example software, application, platform, infrastructure, or data domain). We know that the metadata model does not store any data values but only the metadata information. This avoids data risk exposure and provides the visibility of metadata to the end user.
Implementation of a Data Catalog search on top of a robust metadata data platform ensures that the end-user gets a centralized view of data assets to understand and analyze the data from the right and trusted source. This defines and drives value in making important business decisions.
Challenges with existing Data Platforms and Data Architectures
Although an old data point from IDC Blog2, this is still a practice scenario in many organizations. Many Data analysts and Data Scientists spend their valuable time searching and preparing datasets instead of uncovering the hidden potential of the datasets. The following are the main reasons that contribute to such problems-
Centralized data teams
Over the years, organizations have built data platforms with first-generation EDW + BI architecture patterns to solve business needs. As time evolved and the need to manage a higher volume of data grew, newer implementation and use cases like Data Lake with medallion architecture3 for advanced analytics became common.
The fundamental issue is not the technology or these architecture patterns, but the misleading approach that the teams managing these platforms are solely responsible for managing the data. Organizations using the first generation patterns still face such challenges, which becomes a blocker for data maturity in the organization.
This is not a sustainable model as it shifts the ownership of data to a central team, eliminating the data stakeholders and domain experts control of the data. The centralized team is forced to make uninformed decisions on data.
Dependency on IT for data discovery
With the centralized team managing enterprise data within the EDW or Data lake, there are dependencies on various IT teams to understand data assets and capabilities supported by the data.
Such dependencies delay the data discovery for the data stakeholders to make conscious decisions with data without full visibility and control of data.
Because this model also lacks the traceability of data assets to its source, its lineage information about the data asset is either not available or is lost over time. This leads to trust issues with data that can lead to data risks.
In the same way, the metadata information and business glossary information in case it exists in the source systems is also lost or not updated in EDW or data lakes.
Key Considerations to overcome these challenges
The minimalist approach is to consider the following must-haves to set a foundation for Data Catalog Implementation
Domain-Driven Design
Efficient Search for Data Discovery
Before we explore these points, let me set some background. The first-generation EDW + BI Pattern or Data Lake and Lake House implementations are here to stay and solve the technology challenges of accessing and managing data. However, as discussed, these architecture patterns are not the main blockers for data catalog implementation. The ownership of data domains should be with the right team who understands, know, and can effectively manage the data domain. It is important to push such responsibilities within the team who has data fluency and domain expertise. Someone who can make required decisions to share data, the knowledge, the processes, and capabilities of that data across the enterprise so that the consumers can get the right benefit of it. A great way of doing this is by taking a Domain-Driven Design approach.
The book “Domain Drive Design (DDD) by Eric Evans”4 describes this in the context of a software design approach, focusing on a software model to match the domain according to domain experts. However, another book ‘The Enterprise Data Catalog By Ole Olesen-Bagneux’5 relates the domain-driven design concept to defining a data domain. The details of design patterns and examples are well described in this book. It is recommended for further reading for Data Catalog implementation. In this section, however, I will be lightly exploring the key aspects of Domain-Driven Design to emphasize the importance of this approach for successfully overcoming the challenges with current data-team setup in centrally driven architecture patterns. To be honest, if I go deeper into explaining the DDD, I will be repeating what the author has well described in the book.
A Domain: A domain, in a traditional context, is linked to a technology enterprise application. However, here, a definition of a domain is specific to a defined group of teams, structured to share domain knowledge, understand common operating processes, and capabilities, and communicate with the same data semantics - in short, data-fluent domain experts. Occasionally, the domain and Line Of Business (LOB) are interchangeable.
Domain-Driven Design Approach:
Here are the high-level points for the DDD construct, in the context of this article.
The approach of DDD starts with defining domain(s) in the enterprise landscape. The selection of domain(s) for analysis and design can be an iterative process. This process can grow organically as new domains are discovered or identified.
The definition of the domain should consist of a main entry point or a root node of a domain, for example, the Data Catalog name.
Then follows the identification of processes (what this domain does?) and capabilities (how it is done?) within the boundary of the domain (same domain knowledge, goals, and operation).
The hierarchical representation of domain design can have more extensions into other well-defined adjacent domains. The domain design can go deeper into the scope with additional domain knowledge, goals, and operations to consider.
The next step is to identify the data sources that support the domain. There can be
Generic data sources - like technology components: Databases, Servers, Cloud Data storage, Network File Systems, etc.
Specific data source - specific instance or location linking to the generic data source like DB Instance under Database server, or location of object or file under Cloud or Network storage.
The final step is to link it to the metadata asset. Metadata asset a.k.a. data asset is a definition of the asset, for example, the nature of the asset- application, software, source system, or reporting system. Additionally, it consists of the metadata model - data entities and attributes, business glossaries - domain-specific terms, taxonomies, ontologies, global search terms or Thesauri, and social metadata like tags, ratings, etc. My previous article linked in the reference goes in details on metadata assets.
This will help design the domain to ensure that the right group of domain experts owns the responsibility to share the domain knowledge with the broader teams. It will also help to effectively collaborate, perform data discovery and help with the successful adoption of the Data Catalog.
Efficient Search for data discovery
With a strong domain design that brings all the necessary moving pieces to build a robust Data Catalog, the natural next step is to apply that to data discovery.
Data discovery starts with knowing the data assets that exist in the organization. The ability to search for data assets, based on the domain design, helps find any data assets efficiently from the Data Catalog search UI. Such domain-driven design makes searching consistent for any user and easy to adopt.
To make the search more efficient, data assets can be filtered using business glossaries and asset types that are defined in the metadata model. Data Catalog search, when implemented correctly, can be a significant capability to find the data and drive the adoption of metadata platforms. Please note that the reference is to find the data assets and not the exact data values. However, once there is a way to find data assets and know more about their lineages, it becomes easier to introspect the data values they hold in the appropriate ‘system of records’. Hence, it is significant to perform data discovery to identify the right data assets.
This benefits the key stakeholders, for example, Data Analysts, Data Scientists for more efficient data discovery instead of working in silos with the data assets they are familiar with. Similarly, the Data Governance team can benefit from classifying data assets, critical data elements, and business terms as sensitive or confidential for proper management of such data with the right access rules and permissions defined centrally. The search results from the data discovery query can then hide or show these data elements appropriately. Implementing such robust data discovery processes supported by search capabilities in the Data Catalog UI is a considerable undertaking for the team that is responsible for supporting and promoting the usage of the data catalog in the organization.
The team mainly consists of Data Architects, Data Engineers, and Data Stewards to start with and be responsible for the Data Catalog implementations and continuous value creation for better adoption. The other key role is a Data Quality Engineer (Metadata quality in this case).
The core responsibility is also to ensure the metadata quality is within the defined quality metrics for an efficient data discovery process. The key stakeholders for data consumption are the Data Analyst, Data Scientist, and Data Governance team.
The basic capabilities of search to implement for successful adoption of Data Catalog UI would be
Basic Search:
Text, keyword, contains, like, or synonyms searches
Advanced Search:
Fuzzy search.
Logical grouping search with and/ or, boolean operators
Autocomplete
live suggestions as user types
Domain-specific search
based on business glossaries and taxonomies
Global Search
based on Thesauri
Search based on social metadata
like tags, ratings
Ontologies-based search
using knowledge graphs and knowing the search node and adjacent closer node connection.
A great foundation for a good Data catalog implementation is to apply Domain-Driven Design, which will enable efficient search capabilities for powerful data discovery. In this way, the challenges with existing data platforms and data architectures can be overcome.
Value Proposition of a Data Catalog for an Enterprise
The objective of implementing a Data Catalog is to understand the business needs and use data to drive value and help the organization achieve business outcomes. Following are the key value drivers that Data Catalog helps uncovers-
Data Catalog is a key driver for Data Strategy in Data-Driven Organization
A mature data-driven organization invests in implementing data strategy to support goals and objectives like -
Having well established Data Governance process
Adopting a modern Data Architecture that implements core capabilities like Data Engineering, Data Catalog, Data Quality, Data Management, and Data Sharing
Advanced-Data Analytics with self-service capabilities
Continuous innovations and value creation
Drive adoption through a data team with the right data-stakeholders
Capabilities like Data Catalog and much more help in the successful implementation of Data Strategy.
Data Catalog acts like a catalyst for Data Mesh Architecture
A centralized responsibility model, where the ownership of data is pushed to the team managing the data centrally into the data warehouse and data lake. Such models often prove to be non-scalable to drive business needs.
With decentralized and distributed architectures, the responsibilities and roles are well-defined and segregated. The data provider manages the source of information and makes it available with the required lineage, metadata, and business glossaries for the data consumers to consume it.
The notion of breaking the centralized management and governance of data as well as pushing the ownership down to the data team is critical to the successful implementation of Data Mesh architecture.
A well-designed and fully adopted Enterprise Data Catalog helps in breaking the silos created by the centralized responsibility model. It also elevates the metadata data harvesting of data assets from the enterprise landscape into a central location that can eventually become an internal data marketplace.
Building-block for an External Data Marketplace
An external Data Marketplace6 is a data monetization platform that connects Data Providers and interested Data Consumers to transact on publicly available Data Products.
As more organizations can uncover the hidden potential of their data assets, they are open to selling or sharing relevant data assets that are Data Products on Data Marketplace.
This can only be possible when the organizations have trust in their data assets. They can confidently promote their Data Products on the Marketplaces for buyers to safely purchase and augment these Data products in their ecosystem to drive their business outcomes.
For data monetization needs, the fundamental capabilities of knowing what data assets exist in the organization, their data handling & governance processes must be well-established, monitored & reported when required.
The data discovery process built on Data Catalogs should be functional with self-service capabilities. The end-users, data governance, and data analytics team can use it for data fluency and to discover, analyze, collaborate, and build new data assets when required.
With the strong foundation of the data platform, organizations can build better Data Products that can be launched through campaigns in the data marketplace.
Conclusion:
The advantages of Data Catalog in the enterprise landscape are great. Here is a conclusive list of capabilities Data Catalog can unleash.
Data Discovery: Data Catalog provides transparency and visibility for all data assets that exist in the enterprise. It also provides the ability for data discovery and searching data assets that exist in the enterprise landscape. This is very key for the successful adoption of the metadata platform and activating metadata usage through powerful data discovery. The value of stale metadata with mapping and documentation is diminishing. Hence, the need for active metadata with such features that allow intelligent searches and interactive data discoveries are in demand.
Eliminate Data Silos: It eliminates the data silos by knowing the data ownership and usage of data assets across the enterprise. Data Catalog facilitates the Data Scientists and the Data Analysts to efficiently search for data assets and their lineage. This eliminates the limitation of working within the known data assets in a silo. Data Silos are eliminated when groups collaborate with full visibility of enterprise data assets with no boundaries or isolation.
Data Governance: Even though the data values are not exposed in the metadata model, critical data assets and elements within these assets which are sensitive and confidential. With the right rules, access permissions, and perspectives, the Data Catalogs can be made available in a meaningful context to the key data stakeholders like Data Stewards, Data Analysts, Data Admins, and Data Engineers.
Visual Data Lineage: Data Catalog follows a domain-driven design. Because of this, it is easier to visualize the relationship with other data domains and understand details about the domains. Data assets can be associated with a vertical structure or horizontal.
Vertical Lineage: Often known as the association between domains & subdomains in a hierarchical form providing an insight into an organization, team, group, or data domain. Traditionally known as a Line of Business (LOB) view.
Horizontal Lineage: Also known as data lineage, that shows how data moves within the enterprise landscape from source to target.
Searching with Business Glossaries: For each metadata table or attribute, domain users can define additional metadata to expand the knowledge & documentation of the data asset. This information can collaborate with other domains and end-users to further enable social aspects like rating, tagging, or commenting on the data assets.
Finally, to conclude, the Data Catalog is a great collaborative tool. It is good to achieve well-managed governance, but it is great to be able to collaborate well. It is essential to drive metadata adoption in the enterprise landscape, which depends on the end-users and data stakeholders. However, the enablement of the Data Catalog is the responsibility of the domain experts and data teams to spread the data fluency in the enterprise for their domain.
These domain experts must think of their domains as ‘Data Products’ to promote them within the enterprise for faster analytics and operational use cases by spreading the data fluency of their domain. A federated governance approach helps each business unit share the responsibility of storing, processing and sharing their metadata. Such collaboration then enables advanced data architecture like ‘Data Mesh’ and advanced practices like ‘DataOps’. This leads to the enterprise data users (Governance, Analytics, and end-users) spending maximum time uncovering the potential of the enterprise data. It also takes away the pain of not having to worry about which data assets to use for analysis and the quality of those data assets. This potentially resolves the problem for organizations where their teams spend 80% of the time figuring out what data assets to use instead of meaningful analysis.
The combined previous and this articles7 provide a complete understanding of a metadata platform and more profound insights into why the Data Catalog is a critical implementation to activate metadata capabilities. For further discussion or insights on your metadata experience, please leave a comment below.
Links for Further Reading and References
The quote reference and details on the internal data marketplace can be found in this HBR whitepaper “Beyond Technology: Creating Business Value with Data Mesh”
Medallion Architecture Patterns from Databricks
Domain Driven Design: Tackling complexity in the heart of the software By Eric Evans
The Enterprise Data Catalog By Ole Olesen-Bagneux
Additional reading on Data Marketplace and their use cases
Link to Active Metadata Platform Article from Stratagem360