Skip to content
iunera square logo together with the cloud iunera square logo together with the cloud

Data Catalogs: An Intro To Another Important Topic

by Dhanhyaa

One of the dimensions of data quality is availability. The data users in an organisation need to be able to search and access the data they need to work. An organisation-wide data catalog can help the users find data. Here’s how a data catalog does it.

Table of Contents
  • What is a data catalog?
    • Data assets
    • Metadata
    • The data catalog’s role
  • So, what capabilities should a data catalog have?
    • So far, we found these different data catalog products…
  • Conclusion about data catalogs

What is a data catalog?

The definition of a data catalog is:

An organised, detailed inventory of data assets across all data sources of an organisation

where ordered, indexed and easily accessible metadata can

help data professionals quickly find the most appropriate data

for any analytical or business purpose.

Let’s get a few things clear.

  • data assets
  • what metadata is
  • the role of a data catalog in making data easily searchable and accessible

Data assets

A data asset is defined as “a system, application output file, document, database, or web page that companies use to generate revenues.”

The data assets in a data catalog include:

Data assets in a data catalog
Structured data, unstructured data, reports & query results, data visuals, machine learning models and connections between databases. Image designed using Canva.

Metadata

Metadata is the description of a data asset that makes it easier to find and analyse.

If you guys remember browsing and borrowing books at a library, this popular example of a library catalog card might help.

The card has information about a book such as the title, author, topic, publication date, edition, summary and which section of the library it is.

Such information which acts as the metadata of the book makes it easier for a librarian or reader to find the book.

The same applies to the metadata of a data catalog.

Metadata of this person
Another easy example of metadata is how an individual is described. The metadata of this person can be name, occupation, nationality, the social cause she is passionate about, favourite festival snack and so on. Image designed using Canva.

The data catalog’s role

A data catalog has several capabilities that allow users to:

  • Search the catalog and access data, which is really helpful for self-service analytics.
  • Automate the evaluation and recommendation of potentially relevant data.
  • Make sure the data utility complies with regulations.

The data catalog makes this possible by addressing challenges related to the higher amount of time and effort wasted on finding relevant data compared to using data.

If data scientists spend a bigger chunk of their workday on combing through data lakes that have turned into data swamps, it means that there are problems with the data sets.

The problems that data scientists have to deal with include dark data taking up space in the database and the lack of a common vocabulary reflecting the lack of standards.

As a result, they face difficulties in accessing data, tracing their sources and assessing their quality.

It may seem easier for them to just surrender and not deal with these data challenges but the consequences of leaving them unchecked may outweigh the impact of fixing them.

How a data catalog cuts time wasted on finding data to spend more time on analysing data.
“It is common to shift from 80% of time spent finding data and only 20% on analysis to 20% finding and preparing data with 80% for analysis.” – The benefit as written by Dave Wells in his article on Alation. Image designed using Canva.

So, obviously, by addressing the challenges mentioned, the organisation can benefit from:

  • Improved trust and confidence in data of high quality, leading to more effective decision-making.
  • Reduced data risk due to compliance with regulations.
  • Increased efficiency from having a unified view of all data and reducing dependence on the IT department.

So, what capabilities should a data catalog have?

The capabilities that a data catalog should have in order to address the challenges and provide the benefits stated above are:

User-friendly search experience

All data users should be able to search through the data catalog themselves, whereby they can quickly find results based on the metadata searched and receive relevant recommendations just like on Netflix.

Automation

Automation removes the need to connect data sources manually, hence, saving time and effort for more important tasks.

Simplified compliance

Compliance can be difficult to keep up with, so a data catalog should simplify compliance by profiling data assets, figuring out their relevance to specific regulations and automatically categorising them for future reference.

Connecting various data sources to a single source of truth

All the data assets as well as their metadata in the organisation should be connected to the master data using various tools for business intelligence, data integration, SQL queries, enterprise apps, data modelling, etc.

Support for data quality

For a data catalog to do its job in making data searchable and accessible, the organisation should implement the best practices of data quality management (DQM) including root cause analysis, setting data quality rules and using data quality tools.

Capabilities of a data catalog and how it reaps benefits.
The capabilities of search experience, automation, compliance, connectors and DQM ultimately produce the benefits of improved trust, increased efficiency and reduced data risk. Image designed using Canva.

So far, we found these different data catalog products…

Examples of data catalog tools made by big companies to build their own data catalogs are:

  • Google Cloud’s Data Catalog
  • LinkedIn’s DataHub
  • Facebook’s Nemo
  • Shopify’s Artifact
  • Lyft’s Amundsen
  • WeWork’s Marquez
  • IBM’s Watson Knowledge Catalog
  • Microsoft Azure’s Data Catalog

To show what a data catalog looks like, let’s use the US Geological Survey (USGS) Science Data Catalog (SDC) as an example. Look at the screenshots below.

USGS SDC Browse feature
Screenshot of the USGS SDC‘s Browse feature where the user can find data sets by categories like Mission Area, Science Topic or Data Source.

On the USGS SDC page, users can either search using the keywords they have in mind or browse through the categories set by USGS.

These categories are Mission Area, Science Topic and Data Source, each of which contains sub-categories.

If a user chooses to browse and clicks on the subcategory Land Use Change under the category Science Topic, the user will receive the list of data set results as shown below.

USGS SDC Browse, science topic, land use change
Screenshot of the results of Browse > Science Topic > Land Use Change on USGS SDC.

In this list, the user can see what the data sets are and view the metadata of the data sets by clicking on the “View Metadata” button on the right side of the screen.

When the user clicks on the “View Metadata” button for the data set “USGS National Land Cover Dataset (NLCD) Downloadable Data Collection”, the user will see the metadata of that data set as shown below.

USGS SDC > Browse > Category > Subcategory > Data set > Metadata
Screenshot of the metadata of a data set found on USGS SDC.

In this way, the user can identify the data set’s citation information, description, spatial information, metadata reference, and so on to effectively decide whether this data set meets the user’s data needs.

Conclusion about data catalogs

A data catalog is defined by the various data assets it holds, the metadata describing these data assets and its capabilities in making relevant data more available for any analytical or business use.

Companies use metadata to find and identify data assets with the aim of generating revenues out of the data analysis.

Data catalogs are able to optimise the company’s returns on data by optimising efficiency in terms of the time and effort spent on finding and preparing data compared to analysing data.

In a recent online conference, I’ve witnessed data scientists having a discussion about how data preparation is an integral part of a typical workday for data scientists.

Perhaps, using data catalogs might help minimise data prep time so that data scientists can focus more of their time and energy on analysis instead.

Let us know your challenges or support us by sharing the article

  • share 
  • share 
  • share 
  • share 
  • share 
  • share 
  • share 

Check iunera.com to learn more about what we do!

Categories:

Big Data Lessons

Tags:

data assetsdata catalogdata catalogsdata qualitydata quality managementdimensions of data qualitymetadata

Post navigation

Previous post A Simple Guide to Everything You Need to Know About Data Lake

Post navigation

Next post Shedding Light On The Topic Of Dark Data

Search
Recent Posts
  • Guide: Exposing Enterprise Data with Java and Spring for AI Indexing (for NLWeb)
  • Guide: How to Use NLWeb to Unleash AI-Powered Websites
  • Reimagining Open Source: Innovative Solutions in the Age of Exploitation
Latest Changes
  • Locality-Sensitive Hashing (LSH): The Ultimate Guide
  • Concepts and Characteristics of Big Data Analytics
  • Which is better - Random Forest vs Support Vector Machine vs Neural Network
Categories
Big Data Examples Big Data Lessons E-Commerce Interviews Machine Learning and AI NLWeb Our Projects Public Transport Sustainability Time Series Analytics Uncategorized
iunera square logo together with the cloud iunera square logo together with the cloud
  • Home
  • Blog
  • Wiki
  • Contact
  • Imprint
  • Privacy Policy
iunera square logo together with the cloud iunera square logo together with the cloud
  • Home
  • Blog
  • Wiki
  • Contact
  • Imprint
  • Privacy Policy

© 2019

© 2025 iunera GmbH & Co KG
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies. However you may visit Cookie Settings to provide a controlled consent.
Cookie settingsACCEPT
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Preferences
Uncategorized
Other
Analytics
Performance
Advertisement
Functional
Save & Accept