What is metadata and how does it work?

What Is Metadata?

Metadata is data that describes other data in a structured, consistent way, so large amounts of data can be collected, stored, and analyzed over time. Data warehouses require metadata for easy retrieval and management when storing big data. A data warehouse uses structured data that is standardized, clean, and consistent across data sources. Metadata ensures uniformity in the collection and storage of this data so that business owners and data analysts can easily access and derive insights from the data.

What Is Metadata?

Effective management of metadata is an essential part of a reliable and flexible big data “ecosystem”, as it helps companies more efficiently manage their data assets and make them available to data scientists and other analysts.

Metadata Classification and Examples：

1. Technical Metadata

1). Physical Metadata: Metadata describing physical resources, such as: servers, operating systems, computer room locations, and other information.

2). Data Source Metadata: Metadata describing the data source, usually including four types of information:

Data source address (eg IP, PORT, etc.);
Physical topology (eg active/standby, roles, etc.);
Permissions (e.g. username, password, etc.);
Library name, version, domain name, etc.

3). Storage Metadata: Metadata describing object storage, usually also “narrow” metadata, including several main types of management attributes (eg creator, application system, business unit, business owner, etc.).

Lifecycle (such as creation time, DDL time, version information, etc.);
Storage properties (such as location, physical size, etc.);
Data characteristics (e.g. data skew, average length, etc.);
Use characteristics (e.g. DML, refresh rate, etc.)；
Data structure table/partition (e.g. name, type, remarks, etc.)；
Columns (e.g. name, type, length, precision, etc.)；
Index (e.g. name, type, field, etc.)；
Constraints (e.g. types, fields, etc.).

4). Computational Metadata: Metadata describing the process of data computation can generally be classified into two types of computations: data extraction (ETL) or data processing (JOB). Each type of computation can be further broken down by control metadata (e.g. configuration properties, scheduling policies, etc.) and process metadata (e.g. dependencies, execution status, execution logs, etc.).

5). Quality Metadata: A metadata that describes the quality of data. Typically, data quality is reflected by defining a series of quality metrics.

6). Operational Metadata: A class of metadata that describes how data is used for operations.

Data generation (e.g. generation time, job information, etc.);
Table access (e.g. queries, associations, aggregations, etc.);
Table associations (e.g. associated tables, associated fields, associated types, associated counts);
Field access (e.g. query, association, aggregation, filtering, etc.).

7). Operation and Maintenance Metadata: Metadata describing the operation and maintenance level of the system, usually including tasks, alerts, and failures.

8). Cost Metadata: Metadata describing the cost of data storage and computation.

Computational cost (eg, CPU, MEM, etc.);
Storage cost (eg, space, compression ratio, etc.).

9). Standard Metadata: Metadata describing the standardized content of the data.

Code management (e.g. transformation rules, external interfaces, etc.);
Mapping manages data display (e.g. styles, rules, semantics, units, etc.).

10). Security Metadata: Metadata describing the content of data security.

Security level data sensitivity (for example, is it sensitive, desensitization algorithm, etc.)

11). Shared Metadata: It describes how to share data, such as interface methods, format and content.

2. Business Metadata

1). Model Metadata: Data modeling is the description of the business, and the business can be better understood through the model. Common modeling approaches include paradigm models, dimensional models, and multidimensional modeling. Below is an example of a size model, such as business lines, sectors, process data domains, subject domain dimensions, attribute index facts, metrics markets and applications.

2). Application Metadata: It refers to the metadata describing the data application class.

3). Analysis Metadata: It refers to the description of business metadata from the perspective of data analysis. For example, data domain, subject domain, product line, section, business process, business rules, etc.

3. Manage Metadata

Management metadata describes the content of data management within an enterprise, such as people, process responsibilities, job organizations, and departments.

Metadata Features：

Metadata is structured data about data, which is not necessarily in digital form and can come from different sources.
Metadata is object-related data that shields potential users from having a complete understanding of the existence and characteristics of these objects.
Metadata is an encoded description of an Information Package.
Metadata contains a set of data elements used to describe the content and location of information objects, facilitating the discovery and retrieval of information objects in a network environment.
Metadata not only describes information objects, but also describes the usage environment, management, processing, storage, and usage of resources.
Metadata is naturally added during the life cycle of an information object or system.
The “data” in the conventional definition of metadata is the symbol of transactional nature, which is the numerical value based on which all kinds of statistics, calculations, scientific research and technical design are carried out, or the information that is digitized, formulated, coded and graphized.

Advantages of Metadata

Metadata is key for a simpler programming model that no longer requires interface Definition Language (IDL) files, headers, or any external component reference methods. Metadata allows the.NET language to automatically describe itself in a non-language-specific way that is invisible to developers and users. In addition, metadata can be extended by using attributes. Metadata has the following major advantages:

1). self description: Common language runtime modules and assemblies are self-describing. A module’s metadata contains all the information needed to interact with another module. Metadata automatically provides the functionality of IDL in COM, allowing a file to be used for both definition and implementation. Runtime modules and assemblies do not even need to be registered with the operating system. As a result, the instructions used by the runtime always reflect the actual code in the compiled file, improving the reliability of the application.

2). design: Metadata provides all the necessary information about the compiled code to allow you to inherit classes from PE files written in different languages. You can create instances of any class written in any managed language (any language for the common language runtime) without worrying about explicit marshaling or using custom interop code.

Why does an organization record and manage its metadata?

The information architecture of most organizations is similar to that of a crowded, disorganized bookstore. The data is everywhere. Most organizations’ data is not organized or catalogued, making it difficult to find the data you need.

This is the core problem – lack of data findability, and therefore lack of data availability. And the problem is only getting worse. In 10 years, the amount of organizational data can go from gigabytes to terabytes to petabytes. In the era of “data is the new oil”, successful organizations must be able to find and use all data to gain a competitive advantage. The description and search capabilities of metadata management are critical to successfully finding and using this data.

Metadata management is also important because definitions can change depending on information context. See how different groups think about and define the word “customer.” For example, if you talk to someone in IT, sales, or compliance, they might have a different view or perspective on what the customer represents and how the data is stored.

For IT, data about customers may focus on executing analytics reports and dashboards for the company, as well as the technical aspects of storing this data. If you ask IT to define the location of “customer” data, they might reply, “This is in our enterprise data warehouse for reporting, which dates back to 2015. We also have customer data from new acquisitions in the data lake. This data is in the data lake and needs to be converted before we can report.” Thus, for them, “customer” data may be very analytical, or contain historical backtracking.

Your sales team may be more focused on operations, such as how they now use customer data in sales. To them, customer data may mean only active customers or account-level customer data (such as the company name), not all the customers the company has ever owned. Sales teams may refer to customer data as company names rather than staff level data. Moreover, compliance departments may consider customer data at a personnel level, since their primary use of data is to comply with regulations such as the GDPR.

As you can see, the challenge is not just in the definition, but in the inconsistency of definitions between these different teams and processes. And the numbers are growing. You need to be able to find your data for optimal analysis. In operations, you need to understand all the different applications and where they get their data. In terms of compliance, you need to ensure that your organization follows the rules; for THE IT department, the main concern is generating analysis and preserving history.

With metadata management, you can provide each part of your organization with the metadata it needs to understand and manage your systems, your data, your entire organization, and a unified view of the data throughout your organization. This is the only way organizations can function properly and ensure that they eventually get things right.

Conclusion

Thank you for reading our article and we hope it can help you to have a better understanding of what is metadata. If you want to learn more about metadata, we would like to advise you to visit Gudu SQLFlow for more information.

As one of the best data lineage tools available on the market today, Gudu SQLFlow can not only analyze SQL script files, obtain data lineage, and perform visual display, but also allow users to provide data lineage in CSV format and perform visual display. (Published by Ryan on Jun 24, 2022)