A focus on data management
Why has data management become more important than knowledge management?
This question frames a discussion on how we use data to get desired business and mission outcomes. Knowledge management, which is ideal for business users who create reports, manage dashboards, and create presentations on operational metrics, aims for incremental efficiency improvements. Data management, on the other hand, targets a broader user base—referred to as data consumers—and focuses on creating differentiating capabilities and data products such as artificial intelligence (AI), advanced predictive analytics, and machine learning (ML) models.
These products exponentially improve efficiency through human-machine teaming or through fusing sensor data for rapid intelligence processing. Data management has quickly surpassed knowledge management in terms of increasing efficiency, enhancing capabilities, and developing game-changing new capabilities that out-maneuver competitors and drive innovation into operations.
Discoverable and searchable data repositories for rapid business value
To stay competitive, project and departmental data teams must find data quickly to answer questions and deliver valuable data products like predictive analytics. Achieving the rapid search and discovery of data requires a unified data management framework and a data management design pattern such as data hub or data mesh. Effective use of a data lake or lakehouse architecture with mixed-use or multi-modal data requires pairing storage with tools that assist with making sense of the stored data.
NT Concepts leverages an AI-infused technology known as a semantic knowledge graph for enhanced searchability, and a fully indexed metadata dictionary for rapid query results. The ongoing challenge for many organizations is that legacy on-premises data stores tend to follow Conway’s Law, who observed:
Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.
Melvin E. Conway
Many organizations have a hierarchical and departmental organization chart leading to storing data in a similar model or schema, resulting in data duplication, inconsistencies, sharing issues, and trouble accessing or knowing about data in other departments.
For the past ten years, a data fabric design pattern for data management may have been considered a low-risk bet to organize on-premises data before migrating it to a cloud provider. However, data fabric may be too restrictive and slow innovative data product creation because the central data and IT team can become overburdened with technical change requests and lack of data domain knowledge to develop the most effective data products.
Additionally, identifying and classifying data stores makes it easier to determine what data is most valuable and what data needs to be decommissioned or destroyed. To enhance the value derived from data, organizations should use a data management platform to index the important context of the data (also known as metadata) for instant visibility, shareability, and searchability across the organization.
Data access – Evolving toward mesh design goals
What do organizations need for data access, and how can better access to data lead to mature datasets for data product developers, AI algorithms, and relevant answers for knowledge workers?
Basic requirements for data access include the ability to ingest, store, search, secure, and easily share data stores in your organization with data consumers. Data consumers are anyone needing data sourced from enterprise-wide data stores for use in their daily work, for compliance requests, or to build data products for specific use cases.
To effectively empower data consumers who are trying to do their job efficiently, you need a metadata catalog to organize what type of data your organization has across the enterprise. Modern data management design patterns include a knowledge management capability with a data catalog, such as a semantic knowledge graph shown in Figure 1.
Figure 1. The semantic knowledge graph is complemented by a metadata catalog, which acts as a central repository for capturing and documenting essential information about the data. The catalog provides a structured framework for understanding the characteristics, relationships, and usage of various data assets within an organization.
Semantic knowledge graphs highlight an understanding of the metadata relationships between datasets without manually encoding them into a rigid schema. The auto-encoding of metadata relationships utilizes a type of AI known as an inference engine, which is an explainable form of AI. Data stewards and data architects collaborate to design data models based on your organization’s specific business rules, ensuring that the inference engine focuses on providing mission and business value. The semantic knowledge graph offers flexibility, enabling context-rich, human-understandable data discovery, and eliminates the need for data engineers or database administrators to redesign database schemas for individual use cases or application designs. Instead, the graph empowers data lakes with various data types, formats, and columns to accommodate an evolving set of requirements over time.
There are several good options for cloud-agnostic data management, such as MarkLogic Data Hub or Databricks Data Lakehouse, which may be available on-premises in your organization’s data center or as a service deployed from Amazon Web Service, Microsoft Azure, or Google Cloud Platform.
It’s important to select a data management platform that is highly compliant with federal and DoD standards, resulting in faster security hardening and Authority-to-Operate (ATO) for data products.
Achieve data sovereignty and data residency with data fabric and data hub
Data sovereignty is becoming a bigger topic as privacy regulations such as General Data Protection Regulation (GDPR), and U.S. federal and state regulations require data to be stored and processed within a specific geographic boundary. Data residency is where the data and metadata are physically stored and processed. Many countries, including the United States, have data sovereignty laws requiring data about its citizens to be accessible and managed regardless of where the data is stored physically.
By ingesting data into a data fabric or data hub, centralized data teams have absolute control over where data is physically stored. Data management controls enable compliance with international privacy laws and avoid expensive or embarrassing compliance violations, such as Facebook parent company Meta having repeat $200+ million fines for allowing data scraping of European citizens’ metadata. The additional benefits of managing where data are physically stored at rest include the ability to lower latency for faster access to the data and the ease of re-patriating or moving the data back on-premises if cloud expenses become economically unsustainable.
Cloud storage and data management services may become expensive when compared holistically with on-premises data storage equivalents, especially when considering the cost of moving the data between regions, the use of cloud service endpoints, and the use of data services. Using a cloud-agnostic or infrastructure-agnostic data management platform such as MarkLogic, Databricks, or Snowflake allows for flexibility of storage location and the rapid movement of where data is determined to be best located for business or mission needs.
Figure 2. Our DoD-readied portable NT StudioDX platform integrates seamlessly with commercial cloud data hub solutions configured in data mesh design patterns to help data product teams build complex analytics, create unique insights, and identify capability differentiators.
Which is better for your agency: data fabric, data hub, or data mesh?
In Data Management Cloud Design Patterns for Cloud Ecosystems, we describe the differences between data fabric, data hub (platform), and data mesh design patterns from a benefit/risk perspective.
We recommend a hybrid data hub (platform)/mesh approach for organizations requiring more agility and faster innovation cycles to allow for maximum control of data products at the departmental level. For organizations with low risk tolerance and highly centralized control policies, data fabric is the best choice because it enforces enterprise policy and requires service tickets for changes to data pipelines.
Let’s consider some of the features and capabilities more in-depth for data hub (platform) because it shares characteristics of a data fabric and data mesh. Below is a list of shared capabilities of a data fabric and data mesh and how they are incorporated using commercial platforms:
Data Platform Capability (1 – 5) | MarkLogic Data Hub | Databricks Data Lakehouse | Snowflake |
---|---|---|---|
GOVERNANCE | |||
Easy to integrate, identify, and access management services | |||
Role-Based Access Control (RBAC) | |||
Attribute-Based Access Control (ABAC) | |||
Policy-based Queries | |||
Data Masking | |||
INGESTION | |||
APIs | |||
Optic API (Geospatial-native) | |||
Programmable hooks | |||
Index-on-ingest | |||
Smart curation | |||
STORAGE | |||
Enterprise on-premises or cloud-native storage | |||
Encryption-at-rest (FIPS-compliant) | |||
SEARCH & DISCOVERY | |||
Semantic knowledge graph (SKG) | |||
Automatic metadata and schema curation, Query Languages SQL, SPARQL, GraphQL, GeoSPARQL |
Depending on the data type, different data hubs (platforms) perform with varying degrees of capability. We believe the strongest performer across all data types may be MarkLogic Data Hub Platform. The platform performs exceptionally well across many dataset types, including geospatial, which is uniquely important for many federal agencies we serve.
Figure 3. Commercial data hub platform features and capabilities. A hybrid data hub (platform)/mesh approach offers more agility and faster innovation cycles. For organizations with low risk tolerance and highly centralized control policies, data fabric is the best choice as it enforces enterprise policy and requires service tickets for changes to data pipelines.
NT Concepts specializes in developing and actioning data management plans for our DoD and Intelligence Community customers using both cloud-agnostic and cloud-native service offerings that fit an organization’s risk appetite and goals.
All trademarks, logos and brand names are the property of their respective owners. All company, product, and service names used in this article are for identification purposes only. Use of these names, trademarks, and brands does not imply endorsement.
Cloud Migration & Adoption Technical Lead NICK CHADWICK is obsessed with creating data-driven enterprises. With an impressive certification stack (CompTIA A+, Network+, Security+, Cloud+, Cisco, Nutanix, Microsoft, GCP, AWS, and CISSP), Nick is our resident expert on cloud computing, data management, and cybersecurity.