Building a Cognitive Data Management Practice
Building a Cognitive Data Management Practice
Data management is the core task of IT, yet it has been sadly undervalued and largely ignored for the past thirty years. However, the unprecedented growth of data – in the 10-60 zettabyte range by 2020 – has larger firms and public cloud services, who are concerned about how they will store all the data cost-effectively, suddenly very interested in developing a data management strategy that combines the latest technologies into a serviceable cognitive data management capability. This paper looks at the drivers of, the requirements for, and the essential functionality of cognitive data management.
In a 2016 survey of IT planners conducted by a leading industry analyst, the top three technology initiatives cited by respondents fell within the domain of data management. Priority was assigned to:
- Improving data security and privacy, especially in the wake of hacking incidents, phishing incidents, and ransomware attacks in the 2015-2016 timeframe,
- Consolidating backup and disaster recovery planning, to ensure that critical and irreplaceable data assets were properly protected against natural and man-made interruption events, and
- Building better programs to ensure proper data governance in the face of changing legal, regulatory and business mandates.
Accomplishing any of these objectives will require a much improved data management capability since each requires a more granular understanding of data and the selective application of managed data services to data of different classes.In larger firms, and in the so-called “industrial farmers of the public cloud,” an even larger concern is how to cope with an explosion of data that analysts peg at between 10 and 60 zettabytes of new data (mainly files and objects, or so-called “unstructured” data) expected to hit cloud and big storage platforms within the next three years. Simply put, the annual output of the entire disk drive and the entire flash memory industries will deliver storage capacities of roughly 780 exabytes and 500 exabytes respectively – about 12% of the capacity needed at 10 zettabytes, about 2% of the capacity demand generated by 60 zettabytes.This data growth is the natural outgrowth of three trends:
- The increasing conversion of analog information (film, paper records, analog audio, books, etc.) into digital formats, and the explosion of data created by the growth of digital video in extremely high resolution formats (4K and beyond),
- The Internet of Things (IoT) and the collection of billions of data sets from sensors to fuel Big Data Analytics and data mining.
- The rise of mobile commerce (M commerce), which has a tendency to generate 100 to 1000 transactions in Systems of Record for every one order placed by a consumer via their smartphones, tablets, laptops and other mobile computing platforms.
Without the improved capacity utilization efficiency enabled by effective data management, these firms will need to discard large quantities of the data that is being amassed.In certain industry verticals, including media and entertainment, the need to store and manage massive amounts of data in a manner that facilitates re-reference and reuse is not a new problem. This includes the ability to share data efficiently and securely for purposes including collaborative editing, packaging and repurposing for different delivery platforms, and ultimately to preserve digital assets securely once projects are complete. The increasing production of creative visual and aural projects in digital formats has kept M&E on the cutting edge of technologies for data classification (metadata standards for different media, digital rights management, etc.) and policy-based management of creative workflows and their output. However, the preservation of data assets once production work is complete is still a challenge to be surmounted. This is especially the case in the M&E world, where data is rarely deleted and therefore must be hosted reliably and cost-effectively, literally, forever.Coping with data growth and retention requirements in all industries will require a more complete data management strategy.
These trends help to explain why data management is suddenly front of mind for so many IT planners. Unfortunately, very few IT professionals have a clear notion of what data management is, and how to apply a data-centric approach that can overcome inherent complexities of storage-centric realities. This paper is intended to help set the stage for building a cognitive data management strategy.
It is worth clarifying at the outset what is meant by “cognitive.” Truth be told, data management strategies in the past have involved the manual assignment of the right storage resource and the right protection, preservation, and privacy services to each file or object. Because of the effort required, it is not surprising that this task was demoted or eschewed completely by administrative personnel consumed with operations tasks. Moreover, with the movement of computing beyond the walls of the data center into both distributed on-premises platforms and public clouds, the data management discipline formerly practiced in mainframe operations has fallen into disarray. Mainframe data management tools, for the most part, did not transition into the distributed world, and the skills training provided to many contemporary IT practitioners has omitted data management best practices altogether.
Cognitive data management seeks to leverage the most recent advances in cognitive computing to automate the manual activities involved in data management. The intent is to reduce the administrative burden imposed by data management, to reduce the possibilities of error, and to streamline the stewardship of irreplaceable data assets that is the organizing principle of business information technology. Together with a storage resources management engine, storage services management engine, and a data management policy framework, the cognitive management platform is the centerpiece of contemporary data management strategy.
Data Management defined
Data management, as a technical term, is poorly defined. As a marketing term, it is used to describe whatever product or service a vendor seeks to sell.
- Data Management as HSM: To some it means processes for migrating older, less frequently accessed and seldom modified data from high performance/high cost/limited capacity storage to lower performance/lower cost/greater capacity storage over time. This is also known as hierarchical storage management or HSM. Commonly used in M&E and HPC, this older approach becomes more difficult with large number of files and working across different file systems and storage types.
- Data Management as ILM: Some regard data management as the efficient allocation of appropriate storage resources and services to data throughout its useful life. This interpretation derives from Information Lifecycle Management methodology or ILM.
- Data Management as Archive: Still others see data management as a process for placing data into archival storage based on video, curation, business policies, legal requirements or regulatory mandates governing data retention.
- Data Management as Global Namespace: Others regard data management simply as the placement of data metadata and locational references (where data is physically located on infrastructure) into a common object index or global namespace. The intention is to facilitate individual and collaborative data access whenever the data is needed.
- Data Management as Data Protection: Using data- centric policies to define how each file should be protected during its data life cycle.
- Data Management as Storage Resource Management: managing storage resources regardless of vendor or file system to simplify storage optimization and migrations.
These “definitions” each describe part of the functionality of data management, but they do not provide a definition. Data management isn’t a product or a process, but a practice, based on a deliberative strategy, designed to provision both the right storage resources and the right storage services to the right data at the right time based on business- and data-facing policies. Data management continues throughout the useful life of data assets and works dynamically to migrate data as workflow requirements and the status of data, resources, and services change over time.
Essentially, data management seeks to make more efficient and automated the four activities common to all IT operations.
Data management helps define what is initially done with data when it is created by end users or applications, what storage is provisioned to store the data for optimal access and update frequency. And, as previously, noted, data must be migrated to different storage platforms over time as access and update frequency change (typically, file and object re-reference rates fall sharply within a comparatively short time after data creation) in order to free up expensive high performance/low capacity storage for new data. Such data migrations optimize infrastructure expense and may ultimately lead to the placement of data in some sort of storage archive for data with longer term retention requirements; a third activity in data storage. And finally, when data reaches end of life, it needs to be deleted – though the requirements of M&E, oil and gas, and other verticals typically include the “eternal” preservation of data for future reuse.
This tableau of common data storage activities provides a general description of the domain of data management. While it may seem to be straightforward and readily understood, in fact each set of activities are fraught with complexities.
To manage data creation, for example, planners must consider carefully the answers to questions including:
- What is the most cost efficient way to platform data?
- What is the best way to meet data accessibility/ mobility needs?
- How can we minimize latency to optimize application performance?
- How do we bend the capacity demand curve? How do we best protect irreplaceable data assets?
These are important issues that need to be resolved if organizations are to begin the process of data management with policies that optimize data accessibility, data use, and data protection, preservation and privacy.
Data migration, the second set of activities associated with data management, is also fraught with questions that need to be addressed by IT planners. These questions may include:
- How much does a tiered storage environment cost?
- Can I apply tiering policies between legacy storage with different storage types or file systems?
- How do I design for optimal data distribution with lowest latency?
- Who will design policies?
- What kinds of media should be included?
- How best to track files and objects as they move around infrastructure?
The answers to these questions suggest additional requirements for data management that may not be apparent from a simple review of data storage activity. At a minimum, providing a dynamic data migration capability requires policies that specify what data will be moved – requiring an understanding of the data itself and some means to obtain up-to-date information on its re-reference rates and update frequency status.
Additionally, we need to know the current status of storage resources (media, systems, interconnects, etc.) and storage services so that real time decisions can be made about the best target location to which data will be migrated so that capacity utilization efficiency and cost of storage is optimized. And, planners need to decide when the data will be moved: under what conditions or in the presence of what “trigger” events.
These are the four components of true Information Lifecycle Management, as defined by IBM in the early 1990s. ILM was seen to require a form of data classification (so you would know what to move), some form of storage classification (so you would know where to move it to), pre-defined policies (so you defined the trigger for moving the data) and a data mover (functionality for actually moving the data).
Conceptually, this is also similar to the methods employed by many storage vendors to define storage tiering architectures or hierarchical storage management (HSM) policies. HSM envisions several tiers of storage with different capacity, performance, and cost characteristics. Data is staged on each tier for a period of time, ultimately finding its way into an archive or retention tier: Research by Fred Moore at Horison Information Strategies has demonstrated that the cost to store a petabyte of data using only Tier 0/1 high performance/low capacity flash storage is about $1.5 million. Using Tiers 1 and 2 (flash and disk) reduces the cost to $1 million, while including Tier 3 storage (optical and tape) in the mix can drop the cost to $587,000.
Before HSM or ILM or any other form of data migration process can be mounted, you also need to maintain a global index, inventory, or namespace for all objects and files that can be updated dynamically in terms of location and alternative access paths as the files or objects themselves are moved around between different storage platforms. So, part of handling data migration will also include the definition of a global namespace and a process for updating it with data status and location, storage status, storage service status (the services that will be applied to data per policy to ensure data privacy, protection and preservation), and the availability and congestion of all routes to the storage target (so the best path to data, the path with the lowest latency, can be selected). Deceptively simple, managing data migrations can be a challenge.
Tier 3 storage, consisting of high capacity disk and/ or optical disc, but mostly magnetic tape, storage is sometimes thought of as an archive tier. Archive is another set of processes under the data management umbrella intended to conform data retention and preservation to business requirements and legal/regulatory mandates. As with the other activities, archiving too presents a set of planning questions that need to be addressed as part of a data management strategy-building initiative, including:
- How expensive is archiving? Is it worth it?
- What is the best design for an archive?
- How do we migrate data into the archive on premise or in a cloud?
- How do we adjust to changes in the technology platform, data types, changes in policies?
- How do we certify that our archive is compliant with legal, regulatory and business requirements?
- Can we integrate an “evergreen storage strategy” enabling auto-migrations between storage products, regardless of vendor, file system or storage type. A data centric approach to management that can blur the lines between storage products.
Archive is especially interesting because the access and update characteristics of archived data make them good candidates, in many cases, for cloud-based storage. Cloud service providers have begun offering archive to disk or tape as a core offering, since the profitabilityof such a service is potentially quite high (low cost for storage media and low touch/low administration overhead). However, going to the cloud for archive also widens the set of questions that must be addressed by planners. Finally, data must eventually be deleted. While the thought of data deletion may be anathema to auditors, the fact is that most data has value for only a limited period of time. Beyond that time, it is occupying space on expensive storage media and, in some cases, retaining a “memory” that for legal reasons might be better forgotten. In industries where data is preserved forever, deletion of primary assets may not be required for a managed data solution. However, in virtually all environments, datasets have a tendency to be replicated many times during useful periods. It has been estimated that tens of copies may be created for any file or object, occupying more expensive capacity. So file deletion policy may need to consider at a minimum the issue of how to locate all copies of files so they can be deleted once the original is preserved. Again, incorporating data deletion into your data management strategy requires the solution of a number of questions that include:
- How do I ensure that my data deletion policy complies with regulations and laws?
- How do I ensure that all copies of data are deleted at End Of Life? What is the best approach to identifying what data to delete and when to delete it?
- Is there a way to reclaim media after deletion?
- Can data deletion be automated safely or is human operator review always required?
The moving parts of Data Management
As the preceding overview suggests, data management has many moving parts. Achieving the goal of professional and effective data management requires more than the attention to data itself, it also requires careful attention to the management of storage resources and storage services. Data management policies should specify what data is being acted upon, where data will best be stored to satisfy access and update — and perhaps regulatory – requirements (storage resources), and how should be protected while it is in that storage location (which storage services should be used). At the same time, gating factors in designing management policies include the governance, risk and compliance policies of the business, the special demands of workload, the inventory of resources and services in the infrastructure, and, of course, the budget for doing anything at all.
Wrangling millions or billions of files and objects, and provisioning them individually with storage resources and services, would be a Herculean effort. In most firms, IT planners have leveraged vendor hardware designs and deployment architectures to establish tiered platforms and logical migration paths for data. Unfortunately, data management-by-platform does not scale very well.
Management by data type is another “simplified data management” approach: All files associated with a specific application or user role are managed via a common policy. Unfortunately, many productivity applications do not generate files whose name or extension readily identifies the creator of the file or the business process that the file serves. Asking end users to add more description to their files prior to saving is the technical equivalent to pulling teeth. So, management by data type has limited efficacy.
These days, the serious data management strategy leverages metadata (data about the file or object that is stored with the file/object itself) as a means to classify data so that it can be managed via policy. Techniques range from using minimal metadata detail to classify a file so that a policy can be applied to aggregating multiple types of metadata to add increasing intelligence about the data and thus achieve much higher granularity in data classification.
However data is classified, the classification system needs to become part of a data management policy framework. The framework is part index/namespace, providing a location for storing file/object information, metadata, status and location. The framework also needs to take status information from storage resource management and storage service management monitoring engines so that storage can be classified and service broker burdening can be understood. Only in that way can policies be executed by either human or cognitive machine managers.
The policy management framework, storage resource management engine, and storage services management engine may be purchased as separate components from different vendors, creating an integration challenge and possible lock-ins in terms of supported hardware or service software. Fortunately, work is proceeding on unified data management technology that incorporates the functionality of all three components sets.
The jewel of the crown, of course, is the cognitive management platform. Cognitive data management uses cognitive computing to augment and automate many of the repetitious and time-consuming data management activities previously provided by a human actor.
Cognitive data management reflects the dynamism of data management and the need to respond to constant changes in the status of data itself (file/object utilization metrics and retention requirements), the status of storage services (the burdening and proximity of storage service brokers), and the status of storage targets and pathways (resources). The greater the volume of data and the more complex and heterogeneous the storage infrastructure, the more challenging the tasks of data management. Automation is required for success and cognitive computing provides a technology with promise for automating the data management task.
Into the future
Cognitive Data Management is about instrumenting the infrastructure and the data for policy-based management. Metaphorically, CDM can be viewed as a nervous system for IT. Management APIs, providers and MIBs act as afferent neurons that monitor the condition of the infrastructure components and services. Some APIs and providers provide the means to act upon the monitored devices or services, calling them into action or retiring their function with respect to certain data; these are similar to efferent neurons. These can feed impulses directly back to the neural network or cognitive computing engine (the brain), or they can provide functions in a more local way via clusters of neurons (nerve fibres, nerve bundles or ganglia).
In operation, whether centralized or federated, this cognitive management platform continuously monitors and updates its namespace and inventories and actuates infrastructure based on stored policies. Below is a very high level topology for a CDM infrastructure.
Defining a workable CDM for your organization will require a cooperative effort by data creators (line of business managers, project managers, etc.), IT operatives, governance, risk and compliance management, technology vendors, and others. It will also require senior management buy- in to the vision of CDM that can be obtained by providing a full business value case for the initiative.
- Planners need to identify the contribution that a data managed environment will deliver to cost- containment (bending the storage acquisition cost curve and managing data across storage assets in the most cost-efficient manner).
- They will need to help management see the risk reduction value of CDM, in terms of downtime reduction (data is accorded the correct protection services), regulatory and legal compliance (data is preserved and secured in an intelligent manner), and early technology investment obsolescence (the CDM solution should be hardware agnostic to insulate it from technology changes in storage or services).
- And finally, planners need to emphasize the improved productivity value of CDM. This may be the easiest narrative to present, since only with managed data can airy goals such as IT agility, as well practical goals like easier data collaboration and sharing, be achieved.
Cognitive Data Management is a strategy whose time has come. It is a strategic initiative that should be considered and prioritized today.