SQL Troubles: February 2017

28 February 2017

Data Management: Data Lifecycle (Definitions)

[Data Lifecycle Management (DLM):" "The process by which data is moved to different mass storage devices based on its age." (Tom Petrocelli, "Data Protection and Information Lifecycle Management", 2005)

[master data lifecycle management:] "Supports the definition, creation, access, and management of master data. Master data must be managed and leveraged effectively throughout its entire lifecycle." (Allen Dreibelbis et al, "Enterprise Master Data Management", 2008)

[Data lifecycle management (DLM):] "Managing data as blocks without underlying knowledge of the content of the blocks, based on limited metadata (e.g., creation date, last accessed)." (David G Hill, "Data Protection: Governance, Risk Management, and Compliance", 2009)

"The data life cycle is the set of processes a dataset goes through from its origin through its use(s) to its retirement. Data that moves through multiple systems and multiple uses has a complex life cycle. Danette McGilvray’s POSMAD formulation identifies the phases of the life cycle as: planning for, obtaining, storing and sharing, maintaining, applying, and disposing of data." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"The recognition that as data ages, that data takes on different characteristics" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"The development of a record in the company’s IT systems from its creation until its deletion. This process may also be designated as 'CRUD', an acronym for the Create, Read/Retrieve, Update and Delete database operations." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"The series of stages that data moves though from initiation, to creation, to destruction. Example: the data life cycle of customer data has four distinct phases and lasts approximately eight years." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"covers the period of time from data origination to the time when data are no longer considered useful or otherwise disposed of. The data lifecycle includes three phases, the origination phase during which data are first collected, the active phase during which data are accumulating and changing, and the inactive phase during which data are no longer expected to accumulate or change, but during which data are maintained for possible use." (Meredith Zozus, "The Data Book: Collection and Management of Research Data", 2017)

"The complete set of development stages from creation to disposal, each with its own characteristics and management responsibilities, through which organizational data assets pass." (Kevin J Sweeney, "Re-Imagining Data Governance", 2018)

"An illustrative phrase describing the many manifestations of data from its raw, unanalyzed state, such as survey data, to intellectual property, such as blueprints." (Sue Milton, "Data Privacy vs. Data Security", 2021)

"Refers to all the stages in the existence of digital information from creation to destruction. A lifecycle view is used to enable active management of the data objects and resource over time, thus maintaining accessibility and usability." (CODATA)

Data Warehousing: Data Load Optimization – A Success Story

Data Warehousing Series

Introduction

This topic has been waiting in the queue for almost two years already - since I finished optimizing an already existing relational data warehouse within a SQL Server 2012 Enterprise Edition environment. Through various simple techniques I managed then to reduce the running time for the load process by more than 65%, from 9 to 3 hours. It’s a considerable performance gain, considering that I didn’t have to refactor any business logic implemented in queries.

The ETL (Extract, Transform, Load) solution was making use of SSIS (SQL Server Integration Services) packages to load data sequentially from several sources into staging tables, and from stating further into base tables. Each package was responsible for deleting the data from the staging tables via TRUNCATE, extracting the data 1:1 from the source into the staging tables, then loading the data 1:1 from the staging table to base tables. It’s the simplest and a relatively effective ETL design I also used with small alterations for data warehouse solutions. For months the data load worked smoothly, until data growth and eventually other problems increased the loading time from 5 to 9 hours.

Using TABLOCK Hint

Using SSIS to bulk load data into SQL Server provides an optimum of performance and flexibility. Within a Data Flow, when “Table Lock” property on the destination is checked, it implies that the insert records are minimally logged, speeding up the load by a factor of two. The TABLOCK hint can be used also for other insert operations performed outside of SSIS packages. At least in this case the movement of data from staging into base tables was performed in plain T-SQL, outside of SSIS packages. Also further data processing had benefitted from this change. Only this optimization step alone provided 30-40% performance gain.

Drop/Recreating the Indexes on Big Tables

As the base tables were having several indexes each, it proved beneficial to drop the indexes for the big tables (e.g. with more than 1000000 records) before loading the data into the base tables, and recreate the indexes afterwards. This was done within SSIS, and provided an additional 20-30% performance gain from the previous step.

Consolidating the Indexes

Adding missing indexes, removing or consolidating (overlapping) indexes are typical index maintenance tasks, apparently occasionally ignored. It doesn’t always bring much performance as compared with the previous methods, though dropping and consolidating some indexes proved to be beneficial as fewer data were maintained. Data processing logic benefited from the creation of new indexes as well.

Running Packages in Parallel

As the packages were run sequentially (one package at a time), the data load was hardly taking advantage of the processing power available on the server. Even if queries could use parallelism, the benefit was minimal. Enabling packages run in parallel added additional performance gain, however this minimized the availability of processing resources for other tasks. When the data load is performed overnight, this causes minimal overhead, however it should be avoided when the data are loading to business hours.

Using Nonclustered Indexes

In my analysis I found out that many tables, especially the ones storing prepared data, were lacking a clustered index, even if further indexes were built on them. I remember that years back there was a (false) myth that fact and/or dimension tables don’t need clustered indexes in SQL Server. Of course clustered indexes have downsides (e.g. fragmentation, excessive key-lookups) though their benefits exceed by far the downsides. Besides missing clustered index, there were cases in which the tables would have benefited from having a narrow clustered index, instead of a multicolumn wide clustered index. Upon case also such cases were addressed.

Removing the Staging Tables

Given the fact that the source and target systems are in the same virtual environment, and the data are loaded 1:1 between the various layers, without further transformations and conversions, one could load the data directly into the base tables. After some tests I came to the conclusion that the load from source tables into the staging table, and the load from staging table into base table (with TABLOCK hint) were taking almost the same amount of time. This means that the base tables will be for the same amount of the time unavailable, if the data were loaded from the sources directly into the base tables. Therefore one could in theory remove the staging tables from the architecture. Frankly, one should think twice when doing such a change, as there can be further implications in time. Even if today the data are imported 1:1, in the future this could change.

Reducing the Data Volume

Reducing the data volume was identified as a possible further technique to reduce the amount of time needed for data loading. A data warehouse is built based on a set of requirements and presumptions that change over time. It can happen for example that even if the reports need only 1-2 years’ worth of data, the data load considers a much bigger timeframe. Some systems can have up to 5-10 years’ worth of data. Loading all data without a specific requirement leads to waste of resources and bigger load times. Limiting the transactional data to a given timeframe can make a considerable difference. Additionally, there are historical data that have the potential to be archived.

There are also tables for which a weekly or monthly refresh would suffice. Some tables or even data sources can become obsolete, however they continue to be loaded in the data warehouse. Such cases occur seldom, though they occur. Also some unused or redundant column could have been removed from the packages.

Further Thoughts

There are further techniques to optimize the data load within a data warehouse like partitioning large tables, using columnstore indexes or optimizing the storage, however my target was to provide maximum sufficient performance gain with minimum of effort and design changes. Therefore I stopped when I considered that the amount of effort is considerable higher than the performance gain.

Further Reading:
[1] TechNet (2009) The Data Loading Performance Guide, by Thomas Kejser, Peter Carlin & Stuart Ozer (link)
[2] MSDN (2010) Best Practices for Data Warehousing with SQL Server 2008 R2, by Mark Whitehorn, Keith Burns & Eric N Hanson (link)
[3] MSDN (2012) Whitepaper: Fast Track Data Warehouse Reference Guide for SQL Server 2012, by Eric Kraemer, Mike Bassett, Eric Lemoine & Dave Withers (link)
[4] MSDN (2008) Best Practices for Data Warehousing with SQL Server 2008, by Mark Whitehorn & Keith Burns (link)
[5] TechNet (2005) Strategies for Partitioning Relational Data Warehouses in Microsoft SQL Server, by Gandhi Swaminathan (link)
[6] SQL Server Customer Advisory Team (2013) Top 10 Best Practices for Building a Large Scale Relational Data Warehouse (link)

23 February 2017

Data Management: Data Integration (Definitions)

"The process of coherently using data from across platforms. applications or business units. Data integration ensures that data from different sources is merged allowing silos of data to be combined." (Tony Fisher, "The Data Asset", 2009)

"The planned and controlled:
a) merge using some form of reference,
b) transformation using a set of business rules, and
c) flow of data from a source to a target, for operational and/or analytical use. Data needs to be accessed and extracted, moved, validated and cleansed, standardized, transformed, and loaded. (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The collection of data from various sources with the same significance into one uniform record. This data may be physically integrated, for example, into a data warehouse or virtually, meaning that the data will remain in the source systems, however will be accessed using a uniform view." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"Data integration comprises the activities, techniques, and tools required to consolidate and harmonize data from different (multiple) sources into a unified view. The processes of extract, transform, and load (ETL) are part of this discipline." (Piethein Strengholt, "Data Management at Scale", 2020)

"Pulling together and reconciling dispersed data for analytic purposes that organizations have maintained in multiple, heterogeneous systems. Data needs to be accessed and extracted, moved and loaded, validated and cleaned, and standardized and transformed." (Information Management)

"The combination of technical and business processes used to combine data from disparate sources into meaningful insights." (Solutions Review)

"The process of retrieving and combining data from different sources into a unified set for users, organizations, and applications." (MuleSoft)

"Data integration is the practice of consolidating data from disparate sources into a single dataset with the ultimate goal of providing users with consistent access and delivery of data across the spectrum of subjects and structure types, and to meet the information needs of all applications and business processes." (OmiSci) [source]

"Data integration is the process of combining data from multiple source systems to create unified sets of information for both operational and analytical uses." (Techtarget)

"Data integration is the process of bringing data from disparate sources together to provide users with a unified view. The premise of data integration is to make data more freely available and easier to consume and process by systems and users." (Tibco) [source]

"Data integration is the process of retrieving and combining data from different sources into a unified set of data. Data integration can be used to combine data for users, organizations, and applications." (kloudless)

"Data integration is the process of taking data from multiple disparate sources and collating it in a single location, such as a data warehouse. Once integrated, data can then be used for detailed analytics or to power other enterprise applications." (Xplenty) [source]

"Data integration is the process used to combine data from disparate sources into a unified view that can provide valuable and actionable information." (snowflake) [source]

"Data integration refers to the technical and business processes used to combine data from multiple sources to provide a unified, single view of the data." (OmiSci) [source]

"The discipline of data integration comprises the practices, architectural techniques and tools for achieving the consistent access and delivery of data across the spectrum of data subject areas and data structure types in the enterprise to meet the data consumption requirements of all applications and business processes." (Gartner)

Data Management: Data Cleaning/Cleansing (Definitions)

"A processing step where missing or inaccurate data is replaced with valid values." (Joseph P Bigus, "Data Mining with Neural Networks: Solving Business Problems from Application Development to Decision Support", 1996)

"The process of validating data prior to a data analysis or Data Mining. This includes both ensuring that the values of the data are valid for a particular attribute or variable (e.g., heights are all positive and in a reasonable range) and that the values for given records or set of records are consistent." (William J Raynor Jr., "The International Dictionary of Artificial Intelligence", 1999)

"The process of correcting errors or omissions in data. This is often part of the extraction, transformation, and loading (ETL) process of extracting data from a source system, usually before attempting to load it into a target system. This is also known as data scrubbing." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"The removal of inconsistencies, errors, and gaps in source data prior to its incorporation into data warehouses or data marts to facilitate data integration and improve data quality." (Steve Williams & Nancy Williams, "The Profit Impact of Business Intelligence", 2007)

"Software used to identify potential data quality problems. For example, if a customer is listed multiple times in a customer database using variations of the spelling of his or her name, the data cleansing software ensures that each data element is consistent so there is no confusion. Such software is used to make corrections to help standardize the data." (Judith Hurwitz et al, "Service Oriented Architecture For Dummies" 2nd Ed., 2009)

"The process of reviewing and improving data to make sure it is correct, up to date, and not duplicated." (Tony Fisher, "The Data Asset", 2009)

"The process of correcting data errors to bring the level of data quality to an acceptable level for the information user needs." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The act of detecting and removing and/or correcting data in a database. Also called data scrubbing." (Craig S Mullins, "Database Administration: The Complete Guide to DBA Practices and Procedures", 2012)

"Synonymous with data fixing or data correcting, data cleaning is the process by which errors, inexplicable anomalies, and missing values are somehow handled. There are three options for data cleaning: correcting the error, deleting the error, or leaving it unchanged." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"The process of detecting, removing, or correcting incorrect data." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"The process of finding and fixing errors and inaccuracies in data" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"The process of removing corrupt, redundant, and inaccurate data in the data governance process." (Robert F Smallwood, "Information Governance: Concepts, Strategies, and Best Practices", 2014)

"The process of eliminating inaccuracies, irregularities, and discrepancies from data." (Jim Davis & Aiman Zeid, "Business Transformation", 2014)

"The process of reviewing and revising data in order to delete duplicates, correct errors, and provide consistency." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"the processes of identifying and resolving potential data errors." (Meredith Zozus, "The Data Book: Collection and Management of Research Data", 2017)

"A sub-process in data preprocessing, where we remove punctuation, stop words, etc. from the text." (Neha Garg & Kamlesh Sharma, "Machine Learning in Text Analysis", 2020)

"Processing a dataset to make it easier to consume. This may involve fixing inconsistencies and errors, removing non-machine-readable elements such as formatting, using standard labels for row and column headings, ensuring that numbers, dates, and other quantities are represented appropriately, conversion to a suitable file format, reconciliation of labels with another dataset being used (see data integration)." (Open Data Handbook)

"The process of detecting and correcting faulty records, leading to highly accurate BI-informed decisions, as enormous databases and rapid acquisition of data can lead to inaccurate or faulty data that impacts the resulting BI and analysis. Correcting typographical errors, de-duplicating records, and standardizing syntax are all examples of data cleansing." (Insight Software)

"Transforming data in its native state to a pre-defined standardized format using vendor software." (Solutions Review)

"Data cleansing is the effort to improve the overall quality of data by removing or correcting inaccurate, incomplete, or irrelevant data from a data system. […] Data cleansing techniques are usually performed on data that is at rest rather than data that is being moved. It attempts to find and remove or correct data that detracts from the quality, and thus the usability, of data. The goal of data cleansing is to achieve consistent, complete, accurate, and uniform data." (Informatica) [source]

"Data cleansing is the process of modifying data to improve accuracy and quality." (Xplenty) [source]

"Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted." (Sisense) [source]

"Data Cleansing (or Data Scrubbing) is the action of identifying and then removing or amending any data within a database that is: incorrect, incomplete, duplicated." (experian) [source]

"Data cleansing, or data scrubbing, is the process of detecting and correcting or removing inaccurate data or records from a database. It may also involve correcting or removing improperly formatted or duplicate data or records. Such data removed in this process is often referred to as 'dirty data'. Data cleansing is an essential task for preserving data quality." (Teradata) [source]

"Data scrubbing, also called data cleansing, is the process of amending or removing data in a database that is incorrect, incomplete, improperly formatted, or duplicated." (Techtarget) [source]

"the process of reviewing and revising data in order to delete duplicates, correct errors and provide consistency." (Analytics Insight)

21 February 2017

Data Quality Dimensions: Validity (Definitions)

"A characteristic of the data collected that indicates they are sound and accurate." (Teri Lund & Susan Barksdale, "10 Steps to Successful Strategic Planning", 2006)

"Implies that the test measures what it is supposed to." (Robert McCrie, "Security Operations Management" 2nd Ed., 2006)

"The determination that values in the field are or are not within a set of allowed or valid values. Measured as part of the Data Integrity Fundamentals data quality dimension." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"A data quality dimension that reflects the confirmation of data items to their corresponding value domains, and the extent to which non-confirmation of certain items affects fitness to use. For example, a data item is invalid if it is defined to be integer but contains a non-integer value, linked to a finite set of possible values but contains a value not included in this set, or contains a NULL value where a NULL is not allowed." (G Shankaranarayanan & Adir Even, "Measuring Data Quality in Context", 2009)

"An aspect of data quality consisting in its steadiness despite the natural process of data obsolescence increasing in time." (Juliusz L Kulikowski, "Data Quality Assessment", 2009)

"An inherent quality characteristic that is a measure of the degree of conformance of data to its domain values and business rules." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"Validity is a dimension of data quality, defined as the degree to which data conforms to stated rules. As used in the DQAF, validity is differentiated from both accuracy and correctness. Validity is the degree to which data conform to a set of business rules, sometimes expressed as a standard or represented within a defined data domain." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"Validity is defined as the extent to which data corresponds to reference tables, lists of values from golden sources documented in metadata, value ranges, etc." (Rajesh Jugulum, "Competing with High Quality Data", 2014)

"the state of consistency between a measurement and the concept that a researcher intended to measure." (Meredith Zozus, "The Data Book: Collection and Management of Research Data", 2017)

[semantic validity:] "The compliance of attribute data to rules regarding consistency and truthfulness of association." (O Sami Saydjari, "Engineering Trustworthy Systems: Get Cybersecurity Design Right the First Time", 2018)

[syntactic validity:] "The compliance of attribute data to format and grammar rules." (O Sami Saydjari, "Engineering Trustworthy Systems: Get Cybersecurity Design Right the First Time", 2018)

"Validity is a data quality dimension that refers to information that doesn’t conform to a specific format or doesn’t follow business rules." (Precisely) [source]

Data Management: Master Data Management (Definitions)

"The set of disciplines and methods to ensure the currency, meaning, and quality of a company’s reference data that is shared across various systems and organizations." (Jill Dyché & Evan Levy, "Customer Data Integration", 2006)

"The framework of processes and technologies aimed at creating and maintaining an authoritative, reliable, sustainable, accurate, and secure data environment that represents a “single version of truth,” an accepted system of record used both intra- and interenterprise across a diverse set of application systems, lines of business, and user communities. (Alex Berson & Lawrence Dubov, "Master Data Management and Customer Data Integration for a Global Enterprise", 2007)

"Centralized facilities designed to hold master copies of shared entities, such as customers or products. MDM systems are meant to support transaction systems and usually have some means to reconcile different sources for the same attribute." (Ralph Kimball, "The Data Warehouse Lifecycle Toolkit", 2008)

"Processes that control management of master data values to enable consistent, shared, contextual use across systems, of the most accurate, timely, and relevant version of truth about essential business entities." (DAMA International, "The DAMA Guide to the Data Management Body of Knowledge" 1st Ed., 2009)

"The guiding principles and technology for maintaining data in a manner that can be shared across various systems and departments throughout an organization." (Tony Fisher, "The Data Asset", 2009)

"The processes and tools to help an organization consistently define and manage core reference or descriptive data across the organization. This may involve providing a centralized view of the data to ensure that its use for all business processes is consistent and accurate." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"Master Data Management comprises a set of processes and tools that consistently define and manage the nontransactional data entities of an organization such as Customers and Products." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A discipline that resolves master data to maintain the golden record, the holistic and panoramic view of master entities and relationships, and the benchmark for master data that can be used across the enterprise, and sometimes between enterprises to facilitate data exchanges." (Alex Berson & Lawrence Dubov, "Master Data Management and Data Governance", 2010)

"A system and services for the single, authoritative source of truth of master data for an enterprise." (Martin Oberhofer et al, "The Art of Enterprise Information Architecture", 2010)

"MDM is a set of disciplines, processes, and technologies for ensuring the accuracy, completeness, timeliness, and consistency of multiple domains of enterprise data across applications, systems, and databases, and across multiple business processes, functional areas, organizations, geographies, and channels." (Dan Power, "Moving Master Data Management into the Cloud", 2010)

"The processes and tools to help an organization consistently define and manage core reference or descriptive data " (Martin Oberhofer et al, "The Art of Enterprise Information Architecture", 2010)

"The set of codes and structures that identify and organize data, such as customer numbers, employee IDs, and general ledger account numbers." (Janice M Roehl-Anderson, "IT Best Practices for Financial Managers", 2010)

"An information quality activity in which the data elements that are used by multiple systems in an organization are identified, managed, and controlled at the enterprise level." (John R Talburt, "Entity Resolution and Information Quality", 2011)

"Data that is key to the operation of a business, such as data about customers, suppliers, partners, products, and materials." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"In business, master data management comprises the processes, governance, policies, standards, and tools that consistently define and manage the critical data of an organization to provide a single point of reference." (Keith Holdaway, "Harness Oil and Gas Big Data with Analytics", 2014)

"The data that describes the important details of a business subject area such as customer, product, or material across the organization. Master data allows different applications and lines of business to use the same definitions and data regarding the subject area. Master data gives an accurate, 360-degree view of the business subject." (Jim Davis & Aiman Zeid, "Business Transformation: A Roadmap for Maximizing Organizational Insights", 2014)

"A new trend in IT, Master Data Management is composed of processes and tools that ultimately help an organization define and manage its master data." (Andrew Pham et al, "From Business Strategy to Information Technology Roadmap", 2016)

'It is a technology-enabled discipline in which business and IT work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise’s official shared master data assets. Master data is the consistent and uniform set of identifiers and extended attributes that describes the core entities of the enterprise including customers, prospects, citizens, suppliers, sites, hierarchies and chart of accounts." (Richard T Herschel, "Business Intelligence", 2019)

"The most critical data is called master data and the companioned discipline of master data management, which is about making the master data within the organization accessible, secure, transparent, and trustworthy." (Piethein Strengholt, "Data Management at Scale", 2020)

"An umbrella term that incorporates processes, policies, standards, tools and governance that define and manage all of an organization’s critical data in order to formulate one point of reference." (Solutions Review)

"The technology, tools, and processes required to create and maintain consistent and accurate lists of master data of an organization." (Microsoft)

"Master data management solutions provide the capabilities to create the unique and qualified reference of shared enterprise data, such as customer, product, supplier, employee, site, asset, and organizational data." (Forrester)

"Master data management (MDM) is the effort made by an organization to create one single master reference source for all critical business data, leading to fewer errors and less redundancy in business processes." (Informatica) [source]

"Master data management (MDM) is the process of defining, managing, and making use of an organization’s master data so it is visible and accessible via a single reference point." (MuleSoft) [source]

"Master data management (MDM) is the process of making sure an organization is always working with, and making decisions based on, one version of current, ‘true’ data - often referred to as a 'golden record'." (Talend) [source]

20 February 2017

Data Quality Dimensions: Timeliness (Definitions)

"Coming early or at the right, appropriate or adapted to the times or the occasion." (Martin J Eppler, "Managing Information Quality" 2nd Ed., 2006)

[timeliness & availability] "A data quality dimension that measures the degree to which data are current and available for use as specified, and in the time frame in which they are expected." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"the ability of a task to repeatedly meet its timeliness requirements." (Bruce P Douglass, "Real-Time Agility: The Harmony/ESW Method for Real-Time and Embedded Systems Development", 2009)

"A pragmatic quality characteristic that is a measure of the relative availability of data to support a given process within the timetable required to perform the process." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"1.The degree to which available data meets the currency requirements of information consumers. 2.The length of time between data availability and the event or phenomenon they describe." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Timeliness is a dimension of data quality related to the availability and currency of data. As used in the DQAF, timeliness is associated with data delivery, availability, and processing. Timeliness is the degree to which data conforms to a schedule for being updated and made available. For data to be timely, it must be delivered according to schedule." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"The degree to which the model contains elements that reflect the current version of the world Transitive Relation When a relation R is transitive then if R links entity A to entity B, and entity B to entity C, then it also links A to C." (Panos Alexopoulos, "Semantic Modeling for Data", 2020)

"Length of time between data availability and the event or phenomenon they describe." (SDMX)

Data Management: Data Security (Definitions)

"The protection of data from disclosure, alteration, destruction, or loss that either is accidental or is intentional but unauthorized. (Network Working Group, "RFC 4949: Internet Security Glossary", 2007)

"An area of information security focused on the protection of data from either accidental or unauthorized intentional viewing, modification, destruction, duplication, or disclosure during input, processing, storage, transmission, or output operations. Data security deals with data that exists in two modes: data-in-transit and data-at-rest." (Alex Berson & Lawrence Dubov, "Master Data Management and Data Governance", 2010)

"1.The safety of data from unauthorized and inappropriate access or change. 2.The measures taken to prevent unauthorized access, use, modification, or destruction of data." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

[Data Security Managemen:] "The process of ensuring that data is safe from unauthorized and inappropriate access or change. Includes focus on data privacy, confidentiality, access, functional capabilities and use." (DAMA International, "The DAMA Dictionary of Data Management" 1st Et., 2010)

"Protection against illegal or wrongful intrusion. In the IT world, intrusion concerns mostly deal with gaining access to user and company data." (Peter Sasvari & Zoltán Nagymate, "The Empirical Analysis of Cloud Computing Services among the Hungarian Enterprises", 2015)

"Linked to data privacy rights, the term refers to the IT mechanisms to protect data through defined processes, filters, fire walls, encryption-in-transit, etc." (Beatriz Arnillas, "Tech-Savvy Is the New Street Smart: Balancing Protection and Awareness", 2019)

"The processes and technologies that ensure that sensitive and confidential data about an organization are kept secure according to the organization’s policies." (Lili Aunimo et al, "Big Data Governance in Agile and Data-Driven Software Development: A Market Entry Case in the Educational Game Industry", 2019)

"The process of protecting the availability, integrity, and privacy of information from undesired actions." (Zerin M Khan, "How Do Mobile Applications for Cancer Communicate About Their Privacy Practices?: An Analysis of Privacy Policies", 2021)

"Data security can be described as the set of policies, processes, procedures, and tools that IT organizations implement to prevent unauthorized access to their networks, servers, data storage and any other on-premise or cloud-based IT infrastructure." (Sumo Logic) [source]

"Data security comprises the processes and associated tools that protect sensitive information assets, either in transit or at rest. Data security methods include:
• Encryption (applying a keyed cryptographic algorithm so that data is not easily read and/or altered by unauthorized parties)
• Masking (substituting all or part of a high-value data item with a low-value representative token)
• Erasure (ensuring that data that is no longer active or used is reliably deleted from a repository)
• Resilience (creating backup copies of data so that organizations can recover data should it be erased or corrupted accidentally or stolen during a data breach)." (Gartner)

[Data security and privacy technology] "Technologies that directly touch the data itself and that help organizations: 1) understand where their data is located and identify what data is sensitive; 2) control data movement as well as introduce data-centric controls that protect the data no matter where it is; and 3) enable least privilege access and use. This still encompasses a wide range of technologies." (Forrester)

"Is the protection of data from unauthorized (accidental or intentional) modification, destruction, or disclosure." (MISS-DND)

"The capability of the software product to protect programs and data from unauthorized access, whether this is done voluntarily or involuntarily." (ISO 9126)

"The degree to which a collection of data is protected from exposure to accidental or malicious alteration or destruction." (IEEE 610.5-1990)

"Those controls that seek to maintain confidentiality, integrity and availability of information." (ISACA)

18 February 2017

Data Management: Data Migration (Definitions)

"The process of extracting data from operational systems to a data warehouse with minimal effect on the source systems, and the transformation of the source data into a format consistent with the design and requirements of the data warehouse." (Microsoft Corporation, "Microsoft SQL Server 7.0 System Administration Training Kit", 1999)

"The movement of data from one storage system to another." (Tom Petrocelli, "Data Protection and Information Lifecycle Management", 2005)

"Form- and function-preserving movement of data between locations or formats, which can take one of three forms: (1) from one physical system or location to another; (2) from one physical format to another; or (3) from one logical format to another." (David G Hill, "Data Protection: Governance, Risk Management, and Compliance", 2009)

"The process of migrating data to or from an application such as moving data from a sales force automation tool to a data warehouse." (Tony Fisher, "The Data Asset", 2009)

"The process of transferring data from one database to another." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Moving a specific set of data from one system to the other. It contains a source system where the data resides prior to the migration, criteria determining the scope of data to be migrated, a transformation that the data will go through, and a destination system where the data will be inserted after." (MuleSoft)

"The process of moving data between two or more storage systems, data formats, warehouses or servers." (Solutions Review)

"A Data Migration is the process of transferring data between computer storage systems or file formats. They greatly range in scale and can often be significant projects including numerous preparation and reconciliation activities." (experian) [source]

"Data migration is the process of moving data between different types of storage or file formats. Cloud data migration refers to transferring or replicating data from on-premise systems to cloud-based storage." (Qlik) [source]

"Data migration is the process of moving a specific set of data from one system to the other. One system is the source system where the data resides prior to the migration, and one system is the destination system where the data will be inserted after." (kloudless)

"Data migration is the process of transferring data between data storage systems, data formats or computer systems." (Techtarget) [source]

"Data migration is the process of transferring data from repository to another." (Information Management)

15 February 2017

Data Management: Data Architecture (Definitions)

"Data Architecture is the design of data for use in defining the target state and the subsequent planning needed to hit the target state. Data architecture includes topics such as database design, information integration, metadata management, business semantics, data modeling, metadata workflow management, and archiving." (Martin Oberhofer et al, "Enterprise Master Data Management", 2008)

"Describes how data is organized and structured to support the development, maintenance, and use of the data by application systems. This includes guidelines and recommendations for historical retention of the data, and how the data is to be used and accessed." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"the organized arrangement of components to optimize the function, performance, feasibility, cost, and/or aesthetics of an overall structure." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The logical-data architecture describes the specific data elements held by the team in a platform-agnostic and business-friendly manner. It plots out the specific tables, fields, and relationships within the team’s data assets and is usually fully normalized to minimize redundancy and represents the highest level of design efficiency possible." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"The physical-data architecture is the lowest level of detail in data architecture. It describes how the logical architecture is actually implemented within the data mart and describes elements by their technical (rather than business) names." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"Defines a company-wide, uniform model of corporate data (the corporate data model). It also describes the architecture for the distribution and retention of data. This describes which data will be stored in which systems, which systems are single sources of truth for which data objects or attributes and the flow of data between the systems." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"One of the layers of the enterprise architecture (EA) that focuses on the IT data architecture side, both for transactional and business intelligence IT data architecture." (David K Pham, "From Business Strategy to Information Technology Roadmap", 2016)

"The discipline, methods, and outputs related to understanding data, its meaning and relationships." (Gregory Lampshire et al, "The Data and Analytics Playbook", 2016)

"Models, policies, and guidelines that structure how data are collected, stored, used, managed, and integrated within an organization." (Jonathan Ferrar et al, "The Power of People", 2017)

"Data architecture is the structure that enables the storage, transformation, exploitation, and governance of data." (Pradeep Menon, "Data Lakehouse in Action", 2022)

"Data architecture is the process of designing and building complex data platforms. This involves taking a comprehensive view, which includes not only moving and storing data but also all aspects of the data platform. Building a well-designed data ecosystem can be transformative to a business." (Brian Lipp, "Modern Data Architectures with Python", 2023)

"A data architecture defines a high-level architectural approach and concept to follow, outlines a set of technologies to use, and states the flow of data that will be used to build your data solution to capture big data. [...] Data architecture refers to the overall design and organization of data within an information system." (James Serra, "Deciphering Data Architectures", 2024)

"Data architecture encompasses the rules, policies, models, and standards that govern data collection and how that data is then stored, managed, processed, and used within an organization’s databases and data systems." (snowflake) [source]

"Data architecture is the process by which an organization aligns its data environment with its operational goals." (Xplenty) [source]

14 February 2017

Data Management: Data Asset (Definitions)

[information asset:] "Data in any form or media placed into meaningful context for users, collected in relation to business or research activity." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The term data asset refers generically to organizational data, or to any transactional system, data storage system, or application that contains data required for business functions. Any of these systems might be involved in the production and use of data and therefore may affect its quality. Treating data as an asset enables organizations to manage and increase its value. The quality of the data in a system has direct effects on the successful execution of business functions and therefore the success of the enterprise as a whole." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

[information asset:] "digitally stored content owned by an individual or organization." ( Manish Agrawal, "Information Security and IT Risk Management", 2014)

"Any entity that is comprised of data. For example, a database is a data asset that is comprised of data records. A data asset may be a system or application output file, database, document, or web page. A data asset also includes a service that may be provided to access data from an application. For example, a service that returns individual records from a database would be a data asset. Similarly, a web site that returns data in response to specific queries would be a data asset." (CNSSI 4009-2015)

"Data assets refer to a system, application output file, document, database, or web page that companies use to generate revenues." (CFI) [source]

12 February 2017

Data Management: Data Quality (Definitions)

"Data are of high quality if they are fit for their intended use in operations, decision-making, and planning." (Joseph M Juran, 1964)

"[…] data has quality if it satisfies the requirements of its intended use. It lacks quality to the extent that it does not satisfy the requirement. In other words, data quality depends as much on the intended use as it does on the data itself. To satisfy the intended use, the data must be accurate, timely, relevant, complete, understood, and trusted." (Jack E Olson, "Data Quality: The Accuracy Dimension", 2003)

"A set of measurable characteristics of data that define how well data represents the real-world construct to which it refers." (Alex Berson & Lawrence Dubov, "Master Data Management and Customer Data Integration for a Global Enterprise", 2007)

"The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use." (Keith Gordon, "Principles of Data Management", 2007)

"Deals with data validation and cleansing services (to ensure relevance, validity, accuracy, and consistency of the master data), reconciliation services (aimed at helping cleanse the master data of duplicates as part of consistency), and cross-reference services (to help with matching master data across multiple systems)." (Martin Oberhofer et al,"Enterprise Master Data Management", 2008)

"A set of data properties (features, parameters, etc.) describing their ability to satisfy user’s expectations or requirements concerning data using for information acquiring in a given area of interest, learning, decision making, etc." (Juliusz L Kulikowski, "Data Quality Assessment", 2009)

"Assessment of the cleanliness, accuracy, and reliability of data." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"A set of measurable characteristics of data that define how well the data represents the real-world construct to which it refers." (Alex Berson & Lawrence Dubov, "Master Data Management and Data Governance", 2010)

"This term refers to whether an organization’s data is reliable, consistent, up to date, free of duplication, and can be used efficiently across the organization." (Tony Fisher, "The Data Asset", 2009)

"The degree of data accuracy, accessibility, relevance, time-liness, and completeness." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed., 2011)

"The degree of fitness for use of data in particular application. Also the degree to which data conforms to data specifications as measured in data quality dimensions. Sometimes used interchangeably with information quality." (John R Talburt, "Entity Resolution and Information Quality", 2011)

"The degree to which data is accurate, complete, timely, consistent with all requirements and business rules, and relevant for a given use." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Contextual data quality considers the extent to which data are applicable (pertinent) to the task of the data user, not to the context of representation itself. Contextually appropriate data must be relevant to the consumer, in terms of timeliness and completeness. Dimensions include: value-added, relevancy, timeliness, completeness, and appropriate amount of data (from the Wang & Strong framework.)" (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"Intrinsic data quality denotes that data have quality in their own right; it is understood largely as the extent to which data values are in conformance with the actual or true values. Intrinsically good data is accurate, correct, and objective, and comes from a reputable source. Dimensions include: accuracy objectivity, believability, and reputation (from the Wang & Strong framework)." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"Representational data quality indicates that the system must present data in such a way that it is easy to understand (represented concisely and consistently) so that the consumer is able to interpret the data; understood as the extent to which data is presented in an intelligible and clear manner. Dimensions include: interpretability, ease of understanding, representational consistency, and concise representation (rom the Wang & Strong framework)." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"The level of quality of data represents the degree to which data meets the expectations of data consumers, based on their intended use of the data." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement", 2013)

"The relative value of data, which is based on the accuracy of the knowledge that can be generated using that data. High-quality data is consistent, accurate, and unambiguous, and it can be processed efficiently." (Jim Davis & Aiman Zeid, "Business Transformation: A Roadmap for Maximizing Organizational Insights", 2014)

"The properties of data embodied by the “Five C’s”: clean, consistent, conformed, current, and comprehensive." (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"The degree to which data in an IT system is complete, up-to-date, consistent, and (syntactically and semantically) correct." (Tilo Linz et al, "Software Testing Foundations, 4th Ed", 2014)

"A measure for the suitability of data for certain requirements in the business processes, where it is used. Data quality is a multi-dimensional, context-dependent concept that cannot be described and measured by a single characteristic, but rather various data quality dimensions. The desired level of data quality is thereby oriented on the requirements in the business processes and functions, which use this data [...]" (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"[...] characteristics of data such as consistency, accuracy, reliability, completeness, timeliness, reasonableness, and validity. Data-quality software ensures that data elements are represented in a consistent way across different data stores or systems, making the data more trustworthy across the enterprise." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"Refers to the accuracy, completeness, timeliness, integrity, and acceptance of data as determined by its users." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"A measure of the useableness of data. An ideal dataset is accurate, complete, timely in publication, consistent in its naming of items and its handling of e.g. missing data, and directly machine-readable (see data cleaning), conforms to standards of nomenclature in the field, and is published with sufficient metadata that users can easily understand, for example, who it is published by and the meaning of the variables in the dataset." (Open Data Handbook)

"Refers to the level of 'quality' in data. If a particular data store is seen as holding highly relevant data for a project, that data is seen as quality to the users." (Solutions Review)

"the processes and techniques involved in ensuring the reliability and application efficiency of data. Data is of high quality if it reliably reflects underlying processes and fits the intended uses in operations, decision making and planning." (KDnuggets)

"The narrow definition of data quality is that it's about data that is missing or incorrect. A broader definition is that data quality is achieved when a business uses data that is comprehensive, consistent, relevant and timely." (Information Management)

"Data Quality refers to the accuracy of datasets, and the ability to analyse and create actionable insights for other users." (experian) [source]

"Data quality refers to the current condition of data and whether it is suitable for a specific business purpose." (Xplenty) [source]

11 February 2017

Data Management: Data Collection (Definitions)

"The gathering of information through focus groups, interviews, surveys, and research as required to develop a strategic plan." (Teri Lund & Susan Barksdale, "10 Steps to Successful Strategic Planning", 2006)

"The process of gathering raw or primary specific data from a single source or from multiple sources." (Adrian Stoica et al, "Field Evaluation of Collaborative Mobile Applications", 2008)

"A combination of human activities and computer processes that get data from sources into files. It gets the file data using empirical methods such as questionnaire, interview, observation, or experiment." (Jens Mende, "Data Flow Diagram Use to Plan Empirical Research Projects", 2009)

"A systematic process of gathering and measuring information about the phenomena of interest." (Kaisa Malinen et al, "Mobile Diary Methods in Studying Daily Family Life", 2015)

"The process of capturing events in a computer system. The result of a data collection operation is a log record. The term logging is often used as a synonym for data collection." (Ulf Larson et al, "Guidance for Selecting Data Collection Mechanisms for Intrusion Detection", 2015)

"This refers to the various approaches used to collect information." (Ken Sylvester, "Negotiating in the Leadership Zone", 2015)

"Set of techniques that allow gathering and measuring information on certain variables of interest." (Sara Eloy et al, "Digital Technologies in Architecture and Engineering: Exploring an Engaged Interaction within Curricula", 2016)

"with respect to research, data collection is the recording of data for the purposes of a study. Data collection for a study may or may not be the original recording of the data." (Meredith Zozus, "The Data Book: Collection and Management of Research Data", 2017)

"The process of retrieving data from different sources and storing them in a unique location for further use." (Deborah Agostino et al, "Social Media Data Into Performance Measurement Systems: Methodologies, Opportunities, and Risks", 2018)

"It is the process of gathering data from a variety of relevant sources in an established systematic fashion for analysis purposes." (Yassine Maleh et al, 'Strategic IT Governance and Performance Frameworks in Large Organizations", 2019)

"A process of storing and managing data." (Neha Garg & Kamlesh Sharma, "Machine Learning in Text Analysis", 2020)

"The process and techniques for collecting the information for a research project." (Tiffany J Cresswell-Yeager & Raymond J Bandlow, "Transformation of the Dissertation: From an End-of-Program Destination to a Program-Embedded Process", 2020)

"The method of collecting and evaluating data on selected variables, which helps in analyzing and answering relevant questions is known as data collection." (Hari K Kondaveeti et al, "Deep Learning Applications in Agriculture: The Role of Deep Learning in Smart Agriculture", 2021)

"Datasets are created by collecting data in different ways: from manual or automatic measurements (e.g. weather data), surveys (census data), records of decisions (budget data) or ongoing transactions (spending data), aggregation of many records (crime data), mathematical modelling (population projections), etc." (Open Data Handbook)

Data Management: Data Mapping (Definitions)

"The process of identifying correspondence between source data elements and target data elements when migrating data." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"The process of noting the relationship of a data element to something or somebody." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"(1) The process of associating one data element, field, or idea with another data element, field, or idea. (2) In source-to-target mapping, the process of determining (and the resulting documentation of) where the data in a source data store will be moved to another (target) data store." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"The assignment of source data entities and attributes to target data entities and attributes, and the resolution of disparate data." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Data mapping is the process of creating data element mappings between two distinct data models. This activity is considered to be part of data integration." (Piethein Strengholt, "Data Management at Scale", 2020)

"The process defining a link between two disparate data models. It is often the first step towards data integration." (MuleSoft)

"The process of assigning a source data element to a target data element." (Information Management)

"Data mapping is the process of creating data element mappings between two different data models and is used as a first step for a wide array of data integration tasks, including data transformation between a data source and a destination." (Solutions Review)

"Data mapping is the process of defining a link between two disparate data models in the aim of future data integration." (kloudless)

"Data mapping is the process of mapping source data fields to destination related target fields." (Adobe)

09 February 2017

Data Management: Data Discovery (Definitions)

"Data discovery is the process where an organization obtains an understanding of which data matters the most and identifies challenges with that data. The outcome of data discovery is that the scope of a data quality initiative should be clear and data quality rules can be defined." (Robert Hawker, "Practical Data Quality", 2023)

"The process of analyzing the type, quality, accessibility, and location of data in all available data repositories. It's critical for determining the current state of a data environment, especially when a recent and accurate data dictionary doesn't exist." (Forrester)

"Data Discovery describes a range of techniques designed to collect and consolidate information before an alysing it to find relationships and outliers between entities (or data items) that may exist. This process may be done on data from the same database or across multiple, disparate databases. (experian) [source]

"Data discovery involves the collection and evaluation of data from various sources and is often used to understand trends and patterns in the data." (Tibco) [source]

Data discovery is not a tool. It is a business user oriented process for detecting patterns and outliers by visually navigating data or applying guided advanced analytics. Discovery is an iterative process that does not require extensive upfront model creation. (BI Survey) [source]

"Data discovery is the process of using a range of technologies that allow users to quickly clean, combine, and analyze complex data sets and get the information they need to make smarter decisions and impactful discoveries." (Qlik) [source]

06 February 2017

Data Management: Data Validation (Definitions)

"Evaluating and checking the accuracy, consistency, timeliness, and security of information, for example by evaluating the believability or reputation of its source." (Martin J Eppler, "Managing Information Quality" 2nd Ed., 2006)

"The process of ensuring accurate data based on data acceptance and exception handling rules." (Evan Levy & Jill Dyché, "Customer Data Integration", 2006)

"The process of ensuring that the values of data conform to specified formats and/or values." (Allen Dreibelbis et al, "Enterprise Master Data Management", 2008)

"(1) To confirm the validity of data. (2) A feature of data cleansing tools." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"The act of determining that data is sound. In security, generally used in the context of validating input." (Mark S Merkow & Lakshmikanth Raghavan, "Secure and Resilient Software Development", 2010)

"Determining and confirming that something satisfies or conforms to defined rules, business rules, integrity constraints, defined standards, etc. The system cannot perform any validating unless it first has a definition of the way things should be validity The degree to which data conforms to domain values and defined business rules." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"This involves demonstrating that the conclusions that come from data analyses fulfill their intended purpose and are consistent." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"The act of testing a model with data that was not used in the model-fitting process." (Meta S Brown, "Data Mining For Dummies", 2014)

[data integrity validation:] "Data integrity validation allows you to verify the integrity of the data that was secured by data protection operations." (CommVault, "Documentation 11.20", 2018)

05 February 2017

Data Management: Data Stewardship (Definitions)

"Data stewardship is the function that is largely responsible for managing data as an enterprise asset. The data steward is responsible for ensuring that the data provided by the Corporate Information Factory is based on an enterprise view. An individual, a committee, or both may perform data stewardship." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

"Necessary for resolving conflicts in peer-to-peer replication, data stewardship involves assigning an owner to data that resides on multiple servers and creating rules for how the data should be updated." (Sara Morganand & Tobias Thernstrom , "MCITP Self-Paced Training Kit : Designing and Optimizing Data Access by Using Microsoft SQL Server 2005 - Exam 70-442", 2007)

"An approach to data governance that formalizes accountability for managing information resources on behalf of others and in the best interests of the organization." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"Deals with the ownership and accountability of data, and how people manage the data to the benefit of the organization. Data stewardship functions at two levels - Business Data Stewards deal with the higher-level metadata and governance concerns, while Operational Data Stewards focus primarily on the instances of master data in the enterprise." (Allen Dreibelbis et al, "Enterprise Master Data Management", 2008)

"This is known as a role assigned to a person that is responsible for defining and executing data governance policies." (Martin Oberhofer et al, "The Art of Enterprise Information Architecture", 2010)

"1.The formal, specifically assigned, and entrusted accountability for business (non-technical) responsibilities ensuring effective control and use of data and information resources. 2.The formal accountability for business responsibilities ensuring effective control and use of data assets. (DAMA International, "The DAMA Dictionary of Data Management" 1st Ed., 2011)

"Responsibility and accountability for the actions taken upon a defined set of data, including the definition of the consumers of the data. A data steward is not necessarily the data owner." (Craig S Mullins, "Database Administration: The Complete Guide to DBA Practices and Procedures" 2nd Ed., 2012)

04 February 2017

SQL Server Administration: Killing Sessions - Killing ‘em Softly and other Snake Stories

Introduction

    There are many posts on the web advising succinctly how to resolve a blocking situation by terminating a session via kill command, though few of them warn about its use and several important aspects that need to be considered. The command is powerful and, using an old adagio, “with power comes great responsibility”, responsibility not felt when reading between the lines. The easiness with people treat the topic can be seen in questions like “is it possibly to automate terminating sessions?” or in explicit recommendations of terminating the sessions when dealing with blockings.

   A session is created when a client connects to a RDBMS (Relational Database Management System) like SQL Server, being nothing but an internal logical representation of the connection. It is used further on to perform work against the database(s) via (batches of) SQL statements. Along its lifetime, a session is uniquely identified by an SPID (Server Process ID) and addresses one SQL statement at a time. Therefore, when a problem with a session occurs, it can be traced back to a query, where the actual troubleshooting needs to be performed.

   Even if each session has a defined scope and memory space, and cannot interact with other sessions, sessions can block each other when attempting to use the same data resources. Thus, a blocking occurs when one session holds a lock on a specific resource and a second session attempts to acquire a conflicting lock type on the same resource. In other words, the first session blocks the second session from acquiring a resource. It’s like a drive-in to a fast-food in which autos must line up into a queue to place an order. The second auto can’t place an order until the first don’t have the order – is blocked from placing an order. The third auto must wait for the second, and so on. Similarly, sessions wait in line for a resource, fact that leads to a blocking chain, with a head (the head/lead blocking) and a tail (the sessions that follow). It’s a FIFO (first in, first out) queue and using a little imagination one can compare it metaphorically with a snake. Even if imperfect, the metaphor is appropriate for highlighting some important aspects that can be summed up as follows:

Snakes have their roles in the ecosystem
Not all snakes are dangerous
Grab the snake by its head
Killing ‘em Softly
Search for a snake’s nest
Snakes can kill you in sleep
Snake taming

Warning: snakes as well blockings need to be handled by a specialist, so don’t do it by yourself unless you know what are you doing!

Snakes have their roles in the ecosystem

Snakes as middle-order predators have an important role in natural ecosystems, as they feed on prey species, whose numbers would increase exponentially if not kept under control. Fortunately, natural ecosystems have such mechanism that tend to auto-regulate themselves. Artificially built ecosystems need as well such auto-regulation mechanisms. As a series of dynamical mechanisms and components that work together toward a purpose, SQL Server is an (artificial) ecosystem that tends to auto-regulate itself. When its environment is adequately sized to handle the volume of information or data it must process then the system will behave smoothly. As soon it starts processing more data than it can handle, it starts misbehaving to the degree that one of its resources gets exhausted.

Just because a blocking occurs doesn’t mean that is a bad thing and needs to be terminated. Temporary blockings occur all the time, as unavoidable characteristic of any RDBMS with lock-based concurrency like SQL Server. They are however easier to observe in systems with heavy workload and concurrent access. The more users in the system touch the same data, the higher the chances for a block to occur. A good design database and application architecture typically minimize blockings’ occurrence and duration, making them almost unobservable. At the opposite extreme poor database design combined with poor application design can make from blockings a DBA’s nightmare. Persistent blockings can be a sign of poor database or application design or a sign that one of the environment’s limits was reached. It’s a sign that something must be done. Restarting the SQL server, terminating sessions or adding more resources have only a temporary effect. The opportunity lies typically in addressing poor database and application design issues, though this can be costlier with time.

Not all snakes are dangerous

A snake’s size is the easiest characteristic on identifying whether a snake is dangerous or not. Big snakes inspire fear for any mortal. Similarly, “big” blockings (blockings consuming an important percentage of the available resources) are dangerous and they have the potential of bringing the whole server down, eating its memory resources slowly until its life comes to a stop. It can be a slow as well a fast death.

Independently of their size, poisonous snakes are a danger for any living creature. By studying snakes’ characteristics like pupils’ shape and skin color patterns the folk devised simple general rules (with local applicability) for identifying whether snakes are poisonous or not. Thus, snakes with diamond-shaped pupils or having color patterns in which red touches yellow are likely/believed to be poisonous. By observing the behavior of blockings and learning about SQL Server’s internals one can with time understand the impact of each blocking on server’s performance.

Grab the snake by its head

    Restraining a snake’s head assures that the snake is not able to bite, though it can be dangerous, as the snake might believe is dealing with a predator that is trying to hurt it, and reach accordingly. On the other side troubleshooting blockings must start with the head, the blocking session, as it’s the one which created the blocking problem in the first place.

    In SQL Server sp_who and its alternative sp_who2 provide a list of all sessions, with their status, SPID and a reference with the SPID of the session blocking it. It displays thus all the blocking pairs. When one deals with a few blockings one can easily see whether the sessions form a blocking chain. Especially in environments under heavy load one can deal with a handful of blockings that make it difficult to identify all the formed blocking chains. Identifying blocking chains is necessary because by identifying and terminating directly the head blocking will often make the whole blocking chain disappear. The other sessions in the chain will perform thus their work undisturbed.

    Going and terminating each blocking session in pairs as displayed in sp_who is not recommended as one terminates more sessions than needed, fact that could have unexpected repercussions. As a rule, one should restore system’s health by making minimal damage.

    In many cases terminating the head session will make the blocking chain disperse, however there are cases in which the head session is replaced by other session (e.g. when the sessions involve the same or similar queries). One will need to repeat the needed steps until all blocking chain dissolve.

Killing ‘em Softly

Killing a snake, no matter how blamable the act, it is sometimes necessary. Therefore, it should be used as ultimate approach, when there is no other alternative and when needed to save one’s or others’ life. Similarly killing a session should be done only in extremis, when necessary. For example, when server’s performance has deprecated considerably affecting other users, or when the session is hanging indefinitely.

Kill command is powerful, having the power of a hammer. The problem is that when you have a hammer, every session looks like a nail. Despite all the aplomb one has when using a tool like a hammer, one needs to be careful in dealing with blockings. A blocking not addressed correspondingly can kick back, and in special cases the bite can be deadly, for system as well for one’s job. Killing the beast is the easiest approach. Kill one beast and another one will take its territory. It’s one of the laws of nature applicable also to database environments. The difference is that if one doesn’t addresses the primary cause that lead to a blocking, the same type of snake more likely will appear repeatedly.

Unfortunately, the kill command is no bulletproof for terminating a session, it may only severe the snake. As the documentation warns, there can be cases in which the method won’t have any effect on the blocking, the blocking continuing to room around. So, might be a good idea to check whether the session disappeared and keep an eye on it until it totally disappeared. Especially when dealing with a blocking chain it can happen that the head session is replaced by another session, which probably was waiting for the same resources as the previous head session. It may happen that one deals with two or more blocking chains independent from each other. Such cases appear seldom but are possible.

Killing the head session with a blocking without gathering some data provides less opportunities for learning, for understanding what’s happening in your system, of identifying what caused the blocking to occur. Therefore, before jumping to kill a session, collect the data you need for further troubleshooting.

Search for a snake’s nest

With the warning that unless one deals with small snakes, might not be advisable in searching for a snake’s nest, the idea behind this heuristic is that with a snake’s occurrence more likely there is also a nest not far away, where several other snakes might hide. Similarly, a query that causes permanent blockings might be the indication for code that generates a range of misbehaving queries. It can be same code or different pieces of code. One can attempt to improve the performance of a query that leads to blockings by adding more resources on the server or by optimizing SQL Server’s internals, though one can’t compensate for poor programming. When possible, one needs to tackle the problem at the source, otherwise performance improvements are only temporary.

Snakes can kill you in sleep

When wondering into the wild as well when having snakes as pets one must take all measures to assure that nobody’s health is endangered. Same principle should apply to databases as well, and the first line of defense resides in actively monitoring the blockings and addressing them timely as they occur. Being too confident that nothing happens and no taking the necessary precautions can prove to be a bad strategy when a problem occurs. In some situations, the damage might be acceptable in comparison with the effort and costs needed to build the monitoring infrastructure, though for critical systems it can come with important costs.

Snakes’ Taming

Having snakes as pets doesn’t seem like a good idea, and there are so many reasons why one shouldn’t do it (see PETA’s reasons)! On the other side, there are also people with uncommon hobbies, that not only limit themselves at having a snake pet, but try to tame them, to have them behave like pets. There are people who breed snakes to harness their venom for various purposes, occupation that requires handling snakes closely. There are also people who brought their relation with snakes at level of art, since ancient Egypt snake charming being a tradition in countries from Southeast Asia, Middle East, and North Africa. Even if not all snakes are tameable, snake’s taming and charming is possible. In the process the tamer must deprogram or control snakes’ behavior, following a specific methodology in a safe environment.

No matter how much one tries to avoid persistent blockings, one can learn from troubleshooting blockings, about their sources, behavior as well about own limitations. One complex blocking can be a good example with which one can test his knowledge about SQL Server internals as well about applications’ architecture. Each blocking provides a scenario in which one can learn something.

When fighting with a blocking, it’s wise to do it within a safe environment, typically a test or development environment. Fighting with it in a production environment can cause unnecessary stress and damage. So, if you don’t have a safe environment in which to carry the fight, then build one and try to keep the same essential characteristics as in production environment!

There will be also situations in which one must fight with a blocking in the production environment. Then, be careful in not damaging the data as well the environment, and take all the needed precautions!

Conclusion
The comparison between snakes and blockings might not be perfect, though hopefully it will imprint in reader’s mind the dangers of handling blockings inappropriately and increase the awareness in what concerns related topics.

Data Management: Data Matching (Definitions)

[deterministic matching:] "Deterministic matching algorithms compare and match records according to hard-coded business rules according to their precision. For instance, a rule can be set up that stipulates that every “Bill” be matched with a 'William'." (Jill Dyché & Evan Levy, "Customer Data Integration: Reaching a Single Version of the Truth", 2006)

[probabilistic matching:] "Uses statistical algorithms to deduce the best match between two records. Probabilistic matching usually tracks statistical confidence that two records refer to the same customer." (Evan Levy & Jill Dyché, "Customer Data Integration", 2006)

"A feature of data cleansing tools or the process that matches, or links, associated records through a user-defined or common algorithm." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"Data-matching involves bringing together data from disparate services or sources, comparing it, and eliminating duplicate data. There are two types of algorithms that are used in data matching: (1) deterministic algorithms, which strictly use match criteria and weighting to determine the results, and (2) probabilistic algorithms, which use statistical models to adjust the matching based on the frequency of values found in the data." (Allen Dreibelbis et al, "Enterprise Master Data Management", 2008)

"A highly specialized set of technologies that allows users to derive a high-confidence value of the party identification that can be used to construct a total view of a party from multiple party records." (Alex Berson & Lawrence Dubov, "Master Data Management and Data Governance", 2010)

[deterministic matching:] "A type of matching that relies on defined patterns and rules for assigning weights and scores for determining similarity." (DAMA International, "The DAMA Dictionary of Data Management" 1st Ed., 2010)

[probabilistic matching:]"A type of matching that relies on statistical analysis of a sample data set to project results on the full data set." (DAMA International, "The DAMA Dictionary of Data Management" 1st Ed., 2010)

[fuzzy matching:] "A technique of decomposing words into component parts and comparing the parts to find an acceptable level of correspondence." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Matching is a technique for statistical control of confounding. In the simplest form, individuals from the two study groups are paired on the basis of similar values of one or more covariates. Matching can be viewed as a special case of stratification in which each stratum consists of only two individuals." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

"The process of comparing rows in data sets to determine which rows describe the same thing and are therefore either complimentary or redundant." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

01 February 2017

Data Management: Data Strategy (Definitions)

"A business plan for leveraging an enterprise’s data assets to maximum advantage." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

[enterprise data strategy:] "A data strategy supporting the entire enterprise." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

[data management strategy:] "Selected courses of actions setting the direction for data management within the enterprise, including vision, mission, goals, principles, policies, and projects." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A data strategy is a plan for maintaining and improving the quality, integrity, security, and access of the enterprise data. It typically includes business plans that describe how the data will be used to support the enterprise business strategy and goals." (Dewey E Ray, "Valuing Data: An Open Framework", 2018)

"A data strategy is not an algorithm, buzzword, IT project, technology or application, collection of data in storage, department or team, or project or tactic. A data strategy is a set of organization-wide objectives leading to highly efficient processes that turn data resources into outcomes that help the organization fulfill its mission." (Harvinder Atwal, "Practical DataOps: Delivering Agile Data Science at Scale", 2019)

"A data strategy is a plan designed to improve all the ways you acquire, store, manage, share, and use data." (Evan Levy, "TDWI Data Strategy Assessment Guide", 2021)

"A data strategy is a central, integrated concept that articulates how data will enable and inspire business strategy." (MIT CISR)

"A data strategy is a common reference of methods, services, architectures, usage patterns and procedures for acquiring, integrating, storing, securing, managing, monitoring, analyzing, consuming and operationalizing data." (DXC.Technology) [source]

"A data strategy is a highly dynamic process employed to support the acquisition, organization, analysis, and delivery of data in support of business objectives." (Gartner)

SQL Troubles

Pages