SQL Troubles: 2010

31 December 2010

Meta-Blogging: Past, Present and Future

Introduction

Even if I started to blog 3-4 years ago, only this year (still 2010) I started to allocate more time for blogging, having two blogs on which I try to post something periodically: SQL Troubles and The Web of Knowledge plus a homonym Facebook supporting group (for the later blog). As a parenthesis, the two blogs are approaching related topics from different perspective, the first focusing on data related topics, while the second approaching data from knowledge and web perspective; because several posts qualify for both blogs, I was thinking to merge the two blogs, though given the different perspectives and types of domains that deal with them, at least for the moment I’ll keep them apart. Closing the parenthesis, I would like to point out that I would love to allocate more time though I have to balance between blogging, my professional and personal life, and even if the three have many points in common, some delimitation it’s necessary. Because it’s the end of a year, I was thinking that it’s maybe the best time to draw the line and analyze the achievements of the previous year and the expectations for the next year(s), for each of the two blogs. So here are my thoughts:

Past and Present

There are already more than 10 years since I started to work with the various database systems, my work ranging from data modeling to database development, reporting, ERP systems, etc. I can’t consider myself an expert, though I’ve accumulated experience in a whole range of areas, fact that I think entitles me to say that I have something to write about, even if the respective themes are not rocket science. In addition, it’s the human endeavor of learning something new each day, and in IT that’s quite an imperative, the evolvement of various technologies requesting those who are working in this domain to spend extra hours in learning new things or of consolidating or reusing knowledge in new ways. I considered at that time, and I still do, that blogging helps the learning process, allowing me to externalize the old or new knowledge, clear my thoughts, have also some kind of testimony of what I know or at least a repository of information I could reuse when needed, and eventually receive some feedback. These are few of the reasons for which this blog was born, and I hope the information presented in here are useful also for other people.

During the past year I made it to post on my blog more than 100 entries on various topics, the thematic revolving around strings, hierarchical queries, CLR functionality, Data Quality, SSIS, ERPs, Reports, troubleshooting, best practices, joins, etc. Not all the posts rose to my expectations, though that’s a start, hoping that I will find a personal style and the quality of the posts will increase. I can’t say I received lot of feedback, however based on the user access’ statistics provided by Clustrmaps and Google the number of visitors this year was somewhere around 8500, close to my expectations. Talking about the number of visitors, it’s nice to have also some visualization, so for this year’s statistics I’ll use Clustrmaps visualization, which provides a more detailed geographical overview than Google’s Stats, while for trending I show below Google’s Stats (contains data from May until today):

What I find great about Google’s Stats is that it provides also an overview of the most accessed posts and the traffic sources. There are also some statistics of the audience per browsers and OS, though they are less important for my blogging requirements, at least for the moment.

The Web of Knowledge - pageviews by browsers

What I find interesting is that most visited posts and searched keywords were targeting SSIS and Oracle vs. SQL Server-related topics. So, if for the future I want more traffic than maybe I should diversify my topics in this direction.

Future

I realize that I started many topics, having in the next year to continue posting on the same, but also targeting new topics like Relational Theory, Business Intelligence, Data Mining, Data Management, Statistics, SQL Server internals, data technologies, etc. Many of the posts will be an extension of my research on the above topics, and I was thinking to post also my learning notes with the hope that I will receive more feedback. I realized that I need to be more active and provide more feedback to other blogs, using the respective comments as gateways to my blog and try to build a network around it. I was thinking also to start a Facebook “support group”, posting the links I discovered, quotes or impressions in a more condensed form, but again this will take me more time, so I’m not sure if it makes sense to do that. Maybe I should post them directly on the blog, however I wanted my posts to be a little more consistent than that. Anyway I know also that I won’t manage to post more than an average of one post per week though per current expectations is ok.

Right now all the posts are following a push model, in other words I push the content independently of whether there is a demand or not for it. It’s actually natural because the blog is having a personal note. In the future I’m expecting to move in the direction of a pull model, in other words to write on topics requested by readers, however for this I need more feedback from you, the reader. So please let me know what topics you’d like to read!

I close here, hoping that the coming year (2011) will be much better than the current one. I wish to all of you, a Happy New Year!

24 December 2010

Data Warehousing: Data Lake (Definitions)

"If you think of a Data Mart as a store of bottled water, cleansed and packaged and structured for easy consumption, the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples." (James Dixon, "Pentaho, Hadoop, and Data Lakes", 2010) [sorce] [first known usage]

"At its core, it is a data storage and processing repository in which all of the data in an organization can be placed so that every internal and external systems', partners', and collaborators' data flows into it and insights spring out. [...] Data Lake is a huge repository that holds every kind of data in its raw format until it is needed by anyone in the organization to analyze." (Beulah S Purra & Pradeep Pasupuleti, "Data Lake Development with Big Data", 2015)

"Data lakes are repositories of raw source data in their native format that are stored for extended periods." (Saumya Chaki, "Enterprise Information Management in Practice", 2015)

"A repository of data used to manage disparate formats and types of data for a variety of uses." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"A storage system designed to hold vast amounts of raw data in its native (ingested) format, usually in a flat or semi-structured format. Extract, transform, and load (ETL) operations are usually applied to data lakes to extract local data marts for downstream computation." (Benjamin Bengfort & Jenny Kim, "Data Analytics with Hadoop", 2016)

"Data Lake is an analytics system that supports the storing and processing of all types of data." (Maritta Heisel et al, "Software Architecture for Big Data and the Cloud", 2017)

"A data lake is a central repository that allows you to store all your data—structured and unstructured - in volume […]" (Holden Ackerman & Jon King, "Operationalizing the Data Lake", 2019)

"A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics, and machine learning." (Piethein Strengholt, "Data Management at Scale", 2020)

"A repository for storing unstructured and structured data that is downloaded in its raw form and stored by a highly scalable, distributed files system known as open source." (Marcin Flotyński et al, "Non-Technological and Technological (SupTech) Innovations in Strengthening the Financial Supervision", 2021)

"Data lakes are massive repositories for original, raw and unstructured data which is collected from various sources across a smart city. The data from data lakes can be cleansed and transformed for further analytics and modeling." (Vijayaraghavan Varadharajan & Akanksha Rajendra Singh, "Building Intelligent Cities: Concepts, Principles, and Technologies", 2021)

"A data lake is a central location, that holds a large amount of data in its native, raw format, as well as a way to organize large volumes of highly diverse data." (databricks) [source]

"A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale." (Amazon) [source]

"A data lake is a concept consisting of a collection of storage instances of various data assets. These assets are stored in a near-exact, or even exact, copy of the source format and are in addition to the originating data stores." (Gartner)

"A data lake is a collection of long-term data containers that capture, refine, and explore any form of raw data at scale. It is enabled by low-cost technologies that multiple downstream facilities can draw upon, including data marts, data warehouses, and recommendation engines." (Teradata) [source]

"A data lake is a large and diverse reservoir of corporate data stored across a cluster of commodity servers running software, most often the Hadoop platform, for efficient, distributed data processing." (Qlik) [source]

"A data lake is a place to store your structured and unstructured data, as well as a method for organizing large volumes of highly diverse data from diverse sources." (Oracle)

"A Data Lake is a service which provides a protective ring around the data stored in a cloud object store, including authentication, authorization, and governance support." (Cloudera) [source]

"A data lake is a type of data repository that stores large and varied sets of raw data in its native format." (Red Hat) [source]

"A data lake is an unstructured data repository that contains information available for analysis. A data lake ingests data in its raw, original state, straight from data sources, without any cleansing, standardization, remodeling, or transformation. It enables ad hoc queries, data exploration, and discovery-oriented analytics because data management and structure can be applied on the fly at runtime, unlike traditional structured data storage which requires a schema on write." (TDWI)

"A storage repository that holds a large amount of raw data in its native format until it is needed." (Solutions Review)

14 December 2010

SQL Reloaded: Pulling the Strings of SQL Server III: (Concatenation)

Typically a database in general, and a table in particular, that follows the normalization rules, is designed to have the columns contain the smallest semantic chunks of data, it could be a Street, a Zip Code, City, a person’s First or Last Name, but also a large chunk of text like a Description or a Comment. No matter how well designed is a database, there will always be the need to do various operations with strings, typically concatenation, extraction of subpart of a string, insertion or deletion of characters, rearangement of string’s characters, trimming, splitting it in substrings, or of getting various numeric values: length, position of a given text, number of not empty characters, on whether the text represents a valid numeric or date values, etc. In the following posts I will attempt to address the respective operations in the context of select statements, and let’s start with concatenation.

Concatenation is the operation of joining two or more string values within an expression. In SQL Server the “+” operator is used for concatenation, and it could be used to concatenate two or more members. In order to concatenate two members from which one of them is a string, the second term needs to be explicitly converted to a string data type, otherwise an error will occur. For readability or post-processing purposes, the strings are concatenated using a delimiter in order to delimit the boundaries of the initial value, it could be used a space, a comma, a semicolon, a pipe, a tab or any other character that could be used typically in delimiting columns.

-- concatenation of strings 
SELECT 'a string' + ' and ' + 'a second string' Example1  
, 'a string' + ',' + 'a second string' Example2  
, '1245' + '67890' Example3  
, '1245' + '.' + '67890' Example4

The concatenation of string variables or columns functions based on the same principles:

-- concatenating string variables 
DECLARE @string1 varchar(50) 
DECLARE @string2 varchar(50) 
DECLARE @string3 varchar(50) 

SET @string1 = 'this is a string'  
SET @string2 = 'this is another string'  

SELECT @string1 + ' and ' + @string2 Example1 
, @string1 + char(31) + @string2 Example2 
, @string1 + ', ' + @string2 Example3 
, @string1 + ' ' + @string3 Example4 
, @string1 + IsNull(@string3, '!') Example5

Here’s another example based on the concatenation of columns coming from two joined tables from AdventureWorks database:

-- concatenating columns of joined tables 
SELECT PAD.AddressID 
, IsNull(PAD.AddressLine1, '')  
+ IsNull(', ' + PAD.AddressLine2, '') 
+ IsNull(', ' + PAD.City, '') 
+ IsNull(', ' + PAD.PostalCode, '') 
+ IsNull(', ' + PSP.Name, '') Address 
FROM Person.Address PAD 
      JOIN Person.StateProvince PSP 
         ON PAD.StateProvinceID = PSP.StateProvinceID

   As stressed in the previous post, the NULL values need to be adequately handled either by initializing values or by using the IsNull or COALESCE functions. The concatenation of strings combined with IsNull function could be used creatively in order to add a comma only when a value is not null, as in the above example.

   There are scenarios in which is needed to concatenate the values belonging to the same column but from different records, for example concatenating the email values in order to send a single email to all the users in one single action. Before the introduction of common table expressions (CTE), wasn’t possible to concatenate the string values belonging to different records, at least not in a query, this functionality being achieved by using cursors or loops, or simply performed on client or intermediate layers. As I already gave an example on how to use cursor in order to loop through the values of a table and concatenate them (see “Cursors and Lists” post), I will focus on the use of loops and simple CTEs.

    Loops are one of the basic functionality in programming languages, no matter of their complexity or type. Either if are talking about WHILE, FOR, foreach or do … until loops, the principles are the same: perform a list of actions until one or more conditions are met. In this case the actions performed is reduced to a set of concatenations based on the letters of the (English) alphabet:

-- concatenation within a loop 
DECLARE @list varchar(max) 
DECLARE @index int  
SET @list = '' 
SET @index = ASCII('a') WHILE (@index&lt;ASCII('z')) 
BEGIN 
     SET @list = @list + ', ' + Char(@index) 
     SET @index = @index + 1 
END 
SELECT @list Result

There is more work that needs to be performed in order to remove the leading comma from the output, but that’s a topic for the next post, when discussing about decomposition of strings.

CTEs are a little more complex to use than the loops, though the concatenation could be achieved across records and this in one query and not in procedural language as in the above example. In order delimit the two components of a CTE, I made use of a second CTE which simulates the existence of a given table:

-- concatenation across records 
;WITH Data(Column1, Ranking)  
AS  
( -- preparing test data 
       SELECT 'A' Column1, 0 Ranking  
       UNION ALL  
       SELECT 'B' Column1, 1 Ranking  
       UNION ALL  
       SELECT 'C' Column1, 2 Ranking 
) 
, Result(Column1, Ranking)  
AS 
(  -- performing the actual concatenation 
      SELECT Cast(Column1 as nvarchar(max)) Column1 , Ranking

      FROM Data

      WHERE Ranking = 0  
      UNION ALL 
      SELECT Cast(B.Column1 + ',' + A.Column1 as nvarchar(max)) Column1 , A.Ranking  
      FROM Data A  
         JOIN Result  B  
             ON A.Ranking - 1 = B.Ranking  
) SELECT Column1  
FROM Result  
WHERE Ranking IN (SELECT MAX(Ranking) FROM Result)

The logic for doing a simple concatenation seems maybe complicated, though the volume of work is not so big if we ignore the first CTE. On the other side I introduced an “index” within the Ranking column, fact that allows processing easier the records. When dealing with the records coming from a table it’s probably much easier to use one of the ranking functions that suits best.

11 December 2010

SQL Reloaded: Pulling the Strings of SQL Server II (Creation and Selection)

It doesn’t make sense to talk about the creation of string data types without talking about their “selection” as people typically want to see also examples at work, and not only learn about theoretical facts. I’m saying that because often programming seems to be like putting together the pieces of a puzzle without knowing how the final image would look like. Anyway, coming back to the topic, what I find great about SQL Server is that it allows to “create” and select a string with a minimum overhead within a simple select statement:

-- selecting a string value 
SELECT 'this is a string' -- string 1 
, 'and this is a second string' --string2 

-- selecting a string value (named columns)  
SELECT 'this is a string' String1  
, 'and this is a second string' String2

Pretty simple, isn’t it? This kind of scripts could be useful especially when debugging logic or testing an expression (e.g. functions at work), the advantage residing in the fact that is not necessary to built a table in order to test simple things. When performing multiple operations with the same values could be handy to store the values in declared variables:

-- selecting string variables 
DECLARE @string1 varchar(50)

DECLARE @string2 char(50) 
DECLARE @string3 char(50) = 'this is a string'  

SET @string1 = 'this is a string'  

SELECT @string1 String1 , @string2 String2 
, @string3 String3

   As can be seen a not initialized variable has by default the value NULL, fact that could lead to unexpected behavior problems when used in expressions involving multiple variables. Therefore it’s recommended to initialize the values with the empty string or at least to handle the nulls in expressions. Starting with SQL Server 2008 it’s possible to declare and initialize variables within the same statement, so the above code won’t work in previous versions, unless the third declaration is broken into two pieces as per definition of first string.

    It looks like that’s all, at least in what concerns the direct declaration of a string, be it implicit or explicit. However a basic introduction into strings’ creation is incomplete without mentioning the creation of strings from other data types and the various functions that could be used for this purpose. Ignoring the Cast and Convert functions used to explicitly convert the other data types to string data types, there are several other functions for this purpose, namely Char, Space and Replicate.

   Probably a few of you are already familiar with the ASCII (American Standard Code for Information Interchange) character-encoding scheme (vs. binary encoding) used to encode files. ASCII code represents a mapping between number and characters, SQL Server supporting the transformation between two sets through the ASCII and Char functions. If ASCII translates a non-unicode character into an integer, the Char function translates an integer value into a non-unicode character. The integer values range between 0 and 255, they encompassing the 0-9 digits, the characters of English together with the diacritics of other important languages, punctuation signs and several graphical characters. The mapping between the two sets is unique, and as can be seen from the below example based on a common table expression, the functions are inverse:

-- ASCII character values 
WITH CTE 
AS (    
    SELECT 0 [Index]  
    UNION ALL 
    SELECT [Index]+1 [Index]  
    FROM CTE 
    WHERE [Index]&lt;255 
) SELECT [Index] 
, CHAR([Index]) [Character] 
, ASCII(CHAR([Index])) [ASCII] 
FROM CTE 
OPTION (MAXRECURSION 0)

There is not much to say about the Space and Replicate functions, the Space function returns a string of repeated spaces, while the Replicate function forms a string as a repeated sequence of values. The definition of the Space function could be considered as redundant as long the same output could be obtained by using the space as repeating sequence.

-- Space & Replicate at work 
SELECT Space(1) Example1 
, Space(5) Example2 
, Replicate(' ', 1) Example3 
, Replicate(' ', 5) Example4 
, Replicate('10', 1) Example5 
, Replicate('10', 2) Example6 
, Replicate('10', 3) Example7

As per the last statement, the first and third examples, respectively the second and fourth example will return the same values, unfortunately the output of the respective examples it’s not so easy to see. For exemplification, it could have been enriched by comparing or concatenating the strings, though that’s a topic for the next post.

08 December 2010

SQL Reloaded: Pulling the Strings of SQL Server IX (Special Characters)

Under special characters denomination are categorized typically the characters that don’t belong to the alphabet of a given language, typically English (a-z, A-Z), or the numeric digits (0-9). Characters like umlauts (e.g. ä, ë, ö, ü, ÿ), accents (e.g. é, í, ó, ú) and other type of characters used in specific languages like German, French, Turkish, Hungarian, Romanian, Dutch or Swedish, together with punctuation signs or graphical characters are falling under the designation of special characters. It’s true that SQL Server, like many other databases, supports Unicode, a standard designed to encode such characters, though not all database designs are taking Unicodes into account. Preponderantly the old database and other software solutions use the non-unicode string data types (char, varchar, text) fact that makes the special characters to be displayed inadequately, sometimes undecipherable. In order to avoid this behavior could be decided to port the data types to unicode or use only the standard characters of English language, both solutions with their positive and negative aspects. In such cases, especially during migration project or ETL tasks, eventually as part of a Data Quality initiative, it’s preferred to identify and replace the special characters in a manual, automatic or semi-automatic fashion. In addition, there are also cases in which, from various reasons, some attributes are not allowed or should not include special characters, and also this aspect could be included in a Data Quality initiative.

During assessment of Data Quality, in an organized or less organized manner, a first step resides in understanding the quality of the data. In the current case this resumes primarily in identifying how many records contain special characters, how could be the data cleaned, and the “cost” for this activity. Actually, before going this far, must be defined the character sets included in special characters, the definition could vary from case to case. For example in some cases could be considered also the space or the important punctuation signs as valid characters, while in others they may not be allowed. There could be identified thus multiple scenarios, though I found out that the range of characters a-z, A-Z, 0-9 and the space are considered as valid character in most of the cases. For this purpose could be built a simple function that iterates through all the characters of a string and identifies if there is any character not belonging to the before mentioned range of valid characters. In order to address this, a few years back I built a function similar with the below one:

-- checks if a string has special characters 
CREATE FUNCTION dbo.HasSpecialCharacters( 
@string nvarchar(1000)) 
RETURNS int 
AS  
    BEGIN 
         DECLARE @retval int 
         DECLARE @index int 
         DECLARE @char nchar(1)  

         SET @retval = 0 
         SET @index = 1  

         WHILE (@index <= IsNull(len(@string), 0) AND @retval=0) 
        BEGIN  
           SET @char = Substring(@string, @index, @index+1) 
           IF NOT (ASCII(@char) BETWEEN 48 AND 57 -- numeric value 
             OR ASCII(@char) BETWEEN 65 AND 90 -- capital letters 
            OR ASCII(@char) BETWEEN 97 AND 122 -- small letters 
            OR ASCII(@char) = 32) --space 
           BEGIN 
                SET @retval = @index 
           END 
           ELSE 
               SET @index = @index + 1  
     END  

    RETURN (@retval) 
END

Function’s logic is based on the observation that the ASCII of numeric values could be found in the integer interval between 48 and 57, the capital letters between 65 and 90, while the small letters between 97 and 122. By adding the ASCII for space and eventually several other characters, the check on whether an character is valid resuming thus to only 4 constraints. Here’s the function at work:

-- testing HasSpecialCharacters function 
SELECT dbo.HasSpecialCharacters('kj324h5kjkj3245k2j3hkj342jj4') Example1 
, dbo.HasSpecialCharacters('Qualität') Example2 
, dbo.HasSpecialCharacters('Änderung') Example3 
, dbo.HasSpecialCharacters('') Example4 
, dbo.HasSpecialCharacters(NULL) Example5 
, dbo.HasSpecialCharacters('Ä') Example6 
, dbo.HasSpecialCharacters('ä') Example7 
, dbo.HasSpecialCharacters('a') Example8

As can be seen, the function returns the position where a special character is found, fact that enables users to identify the character that causes the problem. A similar function could be built also in order to count the number of special characters found in a string, the change residing in performing a counter rather then returning the position at the first occurrence of a special character.

The function might not be perfect though it solves the problem. There are also other alternatives, for example of storing the special characters in a table and performing a simple join against the target table. Another solution could be based on the use RegEx functionality, either by using OLE automation or directly CLR functionality. There could be done variations on the above solution too by limiting to check on whether the characters of a string are falling in the range projected by the ASCII function. That’s what the following function does:

-- checks if a string has special characters falling in an interval  
CREATE FUNCTION dbo.HasCharNotInASCIIRange( 
@string nvarchar(1000) 
, @start int 
, @end int) 
RETURNS int 
AS  
BEGIN      
    DECLARE @retval int 
    DECLARE @index int 
    DECLARE @char nchar(1)

 
    SET @retval = 0 
    SET @index = 1

    WHILE (@index <= IsNull(len(@string), 0) AND @retval=0) 
    BEGIN  
         SET @char = Substring(@string, @index, @index+1) 
         IF NOT (ASCII(@char) BETWEEN @start AND @end)  
        BEGIN 
              SET @retval = @index 
        END 
        ELSE 
             SET @index = @index + 1  
     END 
     RETURN (@retval) 
END

With this function are necessary 4 calls in order to identify if a string contains special characters, though we loose the flexibility of identifying the first character that is invalid. We could still identify the first occurrence by taking the minimum value returned by the 4 calls, however, unlike Oracle (see Least function), SQL Server doesn’t have such a function, so we’ll have eventually to built it. Anyway, here’s the above function at work:

-- testing HasCharNotInASCIIRange function 
SELECT dbo.HasCharNotInASCIIRange('k12345', 48, 57) Example1 
, dbo.HasCharNotInASCIIRange('12k345', 48, 57) Example2 
, dbo.HasCharNotInASCIIRange('12345', 48, 57) Example3 
, dbo.HasCharNotInASCIIRange(' 12345', 48, 57) Example4 
, dbo.HasCharNotInASCIIRange('12345 ', 48, 57) Example5 
, dbo.HasCharNotInASCIIRange('', 48, 57) Example6 
, dbo.HasCharNotInASCIIRange(NULL, 48, 57) Example7 
, dbo.HasCharNotInASCIIRange('', 48, 57) Example8 
, dbo.HasCharNotInASCIIRange('a', 48, 57) Example9 
, dbo.HasCharNotInASCIIRange('Ä', 48, 57) Example10 
, dbo.HasCharNotInASCIIRange('ä', 32, 32) Example11

07 December 2010

SQL Reloaded: Pulling the Strings of SQL Server I 7- Introduction

The (character) string or simply the character data type, how is named by the MSDN documentation, is one of the primary data types available in a database and probably the most complex given the fact that it can encompass any combination of numeric, literal, date or any other primary data type. In fact it could include any type of chunk of text that could be written in any language as SQL Server supports Unicode and thus most of the (written) languages. In this post I’m not intending to make a complete anthology of strings, and neither to retake the whole volume of information available in MSDN or other important books. My intent is to look at strings from a slightly different perspective, considering the various functions involving strings and how they could be used in order to provide various functionality, in fact the cornerstone of everyday developer.

A few things to remember:

1. there were mainly two types of non-unicode strings: the char (or character) of fixed length, respectively the varchar of variable length (varying character)

2. if initially both types of strings were having a maximum length of 8000 of characters, with SQL Server it’s possible to have a varchar with maximum storage size, declared as varchar(max).

3. if in the past there were some strict recommendations in what concerns the use of char or varchar, nowadays the varchar tends to be used almost everywhere, even for the strings of length 1.

4. talking about length, it denotes the number of chracters a string stores.

5. the trailing spaces, the spaces found at the right extremity of a string are typically ignored, while the leading spaces, the spaces found at the left extremity, are not ignored.

6. starting with SQL Server 2000, for the two character data types were introduced the corresponding unicode data types prefixed with n (national): nchar, respectively nvarchar.

7. given the fact that a unicode character needs more space to store the same non-unicode string, actually the number of bits doubles, the maximum length for an unicode string is only 4000.

8. there is also a non-unicode text, respectively ntext unicode data type, designed to store maximum length, though as it seems they could become soon deprecated, so might be a good idea to avoid it.

9. not-initialized variables, including strings, have the value NULL, referred also the NULL string, therefore it’s always a good idea to initialize your variables.

10. by empty string string is designated the string containing no character “’’”, and has the length 0.

11. there are several specific functions available for the creation, manipulation and conversion of strings from and to other data types.

12. not all of the aggregated functions work with string values (specifically the ones requesting a number value like SUM, AVG, STDV).

13. the operations performed on strings of different data types are generally not impacted by this aspect, though there are some exceptions.

14. there are several (basic) operations with strings, typically concatenation, extraction of subpart of a string, insertion, replacement or deletion of characters, rearrangement of string’s characters, trimming, splitting it in substrings (decomposition), etc.

15. there are several numeric values based on strings: length, position of a given text in a text, number of not empty characters, encoding of a character, on whether the text represents a valid numeric or date values, etc.

05 December 2010

Process Management: Business Process Management (Definitions)

"The general term for the services and tools that support explicit process management (such as process analysis, definition, execution, monitoring, and administration), including support for human and application-level interaction." (Tilak Mitra et al, "SOA Governance", 2008)

"Software that models an enterprise's human and machine tasks and the interactions between them as processes and can monitor these tasks in real time in order to trigger a unit of work or set off an alert when specified time limits are exceeded or a response is not received within a specified time." (Janice M Roehl-Anderson, "IT Best Practices for Financial Managers", 2010)

"Software products that support the design, execution, and monitoring of repetitive, day-to-day business processes. Can create data used in diagnostic or interactive control systems." (Leslie G Eldenburg & Susan K Wolcott, "Cost Management 2nd Ed", 2011)

"A popular management technique that includes methods and tools to support the design, analysis, implementation, management, and optimization of operational business processes." (Linda Volonino & Efraim Turban, "Information Technology for Management 8th Ed", 2011)

"Management approach focused on aligning all aspects of an organization with the wants and needs of clients. Business process management attempts to improve processes continuously and may be therefore described as a “process optimization process.” Business process management activities can be grouped into five categories: design, modeling, execution, monitoring, and optimization." (IQBBA, "Standard glossary of terms used in Software Engineering", 2011)

"A managerial discipline that is focused on execution. It is the art and science of how organizations do things and how they can do them better. BPM attempts to optimize process performance to achieve strategic business objectives consistently while adapting, when necessary, to change or to new opportunity." (Carl F Lehmann, "Strategy and Business Process Management", 2012)

"Managing the work steps and business activities of an organization's workers in an automated way." (Robert F Smallwood, "Information Governance: Concepts, Strategies, and Best Practices", 2014)

"A discipline focused on continuous improvement and transformation of end-to-end cross-functional business processes." (Forrester)

"Business process management (BPM) is a discipline that uses various methods to discover, model, analyze, measure, improve and optimize business processes. A business process coordinates the behavior of people, systems, information and things to produce business outcomes in support of a business strategy. Processes can be structured and repeatable, or unstructured and variable." (Gartner)

"Business process management (BPM) is the discipline of improving a business process from end to end by analyzing it, modelling how it works in different scenarios, executing improvements, monitoring the improved process and continually optimizing it." (Techtarget)

"Business process management (BPM) involves the use of appropriate tools and techniques to design, analyse, and manage operational business processes and, where possible, to improve those processes. The term business process refers to repetitive activities performed in the context of an organization’s normal, everyday operations." (Institute for Competitive Intelligence)

SQL Reloaded: A Look into Reverse

Some time back I was mentioning Stuff as one of the functions rarely used in SQL Server. Here’s another one: Reverse. I remember I used such a function in other programming languages and maybe 1-2 times in SQL Server, and I can’t say I seen it in action in other posts. As it name states and the MSDN documentation confirms, Reverse returns the reverse of a string value. The function takes as parameter any string value or value that could be implicitly conversed to a string, otherwise you’ll have to do an explicit conversion. As can be seen from the below first example the reverse of the “string” string is “gnirts”, and, as per the second example, the reverse of the revers of a string is the initial string itself. The third and fourth example use in exchange a numeric value.

-- reversing a string using Reverse function 
SELECT Reverse('string') Example1 
, Reverse(Reverse('string')) Example2 
, Reverse(1234567890) Example3 
, Reverse(Reverse(1234567890)) Example4

Example1	Example2	Example3	Example4
gnirts	string	987654321	1234567890

It seems there is not much to be said about Reverse function, and in the few situations that you need it, it’s great to have it, otherwise you would have to built it yourself. Let’s try to imagine there is no Reverse function in SQL Server, how would we provide such functionality by using existing functionality? In general such an exercise might look like lost time, though it could be useful in order to present several basic techniques. In this case the Reverse functionality could be provided by using a simple loop that iterates through string’s characters changing their order. The below piece of code should return the same output as above:

-- reversing a string using looping 
DECLARE @String varchar(50) 
DECLARE @Retval varchar(50) 
DECLARE @Index int 
SET @String = 'string' 
SET @Index = 1 
SET @Retval = '' 
WHILE @Index <= Len(@String) 
BEGIN  
   SET @Retval = Substring(@String, @Index, 1) + @Retval   
   SET @Index = @Index + 1  
END 
SELECT @Retval

Some type of problems whose solution involve the sequential processing of values within loops could be solved also using recursive techniques (recursion) that involve the sequential processing of values in which the output of a given intermediary step is based on the previous step. Recursion imply a slightly different view of the same problem, it could prove itself to be maybe less efficient in some cases and quite a useful technique in the others. SQL Server provides two types of recursion, one at function level involving functions, stored procedures are triggers, respectively at query level implemented in common table expressions. Recursive functions, functions that call themselves, have been for quite a while in SQL Server though few of the database books I read really talk about them. Probably I will detail the topic in another post, though for the moment I will show only the technique at work. In this case the recursive approach is quite simple:

-- reversing a string using a recursive function 
CREATE FUNCTION dbo.ReverseString( 
@string varchar(50)) 
RETURNS varchar(50) 
AS 
BEGIN           
     RETURN (CASE WHEN Len(@string)>1 THEN dbo.ReverseString( Right(@string, len(@string)-1)) + Left(@string, 1) ELSE @string END) 
END

-- testing 
SELECT dbo.ReverseString('string') Example1 
, dbo.ReverseString(dbo.ReverseString('string')) Example2

Common table expressions (CTE), since their introduction with SQL Server 2005, became quite an important technique in processing hierarchical structures like BOMs, which involve the traversal of the whole tree. A sequence is actually a tree having the width of 1 node, so it seems like a CTE’s job:

-- reversing a string using common table expression 
DECLARE @String varchar(50) 
SET @String = 'string' 
;WITH CTE(String, [Index]) 
AS (     SELECT Cast('' as varchar(50)) String 
    , 1 [Index] 
    UNION ALL 
    SELECT Cast(Substring(@String, [Index], 1) + String as varchar(50)) String 
    , [Index] + 1 [Index] 
    FROM CTE 
    WHERE [Index]<=Len(@String) 
)

SELECT @String Reversed 
FROM CTE 
WHERE [Index]-1=Len(@String)

Unlike the previous two solutions, the CTE approach involves more work, work that maybe isn’t required just for the sake of using the technique. When multiple alternative approaches exist, I prefer using the (simplest) solution that offers the highest degree of flexibility. In this case is the loop, though in others it could be a recursive function or a CTE.

30 November 2010

Business Intelligence: How Many Reporting Systems Do You Need?

Business Intelligence Series

I hope nobody’s questioning the fact that an organization needs at least a reporting system in order to take advantage of the huge amount of data it has collected over time, in order to have an overview on what’s happening in the organization, take better decision supported by data, etc. In one way or another many organizations arrive to have in place two or more reporting systems and all the consequences deriving from it. From experience, I would expect that many professionals have put themselves at least once the above question, often they having to live with their decisions and they are not always happy about them.

On one side we have the demand, the needs of the various departments and groups existing in an organization, while on the other side of the balance we have the various types of data and the technologies existing on the market (could be regarded as the supply), and the various constraints an organization might have: resources (financial, human, processing power, time), politics and philosophies, geography, complexity, etc. If they don’t seem so many you have to consider that the respective constraints could be further split into a long chain of causality, the books on Project Management, Architecture, Methodologies and other related topics treating them in detail, so no need to go there.

From practical reasons, I’d like to reformulate the above question as follows:
1. Is it enough only one reporting system or do we need more than one?
2. What’s the optimum number of reporting systems existing in an organization?

A blunt answer for the first question would be that an organization might need as many reporting systems as it’s needed in order to satisfy the existing reporting needs, in other words the demand. Straightforward answer but not really acceptable for a Manager, in particular, and an IT Professional, in general. Another blunt answer, this time for the second question, would be that an organization needs only “one and good” reporting system, though that’s contradicting the various constraints existing in IT world. So, what should it be?

Typically there are two views, the bottom up view, in this case starting with the data and building up to the reporting demands, and top down view, starting from the reporting requirements to data. In this play-set the technologies, in general, the reporting systems, in particular, are playing the role of middle point, the two views applying in respect to data vs. reporting technologies, reporting technologies vs. reporting requirements, increasing the complexity of the view. Sometimes is enough to focus only on the reporting requirements, why we consider then also the data in this scenario?! Remember, reporting it’s all about harnessing the value of your data, not only in finding the right reporting solution.

Often, in order to understand the requirements of an organization and understanding the value of data it’s easier to look at the various types of data it stores – the bottom up approach. Typically we could make distinction between master vs. transactional data, raw vs. aggregated data, structured vs. semi-structured vs. non-structured data, cleaned vs. not cleaned data, historical vs. live data, mined, distributed or heterogenous data, to mention just a few of the important ways of categorizing the data.

The respective types of data are important because they require special types of techniques or tools to process, report and harness them. The master and transactional data are at the heart of business applications, the “fluid” that flows and nourishes them, with an incredible potential when harnessed at its real value. Such unaltered data are typically structured at macro level, though they could contain also semi-structured or non-structured chunks of data, could have different levels of quality, could be distributed or heterogenous, etc. The range of data repositories range from text or document systems to specialized databases, any mixture being in theory possible.

So we have a huge volume of data collected over time, how could we harness them in order to reveal their real value? Excepting the operational use, the usefulness of raw data, data found in their unaltered form, stops usually at the point where their complexity falls beyond people’s limits of understanding them. Thus we arrived to aggregate, recombine or extract chunks from the data, bringing the data to a more comprehensible form. We even arrived to extract more information out of the data using specific techniques known as data mining methods, algorithms that allow us to identify associations, cluster or interpolate the data, the complexity of such algorithms evolving in considerable in the past two decades.

There are few software applications which provide the whole range of types of data processing, most of them resuming in recombining and presenting the data. Also data presentation offers a considerable range of formats (e.g. tabular, text, XML, Excel, charts and other diagrams) with complex navigational functionality (slice-and-dice, drill-down, drill-through, click-through), different ways of making the data available for consumption (direct-access, cached, web services), different security levels, etc.

When it comes to managers and users they want the data at the level of detail and form that allows the easiest/lowest proximate understanding level, and, “yesterday”, to use a word that reflects the urgency of requirements. This could be done in theory by coming with a whole set of reports covering all possible requirements, though that’s not so efficient as investment and not always possible with a button click, as probably all we’d like to.

For the ones familiarized to Manufacturing, it’s actually a disguised push vs. pull scenario, pushing the reports to the user without expecting their demands vs. waiting for the user to forward the reporting requirements, from which to derive eventually new reports or make changes to existing ones. In Manufacturing it’s all about finding the right balance between push and demand, and even if the respective field has been found some (best) practices that leads to the middle way, in IT we still have to dig for it. The road seems to lead nowhere… but it helps getting a rough understanding of what a report involves, at least in what concerns the important non-programming stuff.

In the top down approach, as the data are remaining relatively constant, it’s easier to focus on the set of requirements and the reporting tool(s) used. It makes sense to choose the reporting tool that covers at least the most important requirements while for the other you have the choice to use workarounds or address them using two or more reporting solutions, attempting again to cover the most important requirements, and so on. The problem with having more than one reporting solution is that often data, logic and reports arrive to be replicated, the whole reporting infrastructure becoming more difficult to manage. On the other side as long you are having a clear (partitioned) delimitation between the reporting frameworks, logic and data are replicated at minimum, having two or more reporting solutions seems to be an acceptable trade. Examples of such delimitations are for example the OLTP vs. OLAP solutions, to which adds the data mining reporting solution(s).

There are also attempts to extend an existing reporting framework above the initial functionality, though often people arrived to reinvent the wheel or create little monsters they arrive to live with. Also this is an acceptable approach, as long the reporting framework allows that. Some attempts might succeed to provide what they were designed for, others not. There are also some good news from this perspective, the appearance of mash-ups and semantic technologies could in the future allow the integration of reporting systems, making possible things that nowadays are quite challenging. The future is open of unlimited possibilities, but better stick to the present! For that stay tuned for a second post!

Previous Post <<||>> Next Post

SQL Troubles

Pages