31 December 2010

SQL Troubles - Past, Present and Future

Introduction

   Even if I started to blog 3-4 years ago, only this year (still 2010) I started to allocate more time for blogging, having two blogs on which I try to post something periodically: SQL Troubles and The Web of Knowledge plus a homonym Facebook supporting group (for the later blog). As a parenthesis, the two blogs are approaching related topics from different perspective, the first focusing on data related topics, while the second approaching data from knowledge and web perspective; because several posts qualify for both blogs, I was thinking to merge the two blogs, though given the different perspectives and types of domains that deal with them, at least for the moment I’ll keep them apart. Closing the parenthesis, I would like to point out that I would love to allocate more time though I have to balance between blogging, my professional and personal life, and even if the three have many points in common, some delimitation it’s necessary. Because it’s the end of a year, I was thinking that it’s maybe the best time to draw the line and analyze the achievements of the previous year and the expectations for the next year(s), for each of the two blogs. So here are my thoughts:

Past and Present

    There are already more than 10 years since I started to work with the various database systems, my work ranging from data modeling to database development, reporting, ERP systems, etc. I can’t consider myself an expert, though I’ve accumulated experience in a whole range of areas, fact that I think entitles me to say that I have something to write about, even if the respective themes are not rocket science. In addition, it’s the human endeavor of learning something new each day, and in IT that’s quite an imperative, the evolvement of various technologies requesting those who are working in this domain to spend extra hours in learning new things or of consolidating or reusing knowledge in new ways. I considered at that time, and I still do, that blogging helps the learning process, allowing me to externalize the old or new knowledge, clear my thoughts, have also some kind of testimony of what I know or at least a repository of information I could reuse when needed, and eventually receive some feedback. These are few of the reasons for which this blog was born, and I hope the information presented in here are useful also for other people.

  During the past year I made it to post on my blog more than 100 entries on various topics, the thematic revolving around strings, hierarchical queries, CLR functionality, Data Quality, SSIS, ERPs, Reports, troubleshooting, best practices, joins, etc. Not all the posts rose to my expectations, though that’s a start, hoping that I will find a personal style and the quality of the posts will increase. I can’t say I received lot of feedback, however based on the user access’ statistics provided by Clustrmaps and Google the number of visitors this year was somewhere around 8500, close to my expectations. Talking about the number of visitors, it’s nice to have also some visualization, so for this year’s statistics I’ll use Clustrmaps visualization, which provides a more detailed geographical overview than Google’s Stats, while for trending I show below Google’s Stats (contains data from May until today):

The Web of Knowledge - Clustrmaps 2010 statistics

The Web of Knowledge - Google 2010 statistics

     What I find great about Google’s Stats is that it provides also an overview of the most accessed posts and the traffic sources. There are also some statistics of the audience per browsers and OS, though they are less important for my blogging requirements, at least for the moment.

The Web of Knowledge - pageviews by OS The Web of Knowledge - pageviews by browsers

  What I find interesting is that most visited posts and searched keywords were targeting SSIS and Oracle vs. SQL Server-related topics. So, if for the future I want more traffic than maybe I should diversify my topics in this direction.

Future

  I realize that I started many topics, having in the next year to continue posting on the same, but also targeting new topics like Relational Theory, Business Intelligence, Data Mining, Data Management, Statistics, SQL Server internals, data technologies, etc. Many of the posts will be an extension of my research on the above topics, and I was thinking to post also my learning notes with the hope that I will receive more feedback. I realized that I need to be more active and provide more feedback to other blogs, using the respective comments as gateways to my blog and try to build a network around it. I was thinking also to start a Facebook “support group”, posting the links I discovered, quotes or impressions in a more condensed form, but again this will take me more time, so I’m not sure if it makes sense to do that. Maybe I should post them directly on the blog, however I wanted my posts to be a little more consistent than that. Anyway I know also that I won’t manage to post more than an average of one post per week though per current expectations is ok.

  Right now all the posts are following a push model, in other words I push the content independently of whether there is a demand or not for it. It’s actually natural because the blog is having a personal note. In the future I’m expecting to move in the direction of a pull model, in other words to write on topics requested by readers, however for this I need more feedback from you, the reader. So please let me know what topics you’d like to read!

  I close here, hoping that the coming year (2011) will be much better than the current one. I wish to all of you, a Happy New Year!

14 December 2010

Pulling the Strings of SQL Server – Part III: Concatenation

    Typically a database in general, and a table in particular, that follows the normalization rules, is designed to have the columns contain the smallest semantic chunks of data, it could be a Street, a Zip Code, City, a person’s First or Last Name, but also a large chunk of text like a Description or a Comment. No matter how well designed is a database, there will always be the need to do various operations with strings, typically concatenation, extraction of subpart of a string, insertion or deletion of characters, rearangement of string’s characters, trimming, splitting it in substrings, or of getting various numeric values: length, position of a given text, number of not empty characters, on whether the text represents a valid numeric or date values, etc. In the following posts I will attempt to address the respective operations in the context of select statements, and let’s start with concatenation.

    Concatenation is the operation of joining two or more string values within an expression. In SQL Server the “+” operator is used for concatenation, and it could be used to concatenate two or more members. In order to concatenate two members from which one of them is a string, the second term needs to be explicitly converted to a string data type, otherwise an error will occur. For readability or post-processing purposes, the strings are concatenated using a delimiter in order to delimit the boundaries of the initial value, it could be used a space, a comma, a semicolon, a pipe, a tab or any other character that could be used typically in delimiting columns.

-- concatenation of strings
SELECT 'a string' + ' and ' + 'a second string' Example1
, 'a string' + ',' + 'a second string' Example2

, '1245' + '67890' Example3

, '1245' + '.' + '67890' Example4


   The concatenation of string variables or columns functions based on the same principles:
-- concatenating string variables
DECLARE @string1 varchar(50)
DECLARE
@string2 varchar(50)

DECLARE
@string3 varchar(50)


SET @string1 = 'this is a string'
SET
@string2 = 'this is another string'


SELECT @string1 + ' and ' + @string2 Example1
,
@string1 + char(31) + @string2 Example2

,
@string1 + ', ' + @string2 Example3

,
@string1 + ' ' + @string3 Example4

,
@string1 + IsNull(@string3, '!') Example5

concatenation 2 

    Here’s another example based on the concatenation of columns coming from two joined tables from AdventureWorks database:
-- concatenating columns of joined tables
SELECT PAD.AddressID
,
IsNull(PAD.AddressLine1, '')

+
IsNull(', ' + PAD.AddressLine2, '')
+
IsNull(', ' + PAD.City, '')
+
IsNull(', ' + PAD.PostalCode, '')
+
IsNull(', ' + PSP.Name, '') Address
FROM
Person.Address PAD
      JOIN Person.StateProvince PSP
         ON PAD.StateProvinceID = PSP.StateProvinceID

concatenation 3
    As stressed in the previous post, the NULL values need to be adequately handled either by initializing values or by using the IsNull or COALESCE functions. The concatenation of strings combined with IsNull function could be used creatively in order to add a comma only when a value is not null, as in the above example.

   There are scenarios in which is needed to concatenate the values belonging to the same column but from different records, for example concatenating the email values in order to send a single email to all the users in one single action. Before the introduction of common table expressions (CTE), wasn’t possible to concatenate the string values belonging to different records, at least not in a query, this functionality being achieved by using cursors or loops, or simply performed on client or intermediate layers. As I already gave an example on how to use cursor in order to loop through the values of a table and concatenate them (see “Cursors and Lists” post), I will focus on the use of loops and simple CTEs.

    Loops are one of the basic functionality in programming languages, no matter of their complexity or type. Either if are talking about WHILE, FOR, foreach or do … until loops, the principles are the same: perform a list of actions until one or more conditions are met. In this case the actions performed is reduced to a set of concatenations based on the letters of the (English) alphabet:
-- concatenation within a loop
DECLARE @list varchar(max)
DECLARE
@index int

SET
@list = ''

SET
@index = ASCII('a')
WHILE (@index<ASCII('z'))
BEGIN

     SET @list = @list + ', ' + Char(@index)

     SET @index = @index + 1

END
SELECT
@list Result


    There is more work that needs to be performed in order to remove the leading comma from the output, but that’s a topic for the next post, when discussing about decomposition of strings.

    CTEs are a little more complex to use than the loops, though the concatenation could be achieved across records and this in one query and not in procedural language as in the above example. In order delimit the two components of a CTE, I made use of a second CTE which simulates the existence of a given table:
-- concatenation across records
WITH Data(Column1, Ranking)
AS 
(
-- preparing test data

       SELECT 'A' Column1, 0 Ranking
       UNION ALL 
       SELECT 'B' Column1, 1 Ranking
       UNION ALL
       SELECT 'C' Column1, 2 Ranking
)
, Result(Column1, Ranking)

AS
(
  -- performing the actual concatenation
      SELECT Cast(Column1 as nvarchar(max)) Column1 , Ranking FROM Data WHERE Ranking = 0
      UNION ALL
      SELECT Cast(B.Column1 + ',' + A.Column1 as nvarchar(max)) Column1 , A.Ranking
      FROM Data A
         JOIN Result  B
             ON A.Ranking - 1 = B.Ranking
)
SELECT Column1
FROM Result

WHERE Ranking IN (SELECT MAX(Ranking) FROM Result


    The logic for doing a simple concatenation seems maybe complicated, though the volume of work is not so big if we ignore the first CTE. On the other side I introduced an “index” within the Ranking column, fact that allows processing easier the records. When dealing with the records coming from a table it’s probably much easier to use one of the ranking functions that suits best.

10 December 2010

Pulling the Strings of SQL Server–Part II: Creation and Selection

    It doesn’t make sense to talk about the creation of string data types without talking about their “selection” as people typically want to see also examples at work, and not only learn about theoretical facts. I’m saying that because often programming seems to be like putting together the pieces of a puzzle without knowing how the final image would look like. Anyway, coming back to the topic, what I find great about SQL Server is that it allows to “create” and select a string with a minimum overhead within a simple select statement:
-- selecting a string value
SELECT 'this is a string' -- string 1
, 'and this is a second string' --string2
Here is the same statement but this time using named columns:
-- selecting a string value (named columns) 
SELECT 'this is a string' String1
, 'and this is a second string' String2
Strings - example1

    Pretty simple, isn’t it? This kind of scripts could be useful especially when debugging logic or testing an expression (e.g. functions at work), the advantage residing in the fact that is not necessary to built a table in order to test simple things. When performing multiple operations with the same values could be handy to store the values in declared variables:
-- selecting string variables
DECLARE @string1 varchar(50) DECLARE @string2 char(50)
DECLARE @string3 char(50) = 'this is a string'
SET @string1 = 'this is a string'

SELECT @string1 String1
, @string2 String2
, @string3 String3
Strings - example2
    As can be seen a not initialized variable has by default the value NULL, fact that could lead to unexpected behavior problems when used in expressions involving multiple variables. Therefore it’s recommended to initialize the values with the empty string or at least to handle the nulls in expressions. Starting with SQL Server 2008 it’s possible to declare and initialize variables within the same statement, so the above code won’t work in previous versions, unless the third declaration is broken into two pieces as per definition of first string.

    It looks like that’s all, at least in what concerns the direct declaration of a string, be it implicit or explicit. However a basic introduction into strings’ creation is incomplete without mentioning the creation of strings from other data types and the various functions that could be used for this purpose. Ignoring the Cast and Convert functions used to explicitly convert the other data types to string data types, there are several other functions for this purpose, namely Char, Space and Replicate.

   Probably a few of you are already familiar with the ASCII (American Standard Code for Information Interchange) character-encoding scheme (vs. binary encoding) used to encode files. ASCII code represents a mapping between number and characters, SQL Server supporting the transformation between two sets through the ASCII and Char functions. If ASCII translates a non-unicode character into an integer, the Char function translates an integer value into a non-unicode character. The integer values range between 0 and 255, they encompassing the 0-9 digits, the characters of English together with the diacritics of other important languages, punctuation signs and several graphical characters. The mapping between the two sets is unique, and as can be seen from the below example based on a common table expression, the functions are inverse:
-- ASCII character values
WITH CTE
AS
(
  
  SELECT 0 [Index]
    UNION
ALL

    SELECT
[Index]+1 [Index]

    FROM
CTE

    WHERE
[Index]<255

)
SELECT [Index]
,
CHAR([Index]) [Character]

,
ASCII(CHAR([Index])) [ASCII]

FROM
CTE

OPTION
(MAXRECURSION 0)


    There is not much to say about the Space and Replicate functions, the Space function returns a string of repeated spaces, while the Replicate function forms a string as a repeated sequence of values. The definition of the Space function could be considered as redundant as long the same output could be obtained by using the space as repeating sequence.
-- Space & Replicate at work
SELECT Space(1) Example1
,
Space(5) Example2

,
Replicate(' ', 1) Example3

,
Replicate(' ', 5) Example4

,
Replicate('10', 1) Example5

,
Replicate('10', 2) Example6

,
Replicate('10', 3) Example7

Replicate
    As per the last statement, the first and third examples, respectively the second and fourth example will return the same values, unfortunately the output of the respective examples it’s not so easy to see. For exemplification, it could have been enriched by comparing or concatenating the strings, though that’s a topic for the next post. 

08 December 2010

Pulling the Strings of SQL Server–Part IX: Special Characters

    Under special characters denomination are categorized typically the characters that don’t belong to the alphabet of a given language, typically English (a-z, A-Z), or the numeric digits (0-9). Characters like umlauts (e.g. ä, ë, ö, ü, ÿ), accents (e.g. é, í, ó, ú) and other type of characters used in specific languages like German, French, Turkish, Hungarian, Romanian, Dutch or Swedish, together with punctuation signs or graphical characters are falling under the designation of special characters. It’s true that SQL Server, like many other databases, supports Unicode, a standard designed to encode such characters, though not all database designs are taking Unicodes into account. Preponderantly the old database and other software solutions use the non-unicode string data types (char, varchar, text) fact that makes the special characters to be displayed inadequately, sometimes undecipherable. In order to avoid this behavior could be decided to port the data types to unicode or use only the standard characters of English language, both solutions with their positive and negative aspects. In such cases, especially during migration project or ETL tasks, eventually as part of a Data Quality initiative, it’s preferred to identify and replace the special characters in a manual, automatic or semi-automatic fashion. In addition, there are also cases in which, from various reasons, some attributes are not allowed or should not include special characters, and also this aspect could be included in a Data Quality initiative.

    During assessment of Data Quality, in an organized or less organized manner, a first step resides in understanding the quality of the data. In the current case this resumes primarily in identifying how many records contain special characters, how could be the data cleaned, and the “cost” for this activity. Actually, before going this far, must be defined the character sets included in special characters, the definition could vary from case to case. For example in some cases could be considered also the space or the important punctuation signs as valid characters, while in others they may not be allowed. There could be identified thus multiple scenarios, though I found out that the range of characters a-z, A-Z, 0-9 and the space are considered as valid character in most of the cases. For this purpose could be built a simple function that iterates through all the characters of a string and identifies if there is any character not belonging to the before mentioned range of valid characters. In order to address this, a few years back I built a function similar with the below one:
-- checks if a string has special characters
CREATE FUNCTION dbo.HasSpecialCharacters(
@string nvarchar(1000))

RETURNS
int
AS

    BEGIN
         DECLARE @retval int
         DECLARE @index int

         DECLARE @char nchar(1)
 

         SET @retval = 0
         SET @index = 1 


         WHILE (@index <= IsNull(len(@string), 0) AND @retval=0)

        BEGIN
           SET @char = Substring(@string, @index, @index+1)

           IF NOT (ASCII(@char) BETWEEN 48 AND 57 -- numeric value

             OR ASCII(@char) BETWEEN 65 AND 90 -- capital letters

            OR ASCII(@char) BETWEEN 97 AND 122 -- small letters

            OR ASCII(@char) = 32) --space

           BEGIN

                SET @retval = @index
           END

           ELSE

               SET @index = @index + 1
     END
 

    RETURN (@retval)

END

    Function’s logic is based on the observation that the ASCII of numeric values could be found in the integer interval between 48 and 57, the capital letters between 65 and 90, while the small letters between 97 and 122. By adding the ASCII for space and eventually several other characters, the check on whether an character is valid resuming thus to only 4 constraints. Here’s the function at work:

-- testing HasSpecialCharacters function
SELECT dbo.HasSpecialCharacters('kj324h5kjkj3245k2j3hkj342jj4') Example1
,
dbo.HasSpecialCharacters('Qualität') Example2

,
dbo.HasSpecialCharacters('Änderung') Example3

,
dbo.HasSpecialCharacters('') Example4

,
dbo.HasSpecialCharacters(NULL) Example5

,
dbo.HasSpecialCharacters('Ä') Example6

,
dbo.HasSpecialCharacters('ä') Example7

,
dbo.HasSpecialCharacters('a') Example8

special characters 1

    As can be seen, the function returns the position where a special character is found, fact that enables users to identify the character that causes the problem. A similar function could be built also in order to count the number of special characters found in a string, the change residing in performing a counter rather then returning the position at the first occurrence of a special character.

    The function might not be perfect though it solves the problem. There are also other alternatives, for example of storing the special characters in a table and performing a simple join against the target table. Another solution could be based on the use RegEx functionality, either by using OLE automation or directly CLR functionality. There could be done variations on the above solution too by limiting to check on whether the characters of a string are falling in the range projected by the ASCII function. That’s what the following function does:

-- checks if a string has special characters falling in an interval
CREATE FUNCTION dbo.HasCharNotInASCIIRange(
@string nvarchar(1000)
,
@start int

,
@end int)

RETURNS
int
AS

BEGIN
    
    DECLARE @retval int
    DECLARE @index int

    DECLARE @char nchar(1)
 
    SET @retval = 0
    SET @index = 1

    WHILE (@index <= IsNull(len(@string), 0) AND @retval=0)

    BEGIN
         SET @char = Substring(@string, @index, @index+1)

         IF NOT (ASCII(@char) BETWEEN @start AND @end)
        BEGIN

              SET @retval = @index
        END

        ELSE

             SET @index = @index + 1
     END

     RETURN (@retval)

END

   With this function are necessary 4 calls in order to identify if a string contains special characters, though we loose the flexibility of identifying the first character that is invalid. We could still identify the first occurrence by taking the minimum value returned by the 4 calls, however, unlike Oracle (see Least function), SQL Server doesn’t have such a function, so we’ll have eventually to built it. Anyway, here’s the above function at work:
-- testing HasCharNotInASCIIRange function
SELECT dbo.HasCharNotInASCIIRange('k12345', 48, 57) Example1
,
dbo.HasCharNotInASCIIRange('12k345', 48, 57) Example2

,
dbo.HasCharNotInASCIIRange('12345', 48, 57) Example3

,
dbo.HasCharNotInASCIIRange(' 12345', 48, 57) Example4

,
dbo.HasCharNotInASCIIRange('12345 ', 48, 57) Example5

,
dbo.HasCharNotInASCIIRange('', 48, 57) Example6

,
dbo.HasCharNotInASCIIRange(NULL, 48, 57) Example7

,
dbo.HasCharNotInASCIIRange('', 48, 57) Example8

,
dbo.HasCharNotInASCIIRange('a', 48, 57) Example9

,
dbo.HasCharNotInASCIIRange('Ä', 48, 57) Example10

,
dbo.HasCharNotInASCIIRange('ä', 32, 32) Example11


special characters 2

07 December 2010

Pulling the Strings of SQL Server–Part I: An Introduction

    The (character) string or simply the character data type, how is named by the MSDN documentation, is one of the primary data types available in a database and probably the most complex given the fact that it can encompass any combination of numeric, literal, date or any other primary data type. In fact it could include any type of chunk of text that could be written in any language as SQL Server supports Unicode and thus most of the (written) languages. In this post I’m not intending to make a complete anthology of strings, and neither to retake the whole volume of information available in MSDN or other important books. My intent is to look at strings from a slightly different perspective, considering the various functions involving strings and how they could be used in order to provide various functionality, in fact the cornerstone of everyday developer.

   A few things to remember:

1. there were mainly two types of non-unicode strings: the char (or character) of fixed length, respectively the varchar of variable length (varying character)

2. if initially both types of strings were having a maximum length of 8000 of characters, with SQL Server it’s possible to have a varchar with maximum storage size, declared as varchar(max).

3. if in the past there were some strict recommendations in what concerns the use of char or varchar, nowadays the varchar tends to be used almost everywhere, even for the strings of length 1.

4. talking about length, it denotes the number of chracters a string stores.

5. the trailing spaces, the spaces found at the right extremity of a string are typically ignored, while the leading spaces, the spaces found at the left extremity, are not ignored.

6. starting with SQL Server 2000, for the two character data types were introduced the corresponding unicode data types prefixed with n (national): nchar, respectively nvarchar.

7. given the fact that a unicode character needs more space to store the same non-unicode string, actually the number of bits doubles, the maximum length for an unicode string is only 4000.

8. there is also a non-unicode text, respectively ntext unicode data type, designed to store maximum length, though as it seems they could become soon deprecated, so might be a good idea to avoid it.

9. not-initialized variables, including strings, have the value NULL, referred also the NULL string, therefore it’s always a good idea to initialize your variables.

10. by empty string string is designated the string containing no character “’’”, and has the length 0.

11. there are several specific functions available for the creation, manipulation and conversion of strings from and to other data types.

12. not all of the aggregated functions work with string values (specifically the ones requesting a number value like SUM, AVG, STDV).

13. the operations performed on strings of different data types are generally not impacted by this aspect, though there are some exceptions.

14. there are several (basic) operations with strings, typically concatenation, extraction of subpart of a string, insertion, replacement or deletion of characters, rearangement of string’s characters, trimming, splitting it in substrings (decomposition), etc.

15. there are several numeric values based on strings: length, position of a given text in a text, number of not empty characters, encoding of a character, on whether the text represents a valid numeric or date values, etc.

 

Please note that this is work in progress so take it as such!!! I’m trying to put together things learned long time ago, so maybe I will have to reformulate and add new things on the list. Any additions are more than welcome!