About Me

IT Professional with more than 16 years experience in IT especially in the area of full life-cycle of Web/Desktop Applications Development, Database Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP support, etc.

Saturday, September 22, 2007

6 out of 49 – Data Analysis (Part 2)

    In a posting I read long time ago, it was mentioned that’s interesting to see in a data set in which value intervals fall the data. Thus it was recommended to split the interval in 10 buckets, the only problem – we don’t know which is the right most extremity for our interval. We could use the maximum value: -- get maximum number of consecutive drawings for which any number haven’t appeared SELECT Max(DifferenceWeeks) FROM vLottoConsecutiveDrawings     But this doesn’t guarantees that there will be cases in which the extremity is exceeded. From my point of view, if the maximum is 63, then I’ll take the extremity 70 and split it in 10 buckets: first interval will be 1 … 7, second 8 … 14, etc. Thus, I’m taking the GROUP query I used in the previous copy and add to it additional logic for the split: SELECT Number , Min(NullIf(DifferenceWeeks, 0)) Minimum , Max(NullIf(DifferenceWeeks, 0)) Maximum , Avg(NullIf(DifferenceWeeks, 0)) Average , count(*) NumberDrawings , SUM(CASE WHEN DifferenceWeeks BETWEEN 1 AND 7 THEN 1 END) Range1_7 , SUM(CASE WHEN DifferenceWeeks BETWEEN 8 AND 14 THEN 1 END) Range8_14 , SUM(CASE WHEN DifferenceWeeks BETWEEN 15 AND 21 THEN 1 END) Range15_21 , SUM(CASE WHEN DifferenceWeeks BETWEEN 22 AND 28 THEN 1 END) Range22_28 , SUM(CASE WHEN DifferenceWeeks BETWEEN 29 AND 35 THEN 1 END) Range29_35 , SUM(CASE WHEN DifferenceWeeks BETWEEN 36 AND 42 THEN 1 END) Range36_42 , SUM(CASE WHEN DifferenceWeeks BETWEEN 43 AND 49 THEN 1 END) Range43_49 , SUM(CASE WHEN DifferenceWeeks BETWEEN 50 AND 56 THEN 1 END) Range50_56 , SUM(CASE WHEN DifferenceWeeks BETWEEN 57 AND 63 THEN 1 END) Range57_63 , SUM(CASE WHEN DifferenceWeeks BETWEEN 64 AND 70 THEN 1 END) Range64_70 , 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 1 AND 7 THEN 1 END)/count(*) Percentage1_7 , 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 8 AND 14 THEN 1 END)/count(*) Percentage8_14 , 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 15 AND 21 THEN 1 END)/count(*) Percentage15_21 , 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 22 AND 28 THEN 1 END)/count(*) Percentage22_28 , 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 29 AND 35 THEN 1 END)/count(*) Percentage29_35 , 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 36 AND 42 THEN 1 END)/count(*) Percentage36_42 , 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 43 AND 49 THEN 1 END)/count(*) Percentage43_49 , 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 50 AND 56 THEN 1 END)/count(*) Percentage50_56 , 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 57 AND 63 THEN 1 END)/count(*) Percentage57_63 , 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 64 AND 70 THEN 1 END)/count(*) Percentage64_70 FROM vLottoConsecutiveDrawings GROUP BY Number ORDER BY Number     As can be seen I just summed the number of cases in which a value fallen in the given interval: SUM(CASE WHEN DifferenceWeeks BETWEEN 1 AND 7 THEN 1 END) Range1_7     Additionally, I wanted to see what was the percentage for a value to fall in a given interval from the total number of occurences: 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 1 AND 7 THEN 1 END)/count(*) Percentage1_7     How can we benefit from such output?     We can check for each number what’s the bucket with the highest percentage, usually there are only 1-2 intervals which qualify. The percentage the is key in finding useful information. For example if there is only one interval with a 75% occurences, then most probably the number will appear in a drawing from that interval, if the number didn’t appeared in the respective interval, then the probability for that number to appear increases. When we have multiple intervals with a considerable high percentage, if the right extremity of the highest interval is exceeded, the chances for a number to appear are considerbly higher.

6 out of 49 – Data Analysis (Part 1)

    One of the interesting things to see in a data set of drawings is the distance between two drawings in which appears the same numbers. It doesn’t look too easy to compare two consecutive records within the same query; actually this can be done easy with the help of DENSE_RANK function (works only on SQL Server 2005+ and Oracle), which ranks the data within a partition, thus if two values are identical, they have the same ranking. Let’s see how DENSE_RANK function works: SELECT Number , DrawingDate , DENSE_RANK() OVER (PARTITION BY Number ORDER BY DrawingDate) Ranking FROM vLottoNumbers     The partition is created by Number, while the ranking is delimited by DrawingDate. Now all we have to do is to join two such queries by Number with two consecutive Rankings: --creating vLottoConsecutiveDrawings view CREATE VIEW vLottoConsecutiveDrawings AS SELECT A.Number , A.DrawingDate , B.DrawingDate NextDrawingDate , IsNull(DateDiff(d, A.DrawingDate, B.DrawingDate), 0) DifferenceDays , IsNull(DateDiff(wk, A.DrawingDate, B.DrawingDate), 0) DifferenceWeeks FROM (     SELECT Number     , DrawingDate     , DENSE_RANK() OVER (PARTITION BY Number ORDER BY DrawingDate) Ranking     FROM vLottoNumbers ) A LEFT JOIN (     SELECT Number     , DrawingDate     , DENSE_RANK() OVER (PARTITION BY Number ORDER BY DrawingDate) Ranking     FROM vLottoNumbers ) B ON A.Number = B.Number AND A.Ranking = B.Ranking – 1 --checking the data SELECT * FROM vLottoConsecutiveDrawings ORDER BY Number, DrawingDate     The first interesting thing to find out is:     What’s the average number of occurrences?     What’s the minimum/maximum number of consecutive drawings in which the number hasn’t appeared?     How many times a number appeared?     The following query answers to these questions, by doing a simple grouping by Number over vLottoConsecutiveDrawings output: SELECT Number , Min(NullIf(DifferenceWeeks, 0)) Minimum , Max(NullIf(DifferenceWeeks, 0)) Maximum , Avg(NullIf(DifferenceWeeks, 0)) Average , count(*) NumberDrawings FROM vLottoConsecutiveDrawings GROUP BY Number ORDER BY Number     The output doesn’t give too much information, but it’s a good start. The study interval can be decreased by using the DrawingDate in GROUP and/or WHERE clause: SELECT Number , Year(DrawingDate) DrawingYear , Min(NullIf(DifferenceWeeks, 0)) Minimum , Max(NullIf(DifferenceWeeks, 0)) Maximum , Avg(NullIf(DifferenceWeeks, 0)) Average , count(*) NumberDrawings FROM vLottoConsecutiveDrawings --WHERE Year(DrawingDate) IN (2000, 2001) GROUP BY Number , Year(DrawingDate) ORDER BY Number, DrawingYear

6 out of 49 – Modeling the data (Part 1)

    In a previous posting I created a random set of data and stored them in LottoExtracts table, the first step is done, now we have to model the data in a form which could be useful for our analysis. Thus it makes sense to split the table in two tables:     LottoExtracts containing mainly the DrawingDate and eventually additional information about extraction (e.g. Country, System, etc.). For simplicity I’m including only the Drawing Date.     LottoNumbers containing the numbers and their order in extraction. Here are the tables:

--creating LottoDrawings table CREATE TABLE LottoDrawings( DrawingID int IDENTITY(1,1) NOT NULL, DrawingDate smalldatetime NULL ) --creating LottoNumbers table CREATE TABLE LottoNumbers( NumberID int IDENTITY(1,1) NOT NULL, DrawingID int NULL, Position int NULL, Number int NULL )     And the view which will be used as source for analysis: --creating vLottoNumbers view CREATE VIEW vLottoNumbers AS SELECT LN.NumberID , LN.Position , LN.Number , LN.DrawingID , LD.DrawingDate , Year(LD.DrawingDate) DrawingYear , Month(LD.DrawingDate) DrawingMonth , DatePart(dd, LD.DrawingDate) DrawingDay , DatePart(wk, LD.DrawingDate) DrawingWeek FROM LottoNumbers LN JOIN LottoDrawings LD ON LN.DrawingID = LD.DrawingID     Now that we have the objects to model the data, we need to populate the tables with the data from LottoExtracts table: --pupulating LottoDrawings table INSERT LottoDrawings (DrawingDate) SELECT DISTINCT DrawingDate FROM LottoExtracts --pupulating LottoNumbers table - first extraction number INSERT LottoNumbers (DrawingID, Position, Number) SELECT DrawingID, 1, N1 FROM LottoExtracts LE JOIN LottoDrawings LD ON DateDiff(d, LE.DrawingDate, LD.DrawingDate) = 0     The above query inserted only the first number of the extraction, in order to insert the other 5 numbers, all you have to do is to replace 1 with 2 in the second line and run the query again, the insert for second number becomes: --pupulating LottoNumbers table – second drawing number INSERT LottoNumbers (DrawingID, Position, Number) SELECT DrawingID, 2, N2 FROM LottoExtracts LE JOIN LottoDrawings LD ON DateDiff(d, LE.DrawingDate, LD.DrawingDate) = 0

Run it again after you replaced 2 with 3, 4, 5, respectively 6. And then checking the output (you should have 200 records): --checking the data SELECT * FROM vLottoNumbers     Now we have the data, it’s time for analysis, in a next posting!

6 out of 49 – Getting the data (Part 1)

    To study numbers’ occurrences we need some data, sometimes that’s difficult to get so they satisfy our purpose. Fortunately, each lottery has a website from which data can be taken drawing by drawing or when possible, as a data dump. Some of the lotteries have variations for 6/49. Lottery Canada Lotto Germany Lotto France (*) Lotto Belgium (*) Lotto Romania (*) Lotto Austria (*) Lotto UK     For other lotteries you can check The Lottery Site.     Because I don’t want to be involved in legal litigations, I will use a self made set of data, for this let’s create a table called LottoExtracts as below: --creating the LottoExtracts table CREATE TABLE LottoExtracts(     DrawingDate datetime NULL,     N1 int NULL,     N2 int NULL,     N3 int NULL,     N4 int NULL,     N5 int NULL,     N6 int NULL )     This format will be used to upload the data in a staging table which is used later as source for our data model. As can be seen it has a Drawing Date in which is stored Drawing’s date, while in N1, …, N6 are stored the drawings’ numbers (in the order they were picked). --creating the data DECLARE @index int DECLARE @counter int DECLARE @N1 int DECLARE @N2 int DECLARE @N3 int DECLARE @N4 int DECLARE @N5 int DECLARE @N6 int SET @index = 0 WHILE @index < 200 BEGIN -- creating first number SET @N1 = Cast(1000 * RAND() as int) % 48 + 1 -- creating second number SET @counter = 1 WHILE @counter < 50 or @N1=@N2 BEGIN SET @N2 = Cast(1000 * RAND() as int) % 48 + 1 SET @counter = @counter + 1 END -- creating third number SET @counter = 1 WHILE @counter < 50 or @N3 IN (@N1, @N2) BEGIN SET @N3 = Cast(1000 * RAND() as int) % 48 + 1 SET @counter = @counter + 1 END -- creating fourth number SET @counter = 1 WHILE @counter < 50 IN (@N1, @N2, @N3) BEGIN SET @N4 = Cast(1000 * RAND() as int) % 48 + 1 SET @counter = @counter + 1 END -- creating fifth number SET @counter = 1 WHILE @counter < 50 IN (@N1, @N2, @N3, @N4) BEGIN SET @N5 = Cast(1000 * RAND() as int) % 48 + 1 SET @counter = @counter + 1 END -- creating sixth number SET @counter = 1 WHILE @counter < 50 color="#000099">IN (@N1, @N2, @N3, @N4, @N5) BEGIN SET @N6 = Cast(1000 * RAND() as int) % 48 + 1 SET @counter = @counter + 1 END --inserting a new drawing INSERT LottoExtracts(DrawingDate, N1, N2, N3, N4, N5, N6) VALUES (DateAdd(d, @index*7, '01/01/2000'), @N1, @N2, @N3, @N4, @N5, @N6) SET @index = @index + 1 END --checking the data SELECT * FROM LottoExtracts     The following statement creates a random number in 1 … 49 range: SET @N1 = Cast(1000 * RAND() as int) % 48 + 1     Thus can be created all 6 numbers, however because the same number can’t be extracted twice in the same drawing, I had to add additional duplication checks: -- creating third number SET @counter = 1 WHILE @counter < 50 or @N3 IN (@N1, @N2) BEGIN SET @N3 = Cast(1000 * RAND() as int) % 48 + 1 SET @counter = @counter + 1 END     I created only 200 records, which equates with almost 4 years of history, and should be enough for our purpose. Of course, you can increase or decrease the number of drawings! Notes:     The links could change in time, I provided them just to make easier the search, if they change in time, then tough luck!     (*) denotes that historical data can be downloaded.

6 out of 49 – The Magic of Numbers (Introduction)

    The antics had a special fascination for numbers, be it them just simple counters or numbers with special qualities, numbers that bring memories, numbers that have powers, numbers that bring emotions, number that bring luck, and I can continue here with many more examples based on the role the numbers play in our lives. Such numbers may create a special universe, and lot of people spent lot of time or even lives trying to discover insights in such universes. The numbers we collect from one or multiple sources could offer us interesting information, for example analysts may look at prices or companies’ profit evolution to observe trends in data, trends which can be used to find information about past or even predict the future.     I’m working often with numbers, better said with data, and frankly I like it. With a few nice written queries, there could be some valuable information in each set of data, some little knowledge which can be gained.     Probably the numbers charged with the most powerful emotional component are the Lotto numbers. Those 6 or 7 magic numbers change destinies, there are people who became incredible rich over night, there are people keeping alive all their life the dream to win the big pot that will change their lives. I will try in a set of a few postings to show a few SQL techniques based on drawing numbers. They can be used only to get insights on numbers’ occurrence in drawing over time; they’ll not make anybody rich, not without lot of luck or inspiration.     So, let’s get into it! Data analysis presumes: 1.  Having a set of data which can be studied 2.  Transforming the data in a form which can model the problem; often the data must be cleaned so they can be use. 3.  Use a technique or set of techniques to study the data through one or multiple iterations. 4.  Making use of the data, and believe me, that the most difficult task, because it is based on the understanding level of each person. Just to exemplify, the stock information are available for many people, but only those who combine the data with knowledge derived from experience and luck are successful.     This was just a “short” introduction, stay tuned for more!

Monday, September 17, 2007

The Creators of New Worlds My first Project Manager was saying that "programming is poetry", while o popular joke on the www was proclaiming that "the programmer is the God", I suppose that this comparison derives from the fact that programmers create incredible worlds with powerful words coming from light, nice parallelism. Remaining in the same context, another joke tries to answer to a few Theologic questions: Q: "Did God really create the world in seven days?" A: "He did it in six days and nights while living on cola and candy bars. On the seventh day he went home and found out his girlfriend had left him." (Thousand Lakes Web Pages, 1995) The mighty power of Email, the "God of Knowledge" brought me an interesting posting from DeviZen about the less seen and unseen facets of programmers' life: "Is programming the road to ruin? Or is it that those with a predilection for detail and mental gymnastics find themselves drawn to it. Perhaps it simply exacerbates a pre-existing mindset. There are certainly other traits (stereotypical or not) that most programmers seem to share." (DeviZen, 2007) Even if you are a programmer or not, it worth to pay a tribute for all the nice applications you use, and read DeviZen's post, from my point of view it worth! References: 1. DeviZen. 2007. Programming Can Ruin Your Life. [Online] Available from: http://devizen.com/blog/2007/09/11/ruin/ (Accessed: 17 September 2007) 2. Thousand Lakes Web Pages. 1995. God as Computer Programmer. [Online] Available from: http://www.columbia.edu/~sss31/rainbow/god.qa.html. (Accessed: 17 September 2007) Resources (just kidding): 1. Omri Weismann's Web page. ???? In the Beginning. [Online] Available from: http://www.cs.bgu.ac.il/~omri/Humor/Creation.html (Accessed: 17 September 2007) 2. Chrono Tron. 2006. God Programmes. [Online] Available from: http://chronotron.wordpress.com/2006/03/31/god-programmes-jokes/ (Accessed: 17 September 2007)

Saturday, September 15, 2007

Updating Data With Values From The Same Table

    I have to deal with cases in which I have to update data from a table with values from the same table across records, a simple examle I can scatch is the one of Standard Prices, which are normally valid during a year. For example if I would have to reset the Standard Prices from a year to the ones from the previous year, how this can be done? Actually that's pretty sample, for exemplification purposes I will create a table having only 3 columns, ProductID, Year and StandardPrice: --creating StandardPrices table CREATE TABLE StandardPrices( ProductID int , [Year] int , StandardPrice decimal(13,2) )     We need for it some sample data: --creating test data DECLARE @index int SET @index = 0 WHILE @index < 50 BEGIN INSERT StandardPrices(ProductID, [Year], StandardPrice) VALUES (@index % 25+1, 2000 + (@index/25), 100 * RAND()) SET @index = @index + 1 END     Let's see the content, the RAND() function allows me to assign random prices to each product. As can be seen there are 25 products with Standard Prices for the Years 2000 and 2001. --testing StandardPrices table's content SELECT * FROM StandardPrices ORDER BY ProductID, [Year]     Let's try to use a normal update: --first update attempt UPDATE StandardPrices SET StandardPrice = A.StandardPrice FROM StandardPrices A WHERE StandardPrices.ProductID = A.ProductID AND StandardPrices.[Year] = 2001 AND A.[Year] = 2000     Ups, I got an error (SQL Server 2005): Msg 4104, Level 16, State 1, Line 1 The multi-part identifier "StandardPrices.ProductID" could not be bound. Msg 4104, Level 16, State 1, Line 1 The multi-part identifier "StandardPrices.Year" could not be bound. SQL Server 2000: Server: Msg 107, Level 16, State 3, Line 1 The column prefix 'StandardPrices' does not match with a table name oralias name used in the query. Server: Msg 107, Level 16, State 1, Line 1 The column prefix 'StandardPrices' does not match with a table name oralias name used in the query.     It seems we can't use the same table in updates of this type, but we can use a subquery or view which refers to the same table, so we can write: --the update UPDATE StandardPrices SET StandardPrice = A.StandardPrice FROM ( SELECT ProductID , [Year] , StandardPrice FROM StandardPrices WHERE [Year] = 2000 ) A WHERE StandardPrices.ProductID = A.ProductID AND StandardPrices.[Year] = 2001     Now there are no errors, 25 records updated, let's see the output: --testing StandardPrices table's content after update SELECT * FROM StandardPrices ORDER BY ProductID, [Year] That's all folks!

PIVOT Operator Example

    Starting with SQL Server 2005, Microsoft introduced the PIVOT operator, which tries to implement the pivoting feature available in Excel, however it does only static pivoting, the values on which the pivoting is done must be mentioned explicitely.
    In List Based Aggregations I gave an example of a query in which the On Hand was cumulated per Location, the same output can be obtained with the below query:
SELECT Name
, ProductNumber
, MakeFlag
, FinishedGoodsFlag
, IsNull([Paint Storage], 0) OnHandPaintStorage
, IsNull([Frame Forming], 0) OnHandFrameForming
, IsNull([Frame Welding], 0) OnHandFrameWelding
, IsNull([Debur and Polish], 0) OnHandDeburPolish
, IsNull([Specialized Paint], 0) OnHandSpecializedPaint
, IsNull([Subassembly], 0) OnHandAssembly
, IsNull([Final Assembly], 0) OnHandFinalAssembly
, IsNull([Paint Storage], 0) + IsNull([Frame Forming], 0) + IsNull([Frame Welding], 0) + IsNull([Debur and Polish], 0) + IsNull([Specialized Paint], 0) + IsNull([Subassembly], 0) + IsNull([Final Assembly], 0) OnHand
, ListPrice
, StandardCost
, (IsNull([Paint Storage], 0) + IsNull([Frame Forming], 0) + IsNull([Frame Welding], 0) + IsNull([Debur and Polish], 0) + IsNull([Specialized Paint], 0) + IsNull([Subassembly], 0) + IsNull([Final Assembly], 0) ) * StandardCost Value
FROM
(
SELECT PP.Name
, PP.ProductNumber
, PP.MakeFlag
, PP.FinishedGoodsFlag
, PL.Name AS Location
, PP.ListPrice
, PP.StandardCost
, PPI.Quantity
FROM Production.ProductInventory PPI
JOIN Production.Product PP
ON PPI.ProductID = PP.ProductID
JOIN Production.Location PL
ON PPI.LocationID = PL.LocationID
) AS A
PIVOT
(
SUM(Quantity)
FOR Location IN ([Paint Storage], [Frame Forming], [Frame Welding], [Debur and Polish], [Specialized Paint], [Subassembly], [Final Assembly])
) PVT

    The query is not rocket science, but it took me some time to figure out that when I need to use multiple tables, I have to create first a subquery, and only then can apply the PIVOT operator. This is how I tried to write my query:
SELECT PP.Name
, PP.ProductNumber
, PP.MakeFlag
, PP.FinishedGoodsFlag
, IsNull([Paint Storage], 0) OnHandPaintStorage
, IsNull([Frame Forming], 0) OnHandFrameForming
, IsNull([Frame Welding], 0) OnHandFrameWelding
, IsNull([Debur and Polish], 0) OnHandDeburPolish
, IsNull([Specialized Paint], 0) OnHandSpecializedPaint
, IsNull([Subassembly], 0) OnHandAssembly
, IsNull([Final Assembly], 0) OnHandFinalAssembly
, IsNull([Paint Storage], 0) + IsNull([Frame Forming], 0) + IsNull([Frame Welding], 0) + IsNull([Debur and Polish], 0) + IsNull([Specialized Paint], 0) + IsNull([Subassembly], 0) + IsNull([Final Assembly], 0) OnHand
, PP.ListPrice
, PP.StandardCost
, (IsNull([Paint Storage], 0) + IsNull([Frame Forming], 0) + IsNull([Frame Welding], 0) + IsNull([Debur and Polish], 0) + IsNull([Specialized Paint], 0) + IsNull([Subassembly], 0) + IsNull([Final Assembly], 0) ) * StandardCost Value
FROM Production.ProductInventory PPI
JOIN Production.Product PP
ON PPI.ProductID = PP.ProductID
JOIN Production.Location PL
ON PPI.LocationID = PL.LocationID
PIVOT
(
SUM(Quantity)
FOR PL.Name IN ([Paint Storage], [Frame Forming], [Frame Welding], [Debur and Polish], [Specialized Paint], [Subassembly], [Final Assembly])
) PVT
and the error: Msg 8156, Level 16, State 1, Line 1
The column 'ProductID' was specified multiple times for 'PVT'.
Msg 4104, Level 16, State 1, Line 1
The multi-part identifier "PP.Name" could not be bound.
Msg 4104, Level 16, State 1, Line 1
The multi-part identifier "PP.ProductNumber" could not be bound.
Msg 4104, Level 16, State 1, Line 1
The multi-part identifier "PP.MakeFlag" could not be bound.
Msg 4104, Level 16, State 1, Line 1
The multi-part identifier "PP.FinishedGoodsFlag" could not be bound.
Msg 4104, Level 16, State 1, Line 1
The multi-part identifier "PP.ListPrice" could not be bound.
Msg 4104, Level 16, State 1, Line 1
The multi-part identifier "PP.StandardCost" could not be bound.

Notes:
1. Probably it's better to create a view for subquery and use the view instead. I am doing that most of the times, sometimes I prefer to write table-valued functions; more on this in another post.
2. As it seems can be used only one aggregation function with PIVOT, that's pretty bad because even if you start doing your query using a PIVOT and then the requirements change (e.g. is needed to output how many records are for each Location, a more appropriate example is the ones of Purchase Orders, in which could be aggregations applied on Ordered, Receipt and Open Quantity) then you'll have to change the whole query. I'm expecting more from next implementations of PIVOT operator, I hope Microsoft will have that on their list.
3. The benefit of PIVOT operator is supposed to be that it allows writing shorter code, not necessarily. Of course, it allows to eliminate the redundant code - the multiple CASE WHEN statments, however this works when there is only one column (Location in our example) used to do the split, what if there is additional logic which must be included in CASEs?

Friday, September 14, 2007

List Based Aggregations (On Hand example)

    I found some time ago a query like the one below (I wrote this kind of queries myself too):
SELECT PP.Name
, PP.ProductNumber
, PP.MakeFlag
, PP.FinishedGoodsFlag
, IsNull(PI1.Quantity, 0) OnHandPaintStorage
, IsNull(PI2.Quantity, 0) OnHandFrameForming
, IsNull(PI3.Quantity, 0) OnHandFrameWelding
, IsNull(PI1.Quantity, 0)+IsNull(PI2.Quantity, 0)+IsNull(PI3.Quantity, 0) OnHand
, PP.ListPrice
, PP.StandardCost
, (IsNull(PI1.Quantity, 0)+IsNull(PI2.Quantity, 0)+IsNull(PI3.Quantity, 0))*PP.StandardCost Value
FROM Production.Product PP
LEFT JOIN
(-- On Hand Paint Storage
SELECT SUM (Quantity) Quantity
, PPI.ProductID
FROM Production.ProductInventory PPI
JOIN Production.Location PL
ON PPI.LocationID = PL.LocationID
WHERE PL.Name = 'Paint Storage' GROUP BY PPI.ProductID
) PI1
ON PP.ProductID = PI1.ProductID
LEFT JOIN
(--On Hand Frame Forming
SELECT SUM (Quantity) Quantity
, PPI.ProductID
FROM Production.ProductInventory PPI
JOIN Production.Location PL
ON PPI.LocationID = PL.LocationID
WHERE PL.Name = 'Frame Forming'
GROUP BY PPI.ProductID
) PI2
ON PP.ProductID = PI2.ProductID
LEFT JOIN
(--On Hand Frame Welding
SELECT SUM (Quantity) Quantity
, PPI.ProductID
FROM Production.ProductInventory PPI
JOIN Production.Location PL
ON PPI.LocationID = PL.LocationID
WHERE PL.Name = 'Frame Welding'
GROUP BY PPI.ProductID
) PI3
ON PP.ProductID = PI3.ProductID
WHERE IsNull(PI1.Quantity, 0)+IsNull(PI2.Quantity, 0)+IsNull(PI3.Quantity, 0) <> 0
ORDER BY PP.Name

    I exemplified the above query by using tables from AdventureWorks database, and it's based on Product, ProductInventoy and Location table.
The query returns the OnHand from special Locations (here I exemplified with only 3 locations, but could be more), the query is not difficult even if it involves 3 tables, however many people write this kind of lengthy queries. A much simpler solution is to use directly aggregations as in the below query. Such method is flexible enough to allow to add another Locations to the query in no time, and performs better!
SELECT PP.Name
, PP.ProductNumber
, PP.MakeFlag
, PP.FinishedGoodsFlag
, SUM(CASE WHEN PL.Name = 'Paint Storage' THEN PPI.Quantity ELSE 0 END) OnHandPaintStorage
, SUM(CASE WHEN PL.Name = 'Frame Forming' THEN PPI.Quantity ELSE 0 END) OnHandFrameForming
, SUM(CASE WHEN PL.Name = 'Frame Welding' THEN PPI.Quantity ELSE 0 END) OnHandFrameWelding
, SUM(CASE WHEN PL.Name = 'Debur and Polish' THEN PPI.Quantity ELSE 0 END) OnHandDeburPolish
, SUM(CASE WHEN PL.Name = 'Specialized Paint' THEN PPI.Quantity ELSE 0 END) OnHandSpecializedPaint
, SUM(CASE WHEN PL.Name = 'Subassembly' THEN PPI.Quantity ELSE 0 END) OnHandSubassembly
, SUM(CASE WHEN PL.Name = 'Final Assembly' THEN PPI.Quantity ELSE 0 END) OnHandFinalAssembly
, SUM(PPI.Quantity) OnHand
, PP.ListPrice
, PP.StandardCost
, SUM(PPI.Quantity)*PP.StandardCost Value
FROM Production.ProductInventory PPI
JOIN Production.Product PP
ON PPI.ProductID = PP.ProductID
JOIN Production.Location PL
ON PPI.LocationID = PL.LocationID
WHERE PL.Name IN ('Paint Storage', 'Frame Forming', 'Frame Welding', 'Debur and Polish', 'Specialized Paint', 'Subassembly', 'Final Assembly')
GROUP BY PP.Name
, PP.ProductNumber
, PP.MakeFlag
, PP.FinishedGoodsFlag
, PP.ListPrice
, PP.StandardCost
HAVING SUM(PPI.Quantity)<>0
ORDER BY PP.Name

Note:
1. Probably you'll wonder why I put the constraint SUM(PPI.Quantity)<>0?! The explanation is simple, I met cases in which ERP systems were allowing negative values for On Hand, pretty cool, isn't it?
2. AdventureWorks database can be downloaded from Microsoft site.
3. I met many other cases in which this type of quries are very handy, in here I tried to exemplify the method in the easiest possible way, with minimum of effort :D.

Some Stuff Here

    It’s amazing, no matter how much experience we have in a programming language or a technology, there is always something new to learn. There are a few hardly used functions in SQL Server 2000, but they could be really useful in certain situations. One of such functions is Stuff, I discovered its use long time after I started to play with SQL. Stuff ( character_expression , start , length , character_expression ) --inserting a string inside another without doing a replacement SELECT Stuff('This is just a test', 3, 0, 'x') Output: Thxis is just a test --inserting a string inside another without doing a replacement SELECT Stuff('This is just a test', 3, 5, 'at was') Output: That was just a test     So, it could be useful when we check whether a character is on a certain position, and replace it with another character. Normally we would have to write something like: DECLARE @string varchar(50) SET @string = 'DE1988299X8829' SELECT CASE WHEN Substring(@string, 10,1) = 'X' THEN Stuff(@string, 10, 1, 'Y') ELSE @string END Output: DE1988299Y8829     Another function I haven't saw too often in SQL logic is Replicate, yeah, it does exactly what it's name suggests - it takes a string and replicates it's content multiple times. Replicate ( character_expression , integer_expression ) SELECT Replicate ('0', 10) Output: 0000000000 SELECT Replicate ('tone', 3) Output: tonetonetone     The function could be useful when we need to put a number of characters in front of a value. For example a table contains integer values, but in a report we need them with leading zeroes (e.g. 00056 instead of 56). I tried to implement such functionality as a function, in a previous posting with the help of Space function; using Space and Replace functions can be obtained the same result as using Replicate: SELECT Replace(Space(3), ' ', 'tone') Output: tonetonetone
Duplicates When analyzing table structures, it isn't only interesting to see the distinct values from a column or group of columns, but to study if there are any duplicates. In order to do that, I'm identifying which of the columns would make my records unique within the view/table, and then I'm grouping them and selecting only those records for which multiple occurrences were found. In this case there could be two reasons behind duplication: 1. The chosen columns don't identify a unique record. 2. One of the JOINs is creating troubles, maybe I forgot to add one constraint or the JOIN is partially incorrect. --checking for duplicates, <table_name> is placeholder for a table/view, while <column_1> , … , <column_n> for columns from the given table SELECT <column_1>,<column_2> --,<column_n> FROM <table_name> GROUP BY <column_1>,<column_2> --,<column_n> HAVING count(*)>1 To select the records which create the duplication, all we have to do is to JOIN the previous query with the base table on the chosen group of columns: SELECT A.* FROM <table_name> A JOIN ( SELECT <column_1>,<column_2> --,<column_n> FROM <table_name> GROUP BY <column_1>,<column_2> --,<column_n> HAVING count(*)>1 ) B ON A.<column_1> = B.<column_1> AND A.<column_2> = B.<column_2> --AND A.<column_n> = B.<column_n> Of course, the non-duplicate records can be shown with a LEFT JOIN: SELECT A.* FROM <table_name> A LEFT JOIN ( SELECT <column_1>,<column_2> --<column_n> FROM <table_name> GROUP BY <column_1>,<column_2> --<column_n> HAVING count(*)>1 ) B ON A.<column_1> = B.<column_1> AND A.<column_2> = B.<column_2> --AND A.<column_n> = B.<column_n> WHERE B. IS NULL Note: <column_1>, … , <column_n> are supposed to contain no NULLs.

Thursday, September 13, 2007

Useful Functions in Data Conversion

    When working with data, even if it's about reporting, data conversion or data analysis, there are a few useful functions which makes life easier. They deal with three types of basic variables and their varations: strings (e.g. char, varchar), numeric (e.g. int, decimal) and date (e.g. smalldatetime, datetime). 1. CASE input_expression WHEN when_expression THEN result_expression [ ...n ] [ ELSE else_result_expression ] END     Judging from experience, CASE is the most used function in data conversions, when the output value is based on on one or multiple constraints over the already existing values. CASE WHEN column1 = 'yes' OR column2 = 'other' THEN 'yes/other' END     Just be carefull, when the ELSE is not provided, a null will be returned! Another important thing, the values returned must have the same base type, dates, strings or numeric values can't be mixed without convertion! --example 1 CASE WHEN DateDiff(d, DateValue, GetDate())>0 THEN 'past' WHEN DateDiff(d, DateValue, GetDate())=0 THEN 'today' ELSE 'future' END --example 2 CASE WHEN column1 BETWEEN 1 AND 100 THEN '1st range' WHEN column1 BETWEEN 101 AND 1000 THEN '2nd range' ELSE '3rd range' END --example 3 CASE WHEN column1 IN ('A', 'B', 'C') THEN '1st set' WHEN column1 IN ('C', 'D', 'E') THEN '2nd set' ELSE '3rd set' END     There is a saying for DBAs and not only, "Know your data!". As each statement is checked until comparison operation evaluated True, it is better to arange the WHEN expression in the descending order of highest occurence. 2. IsNull(expression)     It's always a good idea to check for NULL values, they could occur anytime when a default value is not provided for a value. One of the cases I often found was when joining two tables when one of the values from JOIN was NULL. SELECT * FROM table1 A LEFT JOIN table2 B ON A.column1 = B.column1 AND IsNull(A.column2, ' ') = IsNull(B.column2, ' ') WHERE B.column1 IS NULL for string values or SELECT * FROM table1 A LEFT JOIN table2 B ON A.column1 = B.column1 AND IsNull(A.column2, 0) = IsNull(B.column2, 0) WHERE B.column1 IS NULL for numeric, this when ' ' and 0 aren't found in the list of possible values or there is no impact on the result. Another used value for IsNull function is 'N/A', to point out the missing value. 3. COALESCE ( expression [ ,...n ] )     There are cases when is needed to take the first not null value from a list of columns, could be dates or any type of other value. Instead of writting something like IsNull(Date1, IsNull(Date2, IsNull(Date3, GetDate()) it's much easier to write COALESCE(Date1, Date2, Date3, GetDate()) SELECT * FROM table1 WHERE COALESCE(Column1, Column2, Column3) IS NOT NULL or SELECT * FROM table1 WHERE Column1 IS NOT NULL OR Column2 IS NOT NULL OR Column3 IS NOT NULL 4. NullIf ( expression , expression )     NullIf function is handy especially when is not needed a certain value in output, for example 0. SELECT column1, NullIf(column2, 0) FROM table_name     When multiple values are not needed, it's easier maybe to write SELECT column1 , CASE WHEN column2 IN ('A', 'B', 'V') THEN NULL ELSE column1 END FROM table_name instead of writing SELECT column1, NullIf(NullIf(NullIf(column2, 'A'), 'B'), 'V') FROM table_name 5. CONVERT ( data_type [ ( length ) ] , expression [ , style ] )     I used to use CAST instead CONVERT, however CONVERT is more powerfull, especially when it comes to dates. SQL Server comes with a few dates formats obtain by providing a value for style parameter: SELECT CONVERT(varchar(10), GetDate(),104) -- SELECT CONVERT(varchar(12), GetDate(),105) -- SELECT CONVERT(varchar(10), GetDate(),103) -- SELECT CONVERT(varchar(10), GetDate(),102) -- SELECT CONVERT(varchar(10), GetDate(),101) -- SELECT CONVERT(varchar(12), GetDate(),106) -- or SELECT Upper(CONVERT(varchar(12), GetDate(),106))     The conversion of a date to a string is mainly required by the need to have the date in a special format or not allow Excel to format the date to its default. 6. IsNumeric ( expression )     This function shouldn't miss when casting string values to numeric values and doing operations with them. SELECT CAST(column1 as decimal(12,2)) + 101 FROM table_name WHERE IsNumeric(column1)=1     Please note that when multiple constraints are used, the IsNumeric function should appear first after the WHERE clause, a query like SELECT CAST(column1 as decimal(12,2)) + 101 FROM table_name WHERE column2 = 'value' AND IsNumeric(column1)=1 might raise an error if one of the column1 values is not numeric. So better write: SELECT CAST(column1 as decimal(12,2)) + 101 FROM table_name WHERE IsNumeric(column1)=1 AND column2 = 'value' 7. IsDate( expression )     It works the same way as IsNumeric, but on date. Thus if the provided values is a date, it will return 1, otherwise 0. SELECT IsDate(GetDate()) SELECT IsDate('11/31/2007') --november has only 30 days or with more readable values: SELECT CASE IsDate('11/31/2007') WHEN 1 THEN 'Yes' ELSE 'No' END Note:     For more information on the function is always a good idea to consult Microsoft Technical documentation.
My first introduction in XML on SQL Server 2005 That's the work I did for a homework at school (still studying, hoping to finish my Masters degree). Sorry, I have only short comments for it, hopefully will be helpful for somebody! /*creating a table for demonstration purposes ID - identity field Books - XML document where the tables are stored */ CREATE TABLE XMLBooks( ID int IDENTITY(1,1) PRIMARY KEY , Books XML) /*dropping the table */ --DROP TABLE XMLBooks /* inserting a record - an XML document containing one book*/ INSERT INTO XMLBooks VALUES ('<books> <book id="1" language="EN"> <title>Inside SQL Server 2005 Tools</title> <authors> <author>Michael Raheem</author> <author>Dima Sonkin</author> <author>Thierry D''Hers</author> <author>Kami LeMonds</author> </authors> <year>2006</year> <publisher>Addison Wesley</publisher> <price></price> <pages>816</pages> <isbn10>0-321-39796-7</isbn10> <isbn13>978-0-321-39796-6</isbn13> </book> </books>') /* inserting a record - an XML document containing three books*/ INSERT INTO XMLBooks VALUES ('<books> <book id="2" language="EN"> <title>Core C# and .NET</title> <authors> <author>Stephen C. Perry</author> </authors> <year>2005</year> <publisher>Prentice Hall</publisher> <price></price> <pages>1008</pages> <isbn10>0-13-147227-5</isbn10> <isbn13></isbn13> </book> <book id="3" language="EN"> <title> Microsoft SQL Server 2005</title> <authors> <author>Andrew J. Brust</author> <author>Stephen Forte</author> </authors> <year>2006</year> <publisher>Microsoft Press</publisher> <price></price> <pages>600</pages> <isbn10>0-7356-1923-9</isbn10> <isbn13>978-0-7356-1923-4</isbn13> </book> <book id="4" language="EN"> <title>MCAD/MCSD: Visual Basic .NET XML Web Services and Server Components Study Guide</title> <authors> <author>Pamela Fanstill</author> <author>Brian Reisman </author> <author>Mitch Ruebush</author> </authors> <year>2003</year> <publisher>Sybex</publisher> <price></price> <pages>598</pages> <isbn10>0-7821-4193-5</isbn10> <isbn13></isbn13> </book> </books>') /* a classical SELECT */ SELECT * FROM XMLBooks /* a first XML falvoured query*/ SELECT Books.query('/.') FROM XMLBooks /* selecting an element's content*/ SELECT Books.query('//book[@id=1]/title/text()') FROM XMLBooks /* selecting an element's content*/ SELECT Books.value('data(//book[@id=1]/title/)','varchar(250)') FROM XMLBooks Msg 2256, Level 16, State 1, Line 2 XQuery [XMLBooks.Books.value()]: Syntax error near ')', expected a "node test". /* selecting an element's content*/ SELECT Books.value('data((//book[@id=1]/title)[1])','varchar(250)') FROM XMLBooks /* selecting only the occurence*/ SELECT Books.value('data((//book[@id=1]/title)[1])','varchar(250)') FROM XMLBooks WHERE Books.exist('//book[@id=1]')=1 /* updating a node by inserting a new child after an existing one */ UPDATE XMLBooks SET Books.modify(' insert Jack Daniels after (//book[@id=1]//author)[1]') /* updating a node */ UPDATE XMLBooks SET Books.modify(' insert Jack Daniels before (//book[@id=1]//author)[1]') /* deleting a child */ UPDATE XMLBooks SET Books.modify(' delete //book[@id=1]//author[1]') /* modifying an attribute */ UPDATE XMLBooks SET Books.modify(' replace value of (//book/@id)[1] with "100"') /* deleting a record */ DELETE FROM XMLBooks WHERE Books.exist('//book[@id=1]')=1 SELECT Motocross.Team.query(‘.’) AS RESULT FROM @xmlvar.nodes(‘/Motocross/Team’) Motocross(Team) SELECT Book.Athors.query('/.') AS AuthorsList FROM XMLBooks.nodes('/books/book/authors') Book(Authors)

Null-ifying the World

    Do you remember the table used in previous post? In the respective case we provided all the values, what happens when a value is missing? How do we deal with such cases? For such cases we have the value NULL. Let's try to insert a NULL in one of the columns: CREATE TABLE Table1( ID int , [Year] int , Product varchar(30) , Price decimal(13,2) , Quantity decimal(13,3))     Then we insterted in the table test data, let's try to insert also a NULL for one of the values: INSERT Table1(ID, [Year], Product, Price, Quantity) VALUES (1, 2002, NULL, 199.23, 100) --checking table's content SELECT * FROM Table1     We can check for NULL values using the IS NULL clasuse: SELECT * FROM Table1 WHERE Product IS NULL If we don't like NULL values, we can replace them with other values, with the help of IsNull function. SELECT ID, [Year], IsNull(Product, 'missing value'), Price, Quantity FROM Table1     Let's try something else: SELECT ID, [Year], IsNull(Product, 'The Prouct value is missing! Please correct it!') Product, Price, Quantity FROM Table1     As you can see, because the maximum length of Product column is 30, the text is trimmed after 30 characters. In order to solve this issue, we can reduce the error message to maximum 30 characters or convert the Product type from varchar(30) to varchar(50), enough to hold the whole string "" SELECT ID, [Year], IsNull(Cast(Product as varchar(50)), 'The Prouct value is missing! Please correct it!') Product, Price, Quantity FROM Table1     For the next example we need to insert more NULL values. INSERT Table1(ID, [Year], Product, Price, Quantity) VALUES (1, 2002, 'Product 6', NULL, 100) INSERT Table1(ID, [Year], Product, Price, Quantity) VALUES (1, 2002, 'Product 7', 45.08, NULL) INSERT Table1(ID, [Year], Product, Price, Quantity) VALUES (1, 2002, 'Product 8', NULL, NULL) INSERT Table1(ID, [Year], Product, Price, Quantity) VALUES (1, NULL, 'Product 9', NULL, NULL)     When we need to test for NULL values in multiple columns we could write something like this: SELECT * FROM Table1 WHERE Price IS NULL OR Quantity IS NULL or, in a simpler way: SELECT * FROM Table1 WHERE IsNull(Price, Quantity) IS NULL These checks are useful when you need to check for two values which shouldn't be simultaneously NULL. For more columns with possible NULL values, when the values have the same type, we can rewrite the statements as follows: SELECT * FROM Table1 WHERE COALESCE([Year], Price, Quantity) IS NULL The NULL values can be replaced with values of the same type as the field: SELECT ID , [Year] , IsNull(Product, 'The Prouct value is missing!') Product , IsNull(Price, 0) , IsNull(Quantity, 0) FROM Table1 Now, coming back from where we started, the base table. If we don't want NULL values in our data, we have to declare the content as not being NULL: --droping the table DROP TABLE Table1 CREATE TABLE Table1( ID int NOT NULL , [Year] int NOT NULL , Product varchar(30) NOT NULL , Price decimal(13,2) NOT NULL , Quantity decimal(13,3) NOT NULL)     Now, when we try to insert a NULL value INSERT Table1(ID, [Year], Product, Price, Quantity) VALUES (1, 2002, NULL, 199.23, 100)     We get a specific error message: Msg 515, Level 16, State 2, Line 1 Cannot insert the value NULL into column 'Product', table 'Learning.dbo.Table1'; column does not allow nulls. INSERT fails.     Another way is to use defaults, thus if a value was not provided, it will be replaced with the DEFAULT values: CREATE TABLE Table1( ID int NOT NULL , [Year] int NOT NULL , Product varchar(30) DEFAULT 'Unknown' , Price decimal(13,2) NOT NULL , Quantity decimal(13,3) NOT NULL)     Inserting some values in it: INSERT Table1(ID, [Year], Price, Quantity) VALUES (1, 2002, 199.23, 100) INSERT Table1(ID, [Year], Product, Price, Quantity) VALUES (1, 2002, NULL, 199.23, 100) --testing the output SELECT * FROM Table1     That's all for the moment, I have to work a little more on this post, I wanted intially to make a tutorial from it but as nobody reads it, there is no point to do that for the moment.

Preparing test data

    There are cases in which we need to create test data in order to check a certain functionality. In this post I'll show how can be created data with the help of loops. For this let's create a simple table containing 5 fields: CREATE TABLE Table1( ID int NOT NULL , [Year] int NOT NULL , Product varchar(30) NOT NULL , Price decimal(13,2) NOT NULL , Quantity decimal(13,3) NOT NULL) --DROP TABLE Table1 --TRUNCATE TABLE Table1     Usually I'm commenting DROP and TRUNCATES so I can avoid the modification of data by mistake.     And now the loop, with the help of RAND and modulos the values can be varied. DECLARE @index int SET @index = 1 WHILE @index < color="#000099">BEGIN INSERT Table1(ID, [Year], Product, Price, Quantity) VALUES (@index, 2000 + (@index % 4), 'Product ' + Cast(@index%10 as char(3)), 10 * RAND(), 100 * RAND()) SET @index = @index + 1 END --testing the output SELECT * FROM Table1     And now you can work with the data and write nice statements like the below one: SELECT [Year], Product, SUM(Price * Quantity) Amount FROM Table1 GROUP BY [Year], Product ORDER BY [Year], Product That's all folks!
Lists as Parameters in Stored Procedures In the previous post I've shown how we can transform a delimited list in a table with the help of table-valued functions, in this posting I'll show how we can use lists in stored procedures without the need to create the query as a string. Let's create a table which will be used for testing purposes: CREATE TABLE Testing( value varchar(50)) and populate it from a list: INSERT Testing SELECT value FROM dbo.SplitList('1,5,7,56,67,34,33,54,76', ',') Let's see the values: SELECT * FROM Testing Normally in a stored procedure, if you want to limit your criteria on a dynamic range or values provided as parameters then you would have to concatenate your script dynamically like below: CREATE PROCEDURE TestingListsAsParameters( @listOfValues varchar(500)) AS DECLARE @SQL varchar(1000) SET @SQL = 'SELECT * FROM Testing WHERE Value IN (' + @listOfValues + ')' EXEC (@SQL) Testing the stored procedure: EXEC TestingListsAsParameters '5,7,56, 67' Same result can be obtained by using the following method: CREATE PROCEDURE TestingListsAsParameters1( @listOfValues varchar(500)) AS SELECT A.* FROM Testing A JOIN dbo.SplitList(@listOfValues, ',') B ON A.Value = B.Value EXEC TestingListsAsParameters1 '5,7,56,67' nbsp; Sometimes we want if the list contains no values to return all the records from the targeted table, we can rewrite the above procedure as follows: CREATE PROCEDURE TestingListsAsParameters2( @listOfValues varchar(500)='') AS SELECT A.* FROM Testing A WHERE (Value IN (SELECT value FROM dbo.SplitList(@listOfValues, ',')) AND Len(@listOfValues)>0) OR Len(@listOfValues)=0 EXEC TestingListsAsParameters2 '5,7,56,67' --returns only the matches from the list EXEC TestingListsAsParameters2 --returns all the values nbsp; The method can be thus extended to support multiple lists as parameters.
Lists and a Little Math In the previous post I've shown how we can transform a delimited list in a table with the help of table-valued functions, and I gave as example a JOIN which provided the common elements between two lists. SELECT A.* FROM dbo.SplitList('23 34 50 71', ' ') A JOIN dbo.SplitList('23,34,25,45', ',') B ON A.Value = B.Value Playing with the function, but this time letting out the JOIN, obtaining thus a cartesian product. --an example SELECT * FROM dbo.SplitList('1,2,3,4', ',') A , dbo.SplitList('1,2,3,4', ',') B --another example SELECT * FROM dbo.SplitList('1,2,3,4', ',') A , dbo.SplitList('1,2,3,4', ',') B , dbo.SplitList('1,2,3,4', ',') C , dbo.SplitList('1,2,3,4', ',') D Without any apparent use, this could be handy when generating sets of testing data, instead of creating imbricated loops. A more interesting example, at least for kids, is the following. SELECT 'The ' + A.Value + ' ' + B.Value + ' ' + C.Value + '!' FROM dbo.SplitList('cat,dog,bird,fish', ',') A , dbo.SplitList('drinks,smells,eats,jumps', ',') B , dbo.SplitList('loudly,nicely,grumpy,alone', ',') C We can do even permutations, of course this example makes sense for a small number of elements: SELECT * FROM dbo.SplitList('1,2,3,4', ',') A JOIN dbo.SplitList('1,2,3,4', ',') B ON A.Value<>B.Value JOIN dbo.SplitList('1,2,3,4', ',') C ON B.Value<>C.Value AND A.Value<>C.Value JOIN dbo.SplitList('1,2,3,4', ',') D ON C.Value<>D.Value AND B.Value<>D.Value AND A.Value<>D.Value ORDER BY A.Value, B.Value, C.Value, D.Value That's all for the moment, a more useful application in a next posting.

Table-valued Functions and List of Values

    Often I needed to transform a list of values in a table format in order to process the data easier, it was easy to do that in a stored procedure but data were usable only inside the procedure. Once I jumped in the world of table-valued functions, new horizons opened. We can provide a list of values as parameter to a function and return the values in a tabular form. I will jump over the introduction of table-valued functions, more information about them can be found in SQL Server Books Online documentation.     So, here is the function, as second parameter is provided the delimiter: CREATE FUNCTION dbo.SplitList( @ListValues varchar(500) , @Delimiter char(1)) RETURNS @Temp TABLE(Value varchar(50)) /* Purpose: translates a list of values delimited by @delimiter into a tabular format Parameters: @List varchar(500) - List of values split by @Delimiter , @Delimiter char(1) - Delimiter Notes: comma is used as default delimiter Sample call: SELECT * FROM dbo.SplitList('24,34,34,56,23', ',') SELECT * FROM dbo.SplitList('24 34 34 56 23', ' ') SELECT * FROM dbo.SplitList(NULL, ',') */ AS BEGIN DECLARE @Index int DECLARE @Length int SET @Index = CharIndex(@Delimiter, @ListValues) SET @Length = Len(@ListValues) - @Index WHILE @Index > 0 --if the fatch was successful BEGIN INSERT @Temp VALUES( Substring(@ListValues, 0, @Index)) SET @ListValues = Substring(@ListValues, @Index+1, @Length) SET @Index = CharIndex(@Delimiter, @ListValues) SET @Length = @Length - @Index END INSERT @Temp VALUES(@ListValues) RETURN END It can be tested using the below calls: SELECT * FROM dbo.SplitList('24,34,34,56,23', ',')     In this example the list of values is comma delimited, while the next one the values are delimited by a space: SELECT * FROM dbo.SplitList('24 34 34 56 23', ' ')     The function works with nulls too: SELECT * FROM dbo.SplitList(NULL, ',')     Why is handy such a function? The main utility resides in the fact that it can be used in JOINs or subqueries, to filter or search for values. For example: --using a JOIN SELECT * FROM JOIN dbo.SplitList('23,34,25,45', ',') ON 1=1 where is just a placeholder for a table. --using an inner query SELECT * FROM WHERE IN (SELECT * FROM dbo.SplitList('23,34,25,45', ','))     We can use even a join between two tables created from list of values. The following query returns the intersection of two lists: SELECT A.* FROM dbo.SplitList('23 34 50 71', ' ') A JOIN dbo.SplitList('23,34,25,45', ',') B ON A.Value = B.Value     Of course the JOINs performed on such tables have a low performance, we can speed them by declaring the "value" column as PRIMARY KEY. That's useful when we deal with long lists, however won't work with NULL values and duplicates.     Thus by replacing the RETURN line with the following, the queries will perform faster. RETURNS @Temp TABLE(Value varchar(50) PRIMARY KEY) Now trying to run again the following query: SELECT * FROM dbo.SplitList('24,34,34,56,23', ',') Here comes the error: Msg 2627, Level 14, State 1, Line 1 Violation of PRIMARY KEY constraint 'PK__SplitList__0F975522'. Cannot insert duplicate key in object 'dbo.@Temp'.     While the following statement: SELECT * FROM dbo.SplitList(NULL, ',') returns: Msg 515, Level 16, State 2, Line 1 Cannot insert the value NULL into column 'Value', table '@Temp'; column does not allow nulls. INSERT fails.     That's not news, we'll get similar errors when working with normal tables. I will stop here, more in a next post.