23 September 2007

SQL Reloaded: 6 out of 49 (Part 4: More Basic Data Analysis)

In a post I read long time ago, it was mentioned that’s interesting to see in a data set in which intervals the data falls. It was recommended to split the interval in 10 buckets, though we don’t know which is the right most extremity for our interval. We could use the maximum value:

  -- get maximum number of consecutive drawings for which any number haven’t appeared 
SELECT Max(DifferenceWeeks) 
FROM dbo.vLottoConsecutiveDrawings  

But this doesn’t guarantees that there will be cases in which the extremity is exceeded. From my point of view, if the maximum is 63, then I’ll take the extremity 70 and split it in 10 buckets: first interval will be 1 … 7, second 8 … 14, etc. Thus, I’m taking the GROUP query I used in the previous copy and add to it additional logic for the split: 

SELECT Number 
, Min(NullIf(DifferenceWeeks, 0)) Minimum 
, Max(NullIf(DifferenceWeeks, 0)) Maximum 
, Avg(NullIf(DifferenceWeeks, 0)) Average 
, count(*) NumberDrawings 
, SUM(CASE WHEN DifferenceWeeks BETWEEN 1 AND 7 THEN 1 END) Range1_7 
, SUM(CASE WHEN DifferenceWeeks BETWEEN 8 AND 14 THEN 1 END) Range8_14 
, SUM(CASE WHEN DifferenceWeeks BETWEEN 15 AND 21 THEN 1 END) Range15_21 
, SUM(CASE WHEN DifferenceWeeks BETWEEN 22 AND 28 THEN 1 END) Range22_28 
, SUM(CASE WHEN DifferenceWeeks BETWEEN 29 AND 35 THEN 1 END) Range29_35 
, SUM(CASE WHEN DifferenceWeeks BETWEEN 36 AND 42 THEN 1 END) Range36_42 
, SUM(CASE WHEN DifferenceWeeks BETWEEN 43 AND 49 THEN 1 END) Range43_49 
, SUM(CASE WHEN DifferenceWeeks BETWEEN 50 AND 56 THEN 1 END) Range50_56 
, SUM(CASE WHEN DifferenceWeeks BETWEEN 57 AND 63 THEN 1 END) Range57_63 
, SUM(CASE WHEN DifferenceWeeks BETWEEN 64 AND 70 THEN 1 END) Range64_70 
, 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 1 AND 7 THEN 1 END)/count(*) Percentage1_7 
, 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 8 AND 14 THEN 1 END)/count(*) Percentage8_14 
, 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 15 AND 21 THEN 1 END)/count(*) Percentage15_21 
, 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 22 AND 28 THEN 1 END)/count(*) Percentage22_28 
, 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 29 AND 35 THEN 1 END)/count(*) Percentage29_35 
, 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 36 AND 42 THEN 1 END)/count(*) Percentage36_42 
, 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 43 AND 49 THEN 1 END)/count(*) Percentage43_49 
, 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 50 AND 56 THEN 1 END)/count(*) Percentage50_56 
, 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 57 AND 63 THEN 1 END)/count(*) Percentage57_63 
, 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 64 AND 70 THEN 1 END)/count(*) Percentage64_70 
FROM dbo.vLottoConsecutiveDrawings 
GROUP BY Number 
ORDER BY Number 

As can be seen I just summed the number of cases in which a value fallen in the given interval:

SUM(CASE WHEN DifferenceWeeks BETWEEN 1 AND 7 THEN 1 END) Range1_7     

Additionally, I wanted to see what was the percentage for a value to fall in a given interval from the total number of occurrences: 
    
 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 1 AND 7 THEN 1 END)/count(*) Percentage1_7 

How can we benefit from such output?     
We can check for each number what’s the bucket with the highest percentage, usually there are only 1-2 intervals which qualify. The percentage the is key in finding useful information. For example if there is only one interval with a 75% occurrences, then most probably the number will appear in a drawing from that interval, if the number didn’t appeared in the respective interval, then the probability for that number to appear increases. When we have multiple intervals with a considerable high percentage, if the right extremity of the highest interval is exceeded, the chances for a number to appear are considerably higher.

Previous Post <<||>> Next Post

SQL Reloaded: 6 out of 49 (Part 3: Basic Data Analysis)

The most basic information one can get is the number of drawings in which a number occurred, respectively which was its frequency:

-- number & frequency drawings by number
DECLARE @NumberDrawings int = 2000
SELECT Number 
, Cast(100.0*count(*)/@NumberDrawings as decimal(10,2)) Frequency 
, count(*) NumberDrawings 
FROM dbo.vLottoNumbers 
GROUP BY Number 
ORDER BY Number  

One of the interesting things to see in a data set of drawings is the distance between two drawings in which appears the same numbers. It doesn’t look too easy to compare two consecutive records within the same query; actually this can be done easy with the help of DENSE_RANK function (works only on SQL Server 2005+ and Oracle), which ranks the data within a partition, thus if two values are identical, they have the same ranking. Let’s see how DENSE_RANK function works: 

-- ranked drawings by number
SELECT Number 
, DrawingDate 
, DENSE_RANK() OVER (PARTITION BY Number ORDER BY DrawingDate) Ranking 
FROM dbo.vLottoNumbers   

The partition is created by Number, while the ranking is delimited by DrawingDate. Now all we have to do is to join two such queries by Number with two consecutive Rankings: 

--creating the view 
CREATE VIEW dbo.vLottoConsecutiveDrawings 
AS 
-- consecutive drawings by numbers
WITH DAT 
AS (
	SELECT Number     
	, DrawingDate     
	, DENSE_RANK() OVER (PARTITION BY Number ORDER BY DrawingDate) Ranking   
	FROM dbo.vLottoNumbers  
)
SELECT A.Number 
, A.DrawingDate 
, B.DrawingDate NextDrawingDate 
, IsNull(DateDiff(d, A.DrawingDate, B.DrawingDate), 0) DifferenceDays 
, IsNull(DateDiff(wk, A.DrawingDate, B.DrawingDate), 0) DifferenceWeeks 
FROM DAT A
     LEFT JOIN DAT B 
	   ON A.Number = B.Number 
	  AND A.Ranking = B.Ranking - 1 

The first interesting thing to find out is:     
  • What’s the average number of occurrences?     
  • What’s the minimum/maximum number of consecutive drawings in which the number hasn’t appeared?  
  • How many times a number appeared?      
The following query answers to these questions, by doing a simple grouping by Number over vLottoConsecutiveDrawings output:

-- minimum/maximum differences between drawings
SELECT Number 
, Min(NullIf(DifferenceWeeks, 0)) Minimum 
, Max(NullIf(DifferenceWeeks, 0)) Maximum 
, Avg(NullIf(DifferenceWeeks, 0)) Average 
, count(*) NumberDrawings 
FROM dbo.vLottoConsecutiveDrawings 
GROUP BY Number 
ORDER BY Number    

The output doesn’t give much information, but it’s a good start. The study interval can be broken down by using the DrawingDate in GROUP and/or WHERE clause: 

-- minimum/maximum differences between drawings by year
SELECT Year(DrawingDate) DrawingYear 
, Number 
, Min(NullIf(DifferenceWeeks, 0)) Minimum 
, Max(NullIf(DifferenceWeeks, 0)) Maximum 
, Avg(NullIf(DifferenceWeeks, 0)) Average 
, count(*) NumberDrawings 
FROM vLottoConsecutiveDrawings 
--WHERE Year(DrawingDate) IN (2000, 2001) 
GROUP BY Number 
, Year(DrawingDate) 
ORDER BY Number, DrawingYear

It would be also interesting to know how many numbers fall on a given row or column within the 7x7 matrix:

-- aggregating data by drawing & row
SELECT LN.DrawingDate 
, SUM(CASE WHEN LN.Number BETWEEN 1 AND 7 THEN 1 ELSE 0 END) Bucket1
, SUM(CASE WHEN LN.Number BETWEEN 8 AND 14 THEN 1 ELSE 0 END) Bucket2
, SUM(CASE WHEN LN.Number BETWEEN 15 AND 21 THEN 1 ELSE 0 END) Bucket3
, SUM(CASE WHEN LN.Number BETWEEN 22 AND 28 THEN 1 ELSE 0 END) Bucket4
, SUM(CASE WHEN LN.Number BETWEEN 29 AND 35 THEN 1 ELSE 0 END) Bucket5
, SUM(CASE WHEN LN.Number BETWEEN 36 AND 42 THEN 1 ELSE 0 END) Bucket6
, SUM(CASE WHEN LN.Number BETWEEN 43 AND 49 THEN 1 ELSE 0 END) Bucket7
FROM dbo.vLottoNumbers LN 
GROUP BY LN.DrawingDate 

-- aggregating data by drawing & column
SELECT LN.DrawingDate 
, SUM(CASE WHEN LN.Number %7 = 1 THEN 1 ELSE 0 END) Bucket1
, SUM(CASE WHEN LN.Number %7 = 2 THEN 1 ELSE 0 END) Bucket2
, SUM(CASE WHEN LN.Number %7 = 3 THEN 1 ELSE 0 END) Bucket3
, SUM(CASE WHEN LN.Number %7 = 4 THEN 1 ELSE 0 END) Bucket4
, SUM(CASE WHEN LN.Number %7 = 5 THEN 1 ELSE 0 END) Bucket5
, SUM(CASE WHEN LN.Number %7 = 6 THEN 1 ELSE 0 END) Bucket6
, SUM(CASE WHEN LN.Number %7 = 7 THEN 1 ELSE 0 END) Bucket7
FROM dbo.vLottoNumbers LN 
GROUP BY LN.DrawingDate 

The buckets can be aggregated as follows (observe the replacement of SUM with MAX):
 
-- consolidated drawings by 7 buckets (row)
 SELECT SUM(Bucket1) Bucket1
, SUM(Bucket2) Bucket2
, SUM(Bucket3) Bucket3
, SUM(Bucket4) Bucket4
, SUM(Bucket5) Bucket5
, SUM(Bucket6) Bucket6
, SUM(Bucket7) Bucket7
 FROM (
     -- aggregating data by drawing & row
	 SELECT LN.DrawingDate 
	, Max(CASE WHEN LN.Number BETWEEN 1 AND 7 THEN 1 ELSE 0 END) Bucket1
	, Max(CASE WHEN LN.Number BETWEEN 8 AND 14 THEN 1 ELSE 0 END) Bucket2
	, Max(CASE WHEN LN.Number BETWEEN 15 AND 21 THEN 1 ELSE 0 END) Bucket3
	, Max(CASE WHEN LN.Number BETWEEN 22 AND 28 THEN 1 ELSE 0 END) Bucket4
	, Max(CASE WHEN LN.Number BETWEEN 29 AND 35 THEN 1 ELSE 0 END) Bucket5
	, Max(CASE WHEN LN.Number BETWEEN 36 AND 42 THEN 1 ELSE 0 END) Bucket6
	, Max(CASE WHEN LN.Number BETWEEN 43 AND 49 THEN 1 ELSE 0 END) Bucket7
	 FROM dbo.vLottoNumbers LN 
	 GROUP BY LN.DrawingDate 
  ) DAT

Happy coding!

SQL Reloaded: 6 out of 49 (Part 2: Modeling the Data)

In a previous post I created a random set of data and stored them in LottoExtracts table, the first step is done, now we have to model the data in a form which could be useful for our analysis. Thus it makes sense to split the table in two tables:     
  • LottoExtracts containing mainly the DrawingDate and eventually additional information about extraction (e.g. Country, System, etc.). For simplicity I’m including only the Drawing Date.     
  • LottoNumbers containing the numbers and their order in extraction. 
Here are the tables:
 
-- dropping the LottoDrawings table
DROP TABLE IF EXISTS dbo.LottoDrawings
GO

--creating the LottoDrawings table 
CREATE TABLE dbo.LottoDrawings( 
  DrawingID int IDENTITY(1,1) NOT NULL
 , DrawingDate smalldatetime NULL) 
GO

 -- dropping the LottoNumbers table
DROP TABLE IF EXISTS dbo.LottoNumbers

 --creating LottoNumbers table 
 CREATE TABLE dbo.LottoNumbers(NumberID int IDENTITY(1,1) NOT NULL
 , DrawingID int NULL
 , Position smallint NULL
 , Number smallint NULL)   

 And, the view which will be used as source for analysis: 

 --creating vLottoNumbers view 
 CREATE VIEW dbo.vLottoNumbers AS 
 -- consolidated drawings by number
 SELECT LN.NumberID 
 , LN.Position 
 , LN.Number 
 , LN.DrawingID 
 , LD.DrawingDate 
 , Year(LD.DrawingDate) DrawingYear , Month(LD.DrawingDate) DrawingMonth 
 , DatePart(dd, LD.DrawingDate) DrawingDay 
 , DatePart(wk, LD.DrawingDate) DrawingWeek 
 FROM dbo.LottoNumbers LN 
      JOIN dbo.LottoDrawings LD 
	    ON LN.DrawingID = LD.DrawingID    

Now that we have the objects to model the data, we need to populate the tables with the data from LottoExtracts table: 

--populating LottoDrawings table 
INSERT dbo.LottoDrawings (DrawingDate) 
SELECT DISTINCT DrawingDate 
FROM dbo.LottoExtracts 

--populating the LottoNumbers table - first drawing number 
INSERT LottoNumbers (DrawingID, Position, Number) 
SELECT DrawingID, 1, N1 
FROM dbo.LottoExtracts LE 
     JOIN dbo.LottoDrawings LD 
	   ON DateDiff(d, LE.DrawingDate, LD.DrawingDate) = 0  

The above query inserted only the first number of the extraction, in order to insert the other 5 numbers, all you have to do is to replace 1 with 2 in the second line and run the query again, the insert for second number becomes: 

--populating the LottoNumbers table – second drawing number 
INSERT LottoNumbers (DrawingID, Position, Number) 
SELECT DrawingID, 2, N2 
FROM dbo.LottoExtracts LE 
       JOIN dbo.LottoDrawings LD
	     ON DateDiff(d, LE.DrawingDate, LD.DrawingDate) = 0

Here's the code for the remaining numbers:

--populating LottoNumbers table – the remaining drawing numbers
INSERT LottoNumbers (DrawingID, Position, Number) 
SELECT DrawingID, 3, N3 
FROM dbo.LottoExtracts LE 
       JOIN dbo.LottoDrawings LD
	     ON DateDiff(d, LE.DrawingDate, LD.DrawingDate) = 0
UNION ALL 
SELECT DrawingID, 4, N4
FROM dbo.LottoExtracts LE 
       JOIN dbo.LottoDrawings LD
	     ON DateDiff(d, LE.DrawingDate, LD.DrawingDate) = 0
UNION ALL 
SELECT DrawingID, 5, N5 
FROM dbo.LottoExtracts LE 
       JOIN dbo.LottoDrawings LD
	     ON DateDiff(d, LE.DrawingDate, LD.DrawingDate) = 0
UNION ALL 
SELECT DrawingID, 6, N6
FROM dbo.LottoExtracts LE 
       JOIN dbo.LottoDrawings LD
	     ON DateDiff(d, LE.DrawingDate, LD.DrawingDate) = 0

Checking the output, you should have 12000 records: 

--checking the data 
SELECT * 
FROM dbo.vLottoNumbers 
ORDER BY DrawingDate
, Position

Now we have the data for analysis (see the next post).

Happy coding!

Previous Post <<||>> Next Post

SQL Reloaded: 6 out of 49 (Part 1: Getting the data)

I was thinking to model the 6 out of 49 lottery drawing, so common across EU countries. To study numbers’ occurrences we need to import the historical data, though during my first attempts this proved to be challenging. Each lottery has a website from which data can be taken drawing by drawing or when possible, as a data dump (see notes). Moreover, some of the lotteries have variations for 6/49.       

To avoid any legal litigations, I will use a self-made set of data via the RAND function (which unfortunately is not a perfect randomizer). To store the generated data, I've created a table called LottoExtracts: 

-- cleaning the table
DROP TABLE IF EXISTS dbo.LottoExtracts

--creating the LottoExtracts table 
CREATE TABLE dbo.LottoExtracts(     
   DrawingDate date NULL
 , N1 smallint NULL
 , N2 smallint NULL
 , N3 smallint NULL
 , N4 smallint NULL
 , N5 smallint NULL
 , N6 smallint NULL)

This format will be used to upload the data in a staging table which is used later as source for our data model. As can be seen it has a Drawing Date which identifies uniquely an extraction, while in N1, …, N6 are stored the drawings’ numbers in the order they were picked. 

--creating random data 
DECLARE @index, @counter int 
DECLARE @N1, @N2, @N3, @N4, @N5, @N6 smallint 
  
SET @index = 0 
WHILE @index < 2000 
BEGIN 
  -- creating the first number 
  SET @N1 = Cast(1000 * RAND() as int) % 48 + 1 

  -- creating the second number 
  SET @counter = 1 
  WHILE @counter < 50 or @N1 = @N2 
  BEGIN 
	  SET @N2 = Cast(1000 * RAND() as int) % 48 + 1;
	  SET @counter = @counter + 1 
  END;

  -- creating the third number
  SET @counter = 1 
  WHILE @counter < 50 or @N3 IN (@N1, @N2) 
  BEGIN 
	  SET @N3 = Cast(1000 * RAND() as int) % 48 + 1 
	  SET @counter = @counter + 1 
  END 

  -- creating the fourth number 
  SET @counter = 1 
  WHILE @counter < 50 or @N4 IN (@N1, @N2, @N3) 
  BEGIN 
	  SET @N4 = Cast(1000 * RAND() as int) % 48 + 1 
	  SET @counter = @counter + 1 
  END 

  -- creating the fifth number 
  SET @counter = 1 
  WHILE @counter < 50 or @N5 IN (@N1, @N2, @N3, @N4) 
  BEGIN 
	  SET @N5 = Cast(1000 * RAND() as int) % 48 + 1 
	  SET @counter = @counter + 1 
  END 

  -- creating the sixth number 
  SET @counter = 1 
  WHILE @counter < 50 or @N6 IN (@N1, @N2, @N3, @N4, @N5) 
  BEGIN 
	  SET @N6 = Cast(1000 * RAND() as int) % 48 + 1 
	  SET @counter = @counter + 1 
  END 

  --inserting a new drawing 
  INSERT LottoExtracts(DrawingDate, N1, N2, N3, N4, N5, N6) 
  VALUES (DateAdd(d, @index*7, '01/01/2000'), @N1, @N2, @N3, @N4, @N5, @N6) 
  SET @index = @index + 1 
END 

I've created only 200 records, which equates with almost 4 years of history, and should be enough for testing purposes. Of course, one can increase or decrease the number of drawings as needed! 
 
--checking the data 
SELECT * 
FROM dbo.LottoExtracts   

In what concerns the code, the following statement creates a random number in 1 … 49 range:
 
-- creating first number 
SET @N1 = Cast(1000 * RAND() as int) % 48 + 1    

Thus, can be created all 6 numbers, however because the same number can't be extracted twice in the same drawing, I had to add duplication checks:
 
-- creating third number 
SET @counter = 1 
WHILE @counter < 50 or @N3 IN (@N1, @N2) 
BEGIN 
	SET @N3 = Cast(1000 * RAND() as int) % 48 + 1 
	SET @counter = @counter + 1 
END    

Notes:
(1) The code is verbose and probably suboptimal, though it's a must because of the behavior of RAND function as part of transactions.
(2) The below links could change in time. I provided them just to make easier the search. (*) denotes that historical data can be downloaded.
Previous Post <<||>> Next Post

15 September 2007

SQL Reloaded: Updating Data With Values From The Same Table

I have to deal with cases in which I have to update data from a table with values from the same table across records, a simple example I can sketch is the one of Standard Prices, which are normally valid during a year. For example if I would have to reset the Standard Prices from a year to the ones from the previous year, how this can be done? Actually that's pretty sample, for exemplification purposes I will create a table having only 3 columns, ProductID, Year and StandardPrice: 

-- creating the table 
CREATE TABLE dbo.StandardPrices( 
ProductID int 
, [Year] int 
, StandardPrice decimal(13,2) )      

--creating test data 
DECLARE @index int 
SET @index = 0 
  WHILE @index < 50 
  BEGIN 
      INSERT dbo.StandardPrices(ProductID, [Year], StandardPrice) 
      VALUES (@index % 25+1, 2000 + (@index/25), 100 * RAND()) 
      SET @index = @index + 1 
END     

Let's see the content, the RAND() function allows me to assign random prices to each product. As can be seen there are 25 products with Standard Prices for the Years 2000 and 2001.

 --testing StandardPrices table's content 
SELECT * 
FROM dbo.StandardPrices 
ORDER BY ProductID, [Year]     

Let's try to use a normal update: 

--first update attempt 
UPDATE dbo.StandardPrices 
SET StandardPrice = A.StandardPrice 
FROM StandardPrices A 
WHERE StandardPrices.ProductID = A.ProductID 
AND StandardPrices.[Year] = 2001 AND A.[Year] = 2000     

Ups, I got an error (SQL Server 2005):
Msg 4104, Level 16, State 1, Line 1 The multi-part identifier "StandardPrices.ProductID" could not be bound. Msg 4104, Level 16, State 1, Line 1 The multi-part identifier "StandardPrices.Year" could not be bound. 
SQL Server 2000:
Server: Msg 107, Level 16, State 3, Line 1 The column prefix 'StandardPrices' does not match with a table name oralias name used in the query. Server: Msg 107, Level 16, State 1, Line 1 The column prefix 'StandardPrices' does not match with a table name oralias name used in the query.     

It seems we can't use the same table in updates of this type, but we can use a subquery or view which refers to the same table, so we can write: 

  --the update 
UPDATE dbo.StandardPrices 
SET StandardPrice = A.StandardPrice 
FROM ( 
     SELECT ProductID , [Year] , StandardPrice 
     FROM dbo.StandardPrices 
     WHERE [Year] = 2000 
) A 
WHERE StandardPrices.ProductID = A.ProductID 
     AND StandardPrices.[Year] = 2001     

Now there are no errors, 25 records updated, let's see the output: 

--testing table's content after update 
SELECT * 
FROM dbo.StandardPrices 
ORDER BY ProductID
, [Year]
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.