23 September 2007

SQL Reloaded: 6 out of 49 (Part 4: More Basic Data Analysis)

In a post I read long time ago, it was mentioned that’s interesting to see in a data set in which intervals the data falls. It was recommended to split the interval in 10 buckets, though we don’t know which is the right most extremity for our interval. We could use the maximum value:

  -- get maximum number of consecutive drawings for which any number haven’t appeared 
SELECT Max(DifferenceWeeks) 
FROM dbo.vLottoConsecutiveDrawings  

But this doesn’t guarantees that there will be cases in which the extremity is exceeded. From my point of view, if the maximum is 63, then I’ll take the extremity 70 and split it in 10 buckets: first interval will be 1 … 7, second 8 … 14, etc. Thus, I’m taking the GROUP query I used in the previous copy and add to it additional logic for the split: 

SELECT Number 
, Min(NullIf(DifferenceWeeks, 0)) Minimum 
, Max(NullIf(DifferenceWeeks, 0)) Maximum 
, Avg(NullIf(DifferenceWeeks, 0)) Average 
, count(*) NumberDrawings 
, SUM(CASE WHEN DifferenceWeeks BETWEEN 1 AND 7 THEN 1 END) Range1_7 
, SUM(CASE WHEN DifferenceWeeks BETWEEN 8 AND 14 THEN 1 END) Range8_14 
, SUM(CASE WHEN DifferenceWeeks BETWEEN 15 AND 21 THEN 1 END) Range15_21 
, SUM(CASE WHEN DifferenceWeeks BETWEEN 22 AND 28 THEN 1 END) Range22_28 
, SUM(CASE WHEN DifferenceWeeks BETWEEN 29 AND 35 THEN 1 END) Range29_35 
, SUM(CASE WHEN DifferenceWeeks BETWEEN 36 AND 42 THEN 1 END) Range36_42 
, SUM(CASE WHEN DifferenceWeeks BETWEEN 43 AND 49 THEN 1 END) Range43_49 
, SUM(CASE WHEN DifferenceWeeks BETWEEN 50 AND 56 THEN 1 END) Range50_56 
, SUM(CASE WHEN DifferenceWeeks BETWEEN 57 AND 63 THEN 1 END) Range57_63 
, SUM(CASE WHEN DifferenceWeeks BETWEEN 64 AND 70 THEN 1 END) Range64_70 
, 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 1 AND 7 THEN 1 END)/count(*) Percentage1_7 
, 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 8 AND 14 THEN 1 END)/count(*) Percentage8_14 
, 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 15 AND 21 THEN 1 END)/count(*) Percentage15_21 
, 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 22 AND 28 THEN 1 END)/count(*) Percentage22_28 
, 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 29 AND 35 THEN 1 END)/count(*) Percentage29_35 
, 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 36 AND 42 THEN 1 END)/count(*) Percentage36_42 
, 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 43 AND 49 THEN 1 END)/count(*) Percentage43_49 
, 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 50 AND 56 THEN 1 END)/count(*) Percentage50_56 
, 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 57 AND 63 THEN 1 END)/count(*) Percentage57_63 
, 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 64 AND 70 THEN 1 END)/count(*) Percentage64_70 
FROM dbo.vLottoConsecutiveDrawings 
GROUP BY Number 
ORDER BY Number 

As can be seen I just summed the number of cases in which a value fallen in the given interval:

SUM(CASE WHEN DifferenceWeeks BETWEEN 1 AND 7 THEN 1 END) Range1_7     

Additionally, I wanted to see what was the percentage for a value to fall in a given interval from the total number of occurrences: 
    
 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 1 AND 7 THEN 1 END)/count(*) Percentage1_7 

How can we benefit from such output?     
We can check for each number what’s the bucket with the highest percentage, usually there are only 1-2 intervals which qualify. The percentage the is key in finding useful information. For example if there is only one interval with a 75% occurrences, then most probably the number will appear in a drawing from that interval, if the number didn’t appeared in the respective interval, then the probability for that number to appear increases. When we have multiple intervals with a considerable high percentage, if the right extremity of the highest interval is exceeded, the chances for a number to appear are considerably higher.

Previous Post <<||>> Next Post

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.