31 July 2010

SQL Reloaded: Self-Joins and Denormalized Data Loading in Normalized Models

    One of the scenarios in which I often make use of self-joins is when needing to load denormalized data into a normalized data model. A characteristic of the not-normalized or denormalized data is that there are repeating data, typically the so called header data, which need to be handled specifically.

Note:
   Header data it’s improper said because an entity could contain more than 2 levels of data, for example the Purchase Orders (POs) in an ERP system could be split in PO Headers, Lines, Shipments and Distributions, thus a (denormalized) extract based on the respective data at Distribution level will contain repeating data from the higher levels.

  So for this post I needed to find an entity that contains a parent-child or header-lines structure. Actually it’s not difficult to find such an example, the world of ERP systems is full of such examples – POs, Invoices, Customer Orders, Receipts, Payments, to mention some of the important transactional data, or Customer, Vendors and even the Items, when considering the master data. The difficulty is to simplify the example to a level that could be easier understood also by people who had less tangency with ERP systems or database models. For this I will consider the Receipts received when paying the goods in a (super-)market, if we take such a bill we could typically see the Receipt Number, the name of the Vendor from which we purchased was made, the Receipt Date, the Date when the purchase was made, the Items purchased together with the Quantity and Price. Such information could be easily entered in Excel and later loaded in  a denormalized table, or enter them directly in the respective denormalized table:

-- Receipts denormalized table 
CREATE TABLE [dbo].[Receipts]( 
[ReceiptNumber] [nvarchar](50) NULL, 
[Vendor] [nvarchar](100) NULL, 
[ReceiptDate] [smalldatetime] NULL, 
[CurrencyCode] [nvarchar](3) NULL, 
[ItemNumber] [nvarchar] (50) NULL,  
[Quantity] [decimal](12, 3) NULL, 
[Price] [numeric](12, 2) NULL 
) 
ON [PRIMARY] 

-- inserting test data 
INSERT INTO dbo.Receipts 
VALUES ('012034', 'TOOM', '2010-06-23', 'EUR', 'KABANOS PARIKA', 1, 2.19) 
, ('012034', 'TOOM', '2010-06-23', 'EUR', 'ZITRONE', 1, 0.79) 
, ('012034', 'TOOM', '2010-06-23', 'EUR', 'BREAKFAST BAKON', 1, 1.59) 
, ('7899998', 'KAUFHOFF', '2010-06-22', 'EUR', 'PILLOW', 1, 23.99) 
, ('7899998', 'KAUFHOFF', '2010-06-22', 'EUR', 'BED SHEET', 2, 11.99) 
 
-- checking the data 
SELECT * 
FROM dbo.Receipts  

   Supposing we have the above data and that we would like to load them in a normalized structure formed from the Header information – Receipt Number, Vendor Name and Receipt Date, and Line information – Item Number, Quantity and Price:

-- Receipt Headers (normalized) 
CREATE TABLE [dbo].[ReceiptHeaders]( 
[ReceiptHeaderID] [int]IDENTITY(1,1) NOT NULL, 
[ReceiptNumber] [nvarchar](50) NULL, 
[Vendor] [nvarchar](100) NULL, 
[ReceiptDate] [smalldatetime]NULL, 
[CurrencyCode] [nvarchar](3) NULL 
) ON [PRIMARY]  

-- Receipt Lines (normalized) 
CREATE TABLE [dbo].[ReceiptLines]( 
[ReceiptLineID] [int]IDENTITY(1,1) NOT NULL, 
[ReceiptHeaderID] int NOT NULL, 
[ItemNumber] [nvarchar] (50) NULL, 
[Quantity] [decimal] (12, 3) NULL, 
[Price] [numeric](12, 2) NULL 
) ON [PRIMARY]

   In order to load the denormalized data in a normalized structure we could write two queries, the first populates the ReceiptHeaders table, for this needing to select the distinct header attributes that make the header, while the second populates the ReceiptLines table:

-- inserting the Receipt Header data 
INSERT INTO dbo.ReceiptHeaders 
SELECT DISTINCT R.ReceiptNumber  
, R.Vendor  
, R.ReceiptDate  
, R.CurrencyCode 
FROM dbo.Receipts R 

-- inserting the Receipt Lines data 
INSERT INTO dbo.ReceiptLines 
SELECT RH.ReceiptHeaderID 
, R.ItemNumber  
, SUM(R.Quantity) Quantity 
, R.Price  
FROM dbo.Receipts R 
     JOIN dbo.ReceiptHeaders RH 
      ON R.ReceiptNumber = RH.ReceiptNumber 
     AND R.Vendor = RH.Vendor 
GROUP BY RH.ReceiptHeaderID 
, R.ItemNumber  
, R.Price  

    As can be seen from the second query, the Receipts table was joined with the ReceiptHeaders in order to retrieve the corresponding Header information for each line record. For this action to be possible we need an attribute or combination of attributes unique across the header data, in this case the ReceiptNumber in combination with the Vendor Name. If no such unique combination exists then the match between header and line data is not possible without resulting duplicated data, in such scenario it’s recommended to clean the data before loading them, for example by introducing an attribute that makes the combination unique. The same problem of uniqueness could be applied to the lines too, needing to be possible to identify uniquely a line in the source dataset. This could be done for example by introducing a Line Number in the source dataset or, as in this case, in case there are multiple lines with the same information then we could aggregate the quantities for the respective lines, the Price being moved in the GROUP BY clause as in theory the products with the same Item Number bought at the same time have the same price (that doesn’t necessarily happen in reality though for our exemplification will do).

    After inserting the data in the normalized model it makes sense to check the logic by comparing the data inserted against the initial dataset. It’s always a good idea to do that, in this way could be trapped for example the errors in logic. In Excel for SQL Developers – Part IV: Differences Between Two Datasets I shown how the query for comparing two datasets could be created in a semiautomatic manner and shown also the resulting query. A similar query could be written also in this case, for this purpose being useful to create a view which denormalizes our structure:  

-- Receipts View 
CREATE VIEW dbo.vReceipts 
AS 
SELECT RL.ReceiptLineID  
, RL.ReceiptHeaderID 
, RH.ReceiptNumber  
, RH.Vendor  
, RH.ReceiptDate  
, RH.CurrencyCode  
, RL.ItemNumber  
, RL.Quantity  
, RL.Price  
FROM dbo.ReceiptLines RL 
    JOIN dbo.ReceiptHeaders RH 
      ON RL.ReceiptHeaderID = RH.ReceiptHeaderID  

-- testing the view & updates 
SELECT * 
FROM dbo.vReceipts 
 
self-join normalization
    Until now we made it to load the data from a denormalized structure but no trace of a self-join! In many cases writing queries similar with the above ones is enough, though there are many cases when is needed to load the data successively, either incrementally or complete datasets. In both situations we could deal with data already loaded, so we have to avoid entering duplicates. Now it comes the self join into play, because in both insert queries we have to remove the records already loaded. Even if we deal with incremental data that form partitions (any record is provided only once) it’s safer and recommended to check for possible records provided again. So, we’ll have to modify the above two inserts to ignore the records already loaded:

-- inserting the Receipt Header data (with checking for loaded data) 
INSERT INTO dbo.ReceiptHeaders 
SELECT DISTINCT R.ReceiptNumber  
, R.Vendor  
, R.ReceiptDate  
, R.CurrencyCode 
FROM dbo.Receipts R 
     LEFT JOIN dbo.ReceiptHeaders RH 
       ON R.ReceiptNumber = RH.ReceiptNumber 
     AND R.Vendor = RH.Vendor 
WHERE RH.ReceiptNumber IS NULL 

-- inserting the Receipt Lines data (with checking for loaded data) 
INSERT INTO dbo.ReceiptLines 
SELECT RH.ReceiptHeaderID 
, R.ItemNumber  
, SUM(R.Quantity) Quantity 
, R.Price  
FROM dbo.Receipts R 
     JOIN dbo.ReceiptHeaders RH 
      ON R.ReceiptNumber = RH.ReceiptNumber 
     AND R.Vendor = RH.Vendor 
     LEFT JOIN dbo.vReceipts VR 
       ON R.ReceiptNumber = VR.ReceiptNumber 
     AND R.Vendor = RH.Vendor 
    AND R.ItemNumber = VR.ItemNumber 
WHERE VR.ReceiptNumber IS NULL 
GROUP BY RH.ReceiptHeaderID 
, R.ItemNumber  
, R.Price 

    I hope the queries are easy to be understood, if not then might be a good idea to check the posts on left joins. In order to test the queries let’s insert new data in the Receipts table:

-- inserting new test data 
INSERT INTO dbo.Receipts 
VALUES ('012455', 'TOOM', '2010-07-25', 'EUR', 'KABANOS PARIKA', 1, 2.20) 
, ('012455', 'TOOM', '2010-07-25', 'EUR', 'ZITRONE', 1, 0.79) 

    After running the three inserts the same data should be available in input denormalized and normalized structures:

-- testing the updates 
SELECT * 
FROM dbo.vReceipts 
 
self-join normalization incremental 
Notes:
    In this post wasn’t considered the case in which the data are modified between loads, thus excepting the above inserts must be written also two update statements that should reflect the changes occurred in loaded data. I will try to approach this subject in another post.

Software Engineering: Believe and Not Doubt?!

Software Engineering
Software Engineering Series

Long time ago while reviewing a PowerPoint presentation on Oracle performance troubleshooting, on one of the pages was written with big letters something like “Believe and not doubt!”, this in the context of the knowledge transmitted from guru to other professionals. I do not believe in the approach of following blindly the techniques and knowledge of other professionals, and the professional path can’t be resembled entirely with the spiritual path in which mind itself is said to be the barrier of proceeding further on the way. One reason for my advice of not following blindly other professionals it that knowledge is relative, highly dependent on experience and understanding, no human holding the ultimate knowledge in a domain of any type. 
Even if the world of databases in particular, respectively of computer science in general, is highly conceptualized, full of mathematical models used to describe it, though ignoring the fact that all of them are just an approximation, the reality is that real-life problems are in fact reduced to optimization problems, the solution to a problem resuming in finding the optimal solution based on a whole range of parameters. The same applies also to performance troubleshooting when we have to balance between performance and usability, flexibility, maintenance, security, reusability and whatever might come around.

Please note that I’m saying “not following blindly”, this doesn’t mean that you shouldn’t listen and follow the advices of other professionals, on contrary, I recommend to sip their knowledge and words of wisdom, though try to understand the reasons behind the techniques used, primarily try to understand the benefits and downsides of each technique or feature, understand why, how, when, where, who and by what means. Without understanding these aspects, the information presented can’t rise to the level of knowledge and therefore it won’t reach the potential value it might have for you and other professionals. 

In addition by understanding the various contexts in which the techniques and features apply and shouldn’t be used, you could step on the road to becoming wise, though there is a long road until there. Anyway, that’s already philosophy, but you have to take into account that nobody was born wise, and even the wise people make mistakes but they learn out of them. No matter if you are beginner or an experienced professional, each could learn from each other, there is enough domains and knowledge in the world for everybody.

Don’t be confused of the various contradictions and interpretations met in the world, the truth is relative, and you’ll often hear experts saying simply “it depends…” - is just a syntagma to highlight the complexity of things, the dependence of the solution on the various parameters. Even if you don’t agree with others’ perspective be indulgent with them and yourself, there are various opinions/beliefs, balance their points of view and yours, somewhere in the middle is the true…

30 July 2010

SQL Reloaded: Self-Join in Update Query II

    In yesterday’s post on the same topic I tried to exemplify how a self-join update query could be written in order to avoid an error raised by the database engine. Even if this type of query is normally written for hierarchical structures consisting of one-level parent-child relations stored in the same table, my example was using a simple join based directly on the primary key. During the day I was thinking that might be straightforward to create a simple hierarchical structure based on a relatively realistic example. For this let’s consider a department in which the performance of the the manager is a function of the performance of its employees, while the number of hours of training allocated to each employee is proportional to the number of hours allocated to its manager. For this let’s consider the following table:  

--Employee table's script 
CREATE TABLE [dbo].[Employees]( 
[EmployeeID] [int]  NOT NULL 
, [ManagerID] [int]  NULL 
, [Name] [nvarchar] (50) NULL 
, [Performance] [numeric]  (5, 3) NULL 
, [TrainingHours] [numeric]  (5, 2) NULL ) 
ON [PRIMARY] 

-- inserting the test data 
INSERT INTO dbo.Employees 
VALUES (1, NULL, 'Joe', NULL, 10) 
, (2, 1, 'Jack', .65, NULL) 
, (3, 1, 'Mary', .45, NULL) 
, (4, 1, 'Ross', .54, NULL) 
, (5, 1, 'Scott', .48, NULL) 
, (6, NULL, 'Jane', NULL, 15) 
, (7, 6, 'Sam', .50, NULL) 
, (8, 6, 'Ron', .45, NULL) 

   In order to calculate Manager’s performance we could use a query similar with the following, of course the formula could be more complicated and not a simple average:  

-- updating Manager's Performance 
UPDATE dbo.Employees 
SET Performance = EMP.Performance 
FROM ( -- inline view 
     SELECT ManagerID 
     , AVG(Performance) Performance 
     FROM dbo.Employees 
     WHERE ManagerID IS NOT NULL 
     GROUP BY ManagerID 
) EMP 
WHERE dbo.Employees.ManagerID IS NULL 
AND dbo.Employees.EmployeeID = EMP.ManagerID 

    Let’s check the updates and their correctitude: 


-- Checking updated data 
SELECT * 
FROM dbo.Employees 
WHERE ManagerID IS NULL -- Verifying output SELECT ManagerID 
, AVG(Performance) Performance 
FROM dbo.Employees 
WHERE ManagerID IS NOT NULL 
GROUP BY ManagerID       

self-join output 1      

Note:     
    The average needs to be calculated in the inline view. In case are retrieved more records for each record from the updated table, the query will still work though the result is “unpredictable”:
 


-- updating Manager's Performance w/o aggregates 
UPDATE dbo.Employees 
SET Performance = EMP.Performance 
FROM ( -- inline view 
     SELECT ManagerID 
     , Performance 
     FROM dbo.Employees 
     WHERE ManagerID IS NOT NULL 
) EMP WHERE dbo.Employees.ManagerID IS NULL 
AND dbo.Employees.EmployeeID = EMP.ManagerID       

      
    The reverse update based on the number of hours of training could be written as follows: 
 

-- updating Employees' Training Hours 
UPDATE dbo.Employees 
SET TrainingHours = 0.75 * EMP.TrainingHours 
FROM ( -- inline view 
    SELECT EmployeeID 
    , TrainingHours 
    FROM dbo.Employees 
   WHERE ManagerID IS NULL 
) EMP 
WHERE dbo.Employees.ManagerID IS NOT NULL 
AND dbo.Employees.ManagerID = EMP.EmployeeID       
      
-- Checking updated data 
SELECT * 
FROM dbo.Employees 
WHERE ManagerID IS NOT NULL -- Verifying output SELECT EmployeeID 
, 0.75 * TrainingHours TrainingHours 
FROM dbo.Employees 
WHERE ManagerID IS NULL   

  
self-join output 2     



28 July 2010

SQL Reloaded: Self-Join in Update Query I

    While reading R. Scheldon’s article on “UPDATE Basics in SQL Server” I remembered about a problem I had long time ago while attempting to do a self-join in an Update query. Such a query is quite useful in hierarchical structures consisting of one-level parent-child relations stored in the same table, when needed to update the parent based on child information, or vice-versa. The problem I had could be also exemplified by using a simple table (no hierarchical structure), here’s the query:  

-- self-join update - problematic query 
UPDATE Production.Product 
SET StandardCost = ITM.StandardCost* (1+.012) 
FROM Production.Product ITM 
WHERE Production.Product.ProductID = ITM.ProductID 

 The above statement returns the following error:
Msg 4104, Level 16, State 1, Line 5
The multi-part identifier "Production.Product.ProductID" could not be bound.

    In order to avoid this error message, the table from the FROM clause could be included in an inline view, something like:

-- self-join update - solution 
UPDATE Production.Product 
SET StandardCost = ITM.StandardCost*(1+.012) 
FROM ( -- inline view 
    SELECT * 
    FROM Production.Product ITM 
    ) ITM  
WHERE Production.Product.ProductID = ITM.ProductID 

    The inline view could be replaced with a standard view, table-valued UDF or the update could be done through a view. For exemplification I will use only the first and third case, here’s the view based on Production.Product table and the queries corresponding to the two cases:

-- view based on Production.Product 
CREATE VIEW Production.vProductTest 
AS 
SELECT *  
FROM Production.Product   

-- self-join update through view update 
UPDATE Production.vProductTest 
SET StandardCost = ITM.StandardCost* (1+.012) 
FROM Production.Product ITM 
WHERE Production.vProductTest.ProductID = ITM.ProductID     

-- self-join update from view 
UPDATE Production.Product 
SET StandardCost = ITM.StandardCost* (1+.012) 
FROM Production.vProductTest ITM 
WHERE Production.Product.ProductID = ITM.ProductID 

Note:
1.    The first update query was written in this way only to exemplify the self-update within an hierarchical structure, if it’s needed to modify the StandardCost and nothing more, then the update could be written simply in any of the following forms:

-- Query 1: update statement 
UPDATE Production.Product 
SET StandardCost = StandardCost* (1+.012) 

--Query 2: update statement 
UPDATE ITM 
SET StandardCost = StandardCost * (1+.012) 
FROM Production.Product ITM 

2.    Because the join constraint is based on a hierarchical structure that considers in the join different columns, the query can’t be written as follows:

 
--simple update statement 

UPDATE Production.Product 
SET StandardCost = ITM.StandardCost* (1+.012) 
FROM Production.Product ITM 
 
   This query is nothing more than an alternative to the queries shown in the first note.

26 July 2010

Porting 32 bit CLR UDFs on 64 bit Platforms

    Today I tried to port on a 64 bit platform a few of the CLR UDFs created in the previous posts, this time being constrained to use Visual Basic Studio 2010 Express to create and build the assembly on a x86 platform, and install the assembly on a x64 SQL Server box. From the previous troubleshooting experience between the two platforms, I knew that there will be some challenges, fortunately there was nothing complex. Under SSIS 2008 it’s possible to choose the targeted platform, therefore I was expecting to have something similar also in VB Studio 2010 Express, and after a simple review of Project Properties, especially in what concerns the Compile settings, I found nothing relevant. I tried then the standard approach, so I built the solution, copied the .dll on the target server and tried to register the assembly though I got the following error:

Msg 6218: %s ASSEMBLY for assembly '%.*ls' failed because assembly '%.*ls' failed verification. Check if the referenced assemblies are up-to-date and trusted (for external_access or unsafe) to execute in the database. CLR Verifier error messages if any will follow this message%.*ls

    After several attempts to Google for a solution on how to port 32 bit CLR UDFs on 64 bit Platform or on how to configure VB Studio 2010 Express in order to target solutions for 64 bit platforms, I found a similar question (VB Express target x86 Platform) in MSDN,  Johan Stenberg’s answer completed by JohnWein’s hint, led me to the “Issues When Using Microsoft Visual Studio 2005” document, to be more specific to 1.44 section "References to 32-bit COM components may not work in VB and C# Applications running on 64-bit platforms", of importance being the part talking about "Express Editions". In the document is specified how to modify the project and add in the first PropertyGroup section a PlatformTarget tag with the text value x86, therefore what I had to do was to add the respective tag but with the value x64. After doing this change everything worked smoothly. It’s kind of a mystery why Microsoft hasn’t enabled this feature in Express versions, but in the end I can live with it as long there is a workaround for it.

24 July 2010

Software Engineering: Football and Software Development

 
Software Engineering
Software Engineering Series

I wanted to write this post during the South Africa World Cup 2010 though because of the lack of time and because I was waiting for some statistics I could use, here I am, two weeks after the final whistle of the game for the first place. Football and Software Development, two domains that seems to have nothing in common, even if many software developers like to play football, and many football players are spending lot of time in front of their laptops. There is actually an important coordinate in what concerns the two – team work. Of course, that’s common to many other sports, though there are some characteristics important mainly to soccer -the small rate of deliverables (goals), the rate of failures (wrong passes), and I bet there are other characteristics common to most of the team sports, like the division and specialization of work, migration of players, “project”-oriented work, flow of money, etc.

Looking back at the games from this World Cup, we have to notice that, with a few exceptions, there wasn’t a big difference between the teams anymore, trend that could be seen during the last championships too, and there were no more individual players gaining one game after the other. Nowadays it primes the collective work, the cohesion of a team, the way the players respect the tactical indications given to them, the way they communicate and feel each other on the field. It didn’t matter anymore that you were playing against a Ronaldo, a Messi, Lampard, Drogba or Rooney, small teams like Australia, Chile, New Zeeland or South Korea, fighting as equal against the favorite teams of this tournament. 

What is more important to notice, is that teams whose players cost and make millions, didn’t function well (as expected) because the team haven’t played as a team, because the sense of individuality primed, because there was no adequate communication inside the team, while the trainer didn’t knew how to make himself respected, how to select his team, how to make/take the best out of his players, or how to change the tactics to counteract the one of the adversary. So the team with the best paid/skilled players, the team which puts more effort or controls the game, the team which has the most dynamic, effective (from the number of goals), beautiful or pragmatic play doesn’t necessarily win the game, same as the best trainer can make mistakes too, can make himself easier misunderstood or become overnight a persona non grata for its team or public.

 The same observations could be applied also to software development, and with the risk of being criticized I would say that the team with the best developers/professionals does not necessarily make a project successful, especially when the sense of individuality primes for the one of team, when the team members don’t play as a team, when there is no adequate communication, when the managers doesn’t make himself respected and know how to make/take the best out of his players, how to make a team successful no matter of the team’s number and skill-set. I tend to believe that in software development, same as in football, it must matter the joy for playing in a team, the joy for playing, being an example of professionalism, collaborating in achieving the purpose, helping each other to become better, no matter if one is named player, trainer, masseur, doctor or federation member.

I think that trainers have to learn more about project management, given the fact that building and leading a competitive team has a lot to do with projects and project management, being driven by similar goals, objectives, scope, etc. On the other side I feel that managers have to learn more from the behavior and knowledge of a trainer, to know how to be authoritarian and when to be a friend. And I expect that there are many other aspects the two types of professionals share. 

Also IT professionals have to learn from football players, especially in what concerns the team spirit, what it takes to be/become a team, what it means to have your place in the field, do it right and for the best of the team. Of course, the self-sacrifice must not be brought to extreme, as some players do. In what concerns the football players, I think they could learn from developers the simplicity, abnegation for their work, the abnegation of becoming better, of learning something new, of finding and knowing their place in life, and overall of being humble.

As for the executives dealing with IT projects they gave to learn that a defender can’t become overnight a goal-getter or a goalkeeper, that a new comer in the team needs time to adapt himself, and must be helped to become integrant part, that the new comer needs to find its pace and place in the team. Executives have to learn that it takes time, effort and a good strategy to built good successful teams, and even if everything it turns around money (although it shouldn’t), there should be kept a balance between investments, effort and rewards, some continuity, respect and support toward achieving successful and competitive teams.

Note:
If you liked/found interesting this post, then you might be interested also in Yohasna’s post What can we learn form SPAIN's World Cup Victory in the world of Software? and my answer to it, Satya’s Football and Software teams.. How different are they?, or B. Dwolatzky’s article on Football and Software Development are both team sports.

SQL Reloaded: Some Notes on Dates

   The Date is one of the most difficult data types to handle, given the fact that it might take different formats (e.g. DD/MM/YYYY, MM/DD/YYYY, DD/MM/YY, DD-MON-YYYY, etc.), and even if the various formats make a limited list, the fact that each database, machine or application could work with a different date format or set of date formats, this makes developers’ life a nightmare. The representational complexity resides also in the fact that each country adopts one or more standard date formats, a mixture between day, month, year and the various delimiters used (e.g. “.”, “/”, “-”). For example the SQL Server Books Online lists several country-specific date formats that could be used with Convert function, while in Date and Time Formats tutorial could be found a list with the formats and literals used in Oracle. Even .

     The various frameworks have attempted to introduce modalities of handling dates (e.g. culture info), though none of them is bulletproof given the interaction/integration between the various layers. Sometimes I prefer to store the date as a (standard) text string and transmit it thus between layers, following to transform (convert) it as needed by the various data consumers (UI, components, databases, etc.). What format to choose is depends entirely on habitude, special requests/constraints or on the flexibility of converting a given format to other formats. A flexible approach is the ISO 8601 format that combines date and time representations (<date>T<time>) as YYYY-MM-DDThh:mi:ss.mmm (e.g. 2010-12-22T23:10:29.300). It’s important to note that in SQL Server this format is not affected by the SET DATEFORMAT or SET LANGUAGE settings (see ISO 8601 Format), fact that makes the ISO format a good candidate for representing dates.

   The ISO format allows also to sort lexicographically a date, fact that makes it valuable when combining strings with dates, for example when naming files, though then some of the delimiters need to be removed. I prefer to remove all the delimiters thus storing a file with the timestamp in the format <meaningful_name><timestamp>.<extension> (e.g. ‘Report 20101222T231029.rpt’).

    When the time is not tracked, the date could be stored also as an integer in YYYYMMDD format, thus for example 2010-12-22 could be stored as 20101222. I’ve seen old databases making use of this technique, I found it awkward at the beginning though I discovered its performance benefits when doing comparisons, the difference of days between two dates reducing at a simple subtraction. Actually storing the date as integer is just a simplification of the way on how dates are stored, for example in SQL Server the date is stored as two integers,  in which the first integer stores the date itself and the second the time in clock ticks after midnight (see here). The danger comes when attempting to work with the integer format in order to express the difference between two dates in higher time units than the day (e.g. weeks, months, years), in such cases being recommended to work with built-in date functions. Given the fact that dates on SQL Server could be expressed also as float values (see Joe Celko’s article on Temporal Data Types in SQL Server), the same danger occurs when using the numeric values in calculations in the detriment of date functions.

    In ERP systems or other “activity-based” systems is useful to store a calendar together with other calendar specific information – whether it’s a working date, the period in which the date falls, the fiscal week, month, years, etc. If in OLTP systems such structures are used to store the time dimension, thus each time record being referenced directly, in OLAP systems there could be a mix between between direct referencing of calendar records or of storing directly the date, following in developers’ attributions to retrieve additional information about the date when needed. As it doesn’t make sense to argue in here the benefits of the two approaches, I limit myself to note that the integer value of the date could be used as a foreign key when the time component is not needed to be stored. By storing a date as integer or to a shorter date format  (e.g. YYYYMM, YYYYWW, YYYY) could be impacted applications’ flexibility. Now it depends, there are cases and cases, different problems requiring different solutions. Just think twice when choosing on how to store a date, think not only about the current requirements but also how the requirements could change given a change in the process/business dynamics!

Performance Management: Over-Educated, Yet Under-Qualified?

I know this blog is attempting to treat subjects related to data and databases, though the database developers/programmers, same as other type of developers and other professionals, and the various aspects of their professional life are part of the landscape too. Alfred Thompson, in his post “Over-Educated, Yet Under-Qualified?”, approaches quite an important topic for many graduates, but especially the ones working in IT – even if they graduated an university, they are considered as unqualified for a profession. There are various aspects about the road an IT professional has to cover from the banks of school to the professional life, here are my thoughts on the topic as posted as comment to A. Thompson’s post:

In schools the accent is on theory and algorithms, the small projects target the learning of a technology, their complexity and difficulties involved being quite small when compared with real life applications. Taking an application from design to production and later during support phase requires time and the mix of knowledge from different fields, thing quite difficult to do in a school project, while the structural context in an organization, the requirements and work in a team, is again quite different. Going above the basic features of a programming language takes time, it depends on the learning curve of the programming language and the capacity of the learner, on the complexity of the tasks approached and on the knowledge (made) available. I can’t say that schools can do much in this direction because it’s quite difficult to cover all the aspects in just 8-20 classes, in which the students are introduced into the concepts and some basic applications. What the schools could do in order to support their students is to provide the required infrastructure (mainly computers), bring the technologies and learning material up to date, direct gradually the focus from theory to applicability, and eventually support users getting some additional experience in organizations. It’s in students’ attribution to make most of the learning experience in schools, though often even if the want, need and infrastructure is there, fighting with the lack of time is quite hard.

One of the tough realities in IT is that it takes time to link the dots, and as you already highlighted, it takes about a year before a college/university graduate to become really productive. Now I have to say that this depends also on the organization’s culture/environment, on how it supports the learning process, how it helps the new comer to become part of the team and become productive. I’m saying that because I’ve seen companies doing minimum in this direction, just expecting the new comer to catch everything on the fly and be productive in a matter of weeks. Those working in IT for a longer time know that is not entirely possible, though there are also some exceptions. There are also organizations that train the new comers, introduce them into tasks evolving in complexity based on each person’s skills, provide resources (software tools, books, courses and other type of learning material) and an environment that facilitates learning. Having time allocated for learning new things, participating in activities that allow the distribution of knowledge within a team, having professionals whom you could ask questions or who could mentor you through the learning process, I consider all these as being essential for a modern IT organization.

The theory learned in schools need to be supported by hand-on experience in order to make most of the learning process, IT organizations are maybe the best places to do that, though I’m not sure how much that is possible. There are schools, organizations and governments that support this type of learning, though, unfortunately is not everywhere possible to do that or at least not for everybody. I think it’s in everybody’s interest to make most of the learning process, for schools to have highly skilled graduates, for organizations to have productive employees, a pool of college graduates resources from where they could select potential employees, for students to be skilled, and thus have higher chances of finding a job, while for governments this could lead in theory to a smaller unemployment rate. I find important the constructive involvement of all parties; now, I wonder how many schools, organizations or governments are trying to do something, change something into this direction.

22 July 2010

SQL Reloaded: Running a Statement for Each Database (CLR Version)

    I continue the series of posts on the use of the dbo.ExecuteScalarTo  CLR functions created in a previous post, this time, as is mentioned in the title, by running a statement for each database. The old-fashion method of a statement for each database is with the help of undocumented stored procedure sp_MSforeachdb, see for example ReplTalk’s post found recently. Actually the respective example based on the use of DatabaseProperty function has been given only for exemplification, the values available through the DatabaseProperty function could be retrieved directly from the sys.databases system table: 

 -- querying directly the values 
SELECT database_id 
, name DatabaseName 
, Is_Fulltext_enabled IsFulltextEnabled 
, Is_Ansi_Null_Default_on IsAnsiNullDefault 
, Is_Ansi_Nulls_on IsAnsiNullsEnabled 
, Is_Ansi_Warnings_on IsAnsiWarningsEnabled 
FROM master.sys.databases  
WHERE name LIKE 'AdventureWorks%' 

    Here’s the same query rewritten using the DatabaseProperty function:

 -- DatabaseProperty function at work 
SELECT database_id 
, name DatabaseName 
, DATABASEPROPERTY( name ,'IsFulltextEnabled') IsFulltextEnabled 
, DATABASEPROPERTY( name ,'IsAnsiNullDefault') IsAnsiNullDefault  
, DATABASEPROPERTY( name ,'IsAnsiNullsEnabled') IsAnsiNullsEnabled  
, DATABASEPROPERTY( name ,'IsAnsiWarningsEnabled') IsAnsiWarningsEnabled 
FROM master.sys.databases  
WHERE name LIKE 'AdventureWorks%' 

    The same query could be rewritten using the dbo.ExecuteScalarToString function:  

 -- DatabaseProperty function within CLR function 
SELECT database_id 
, name DatabaseName 
, dbo.ExecuteScalarToString('SELECT DATABASEPROPERTY( ''' + name + ''',''IsFulltextEnabled'')') IsFulltextEnabled 
, dbo.ExecuteScalarToString('SELECT DATABASEPROPERTY( ''' + name + ''',''IsAnsiNullDefault'')') IsAnsiNullDefault  
, dbo.ExecuteScalarToString('SELECT DATABASEPROPERTY( ''' + name + ''',''IsAnsiNullsEnabled'')') IsAnsiNullsEnabled  
, dbo.ExecuteScalarToString('SELECT DATABASEPROPERTY( ''' + name + ''',''IsAnsiWarningsEnabled'')') IsAnsiWarningsEnabled 
FROM master.sys.databases  
WHERE name LIKE 'AdventureWorks%' 

    The usefulness of the dbo.ExecuteScalarToString is minimal in this case, however it could be considered more “complex” examples. For example retrieving a count of the number of tables, views, assemblies or users existing in a database:

-- running a statement for each database 
SELECT database_id 
, name DatabaseName 
, dbo.ExecuteScalarToString('SELECT count(1) FROM ' + name + '.sys.tables') NumberTables 
, dbo.ExecuteScalarToString('SELECT count(1) FROM ' + name + '.sys.views') NumberViews 
, dbo.ExecuteScalarToString('SELECT count(1) FROM ' + name + '.sys.assemblies') NumberAssemblies 
, dbo.ExecuteScalarToString('SELECT count(1) FROM ' + name + '.sys.sysusers') NumberTables 
FROM master.sys.databases  
WHERE name LIKE 'AdventureWorks%' 

CLR  UDF - for each database

   Now it depends on each developer’s inspiration and on the problem attempting to solve. I find dbo.ExecuteScalarToString CLR function quite useful when attempting to run for each database multiple queries that returns a single value. On the other side sp_MSforeachdb provides more flexibility in what concerns the types of queries run, though specific techniques (e.g. temporary tables or table variables) need to be used in order to retrieve sp_MSforeachdb’s output in a single result-set (see for example the Locally Fragmented Indexes example from MSDN’s article on Uncover Hidden Data To Optimize Application Performance).

17 July 2010

SQL Reloaded: Number of Records II (The DMV Approach)

    In SQL Server 2000 the safest way to return the total number of records from a table was to do a simple COUNT. I’m say the safest way because (for big tables) there was a faster but approximate way to obtain the same information by using sysindexes table, which, as its name denotes, was storing one record for each index and table in a given database. When I started to use SQL Server 2005 and further 2008 I naturally followed the same approach, from several reasons not attempting to find whether there is a better alternative. Yesterday, while scanning roughly the titles from MSDN blogs, my attention was caught by Martinjnh’s post SQL Server–HOW-TO: quickly retrieve accurate row count for table, in which, in addition to COUNT and sys.sysindexes view kept from compatibility reasons,  he gives 2 other alternatives that use two sys.partitions and sys.dm_db_partition_stats DMV (dynamic management views), that contain “a row for each partition of all the tables and most types of indexes in the database”, respectively the “page and row-count information for every partition in the current database”.

    Of course, I tested the queries and their accuracy by inserting/deleting a few records in one of the AdventureWorks tables. The “statistics” seems to be updated in real time, thing certified also by the SQL Server 2005 documentation. I wanted to see the definition of the respective views, though they are not browsable by using the Management Studio. Some time ago I discovered that the definition of system views, functions and stored procedures are available though the sys.all_sql_modules view, the union join between sys.sql_modules and sys.system_sql_modules views. Here’s the query I used:

-- retrieving the definition of system views used to return the number of records 
SELECT * 
FROM master.sys.all_sql_modules 
WHERE CHARINDEX ('sys.dm_db_partition_stats', definition)>0 
      OR CHARINDEX ('sys.sysindexes', definition)>0 
       OR CHARINDEX ('sys.partitions', definition)>0 

Notes:
1.   Another way to get the definition of an object is to use the Object_Definition function:

SELECT object_definition(object_id('sys.sysindexes')) 

2.   Following Mladen Prajdic’s post, an SQL Server MVP, I found out that all the system objects are stored in the (hidden) read-only MsSqlSystemResource database that complements the master database.

3.   Diving into the three view’s definition, can be seen that all of them use the OpenRowSet function in order to make calls to INDEPROP, PARTITIONCOUNTS, ALUCOUNT internal SQL Server tables. As it seems this functionality of OpenRowSet function can’t be (mis-)used by users.

4. I found several interesting posts on sys.dm_db_partition_stats, for example Cindy Gross’ post, a Support Engineer at Microsoft, the blog containing several scripts and links to more resources.
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.