About Me

IT Professional with more than 16 years experience in IT especially in the area of full life-cycle of Web/Desktop Applications Development, Database Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP support, etc.

Saturday, July 31, 2010

Self-Joins and Denormalized Data Loading in Normalized Models

    One of the scenarios in which I often make use of self-joins is when needing to load denormalized data into a normalized data model. A characteristic of the not-normalized or denormalized data is that there are repeating data, typically the so called header data, which need to be handled specifically.

Note:
   Header data it’s improper said because an entity could contain more than 2 levels of data, for example the Purchase Orders (POs) in an ERP system could be split in PO Headers, Lines, Shipments and Distributions, thus a (denormalized) extract based on the respective data at Distribution level will contain repeating data from the higher levels.

  So for this post I needed to find an entity that contains a parent-child or header-lines structure. Actually it’s not difficult to find such an example, the world of ERP systems is full of such examples – POs, Invoices, Customer Orders, Receipts, Payments, to mention some of the important transactional data, or Customer, Vendors and even the Items, when considering the master data. The difficulty is to simplify the example to a level that could be easier understood also by people who had less tangency with ERP systems or database models. For this I will consider the Receipts received when paying the goods in a (super-)market, if we take such a bill we could typically see the Receipt Number, the name of the Vendor from which we purchased was made, the Receipt Date, the Date when the purchase was made, the Items purchased together with the Quantity and Price. Such information could be easily entered in Excel and later loaded in  a denormalized table, or enter them directly in the respective denormalized table:
-- Receipts denormalized table
CREATE TABLE [dbo].[Receipts](
[ReceiptNumber] [nvarchar](50) NULL,
[Vendor] [nvarchar](100) NULL,
[ReceiptDate] [smalldatetime] NULL,

[CurrencyCode] [nvarchar](3) NULL,

[ItemNumber] [nvarchar] (50) NULL,

[Quantity] [decimal](12, 3) NULL,
[Price] [numeric](12, 2) NULL
)
ON [PRIMARY]


-- inserting test data
INSERT INTO dbo.Receipts
VALUES
('012034', 'TOOM', '2010-06-23', 'EUR', 'KABANOS PARIKA', 1, 2.19)
,
('012034', 'TOOM', '2010-06-23', 'EUR', 'ZITRONE', 1, 0.79)
,
('012034', 'TOOM', '2010-06-23', 'EUR', 'BREAKFAST BAKON', 1, 1.59)
,
('7899998', 'KAUFHOFF', '2010-06-22', 'EUR', 'PILLOW', 1, 23.99)
,
('7899998', 'KAUFHOFF', '2010-06-22', 'EUR', 'BED SHEET', 2, 11.99)


-- checking the data
SELECT *
FROM
dbo.Receipts


   Supposing we have the above data and that we would like to load them in a normalized structure formed from the Header information – Receipt Number, Vendor Name and Receipt Date, and Line information – Item Number, Quantity and Price:
-- Receipt Headers (normalized)
CREATE TABLE [dbo].[ReceiptHeaders](
[ReceiptHeaderID] [int]IDENTITY(1,1) NOT NULL,

[ReceiptNumber] [nvarchar](50) NULL,

[Vendor] [nvarchar](100) NULL,

[ReceiptDate] [smalldatetime]NULL,

[CurrencyCode] [nvarchar](3) NULL
)
ON [PRIMARY]


-- Receipt Lines (normalized)
CREATE TABLE [dbo].[ReceiptLines](
[ReceiptLineID] [int]IDENTITY(1,1) NOT NULL,

[ReceiptHeaderID] int NOT NULL,

[ItemNumber] [nvarchar] (50) NULL,

[Quantity] [decimal] (12, 3) NULL,

[Price] [numeric](12, 2) NULL
)
ON [PRIMARY]


   In order to load the denormalized data in a normalized structure we could write two queries, the first populates the ReceiptHeaders table, for this needing to select the distinct header attributes that make the header, while the second populates the ReceiptLines table:
-- inserting the Receipt Header data
INSERT INTO dbo.ReceiptHeaders

SELECT
DISTINCT R.ReceiptNumber

,
R.Vendor

,
R.ReceiptDate

,
R.CurrencyCode

FROM
dbo.Receipts R


-- inserting the Receipt Lines data
INSERT INTO dbo.ReceiptLines
SELECT
RH.ReceiptHeaderID

,
R.ItemNumber

,
SUM(R.Quantity) Quantity

,
R.Price

FROM
dbo.Receipts R
     JOIN dbo.ReceiptHeaders RH
      ON R.ReceiptNumber = RH.ReceiptNumber
     AND R.Vendor = RH.Vendor

GROUP
BY RH.ReceiptHeaderID

,
R.ItemNumber

,
R.Price


    As can be seen from the second query, the Receipts table was joined with the ReceiptHeaders in order to retrieve the corresponding Header information for each line record. For this action to be possible we need an attribute or combination of attributes unique across the header data, in this case the ReceiptNumber in combination with the Vendor Name. If no such unique combination exists then the match between header and line data is not possible without resulting duplicated data, in such scenario it’s recommended to clean the data before loading them, for example by introducing an attribute that makes the combination unique. The same problem of uniqueness could be applied to the lines too, needing to be possible to identify uniquely a line in the source dataset. This could be done for example by introducing a Line Number in the source dataset or, as in this case, in case there are multiple lines with the same information then we could aggregate the quantities for the respective lines, the Price being moved in the GROUP BY clause as in theory the products with the same Item Number bought at the same time have the same price (that doesn’t necessarily happen in reality though for our exemplification will do).

    After inserting the data in the normalized model it makes sense to check the logic by comparing the data inserted against the initial dataset. It’s always a good idea to do that, in this way could be trapped for example the errors in logic. In Excel for SQL Developers – Part IV: Differences Between Two Datasets I shown how the query for comparing two datasets could be created in a semiautomatic manner and shown also the resulting query. A similar query could be written also in this case, for this purpose being useful to create a view which denormalizes our structure:
-- Receipts View
CREATE VIEW dbo.vReceipts
AS
SELECT
RL.ReceiptLineID

,
RL.ReceiptHeaderID

,
RH.ReceiptNumber

,
RH.Vendor

,
RH.ReceiptDate

,
RH.CurrencyCode

,
RL.ItemNumber

,
RL.Quantity

,
RL.Price

FROM
dbo.ReceiptLines RL
    JOIN dbo.ReceiptHeaders RH
      ON RL.ReceiptHeaderID = RH.ReceiptHeaderID


-- testing the view & updates
SELECT *
FROM
dbo.vReceipts

self-join normalization
    Until now we made it to load the data from a denormalized structure but no trace of a self-join! In many cases writing queries similar with the above ones is enough, though there are many cases when is needed to load the data successively, either incrementally or complete datasets. In both situations we could deal with data already loaded, so we have to avoid entering duplicates. Now it comes the self join into play, because in both insert queries we have to remove the records already loaded. Even if we deal with incremental data that form partitions (any record is provided only once) it’s safer and recommended to check for possible records provided again. So, we’ll have to modify the above two inserts to ignore the records already loaded:
-- inserting the Receipt Header data (with checking for loaded data)
INSERT INTO dbo.ReceiptHeaders
SELECT
DISTINCT R.ReceiptNumber

,
R.Vendor

,
R.ReceiptDate

,
R.CurrencyCode

FROM
dbo.Receipts R
     LEFT JOIN dbo.ReceiptHeaders RH
       ON R.ReceiptNumber = RH.ReceiptNumber
     AND R.Vendor = RH.Vendor

WHERE
RH.ReceiptNumber IS NULL


-- inserting the Receipt Lines data (with checking for loaded data)
INSERT INTO dbo.ReceiptLines
SELECT
RH.ReceiptHeaderID

,
R.ItemNumber

,
SUM(R.Quantity) Quantity

,
R.Price

FROM
dbo.Receipts R
     JOIN dbo.ReceiptHeaders RH
      ON R.ReceiptNumber = RH.ReceiptNumber
     AND R.Vendor = RH.Vendor
     LEFT JOIN dbo.vReceipts VR
       ON R.ReceiptNumber = VR.ReceiptNumber
     AND R.Vendor = RH.Vendor
    AND R.ItemNumber = VR.ItemNumber

WHERE
VR.ReceiptNumber IS NULL

GROUP
BY RH.ReceiptHeaderID

,
R.ItemNumber

,
R.Price


    I hope the queries are easy to be understood, if not then might be a good idea to check the posts on left joins. In order to test the queries let’s insert new data in the Receipts table:
-- inserting new test data

INSERT INTO dbo.Receipts
VALUES
('012455', 'TOOM', '2010-07-25', 'EUR', 'KABANOS PARIKA', 1, 2.20)
,
('012455', 'TOOM', '2010-07-25', 'EUR', 'ZITRONE', 1, 0.79)


    After running the three inserts the same data should be available in input denormalized and normalized structures:
-- testing the updates
SELECT *
FROM
dbo.vReceipts

self-join normalization incremental 
Notes:
    In this post wasn’t considered the case in which the data are modified between loads, thus excepting the above inserts must be written also two update statements that should reflect the changes occurred in loaded data. I will try to approach this subject in another post.

Friday, July 30, 2010

Believe and Not Doubt?!

    Long time ago while reviewing a PowerPoint presentation on Oracle performance troubleshooting, on one of the pages was written with big letters something like “Believe and not doubt!”, this in the context of the knowledge transmitted from guru to other professionals. I do not believe in the approach of following blindly the techniques and knowledge of other professionals, and the professional path can’t be resembled entirely with the spiritual path in which mind itself is said to be the barrier of proceeding further on the way. One reason for my advice of not following blindly other professionals it that knowledge is relative, highly dependent on experience and understanding, no human holding the ultimate knowledge in a domain of any type. Even if the world of databases in particular, respectively of computer science in general, is highly conceptualized, full of mathematical models used to describe it, though ignoring the fact that all of them are just an approximation, the reality is that real-life problems are in fact reduced to optimization problems, the solution to a problem resuming in finding the optimal solution based on a whole range of parameters. The same applies also to performance troubleshooting when we have to balance between performance and usability, flexibility, maintenance, security, reusability and whatever might come around.

    Please note that I’m saying “not following blindly”, this doesn’t mean that you shouldn’t listen and follow the advices of other professionals, on contrary, I recommend to sip their knowledge and words of wisdom, though try to understand the reasons behind the techniques used, primarily try to understand the benefits and downsides of each technique or feature, understand why, how, when, where, who and by what means. Without understanding these aspects, the information presented can’t rise to the level of knowledge and therefore it won’t reach the potential value it might have for you and other professionals. In addition by understanding the various contexts in which the techniques and features apply and shouldn’t be used, you could step on the road to becoming wise, though there is a long road until there. Anyway, that’s already philosophy, but you have to take into account that nobody was born wise, and even the wise people make mistakes but they learn out of them. No matter if you are beginner or an experienced professional, each could learn from each other, there is enough domains and knowledge in the world for everybody.

    Don’t be confused of the various contradictions and interpretations met in the world, the truth is relative, and you’ll often hear experts saying simply “it depends…” - is just a syntagma to highlight the complexity of things, the dependence of the solution on the various parameters. Even if you don’t agree with others’ perspective be indulgent with them and yourself, there are various opinions/beliefs, balance their points of view and yours, somewhere in the middle is the true…

Self-Join in Update Query - Update

    In yesterday’s post on the same topic I tried to exemplify how a self-join update query could be written in order to avoid an error raised by the database engine. Even if this type of query is normally written for hierarchical structures consisting of one-level parent-child relations stored in the same table, my example was using a simple join based directly on the primary key. During the day I was thinking that might be straightforward to create a simple hierarchical structure based on a relatively realistic example. For this let’s consider a department in which the performance of the the manager is a function of the performance of its employees, while the number of hours of training allocated to each employee is proportional to the number of hours allocated to its manager. For this let’s consider the following table:

--Employee table's script
CREATE TABLE [dbo].[Employees](
[EmployeeID] [int]  NOT NULL

, [ManagerID] [int]  NULL
, [Name] [nvarchar] (50) NULL
, [Performance] [numeric]  (5, 3) NULL
, [TrainingHours] [numeric]  (5, 2) NULL )
ON [PRIMARY]

-- inserting the test data
INSERT INTO dbo.Employees

VALUES
(1, NULL, 'Joe', NULL, 10
)
,
(2, 1, 'Jack', .65,
NULL)
,
(3, 1, 'Mary', .45,
NULL)
,
(4, 1, 'Ross', .54,
NULL)
,
(5, 1, 'Scott', .48,
NULL)
,
(6, NULL, 'Jane', NULL, 15
)
,
(7, 6, 'Sam', .50,
NULL)
,
(8, 6, 'Ron', .45,
NULL)

    In order to calculate Manager’s performance we could use a query similar with the following, of course the formula could be more complicated and not a simple average:
-- updating Manager's Performance
UPDATE dbo.Employees

SET
Performance = EMP.Performance

FROM
( -- inline view

     SELECT ManagerID
     , AVG(Performance) Performance
     FROM dbo.Employees
     WHERE ManagerID IS NOT NULL
     GROUP BY ManagerID
) EMP

WHERE
dbo.Employees.ManagerID IS NULL

AND dbo.Employees.EmployeeID = EMP.ManagerID

    Let’s check the updates and their correctitude:
-- Checking updated data
SELECT *

FROM
dbo.Employees

WHERE
ManagerID IS NULL

-- Verifying output
SELECT ManagerID

,
AVG(Performance) Performance

FROM
dbo.Employees

WHERE
ManagerID IS NOT NULL

GROUP
BY ManagerID

self-join output 1
Note:
    Please note that the average needs to be calculated in the inline view. In case are retrieved more records for each record from the updated table, the query will still work though the result is “unpredictable”:

-- updating Manager's Performance w/o aggregates
UPDATE dbo.Employees

SET
Performance = EMP.Performance

FROM
( -- inline view

     SELECT ManagerID
     , Performance
     FROM dbo.Employees
     WHERE ManagerID IS NOT NULL
) EMP WHERE dbo.Employees.ManagerID IS NULL
AND dbo.Employees.EmployeeID = EMP.ManagerID

    The reverse update based on the number of hours of training could be written as follows:
-- updating Employees' Training Hours
UPDATE dbo.Employees

SET
TrainingHours = 0.75 * EMP.TrainingHours

FROM
( -- inline view

    SELECT EmployeeID
    , TrainingHours
    FROM dbo.Employees
   WHERE ManagerID IS NULL
) EMP

WHERE
dbo.Employees.ManagerID IS NOT NULL

AND dbo.Employees.ManagerID = EMP.EmployeeID

    Let’s check the changes:
 
-- Checking updated data
SELECT *
FROM
dbo.Employees

WHERE
ManagerID IS NOT NULL

-- Verifying output
SELECT EmployeeID

,
0.75 * TrainingHours TrainingHours

FROM
dbo.Employees

WHERE
ManagerID IS NULL

  
self-join output 2

Wednesday, July 28, 2010

Self-Join in Update Query

    While reading R. Scheldon’s article on “UPDATE Basics in SQL Server” I remembered about a problem I had long time ago while attempting to do a self-join in an Update query. Such a query is quite useful in hierarchical structures consisting of one-level parent-child relations stored in the same table, when needed to update the parent based on child information, or vice-versa. The problem I had could be also exemplified by using a simple table (no hierarchical structure), here’s the query:

-- self-join update - problematic query
UPDATE Production.Product
SET
StandardCost = ITM.StandardCost* (1+.012)

FROM
Production.Product ITM

WHERE
Production.Product.ProductID = ITM.ProductID


    The above statement returns the following error:
Msg 4104, Level 16, State 1, Line 5
The multi-part identifier "Production.Product.ProductID" could not be bound.

    In order to avoid this error message, the table from the FROM clause could be included in an inline view, something like:
-- self-join update - solution
UPDATE Production.Product
SET
StandardCost = ITM.StandardCost*(1+.012)

FROM
( -- inline view

    SELECT *

    FROM Production.Product ITM

    )
ITM

WHERE Production.Product.ProductID = ITM.ProductID

    The inline view could be replaced with a standard view, table-valued UDF or the update could be done through a view. For exemplification I will use only the first and third case, here’s the view based on Production.Product table:
-- view based on Production.Product
CREATE VIEW Production.vProductTest
AS
SELECT
*

FROM
Production.Product


    And here are the queries corresponding to the two cases:
-- self-join update through view update
UPDATE Production.vProductTest
SET
StandardCost = ITM.StandardCost* (1+.012)

FROM
Production.Product ITM

WHERE
Production.vProductTest.ProductID = ITM.ProductID

-- self-join update from view
UPDATE Production.Product
SET
StandardCost = ITM.StandardCost* (1+.012)

FROM
Production.vProductTest ITM

WHERE
Production.Product.ProductID = ITM.ProductID


Note:
1.    The first update query was written in this way only to exemplify the self-update within an hierarchical structure, if it’s needed to modify the StandardCost and nothing more, then the update could be written simply as:
--simple update statement
UPDATE Production.Product
SET
StandardCost = StandardCost* (1+.012)

    Bruce H. posted as comment to R. Scheldon’s article  a surprising rewriting of the above query :
--simple update statement
UPDATE ITM
SET
StandardCost = StandardCost * (1+.012)

FROM
Production.Product ITM


2.    Because the join constraint is based on a hierarchical structure that considers in the join different columns, the query can’t be written as follows:
--simple update statement
UPDATE Production.Product
SET
StandardCost = ITM.StandardCost* (1+.012)

FROM
Production.Product ITM

   This query is nothing more than an alternative to the queries shown in the first note.

Monday, July 26, 2010

Porting 32 bit CLR UDFs on 64 bit Platforms

    Today I tried to port on a 64 bit platform a few of the CLR UDFs created in the previous posts, this time being constrained to use Visual Basic Studio 2010 Express to create and build the assembly on a x86 platform, and install the assembly on a x64 SQL Server box. From the previous troubleshooting experience between the two platforms, I knew that there will be some challenges, fortunately there was nothing complex. Under SSIS 2008 it’s possible to choose the targeted platform, therefore I was expecting to have something similar also in VB Studio 2010 Express, and after a simple review of Project Properties, especially in what concerns the Compile settings, I found nothing relevant. I tried then the standard approach, so I built the solution, copied the .dll on the target server and tried to register the assembly though I got the following error:

Msg 6218: %s ASSEMBLY for assembly '%.*ls' failed because assembly '%.*ls' failed verification. Check if the referenced assemblies are up-to-date and trusted (for external_access or unsafe) to execute in the database. CLR Verifier error messages if any will follow this message%.*ls

    After several attempts to Google for a solution on how to port 32 bit CLR UDFs on 64 bit Platform or on how to configure VB Studio 2010 Express in order to target solutions for 64 bit platforms, I found a similar question (VB Express target x86 Platform) in MSDN,  Johan Stenberg’s answer completed by JohnWein’s hint, led me to the “Issues When Using Microsoft Visual Studio 2005” document, to be more specific to 1.44 section "References to 32-bit COM components may not work in VB and C# Applications running on 64-bit platforms", of importance being the part talking about "Express Editions". In the document is specified how to modify the project and add in the first PropertyGroup section a PlatformTarget tag with the text value x86, therefore what I had to do was to add the respective tag but with the value x64. After doing this change everything worked smoothly. It’s kind of a mystery why Microsoft hasn’t enabled this feature in Express versions, but in the end I can live with it as long there is a workaround for it.

Saturday, July 24, 2010

Football and Software Development

    I wanted to write this post during the South Africa World Cup 2010 though because of the lack of time and because I was waiting for some statistics I could use, here I am, two weeks after the final whistle of the game for the first place. Football and Software Development, two domains that seems to have nothing in common, even if many software developers like to play football, and many football players are spending lot of time in front of their laptops. There is actually an important coordinate in what concerns the two – team work. Of course, that’s common to many other sports, though there are some characteristics important mainly to soccer -the small rate of deliverables (goals), the rate of failures (wrong passes), and I bet there are other characteristics common to most of the team sports, like the division and specialization of work, migration of players, “project”-oriented work, flow of money, etc.

    Looking back at the games from this World Cup, we have to notice that, with a few exceptions, there wasn’t a big difference between the teams anymore, trend that could be seen during the last championships too, and there were no more individual players gaining one game after the other. Nowadays it primes the collective work, the cohesion of a team, the way the players respect the tactical indications given to them, the way they communicate and feel each other on the field. It didn’t matter anymore that you were playing against a Ronaldo, a Messi, Lampard, Drogba or Rooney, small teams like Australia, Chile, New Zeeland or South Korea, fighting as equal against the favorite teams of this tournament. What it’s more important to notice, is that teams whose players cost and make millions, didn’t function well (as expected) because the team haven’t played as a team, because the sense of individuality primed, because there was no adequate communication inside the team, while the trainer didn’t knew how to make himself respected, how to select his team, how to make/take the best out of his players, or how to change the tactics to counteract the one of the adversary. So the team with the best paid/skilled players, the team which puts more effort or controls the game, the team which has the most dynamic, effective (from the number of goals), beautiful or pragmatic play doesn’t necessarily win the game, same as the best trainer can make mistakes too, can make himself easier misunderstood or become overnight a persona non grata for its team or public.

   The same observations could be applied also to software development, and with the risk of being criticized I would say that the team with the best developers/professionals does not necessarily make a project successful, especially when the sense of individuality primes for the one of team, when the team members don’t play as a team, when there is no adequate communication, when the managers doesn’t make himself respected and know how to make/take the best out of his players, how to make a team successful no matter of the team’s number and skill-set. I tend to believe that in software development, same as in football, it must matter the joy for playing in a team, the joy for playing, being an example of professionalism, collaborating in achieving the purpose, helping each other to become better, no matter if one is named player, trainer, masseur, doctor or federation member.

    I think that trainers have to learn more about project management, given the fact that building and leading a competitive team has a lot to do with projects and project management, being driven by similar goals, objectives, scope, etc. On the other side I feel that managers have to learn more from the behavior and knowledge of a trainer, to know how to be authoritarian and when to be a friend. And I expect that there are many other aspects the two types of professionals share. Also IT professionals have to learn from football players, especially in what concerns the team spirit, what it takes to be/become a team, what it means to have your place in the field, do it right and for the best of the team. Of course, the self-sacrifice must not be brought to extreme, as some players do. In what concerns the football players, I think they could learn from developers the simplicity, abnegation for their work, the abnegation of becoming better, of learning something new, of finding and knowing their place in life, and overall of being humble.

    As for the executives dealing with IT projects they gave to learn that a defender can’t become overnight a goal-getter or a goalkeeper, that a new comer in the team needs time to adapt himself, and must be helped to become integrant part, that the new comer needs to find its pace and place in the team. Executives have to learn that it takes time, effort and a good strategy to built good successful teams, and even if everything it turns around money (although it shouldn’t), there should be kept a balance between investments, effort and rewards, some continuity, respect and support toward achieving successful and competitive teams.

Note:
    If you liked/found interesting this post, then you might be interested also in Yohasna’s post What can we learn form SPAIN's World Cup Victory in the world of Software? and my answer to it, Satya’s Football and Software teams.. How different are they?, or B. Dwolatzky’s article on Football and Software Development are both team sports.