About Me

IT Professional with more than 16 years experience in IT especially in the area of full life-cycle of Web/Desktop Applications Development, Database Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP support, etc.

Sunday, January 13, 2008

Advices on SQL logic split in Web Applications

Yesterday I remembered about my first two small fights on SQL related topics, though in both situations I had to renounce temporarily to my opinions for the sake of armistice. It’s understandable, we can all make mistakes, unfortunately what we know hurts us more then what we don’t know. It’s amusing when I’m thinking about the two issues in discussion at those times, though then it was a little disturbing. What was about?! Actually both issues are related to web applications, my first “professional encounter” with programming.

Issue 1 – using JOINs or backend vs middle tier processing

JOINs are a powerful feature of SQL in combing related data from multiple tables in only one query, this coming with a little (or more) overhead from the database server side.
Web applications make use of lot of data access operations, data being pulled from a database each time a user requests a page, of course that happening when the page needs data from database(s) or execute commands on it, the CRUD (Create/Read/Update/Delete) gamma. That can become costly in time, depending on how data access was architected and requirements. The target is to pull smallest chunk of data possible (rule 1), with a minimum of trips to the database (rule 2).

Supposing that we need Employees data from a database for a summary screen with all employees, it could contain First Name, Last Name, Department and Contact information – City, Country, Email Address and Phone Number. Normally the information could be stored in 4 tables – Employees, Departments, Address and Countries, like in the below diagram.



The easiest and best way to pull the Employee needed data is to do a JOIN between tables:



More likely that two or more employees will have the same country or department, resulting in “duplication” of small pieces of information within the whole data set, contradicting rule 1. Can be pulled smaller chunks of data targeting only the content of a table, that meaning that we have to pull first all Employees or the ones matching a set of constraints, then all the departments or the only the ones for which an Employee was returned, and same with Addresses and Countries. In the end will have 4 queries and same number of roundtrips (or more). In the web page the code will have to follow the below steps:

Step 1: Pull the Employee data matching the query:
SELECT E.EmployeeID
, E.DepartmentID
, E.FirstName
, E.LastName
, E.EmailAddress
FROM Employees E
WHERE

Step 2: Build the (distinct) list of Department IDs and a (distrinct) list of Employee IDs.

Step 3: Pull the Department data matching the query:
SELECT D.DepartmentID
, D.Department
FROM Departments D
WHERE DepartmentID IN (<list of Department IDs>)

Step 4: Pull the Address data matching the query:
SELECT A.EmployeeID
, A.CountryID
, A.City
, A.Phone
FROM Addresses A
WHERE EmployeeID IN (<list of Employee IDs>)

Step 5: Build the (distinct) list of Country IDs.

Step 6: Pull the Country data matching the query:
SELECT C.CountryID
, C.Country
FROM Countries C
WHERE CountryID IN (<list of Country IDs>)

And if this doesn’t look like an overhead for you, you have to take into account that for each Employee is needed to search the right Department from the set of data returned in Step 3, and same thing for Addresses and Countries. It’s exactly what the database server does but done on the web server, with no built in capabilities for data matching.
In order to overcome the problems raised by matching, somebody could go and execute for each employee returned in Step 1 a query like the one defined in Step 4, but limited only to the respective Employee, thus resulting an additional number of new roundtrips matching the number of Employees. Quite a monster, isn’t it? Please don’t do something like this!

It’s true that we always need to mitigate between minimum of data and minimum of roundtrips to a web server, though we have to take into account also the overhead created by achieving extremities and balance them in an optimum manner, implementing the logic on the right tier. So, do data matching as much as possible on the database server because it was designed for that, and do, when possible, data enrichment (e.g. formatting) only on the web server.

In theory the easiest way of achieving something it’s the best as long the quality remains the same, so try to avoid writing expensive code that’s hard to write, maintain and debug!

Issue 2: - LEFT vs FULL JOINs

Normally each employee should be linked to a Department, have at least one Address, and the Address should be linked to a Country. That can be enforced at database and application level, though it’s not always the case. There could be Employees that are not assigned to a Department, or without an Address; in such cases then instead of a FULL JOIN you have to consider a LEFT or after case a RIGHT (OUTER) JOIN. So, I’ve rewritten the first query, this time using LEFT JOINs.



You don’t need to use always LEFT JOINs unless the business case requires it, and don’t abuse of them as they come with performance decrease!

Saturday, January 12, 2008

SQL Server and Excel Data

Looking after information about SQL Server 2008, I stumble over Bob Beauchemin’s blog, the first posting I read Another use for SQL Server 2008 row constructors demonstrating a new use of VALUES clause that allows to insert multiple lines in a table by using only one query, or to group a set of values in a table and use them as source for a JOIN. I was waiting for this feature to be available under SQL Server 2005, though better later than never!

The feature is useful when you need to limit the output of a query based on matrix (tabular) values coming from an Excel or text file. For exemplification I’ll use HumanResources.vEmployee view from AdventureWorks database that co/mes with SQL Server 2005, you can download it from Code Plex in case you don’t have it for SQL Server 2008.
Let’s suppose that you have an Excel file with Employees for which you need contact information from a table available on SQL Server. You have the FirstName, MiddeName and LastName, and you need the EmailAddress and Phone. In SQL Server 2008 you can do that by creating a temporary table-like structure on the fly using VALUES clause, and use it then in JOIN or INSERT statements.

Query 1

The heart of the query is the below structure, where B(FirstName, MiddleName, LastName) is the new table, each row in its definition being specified by comma delimited triples of form ('FirstName ', 'MiddleName ', ' LastName'):



The construct it’s time consuming to build manually, especially when the number of lines is considerable big, though you can get the construct in Excel with the help of an easy formula.



The formula from column D is = ", ('" & A2 & "','" & B2 & "','" & C2 & "')" and it can be applied to the other lines too. You just need to copy now the data from Column D to SQL Server and use them in Query with a few small changes. Of course, you can create also a custom function (macro) in Excel to obtain the whole structure is a singe cell.

You can do something alike under older versions of SQL Server (or other databases) using a simple trick – concatenating the values from each column by row by using a delimiter like “/”, “~”, “|” or any other delimiter, though you have to be sure that the delimiter isn’t found in your data sources (Excel and table). Using “/” the formula is = ", '" & A2 & "/" & B2 & "/" & C2 & "'".




Then you have to use the same trick and concatenate the columns from the table, the query becoming:

Query 2

This technique involves small difficulties when:
• The data used for searching have other type than string derived data types, however that can be overcome by casting the values to string before concatenation.
• The string values contain spaces at extremities, so it’s better to trim the values using LTrim and RTrim functions.
• The values from the two sources are slightly different, for example diacritics vs. Latin standard characters equivalents, for this being necessary a transformation of the values to the same format.