About Me

IT Professional with more than 16 years experience in IT especially in the area of full life-cycle of Web/Desktop Applications Development, Database Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP support, etc.

Monday, December 01, 2008

BI issues

    BI projects are noble in intent though many managers and technical consultants ignore its implications and prerequisites – data availability and quality, cooperation, maturity, infrastructure, adequate tools and knowledge.

Data availability and quality

    The problem with data starts at the source, usually ERP and other IS. In theory they should cover all the requirements existing in an enterprise, though it’s not always possible to rich that state of art. Such needs arrive to be covered by ad-hoc developed tools which often include Excel & MS Access based solutions and whose data are often difficult to integrate.

    Data Quality is maybe the most ignored component in the attempt of achieving an effective BI solution. The quality of data is based primarily on the validation implemented in source systems and the mechanisms used to cleanse the data before being reported, and secondarily on the efficiency and effectiveness of existing business processes and best practices.

    Data Quality must be guaranteed for accurate decisions, if the quality is not proved, no wonder that users are reluctant of using the reports! Normally the data must be validated, and for this are needed users with adequate knowledge to do that and enough data to back up the BI reports. I would advance here that each BI report needs corresponding supporting OLTP reports, a good synchronization between data sets could offer the base for an adequate validation.

    The quality of decisions is based on the degree to which data were understood and presented to decisional factors, though that’s not enough; is need also a complete perspective, and maybe that’s why some professionals prefer to prepare and aggregate data by themselves, this process allowing to get a deeper feeling of what’s happening (even that’s not enough).


    A BI initiative doesn’t depends only on the effort of a department (usually IT), but of the business as a whole. Unfortunately the so called partnership is more a theoretical term than a fact, same as managers’ and users involvement.

    BI implementations are also dependant on consultants’ skills, the degree to which they understood business’ requirements, team’s cohesion and other project (management) related issues. Everything culminates with the knowledge transfer and training which happens to miss just because of I don’t know what occult reason.

Adequate tools

    Most of the BI tools available on the market don’t satisfy all business/users’ requirements, even if they excel in some features, they lack in others. I would say that more that one BI tool is needed to cover most of the requirements. When features are not available or they are not handy enough, there is no wonder that users prefer to use tools they already know and see the immediate benefit.

    Another important consideration is that BI tools rely on a data model, often inflexible from the point of the information it provides, but also of integrating additional data sets, algorithms and customization. It’s not an accident that customers are finding the concept of data cloud more appealing – it’s a requirement for nowadays business dynamics. Maturity Jumping over the complexity of CMM/CMMI and other such models, I would simplify and say that organizations lack in that knowledge of transforming data in knowledge, understanding it and evolving it further in wisdom and competitive advantage. Most of the fancy words used by Sales People to sell a product don’t become reality over night; of course a BI tool might have the potentiality of fulfilling such goals, though it’s a long way to go until reaching that level. Infrastructure Infrastructure here refers to human and technical components and the way they interact in getting the job done. Some of the issues were already covered in the previous subsections, while many others can be subject of more elaborated work. I resume in saying that it's not only about “breaking the habits” but aligning people and technologies to the desired level of performance, of retaining and diffusing knowledge. Unfortunately we deal with too much philosophy and less action.

Tuesday, November 11, 2008

Data Quality in Information Systems

    One of the LinkedIn users brought to our attention the fact that “according to top managers in the IT business the top 2 reasons why CRM investments fails is:
1. Managing resistance within the organisation
2. Bad data quality” (B. Hellgren, 2008) and I think that’s valid not only to CRM or BI solutions but also to other information system solutions.

    Especially in ERP systems the data quality continues to be a problem and from my experience I can point here the following reasons:
- processes span across different functions, each of them maintaining the data they are interested in, being rare the cases in which a full agreement is met between functions, and here is also management’s fault.
- within an enterprise many systems arrive to be integrated, the quality of the data depending on integrations’ stability and scope; if stability can be obtained in time, scope resumes to the volume of data available, the delta needing to be maintained mainly manually.
- there are systems which are not integrated but using the same data, users needing to duplicate their effort so they often focus on their immediate needs. - lack of knowledge about the system used, the processes in place and best practices (they can be further detailed).
- basic or inexistent validation for data entry in each important entry point (User Interface, integration interfaces, bulk upload), system permissiveness (allowing workarounds), stability and reliability (bugs/defects).
- inexistence of data quality control mechanisms or quality methodologies.

    Data quality is usually ignored in BI projects, and this because few are the ones that go and search the causes, being easier to blame the BI solution or the technical team. This is ones of the reasons for which users are reticent in using a BI solution, to which add up solution’s flexibility and the degree up to which the solution satisfies users’ needs.     On the other side BI solutions are often abused, including also reports which have OLTP characteristics or of providing too much, unstructured or inadequate content that needs to be further reworked.

    I think that data quality comes on manager’s dashboards especially during ERP implementations, when the destination system has some strong validation. And then there is no wonder when each record contains at minimum a defect and the whole dashboard is in nuances of red (the red must be "en vogue").

Saturday, November 08, 2008

Dealing with data duplicates on SQL Server

    Duplicates or duplications can be defined as “result rows that are identical to one or more other rows” (S.Hull, 1998), they occur frequently in data collections (e.g. Excel, Access, etc.) that have minimum or no validation, more users, higher the risks of having duplicated data. Duplicates occur also in modern architectures when validation wasn’t addressed correspondingly or from usage related issues, for example in information systems, the huge amount of data entry makes it hard to overlook documents already processed (e.g. purchase orders, invoices, payments, etc.).

    Subject to duplication are whole records, a group of attributes (fields) or only single attributes, it depends from case to case. Often duplicates are easy to identify, it’s enough to let somebody who has the proper knowledge to look over them, but what you do when the volume of data is too large or when is need to automate the process as much as possible. Using the DISTINCT keyword in a SELECT statement might do the trick, while other times it requires more complicated validation, ranging from simple checks to data mining techniques.

    I will try to exemplify the techniques I use to deal with duplicates with the help of a self-constructed example, a table having the following structure, which tracks information about Assets:

CREATE TABLE [dbo].[Assets](
[ID] [int] NOT NULL,
[CreationDate] smalldatetime NOT NULL,
[Vendor] [varchar](50) NULL,
[Asset] [varchar](50) NULL,
[Model] [varchar](50) NULL,
[Owner] [varchar](50) NULL,
[Tag] [varchar](50) NULL,
[Quantity] [decimal](13, 2) NULL

    Next step is to provide the test data:
VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')
VALUES ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1')
VALUES ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
VALUES ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
VALUES ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1')
VALUES ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')

    Now let’s check table’s content:

SELECT ID, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM Assets

Output: All records selection     Normally a Tag or a Serial Number should uniquely identify a product coming from the same Vendor, so in this case duplicates’ identification will resume to the records in which the Tag-Vendor combination appears more than once:

SELECT Vendor, Tag
FROM Assets A
GROUP BY Vendor, Tag

Output: Vendor-Product duplications     In many cases showing only the duplicated values would be enough, however there are cases in which is required to see the whole record in which the duplicates appeared, for example to clarify the context in which the issues appear. A simple JOIN with the base table would provide the affected records:

SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity FROM Assets A
SELECT Vendor, Tag
FROM Assets A
GROUP BY Vendor, Tag
) B
ON A.Vendor = B.Vendor
AND A.Tag = B.Tag

selection of duplicates     There are two scenarios in which we need to remove the duplicates – in a result set or within an existing table.

In a result set normally is enough to use the DISTINCT keyword to remove duplicated rows,

SELECT DISTINCT CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM Assets

selecting distinct records     In our example only some combinations are duplicated while the other attributes might slightly differ, and therefore is needed another approach. First of all we need to identify which one is the most reliable record, in some cases the latest records entry should be the most accurate or closer to reality, but that’s not necessarily the truth. There are also cases in which we don’t care which the record that is selected is, but from experience these cases are few.

    Oracle and SQL Server introduced the dense_rank() analytic function, which “returns the rank of rows within the partition of a result set, without any gaps in the ranking” (MSDN, 2008 a). In our case the partition is determined by Vendor and Tag, following to identify which the logic used for raking. Supposing that we are always interested in the last record entered, the query would look like this:

SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM (--subquery
SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
, dense_rank() OVER(PARTITION BY Vendor, Tag ORDER BY CreationDate DESC , Id DESC) RANKING
FROM Assets
) A

using dense_rank() function     Things are simple when records’ creation date is available, because a descending order would allow selecting the last record first, the Id being added to the clause for the cases in which multiple records have the same date. At least in this example the same output is obtained by using the row_number() analytic function, which “returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition” (MSDN, 2008 b). Actually the dense_rank function assigns the same rank for the same occurrence of the values appearing in ORDER BY clause (within the same partition), and thus forces the developer to choose the exact attributes that make a field unique within a partition, while row_number() will return a sequence no matter of the attributes used in ORDER BY clause. In this case I suppose that both functions should have the same performance, on the other side, when the selection of records within a partition is not important, it’s better to use the row_number() function, which would need to do less processing and thus better performance.

Unfortunately this technique doesn’t work in SQL Server 2000, so a different approach is requested. In most of the cases the unique identifier for a record is a running number, the highest id corresponding to the latest entered record. This would allow selecting the latest entered record, by using the Max function:

SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM Assets A
SELECT Vendor, Tag, MAX(Id) MaxId
FROM Assets A
GROUP BY Vendor, Tag
) B
ON A.Vendor = B.Vendor
AND A.Tag = B.Tag
AND A.ID = B.MaxId
SQL Server 2000 technique The same technique can be used to delete the duplicates from a table:
SELECT Vendor, Tag, MAX(Id) MaxId
FROM Assets A
GROUP BY Vendor, Tag
) B
WHERE Assets.Vendor = B.Vendor
AND Assets.Tag = B.Tag
AND Assets.ID <> B.MaxId
    When an Id is not available, then we have most probably to create a temporary table with a sequence (numeric unique identifier), sort the records based on the criteria we chose for selection, and then apply the technique based on Id as above.
1. In other scenarios it’s important to select all the records matching extreme values (first, last), the dense_rank function becoming handy, however for versions that doesn’t supports it, a creation date attribute saves the day, when available:
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM Assets A
SELECT Vendor, Tag, MAX(CreationDate) LastCreationDate
FROM Assets A
GROUP BY Vendor, Tag
) B
ON A.Vendor = B.Vendor
AND A.Tag = B.Tag
AND DateDiff(d, A.CreationDate, B.LastCreationDate)=0

record matching extreme values 2. Duplicated records can be created by merging two sets of data with UNION ALL, in such cases a simple UNION would remove the duplicates introduced by the queries coming after the first one however this will burden the database with more checks. It is recommended to find other solutions, when possible!

3. Instead of using a single multi-row insertion I used multiple insertion statements because I preferred to make the tutorial portable also on SQL Server 2000. Here’s the single multi-row insertion statement:

INSERT INTO Assets VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')
, ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1')
, ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1')
, ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')

4. The above techniques should work also in Oracle with two amendments, the attributes’ type must be adapted to Oracle ones, while instead of SQL Server GetDate() function should be used the corresponding Oracle SYSDATE function as below:
VALUES ('1', SYSDATE-1, 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')

S.Hull. 1998. Business Objects Glossary. [Online] Available from:http://planning.ucsc.edu/IRPS/dwh/BOBGLOSS.HTM (Accessed: 7 November 2008)
MSDN. 2008 a. SQL Server 2008 Books Online, dense_rank(). [Online] Available from: http://msdn.microsoft.com/en-us/library/ms173825.aspx(Accessed: 7 November 2008)
MSDN. 2008 b. SQL Server 2008 Books Online, dense_rank(). [Online] Available from: http://msdn.microsoft.com/en-us/library/ms186734.aspx (Accessed: 7 November 2008)

Learning about Oracle APPS internals - Part I

How can I learn more about Oracle APPS (Oracle Applications) and from where? a normal question for every beginner but also for experienced professionals.

Oracle made available documentation about their products through Oracle Technology Network and Metalink. The first source contains documents mainly as pdf files, while Metalink provides richer content and it’s easier to use, however in order to access it, your company has to purchase an Oracle Support Identifier.

In Metalink, Oracle Applications’ documentation is grouped under eTRM (Electronic Technical Reference Manuals) section, while the pdf documents can be found under Oracle 11i Documentation Library, and many of them, especially for older versions, can be found also on the web, and revealed with a simple search by using tables' name or file’s name.
Both sources are by far incomplete, there are many gaps, not to forget that many of the Oracle implementations involve also some customization, information about these changes could find maybe in the documentation made during implementation/customization process.

Lately have appeared many blogs on Oracle Applications internals, and even if many of them resume by copying some material from Metalink or other documents, there are also professionals who respect themselves.

People can learn a lot by checking the objects that unveils the APPS internals, APPS.FND_TABLES providing the list of tables used, while APPS.FND_VIEWS provides the list of views, the problem with the later being that can't be done a search using the field that stores views' script, but the data can be exported to a text file and do the search in there (it won’t work to export the data completely to Excel). In time developers arrive to intuit how the views could be named, so a search on their name could help narrowing down the search.

Other professionals might be willing to help, so often it's a good idea to post questions on blogs, forums or social networks for professionals. Not all the questions get answered so rather than waiting for indirect enlightment, it’s better to do some research in parallel too.

There will be cases in which none of the specified sources will help you, most probably you'll have to reengineer Oracle Applications' internals by studying various business scenarios, and in this case the experimented users could help a lot.

Friday, November 07, 2008

Enterprise reporting

    Supposing that your company invested lot of money in an ERP system, moreover around the nice expensive setup it needed to customize many of the existing functionality. So you end up with an ERP system which provides more or less the functionality you needed, though in order to increase its value and make accurate decisions you need some reports out of it. What you do then?

    From my point of view there are 4 types of reporting needs: - OLTP (On-Line Transaction Processing) system providing reports with actual (live) data; - OLAP (On-Line Analytical Processing) reports with drill-down, roll-up, slice and dice or pivoting functionality, working with historical data, the data source(s) being refreshed periodically;
- ad-hoc reports – reports provided on request, often satisfying one time reports or reports with sporadic needs;
- Data Mining tool(s) focusing on knowledge discovery;
    Actually I would have put a 5th requirement related to direct data access and analysis.

    ERP systems like Oracle Applications or SAP come by default with a set of (predefined) standard reports, which in theory they satisfy the prerequisites of an OLTP system. Unfortunately the standard reports satisfy only basic needs and are not as flexible as expected – it could happen that they can be exported only to text, moreover they being in a non-tabular format, and therefore impossible to reuse for detailed analysis. Another problem could be caused by inadequate filtering constraints, behavior or scope. If existing functionality has been customized, most probably existing reports need to be adapted to the new logic. In the end you realize that you either change the existing reports or adopt an OLTP solution.
    The bad news don’t stop here, as the vendors try to keep the secrecy about their solutions, the information about ERP’s internals is limited, good developers hard to find or really expensive, and often they needing to reinvent the wheel. It’s true that Oracle provided through Metalink a nice set of documentation about Oracle Application’s internals, and even if they updated currently web site’s look and feel, there are still having many gaps concerning tables’ structure and functionality. Fortunately armed with enough patience and some knowledge about existing business processes and databases, a developer can reengineer an important part of the logic. Other good news is that the increasing number of posting focusing on ERP solution, however few are the source that bring something new.

    OLAP solutions presume the existence of a data warehouse that reflects the business model, and when intelligently built it can satisfy an important percentage from the business intelligence requirements. Building a data warehouse or a set of data marts it’s an expensive and time consuming endeavor and rarely arrives to satisfy everybody’s needs. There are also vendors that provide already made models, and at a first view they look like an important deal, however in time excepting the fact that vendor detains the ownership of the model, I bet that such models have many gaps. Again you end up by customizing and extending the model, running in all kind of issues involving model’s design, flexibility, quality, resources and costs.     There are many ways of making things go wrong, one of the interesting situations being when an OLAP system is used to satisfy OLTP reporting needs. It’s like using a city car in a country cross race – you might make it to compete or even end the race, if you are lucky enough, but don’t expect to make a success out of it!

    The need for ad-hoc reports will be there no matter how complete and flexible are your existing reports. There are always new requirements that must be fulfilled in utile time and not rely on the long cycle time needed for an OLTP/OLAP report. Actually many of the reports start as ad-hoc reports and once their scope and logic stabilized they are moved to the reporting solution. Talking about new reports requirements, it worth to mention that many of the users don’t know exactly what they want, what is possible to get and what information it makes sense to show and at what level of detail in order to have a report that reflects the reality. In theory is needed a person who facilitate the communication between users and development team, especially when the work is outsourced. Such a person should have in theory a deep understanding of the business, of the ERP system and reporting possibilities, deeper the knowledge, shorter the delivery cycle time. Maybe such a person could be dispensable if the users and development have the required skill set and knowledge to define and interpret clearly the requirements, however I doubt that’s achievable on large scale. On the other side such attributions could be taken by the IM or functional leaders that support the ERP system, it might work, at least in theory.

    Data Mining tools and models are supposed to leverage the value of an ERP system beyond the functionality provided by analytic reports by helping to find hidden patterns and trends in data, to elaborate predictions and estimates. Here I resume only saying that DM makes sense only when the business reached a certain maturity, and I’m considering here mainly the costs/value ratio (the expected benefits needing to be greater than the costs) and effort required from business side in pursuing such a project.

    There are situations in which the functionality provided by reporting tools doesn’t fulfill users’ requirements, one of such situations being when users need to analyze data by themselves, to link data from different sources, especially Excel sheets. It’s true that vendors tried to address such requirements, though I don’t think they are mature enough, easy to use or allow users to go beyond their skills and knowledge.
    Most of such scenarios resume in accessing various sources over ODBC or directly using Excel or MS Access, such solutions being adequate more for personal use. The negative side is that people arrive to misuse them, often ending up by having a multitude of such solution which maybe would make sense to have implemented as a report.
    There are managers who believe that such tools would allow eliminating the need for ad-hoc reports, it might be possible in isolated cases though don’t expect from users to be a Bill Inmon or Bill Gates!

    All the tools have their limitations, no matter how complex they are, and I believe that not always a single tool addresses all requirements. Each of such tools need a support team and even a center of excellence, so assure yourself that you have the resources, knowledge and infrastructure to support them!

Sunday, August 31, 2008

ROWCOUNT in action Especially when working with big tables, the default behaviour of Query Analyzer is to not show the output until the last record has been fetched. This can be time and resource consuming and therefore I’ve appreciated the fact that TOAD and SQL Developer are fetching only a certain number of records. Now I can see that same can be done starting with SQL Server 2005 onward by modifying ROWCOUNT server property using Query/Query Options menu functionality. Query Options under SQL Server 2008 Query Options under SQL Server 2008 or by running the command: SET ROWCOUNT <number of records>; Of course somebody may limit the number of records returned by a query using TOP function when working with SQL Server and ROWNUM in Oracle, though I find it not always handy – it depends from case to case. There are also technical implications between the two types of usage, according SQL Server Books online it is recommended to TOP with SELECT over ROWCOUNT with regard to scope and query optimization, however in this context only the later makes sense:
"As a part a SELECT statement, the query optimizer can use the value of expression in the TOP clause as part of generating an execution plan for a query. Because SET ROWCOUNT is used outside a statement that executes a query, its value cannot be used to generate a query plan for a query."
Notes: 1. Do not mix the ROWNUM with @@ROWNUM function which returns the number of rows affected by the last statement. 2. Some of us list all the records in order to see the number of records returned by a query, though that’s totally not advisable!

Document everything!

    One of the lessons I learned on the hard way is that I have to document everything: meeting minutes, ideas, location of artefacts, code, steps done, and even error messages received when installing or using different pieces of software. Of course “everything” has the connotation of “everything important”, though you never know what’s important or not in the near future. Further, if I can put together the ideas in a list of best practices, then I can bring more value to my notes and maybe could help other people too.

    It’s also important how you document everything, though I believed that a question of habitude and preference. I’m still exploring this using flat documents and mind maps or cognitive maps when possible.

    Talking about best practices, it is interesting to see which are other people or companies’ best practices and, in the end, they reflect their level of knowledge reached. I actually target to check first the best practices of a technology or language when learning something new. It can be helpful and a fast way to learn the most appropriate techniques and avoid possible mistakes.
AdventureWorks requires FILESTREAM enabled Surprises, surprises, surprises, programmers’ world is full of them! When you say that everything is ok, you just discover that something went wrong. I was expecting to have Adventure Works database installed though I haven’t checked that. I realized today that it’s missing, so I tried to reinstall it enabling this time the “Restore AdventureWorks DBs” feature, though I got another nice error: Setup failed for MSSQLSERVER. The following features are missing: FILESTREAM Fix the problems and re-run setup. Guy Burstein, in his blog, wrote that the STREAM support can be enabled using the following SQL command: exec [dbo.sp_filestream_configure] @enable_level = 3; I tried that and another error came in: Msg 2812, Level 16, State 62, Line 1 Could not find stored procedure 'sp_filestream_configure' Checking my local installation of SQL Server Books Online, I found no track of sp_filestream_configure stored procedure, but I found that I can enable the STREAM support using sp_configure stored procedure as below: EXEC sp_configure filestream_access_level, 2 RECONFIGURE GO Once I executed the 3 lines together, I got the following confirmation message which, amusingly, still recommands me to run the RECONFIGURE statement even if I did that. Anyway better more redundant information than nothing… Configuration option 'filestream access level' changed from 2 to 2. Run the RECONFIGURE statement to install.

Saturday, August 30, 2008

No records returned by queries No records returned by a query even if there should be results? Usually I’m using the following checklist: 1. check if the tables contain data. Silly but effective, especially in Oracle APPS in which some tables got deprecated and were replaced by tables with similar names (PA_PROJECTS_ALL vs. PA_PROJECTS), though that could happen in other environments too; 2. check if the JOIN syntax is correct; 3. check if one of the columns use in JOIN has only NULL values; 4. check if the constraints used in WHERE clause causes makes sense (e.g. wrong values or syntax); 5. for Oracle flavored queries, check if in WHERE clause there is a column not referenced with the table name or alias, and the column is available in more than one table used in the query. This Oracle bug is really dangerous when doing fast query checks! 6. for Oracle (APPS), check whether the query or view uses USERENV function with LANG or LANGUAGE text parameter, normally a constraint like: TABLE1.LANGUAGE = USERENV(‘LANG’). The problem with such queries comes when user’s system language is other than the one expected, and thus query’s output might not be as expected. Usually it is preferable to hardcode the value, when possible: TABLE1.LANGUAGE = ‘US’ Note: Actually, also the tools you are using to run a query could create issues, for example a query run under Oracle’s SQL Developer was not returning records even if in TOAD did that. The problem was solved with the installation of a newer SQL Developer version.
Oracle ANSI 92 JOIN syntax error Lately I’m working a lot with Oracle APPS, doing mainly ad-hoc reporting. One of my nightmares is an Oracle bug related to ANSI 92 syntax: “ORA-01445: cannot select ROWID from, or sample, a join without a key-preserved table” Unfortunately, even if the bug was solved by Oracle, it seems the update was missed on some servers and the bug haunts my queries almost on a daily basis. Having an SQL Server background and, for code clearness, I prefer ANSI 92 JOIN syntax: SELECT A.column1, B.column2 FROM table1 A JOIN table2 B ON A.column1 = B.column2 instead of using the old fashioned writing: SELECT A.column1, B.column2 FROM table1 A , table2 B WHERE A.column1 = B.column2 In theory the two queries should provide the same output and have, hopefully, similar performance, but that’s not in scope of this post. The problem with ANSI 92 syntax is that, on some Oracle installations, when the number of joins exceeds a certain limit, usually greater than 7, I get the above error. I can avoid that by using a subquery or a view which would allow me to split the logic and limit the number of direct joins to a table. That can be helpful sometimes, but most of the times it’s cumbersome because it requires more development time and awkward design solutions.
Adventure Works installation error on Vista I tried to install the Adventure Works OLTP & DW on SQL Server 2008 RTM from CodePlex though I got an error: “The installer has encountered an unexpected error installing this package. This may indicate a problem with this package. The error code is 2738.” Initially I thought that the problem was caused by existing Adventure Works installations made on a previous CTP version of SQL Server 2008, forgetting to uninstall them when I uninstalled the CTP version. Totally wrong! Doing a little research, I found first a post on CodePlex Discussion forum mentioning that the problem could be caused by VBScript runtime because, as Toms’Tricks blog highlights, VBScript and Jscript are not registered on Windows Vista. Wonderful! I just run regsvr32 vbscript.dll command and it worked! Another situation in which regsvr32 saved the day! I wonder why I haven’t got the same problem when I previously installed Adventure Works database on CTP version! Could it be because of Windows Vista SP1 changes (I installed Windows Vista SP1 after SQL Server CTP)?
Microsoft SQL Server 2008 installation error This week I tried to install SQL Server 2008 however I got the following error: “A previous release of Microsoft Visual Studio 2008 is installed on this computer. Upgrade Microsoft Visual Studio 2008 to the SP1 before installing SQL Server 2008.” It’s true that I have previously installed Visual Studio 2008, though once I did that I checked if there are any updates and thus I installed SP1 too. I did a quick search on Google and the first results pointed me to an article o Microsoft Help and Support website: Visual Studio 2008 SP1 may be required for SQL Server 2008 installations. It didn’t make sense; in the end I’ve installed the server but enabled the installation of the following components: • Management Tools (Basic or Complete) • Integration Services • Business Intelligence Development Studio The server was installed without problems, so I tried to install the remaining components getting the same error as above. I had to stop at that point and today, giving more thought to the problem, I realized that the error could be caused by Microsoft Visual Studio 2008 Express edition, which I managed to install a few months back. Instead of uninstalling Microsoft Visual Studio 2008 it looked easier to uninstall the Express version, and once I did that, I managed to install the remaining components. Actually I checked before if there is a SP1 for Microsoft Visual Studio 2008 Express, I arrived at Microsoft Visual Studio 2008 Express Editions with SP1 page, though I remembered that I have to install the Web Developer, Visual Basic and C# 2008 separately and in the end I presumed that maybe it would be easier to uninstall the existing versions and try then to install SQL Server remaining components. I haven’t tried to install the Express editions with SP1 as now I have the Professional edition.

Sunday, January 13, 2008

Advices on SQL logic split in Web Applications

Yesterday I remembered about my first two small fights on SQL related topics, though in both situations I had to renounce temporarily to my opinions for the sake of armistice. It’s understandable, we can all make mistakes, unfortunately what we know hurts us more then what we don’t know. It’s amusing when I’m thinking about the two issues in discussion at those times, though then it was a little disturbing. What was about?! Actually both issues are related to web applications, my first “professional encounter” with programming.

Issue 1 – using JOINs or backend vs middle tier processing

JOINs are a powerful feature of SQL in combing related data from multiple tables in only one query, this coming with a little (or more) overhead from the database server side.
Web applications make use of lot of data access operations, data being pulled from a database each time a user requests a page, of course that happening when the page needs data from database(s) or execute commands on it, the CRUD (Create/Read/Update/Delete) gamma. That can become costly in time, depending on how data access was architected and requirements. The target is to pull smallest chunk of data possible (rule 1), with a minimum of trips to the database (rule 2).

Supposing that we need Employees data from a database for a summary screen with all employees, it could contain First Name, Last Name, Department and Contact information – City, Country, Email Address and Phone Number. Normally the information could be stored in 4 tables – Employees, Departments, Address and Countries, like in the below diagram.

The easiest and best way to pull the Employee needed data is to do a JOIN between tables:

More likely that two or more employees will have the same country or department, resulting in “duplication” of small pieces of information within the whole data set, contradicting rule 1. Can be pulled smaller chunks of data targeting only the content of a table, that meaning that we have to pull first all Employees or the ones matching a set of constraints, then all the departments or the only the ones for which an Employee was returned, and same with Addresses and Countries. In the end will have 4 queries and same number of roundtrips (or more). In the web page the code will have to follow the below steps:

Step 1: Pull the Employee data matching the query:
, E.DepartmentID
, E.FirstName
, E.LastName
, E.EmailAddress
FROM Employees E

Step 2: Build the (distinct) list of Department IDs and a (distrinct) list of Employee IDs.

Step 3: Pull the Department data matching the query:
SELECT D.DepartmentID
, D.Department
FROM Departments D
WHERE DepartmentID IN (<list of Department IDs>)

Step 4: Pull the Address data matching the query:
, A.CountryID
, A.City
, A.Phone
FROM Addresses A
WHERE EmployeeID IN (<list of Employee IDs>)

Step 5: Build the (distinct) list of Country IDs.

Step 6: Pull the Country data matching the query:
, C.Country
FROM Countries C
WHERE CountryID IN (<list of Country IDs>)

And if this doesn’t look like an overhead for you, you have to take into account that for each Employee is needed to search the right Department from the set of data returned in Step 3, and same thing for Addresses and Countries. It’s exactly what the database server does but done on the web server, with no built in capabilities for data matching.
In order to overcome the problems raised by matching, somebody could go and execute for each employee returned in Step 1 a query like the one defined in Step 4, but limited only to the respective Employee, thus resulting an additional number of new roundtrips matching the number of Employees. Quite a monster, isn’t it? Please don’t do something like this!

It’s true that we always need to mitigate between minimum of data and minimum of roundtrips to a web server, though we have to take into account also the overhead created by achieving extremities and balance them in an optimum manner, implementing the logic on the right tier. So, do data matching as much as possible on the database server because it was designed for that, and do, when possible, data enrichment (e.g. formatting) only on the web server.

In theory the easiest way of achieving something it’s the best as long the quality remains the same, so try to avoid writing expensive code that’s hard to write, maintain and debug!

Issue 2: - LEFT vs FULL JOINs

Normally each employee should be linked to a Department, have at least one Address, and the Address should be linked to a Country. That can be enforced at database and application level, though it’s not always the case. There could be Employees that are not assigned to a Department, or without an Address; in such cases then instead of a FULL JOIN you have to consider a LEFT or after case a RIGHT (OUTER) JOIN. So, I’ve rewritten the first query, this time using LEFT JOINs.

You don’t need to use always LEFT JOINs unless the business case requires it, and don’t abuse of them as they come with performance decrease!

Saturday, January 12, 2008

SQL Server and Excel Data

Looking after information about SQL Server 2008, I stumble over Bob Beauchemin’s blog, the first posting I read Another use for SQL Server 2008 row constructors demonstrating a new use of VALUES clause that allows to insert multiple lines in a table by using only one query, or to group a set of values in a table and use them as source for a JOIN. I was waiting for this feature to be available under SQL Server 2005, though better later than never!

The feature is useful when you need to limit the output of a query based on matrix (tabular) values coming from an Excel or text file. For exemplification I’ll use HumanResources.vEmployee view from AdventureWorks database that co/mes with SQL Server 2005, you can download it from Code Plex in case you don’t have it for SQL Server 2008.
Let’s suppose that you have an Excel file with Employees for which you need contact information from a table available on SQL Server. You have the FirstName, MiddeName and LastName, and you need the EmailAddress and Phone. In SQL Server 2008 you can do that by creating a temporary table-like structure on the fly using VALUES clause, and use it then in JOIN or INSERT statements.

Query 1

The heart of the query is the below structure, where B(FirstName, MiddleName, LastName) is the new table, each row in its definition being specified by comma delimited triples of form ('FirstName ', 'MiddleName ', ' LastName'):

The construct it’s time consuming to build manually, especially when the number of lines is considerable big, though you can get the construct in Excel with the help of an easy formula.

The formula from column D is = ", ('" & A2 & "','" & B2 & "','" & C2 & "')" and it can be applied to the other lines too. You just need to copy now the data from Column D to SQL Server and use them in Query with a few small changes. Of course, you can create also a custom function (macro) in Excel to obtain the whole structure is a singe cell.

You can do something alike under older versions of SQL Server (or other databases) using a simple trick – concatenating the values from each column by row by using a delimiter like “/”, “~”, “|” or any other delimiter, though you have to be sure that the delimiter isn’t found in your data sources (Excel and table). Using “/” the formula is = ", '" & A2 & "/" & B2 & "/" & C2 & "'".

Then you have to use the same trick and concatenate the columns from the table, the query becoming:

Query 2

This technique involves small difficulties when:
• The data used for searching have other type than string derived data types, however that can be overcome by casting the values to string before concatenation.
• The string values contain spaces at extremities, so it’s better to trim the values using LTrim and RTrim functions.
• The values from the two sources are slightly different, for example diacritics vs. Latin standard characters equivalents, for this being necessary a transformation of the values to the same format.