SQL Troubles: April 2010

24 April 2010

Software Engineering: A Problem Solving Approach to Troubleshooting I

Software Engineering Series

Introduction

In several occasions I observed that there are SQL and non-SQL developers who don’t know how to troubleshoot a programming problem in general, respectively a SQL-related issue in particular, and I’m not referring here to the complex problems that typically require the expertise of a specialist, but simple day to day situations: troubleshooting an error thrown by the database engine, an error in the logic, a performance issue, unavailability of resources, etc. I’m not necessarily talking here about the people posting questions on forums, professional networks or blogs, even if in many situations they could have found an answer to the problem by doing a little research, but seeing developers actually at work. It’s true that there are also many cases in which the software throws an error message that you don’t know from where to start or that the error is pointed as appearing at other line than at the line where actually the error occurs, leading the developer to a false thread.

Before going into detail let’s take a short look at troubleshooting and what it means! Paraphrasing Wikipedia’s general definition for troubleshooting, troubleshooting in IT is a type of problem solving applied to software and infrastructure related issues. Software issues refer not only to the various types of errors thrown by software applications, but also to functional, rendering or configuration errors, performance issues, data quality issues, etc. Infrastructure related issues could refer to the IT infrastructure – network, information systems, processes, methods or methodologies used. In this post I will refer only to the software issues even if the technique(s) for troubleshooting this kind of issues could be applied also to infrastructure issues.

Polya’s Approach to Problem Solving

In his book 'How To Solve It', G. Polya, a well known Hungarian mathematician, advances a 4 step natural approach in solving a problem: 1. understanding the problem, 2. devising a plan, 3. carrying out the plan, and 4. looking back [1]. G. Polya’s approach could be used for all types of problems, including IT problems, and even if I find this approach too high level for solving this type of problems, it’s actually a cornerstone on which more detailed approaches could be used. Let’s look shortly at each of Polya’s four steps!

1. Understanding the problem

Understanding the problem resumes in identifying what is known, the data, the actual facts, and what is not known, what causes the issue and how it will be solved. Somebody was saying that a problem well understood is half solved, and there are quite good chances to arrive to the wrong solution if the problem is not well understood. If in Mathematics the problem is defined beforehand together with the whole range of constraints, in IT for example, when troubleshooting the problem needs to be defined, in the context of this post the problem revolving around a technical or business issue appearing in the form of an error message, un unexpected/wrong application behavior, wrong process, etc. Thus the actual facts could resume to the error message, current vs. expected behavior, tools used, high/low level design, business logic, affected objects, specific constraints, etc.

Defining the issue/problem might not be as simple as it seems, especially when the issue is pointed by other people in informal non-technical terminology, fuzzy formulations like “there is an error in the XYZ screen” without actually detailing what the issue is about, the steps followed and the input that resulted in the respective issue, and other such aspects that need to be addressed in order to understand the problem. All these aspects are not known by the developer though with a little investigation they are transformed in known information, this involving communication with the users, looking in documentation, and gathering any other facts. Actually we could group all this actions under “gathering the facts” syntagm, and this type of operations could be considered as part of this step because they are intrinsic in what concerns problem understanding.

2. Devising a plan

In this step is attempted to find the connection between the data and the unknown, looking at the problem from different angles in order to obtain an idea of the solution, to make a plan [1]. We have a plan when we know which steps we have to follow in order to identify the issue (solve the problem), they don’t have to be too detailed, but addressable, not necessarily complete but as a base that could be evolved with time, for example when new information/results are found. It could be multiple directions to look into, for example based on possible list of causes, constraints the various features comes with, different features for implementing the same thing, etc.

Naturally the first question a developer should ask: have I seen this issue before in actual or slightly modified form? Could be the problem broken down to smaller (known) problems? Could be derived anything useful from the data, have been considered all essential notions involved in the problem [1]? Essential notions, that’s always a thing to look into, mainly because I would say that many issues derive from feature constraints or from misuse of features. There could be used tools brainstorming, check lists, root-cause analysis, conceptual mapping, in fact any tool which could help us to track the essential notions and the relations between them.

3. Carrying out the plan

Once the plan sketched, we could go on and approach each of the branches of the plan, performing the successive steps in one branch until we find an end-point (a point in which we can’t go further). There could be branches going nowhere, multiple solutions, or no apparent solution to the problem. Everything is possible… More likely while advancing in carrying out the plan, we could discover other intermediary steps, other branches (alternatives of arriving to the same result or to approach different constraints).

4. Looking back

According to Polya, this step resumes to examining the solution [1], reviewing the argumentation used, solution’s construction, on whether the solution is optimal, on whether it could be reused to solve other types of problems or whether it could be improved/refactored. Actually this is a step many developers completely ignore, they found a solution, it’s working so their work is done! No, even when pressed by time should be considered also these aspects of problem solving, and from my point of view this step includes also steps like documenting the issue, and in special cases communicating the solution found to the circle of professionals (e.g. in terms of best practices or lessons learned, why not a blog post, etc.). Topics like optimality and refactoring and are quite complex and deserve a post of their own, therefore I will resume myself to mention only the fact that they typically consider the solution from the point of view of performance, complexity, (re)usability and design, the developer having to trade between them and other similar (quality) dimensions.

Beyond Polya’s Approach

A natural question: do we really have to follow this approach?! Come on, there will be cases when you’ll have the solution without actually attempting to define the problem (explicitly) or devise a plan (explicitly), or only by listing the scope and the constraints! Unconsciously we are actually following the first three steps, but forget or complete ignore the fourth, and I feel that applying Polya’s approach brings some “conscious thought” in this process that could help us make the most of it.

In many cases the solution will be there in documentation, giving developers some explicit or implicit hints about the areas in which to search, for example in case of an error related to a query a first input is the error message returned by the database engine. Fortunately RDBMS vendors like Microsoft and Oracle provide also a longer description for each error, allowing thus to understand what the error message is about. This is the happiest case, there are also many software tools that after they run half of hour, they return a fuzzy error message (e.g. ‘an error occurred!’), and nothing more.

Thank God for the internet, a dynamic knowledge repository, in which lot of knowledge could be find with just a simple click, but also sensitive to the input. In many cases I could found one or more solutions or hints for an error I had, usually just by copy pasting the error number of the error description, or when the description is too long, only the most important part. I observed, that there is an quite important number of professionals that prefer to post their issue in a forum or professional group instead of doing some research by themselves, this lack of effort helping to increase the volume of redundant information on the web, this coming with negative but also positive implications. When we perform such a search on internet we actually rely on the solution provided by other users, shortcutting the troubleshooting process, and with the risk of repeating the same syntagm, it comes with negative but also positive implications, for example a negative aspect is that people don’t learn how to troubleshoot by themselves relying instead on ready-available solution, while a positive aspect is that less time is spent within troubleshooting process, at least in theory. Actually, considering the mentioned positive aspect, that’s also why I consider as important the “looking back” step, and I’m referring especially at documenting the issue action.

References:
[1] G. Polya (1973) How To Solve It: A New Aspect of Mathematical Method, 2nd Ed.Stanford University. ISBN: 0-691-08097-6.

21 April 2010

Software Engineering: To get or not Certified?!

Software Engineering Series

To get or not certified?! That’s a question I asked myself several times along the years, and frankly it doesn’t have an easy answer because there are many aspects that need to be considered: previous education, targeted certification, availability of time, financial resources or learning material, required software, hand-on experience, certification’s costs, duration/frequency, objectives, value (on the market) or requirements, contexts, etc.

In many occasions when I had most of the conditions met then I didn’t had the time to do it, or I waited to appear the requirements for the new set of certifications, referring mainly to SQL Server 2005 and 2008 versions, or I preferred to continue my “academic” studies, so here I am after almost 10 years of experience in the world of SQL without any certification, but, I would say, with a rich experience covering mainly full-life cycle development of applications, reporting, data quality and data integration, ETL, etc.

Enough with the talking about myself and get to the subject. I’ve seen recently this topic appearing again in 1-2 professional groups, so I’ll try to approach this topic from a general point of view because most of the characteristics could apply also to database-related certifications like Microsoft MCITP (Microsoft Certified IT Professional) or MCTS (Microsoft Certified Technology Specialist) for SQL Server.

Naturally, in what concerns the certification, the opinions between professionals are split, an often met argument against it is the believe that a certification is just a piece of paper having a limited value without being backed-up by adequate hand-on experience, while the pro-argumentation is that some companies, employers and customers altogether, are valuing a certification, considering it as a personal achievement reflecting not only owners’ commitment to approach and take a certification exam, but also a basic level of knowledge. Both views are entirely correct from their perspective, weighting differently from person to person, community to community or from one domain of expertise to another, and they have positive and negative aspects, many subjective aspects as they are related to people’s perception.

From a global perspective an IT “certification fulfills a great need by providing standardized exams for the most current and important technologies” [3], allowing judging people’s knowledge on the topics encompassed by it[1], being thus a way to quantify knowledge especially related to general tasks. Certifications offer a path to guide the study of a domain [2], are developed around agreed-upon job tasks [3] and consider a basic knowledge base made of vocabulary, definitions, models, standards, methods, methodologies, guidelines or best practices.

Given the fact that a certification covers most of the topics from a given domain, in theory it provides a wide but of superficial depth coverage of the respective domain, in contrast with the hand-on experience, the professional experience accumulated by solving day-to-day tasks, which provides a narrower (task-based) but deeper coverage of the respective domain. Therefore, from my point of view the two are not necessarily complementary but could offer together a wide and deep coverage of the domain, a certification needs somehow to be based on a certain number of years of hand-on experience in order to get more value out of it.

On the other side, the variety in hand-on experience could offer wider coverage of the domain, though I suppose that could be accomplished fully by had-on experience but in a longer unit of time. These suppositions are fully theoretical because there are many other parameters that need to be considered, for example a person’s capacity of learning by doing vs. theoretical learning (this involves also understanding of concepts), the learning curve and particularities of the technologies, methods or methodologies involved, the forms of training used, etc.

A certification is not meaningless, as several professionals advance (e.g. J. Shore, T. Graves & others), even when considered from employers’ perspective, and the fact that it doesn’t count for some employers or professionals, that’s another story. A certification could be considered eventually useless, though also that’s not fully true. Maybe the certification itself is useless for a third party, though it’s not from the point of view of the learning process, as long the the knowledge accumulated is further used and the certification is not an end in itself.

A certification is not or it shouldn’t be an end in itself, it should be a continuous learning process in which knowledge is perpetually discovered, integrated and reused. Most probably in order to keep the learning process continuous several certifications, including MCITP, require to be recertified after a number of years.

There are professional certifications that require provable experience in the respective domain before actually being accepted for a certification, it’s the example of PMP (Project Management Professional) and CAPM (Certified Associate in Project Management) certifications from PMI (Project Management Institute) that require a considerable amount of hours of non-overlapping direct or indirect PM experience, and the example is not singular, if I’m not mistaking also the CISP (Certified Information System Security Professional) certification requires a certain number of years of experience. This type of requirement allows in theory to make most of the learning process being facilitated the integration of knowledge with experience.

How useful is a certification for the certified person?! It depends also how much a certification succeeds in covering the knowledge, skills and abilities required by an actual job, how much of the knowledge acquired will be later used. There are people who focus only on taking the exam, nothing more, though I would say that might come with other downsides on the long term. There are even organizations that encourage and even sponsor their employees’ certification either by providing training material, courses, partial or full-expenses, such initiatives being often part of their strategic effort of creating value and a knowledge-based environment, the professional certification being also a form of recognition, being valued in what concerns employees performance, eventually associated also with a form of remuneration.

I think that a certification could be beneficial for a person with relatively small or no professional experience in a certain domain, the certification bridging to a small degree the gap to hand-on experience. It could be interesting to study whether the on-hand experience could be compensated to some degree by attempting to (re)use the learned concepts in self-driven applications or several examples.

When learning something new I found it useful to try writing a tutorial or a blog post using a well-defined example, though this won’t replace entirely the on-hand experience, the difference between the two being the limited vs. the global scope of handling tasks, in dealing with real-life situations. Most probably it could be also useful to learn about the use of a technique/technology in several contexts, though this equates with lot of research and effort spent in that direction. Does it worth to do that?!

A certification is an opportunity to enter in a “select” circle of professionals, though now it depends also how each vendor or group of certificates takes advantage of this “asset” and what other benefits are derived out of it. For example by publishing domain related content certificates could be kept up-to-date with new features, trends, best practices, etc., the professional network thus created could benefit of the potential such networks offer especially when considering problem solving, the creation, propagation and mapping of knowledge, etc. Of course, such networks could have also side effects, for example the creation of exclusivist networks. I would say that the potential of professional networks is still theoretic, but with the evolution of the Web new possibilities will emerge.

A person taking such a certification arrives in theory to cover most of the important topics related to a given domain, however this doesn’t guarantee that the person is actually capable of applying (successfully) the concepts and techniques in real life scenarios, the many “brain dumps” and other easy ways of taking a certification decreasing certification’s credibility and value. There are domains over-flooded by people with certifications but not having the skills to approach a real project, a company that gives too much credit to a certification could end up stuck with resources that can’t be used, this aspect impacting negatively other professionals too.

I’m coming back to the idea that a certification is subject of people’s perception and I have to say that the most important opinion in this direction is not necessarily professionals’ opinion activating in the respective domain, but of the people from HR, PM and partially headhunters, because they are the ones who are making the selection, deciding who’s hired and who’s not. Considering that there are few professionals from HR and PM that are coming from the IT domain, there are lot of false and true presumptions when evaluating such candidates, people arriving to come with their own methods of filtering the candidates, and even if such methods are efficient from the result perspective, many good professional could feel kind of “discriminated”.

In theory it’s easier to identify a person who has a certification than to navigate through the huge collection of related projects and tasks, or to search in a collection of CVs for the various possible combinations or significant terms related to a job description. Somebody (sorry, I don’t remember who) was saying that a manager spends on average 20-30 seconds for each CV, now it depends also how eye-catching is a certification in a simple CV scanning.

From a semantic point of view I would say that a certification is richer in meaning than any type of written experience, though now it depends also on reviewer’s knowledge about the respective certification. Sure is that when choosing between two professionals with similar experience there are high chances for the one having a certification to be hired. In addition, considering that there are hundreds of applicants for the good jobs on the market, I would say that a certification could allow a candidate, between many other criteria, to distinguish himself from the crowd.

Given the explosion of technologies from IT, domain’s dynamics, segmentation and other intrinsic characteristics , the IT certifications are more specialized, more segmented and less standardized, making difficult their evaluation, especially when domains intersect each other or when the vendors emitting the certifications are competing against each other. Compared with other domains, an IT professional needs to be always up-to-date, cover multiple related domains in order to do his work efficiently, for example in order to provide a full-life cycle solution a developer would have to be kind of expert in Software Engineering, UI and database programming, security, testing, etc. The high segmentation in IT could be seen also in the denominations for the various roles, lot of confusion deriving from this, especially when matching the job descriptions with the roles.

Must be considered also the bottom line: in IT as also in other domains, the knowledge and experience is relative because it depends also on person’s skills and ability of assimilating, using, reusing (creatively) the knowledge given in a domain; a person could in theory accumulate in one year same experience as others in 2 or more years, same as a person who got certified could in theory handle day-to-day tasks without any difficulty, same as in theory a student with no professional experience could handle programming tasks like a professional with several years of experience.

At least in my country, there are many domains in University that provide also IT-related curricula within non-IT domains (e.g. Mathematics, Economics, Engineering), a number of programming courses being thought also in high school or even lower grades, the theory learned and the small projects facilitating theoretically the certification for a programming language (e.g. C#, Java or C++) or of directly handing day-to-day tasks. It’s true that in school is insisted more on the syntax, basic features and algorithmic nature of programming, but this doesn’t diminish the value of this type of learning when done adequately. Such educational experience is not considered as professional experience at all, even if it provides a considerable advantage when approaching a certification or a job.

It must be highlighted that taking a certification comes with no guarantees for getting a job or being successful in your carrier/profession. You have to ask yourself honestly what you want to achieve with a certification, how you’ll use the learning process in order to get most of it. You actually have to enjoy the road to the final destination rather than dreaming about the potential success brought by such a certification. It could take actually more time until you’ll recover your investment or you’ll see that the actual invested time worth, and, as always some risks need to be assumed. Consider the positive and negative aspects altogether and decide by yourself if it makes sense to go for a certification.

There is actually a third choice – continuing the academic studies, for example pursuing a bachelor, masters or why not, a doctoral degree. The approach of a such a degree imposes similar questions as in the case of a certification, though academic degrees are in theory better accepted by the society even if they come with no guarantees too, require more effort and financial resources.

References:
[1] K. Forsberg, H. Mooz, H. Cotterman. (2005). Visualizing Project Management: Models and Frameworks for Mastering Complex Systems. John Wiley & Sons. ISBN: 0-978-0-471-64848-2.
[2] D. Gibson (2008). MCITP SQL Server 2005 Database Developer All-In-One Exam Guide. McGraw-Hill. ISBN: 978-0071546690.
[3] L.A. Snyder, D.E. Rupp. G.C. Thornton (2006). Personnel Selection of Information Technology Workers: The People, The Jobs, and Issues for Human Resources Management. Research in Personnel and Human Resources Management, Vol. 25, Martocchio J.J. (Ed.). JAI Press. ISBN: 978-0762313273.

17 April 2010

Software Engineering: Learning SQL

Software Engineering Series

Introduction

One of the topics that appears in many database related groups or forums is from where and how new comers should start to learn SQL (Structured Query Language) in general, and specific vendor SQL features in particular. Two years back, in Learning something new post from my programming-grasps blog attempt, I wrote down my thoughts on how to approach learning new subjects in general, many of the respective observations could be applied to SQL learning too, and in addition could be added also domain-specific recommendations.

I would say that when discussing a topic like learning it makes sense to approach the subject from users level of knowledge and skill-set, different recommendations could apply for beginners, intermediate and expert level. No, also when you are an expert the learning process doesn’t stops there as SQL and RDBMS (Relational Database Management System) evolve with time, new features and techniques being added, same as new contexts of applicability could emerge (e.g. very large databases, geographic information systems, Web-related SQL, etc.).

The Beginner Level

Approaching a topic from beginner’s perspective is never an easy task given the fact that many new comers don’t know from where to start, what to learn first, what RDBMS to target. I was and I’m still at that point as new subjects appear on my learning schedule. One important decision is what RDBMS solution to choose, and for this you might need to consider several aspects: market share, goals/objectives, functional area (e.g. database administration, database development, data warehousing, business intelligence, ETL, data quality), personal preferences, salaries and jobs postings, UI, features, tools, learning curve, performance, programmability, extensibility, integration and compatibilities with other technologies, availability of documentation, books, tutorials and other learning material, etc.

I hope I’m not annoying anybody when I’m saying that Oracle, Microsoft and IBM are the leaders in RDBMS and related data technologies, this being also the ranking in what concerns the market share and job postings (see for example B. Nicolich’s SQL Server Job Trends article). If you aren’t having any preferences I would recommend you to focus on at least one of the RDBMS of the mentioned vendors: Oracle, SQL Server and MySQL. Unfortunately, given researches’ subjectivity, I can’t delimitate which of the three has better performance or is better to use, being important the context in which the studies are made, the overall architecture and their particularities, etc.

Having a Microsoft technologies background I like SQL Server’s easy to use UI and integrated solution approach it provides, on the other hand Oracle seems to provide more SQL-related features (e.g. functions), while MySQL is still trying to catch the other two vendors and become more mature. All three vendors implement SQL ANSI 92 standard, so learning SQL on one of the platforms would allow you to write queries also on the other platforms, the difference residing in the functions used, DML (Data Manipulation Language) or DDL (Data Definition Language) syntax and specific functionality, architecture and tools.

For SQL Server and Oracle you could download the trial versions, or opt for the Express versions that could be used free but with certain constraints, while for MySQL you could download one of the free community editions, quite stable solutions. If SQL Server comes with a nice Management Studio that could be used to write queries and administrate your databases, for Oracle and MySQL you might have to consider tools like Toad for OracleOracle or SQL Developer, respectively Toad for MySQL, that allow to be more productive on the respective platforms. If you are intending to use a RDBMS more for personal use (e.g. data analysis), what I call a personal database, then MS Access and MS Excel could prove to be enough, especially when is needed to develop personal or small scale solutions in an organization with a limited knowledge on SQL, though a more evolved RDBMS (actually MS Excel is not a RDBMS but it could be used as a data source or data analysis tool) is recommended when providing solution for multiple users.

All three vendors provide rich documentation, with a plus for SQL Server and Oracle, through the online MSDN or Technet, respectively Oracle Database Documentation Library, and MySQL Reference Manual, the knowledge provided by the documentation being complemented by the many published books, magazines, tutorials, blogs, webcasts, podcasts, studies, best practices and other type of technical literature, many of them easily retrievable with just a simple search using your favorite search engine, important being to know what you are looking for. You have to consider that the technical documentation focuses more on the features, syntax and technical details, while for specific purposes like development, architecture integration or problem solving you might have to consider the other sources of knowledge with a plus for books, tutorials and blogs, the later two being actually more up-to-date than the books, given their shorter “time to market” and informal character.

If your interest in SQL and databases derives from pure educational purposes then you could also consider the free online course materials (presentations, video lectures) posted within initiatives like the ones of YouTube EDU or MIT Open CourseWare, many other lecture notes being available online from various well-known universities. Other important source of knowledge that gains field in the past years is the one of social and professional networks centered around the important technologies, so joining such a network could bring some benefit but you don’t expect miracles!

In what concerns the books maybe it makes sense to list at least my preferences given the huge volume of books on this topic. I’m a fan of “for Dummies” series, therefore I find Allen Taylor’s SQL for Dummies and its more extensive version SQL All-In-One Desk Reference for Dummies or Database Development for Dummies as a good starting point. If you are targeting SQL Server then you might need to check also Microsoft SQL Server 2008 For Dummies, Microsoft SQL Server 2008 All-in-One Desk Reference For Dummies, Microsoft SQL Server 2005 Programming For Dummies or Microsoft SQL Server 2005 Express Edition For Dummies. In exchange the list of choices for Oracle or MySQL is quite smaller, as far I know there are only two books on the market on Oracle - Oracle PL/SQL For Dummies and Oracle 11g For Dummies, while for MySQL the books from “for Dummies” cycle target the use of MySQL in combination with other scripting programming languages: Apache, MySQL, and PHP Web Development All-in-One Desk Reference, PHP & MySQL For Dummies or PHP & MySQL Everyday Apps For Dummies. Another popular series are the “for Beginners” or “Beginning” series, having a more narrow focus, for example Beginning T-SQL with Microsoft SQL Server 2005 and 2008 or Beginning PL/SQL: From Novice to Professional could be two of the books you could start with.

Choosing the right book is highly dependent on what you want to achieve, your learning style and author preferences, books’ availability and costs, in general each professional having his/her own list of preferences. Years back I was quite stressed because many of the examples written in books were not working, nowadays hopefully that changed positively, in addition many publishers making the code available through books’ supporting sites or publishers websites together with a few sample chapters from the respective books. If you don’t want to buy the whole book or you’d like to read it online, you could become member of online libraries like Safari Books Online, Skillsoft’s Books 24x7, eBrary or SpringerLink. Many books are starting to become available for purchasing in PDF directly from publishers or in proprietary format on digital book readers like Amazon’s Kindle or Sony’s Reader, the digital books being in theory with 10-30% cheaper than the paperback products.

The Intermediate Level

The intermediate level is a relative point in the evolution of a professional, in theory this level being reached after several projects or prolonged use of SQL in various scenarios. During this stage many developers are starting to approach topics like performance or best practices, going thus above the basics, or interest themselves in taking a certification, attempting thus to approach the whole range of topics related to databases. Some of the books listed above qualify as well to the intermediate or expert levels, it depends also on books’ level of detail and approach on the given topics. The books’ series closer to this level are the ones for Professionals, for example Pro SQL Server 2008 Relational Database Design and Implementation, Pro T-SQL 2008 Programmer’s Guide, Professional Oracle Programming, Pro MySQL or Professional MySQL.

For those targeting the MCTS certification, it makes sense to consider the Self-Paced Training Kit Exams: (Exam 70-432) Microsoft SQL Server 2008-Implementation and Maintenance, (Exam 70-433) Microsoft SQL Server 2008-Database Development, while for those approaching the MCITP certification then MCITP SQL Server 2005 Database Developer All-in-One Exam Guide is quite useful. As I haven’t approached the Oracle or MySQL certifications, I think it’s more indicated to do some research by yourself.

The Expert Level

At this level developers are starting to go deeper into the database internals, learning how things are handled in the background, differentiate between multiple solutions and choose the optimum solution. Books like Microsoft SQL Server 2008 Internals, SQL Server 2008 Query Performance Tuning Distilled or Professional SQL Server 2008 Internals and Troubleshooting shouldn’t miss from a developer’s library! I would expect there are similar books also for Oracle and MySQL, but as I’m not aware of them so I will stop here.

At this level a developer could also approach the database topics from a higher level, making an incursion in the mathematical/conceptual theory of relational databases, books like Information Modeling and Relational Databases or Applied Mathematics for Database Professionals could become really handy in understanding such topics.

Note:

Most probably there are even better books out there though, as I highlighted above, these are my recommendations based on the books that have fallen in my hand along the years. In many cases, instead of buying a book, I’m looking on whether I could find on the web the information I need, though the quality of the material varies, sometimes same as the quality of the books. There are even cases when the information are contradictory, however in most of the cases I would go with the experts that have the information from the source.

12 April 2010

Data Warehousing: Pivot/Unpivot (Definitions)

"To rotate rows to columns, and columns to rows, in a cross-tabular data browser. Also refers to choosing dimensions from the set of available dimensions in a multidimensional data structure for display in the rows and columns of a cross-tabular structure." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"The process of rotating the view of data. For example, viewing what was the x-axis in the y-axis’s position and vice versa." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"The UNPIVOT operator is used within a SELECT statement to create a normalized data report from data that is stored as a spreadsheet." (Darril Gibson, "MCITP SQL Server 2005 Database Developer All-in-One Exam Guide", 2008)

"The PIVOT operator is used within a SELECT statement. It is used to create cross-tab reports (similar to a spreadsheet) from normalized data." (Darril Gibson, "MCITP SQL Server 2005 Database Developer All-in-One Exam Guide", 2008)

"The act of rotating rows to columns, and columns to rows." (Jim Joseph, "Microsoft SQL Server 2008 Reporting Services Unleashed", 2009)

[unpivot:] "To expand values from multiple columns in a single record into multiple records with the same values in a single column." (Microsoft, "SQL Server 2012 Glossary", 2012)

"To rotate a table-valued expression by turning the unique values from one column in the expression into multiple columns in the output, and perform aggregations where they are required on any remaining column values that are wanted in the final output." (Microsoft, "SQL Server 2012 Glossary", 2012)

"In the Lean start-up world, a pivot is a structured, often rapid, course correction on the basis of new market, customer, and development information." (Pamela Schure & Brian Lawley, "Product Management For Dummies", 2017)

"1. To rotate rows to columns, and columns to rows, in a cross-tabular data browser. 2. To choose dimensions from the set of available dimensions in a multidimensional data structure for display in the rows and columns of a cross-tabular structure." (Microsoft Technet)

[unpivot:] "In Integration Services, the process of creating a more normalized dataset by expanding data columns in a single record into multiple records." (Microsoft Technet)

05 April 2010

SQL Reloaded: The Power of Joins VI (Not-Equal Joins)

SQL Server 2008 documentation mentions the not-equal joins without giving an explicit definition, the term appearing within the context of joining values in two columns that are not equal. On the other side the not-equal join denomination could be used to refer more generally to the joins whose join constraints involve comparison operators other than equal (“=”), for example <, >, <>, BETWEEN, IN, and its negation NOT IN, LIKE, etc. The difference between the two views resides on the fact that some of the operators (e.g. IN, BETWEEN) could imply the equality of values from joined columns. Another objection could be made on whether the not-equality join constraint could be based only on one of the table’s attributes participating in the join, the other thus missing, such constraints could be considered also as join constraints even if in some scenarios they could have been written in the main WHERE clause as well. For the sake of simplicity and flexibility I would consider a not-equal join or not-equality join as a join in which at least one of the join constraints is based on a operator other than the equal operator. Could be discussed thus about not-equal join constraints referring to the join constraints that involve other operators than equal, and equal join constraints, the two types of joins could coexist in the same join. A not-equal join constraint often involves the possibility of being returned more records then if a equal join constraint was used.

If we talk about join constraints then more likely we refer to vertical joins, though also anti-joins and semi-joins could include sporadically not-equality operators. In a normal join based on equalities for the set of attributes participating in the join from the left table are matched all the records from the right table, even if this means retrieving more than one record. Same happens also for not-equal constraints even if the business scenarios in which such needs arises are not so intuitive. One relatively simple example I often met is the case when is needed to calculate the incremental aggregation of values, for example to calculate the incremental volume of PO placed over time:

SELECT DII.DateSequence 
, IsNull(POH.SubTotal, 0) SubTotal 
FROM dbo.fGetDatesInInterval('2002-01-01', '2002-01-31') DII 
    CROSS APPLY (--incremental PO Volume 
     SELECT SUM(POH.SubTotal) SubTotal 
    FROM Purchasing.PurchaseOrderHeader POH 
   WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
       AND DATEDIFF(d, POH.OrderDate, DII.DateSequence)&gt;=0 --not-equal constraint 
    ) POH

In order to simplify the query I used the dbo.fGetDatesInInterval table-valued function that returns a sequence of dates within a given time interval, and a CROSS APPLY returning the aggregated value for each date returned by the function, for the calculations being considered all the records having an Order Date smaller or equal with the given (sequence) date. As can be seen from the above query the not-equal constraint is based on DateDiff function, instead we could have written POH.OrderDate<=DII.DateSequence, one of the reasons for its use is the fact that the constraint could be easily modified to show for example only the aggregated values for the past 90 days, the BETWEEN operator entering in scene:

SELECT DII.DateSequence 
, IsNull(POH.SubTotal, 0) SubTotal 
FROM dbo.fGetDatesInInterval('2002-01-01', '2002-01-31') DII 
     CROSS APPLY (--incremental PO Volume 
    SELECT SUM(POH.SubTotal) SubTotal 
    FROM Purchasing.PurchaseOrderHeader POH 
    WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
        AND DATEDIFF(d, POH.OrderDate, DII.DateSequence) BETWEEN 0 AND 90--not-equal constraint 
    ) POH

The not-equal constraint from the query is part of interval-based constraints category which covers the scenarios in which one of the attributes from the tables participating in a join is checked whether it falls within a given interval, typically that being valid for numeric values as above or date value, as in the next query that shows the Last Standard Cost Details per Product.

-- Products &amp; Last Standard Cost details SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
, PCH.StartDate 
, PCH.EndDate 
FROM Production.Product ITM 
    JOIN Production.ProductCostHistory PCH 
     ON ITM.ProductID = PCH.ProductID AND GETDATE() BETWEEN PCH.StartDate AND IsNull(PCH.EndDate, GETDATE())

    Such queries need often to be written when the records in a table have a given validity, no overlap existing between intervals.

    Using list of values within a join constraint and thus the IN or NOT IN operators occurs when for example the base table stores data coming from multiple sources and more encodings are used for the same source, each of them caring its meaning. For example using the Production.TransactionHistory and several other tables coming with AdventureWorks database, and supposing that for Purchase Orders are two transaction types P and B, and for Sales Orders S and I, a general scope query on Transactions could be written as follows:

-- Transaction History 

SELECT PTH.TransactionID 
, PTH.ProductID 
, PTH.ReferenceOrderID 
, CASE PTH.TransactionType 
     WHEN 'W' THEN 'Work Order' 
    WHEN 'S' THEN 'Sales Order' 
    WHEN 'I' THEN 'Internal Sales Order' 
    WHEN 'P' THEN 'Purchase Order' 
    WHEN 'B' THEN 'Blanket Purchase Order' 
    ELSE 'n.a' 
  END TransactionType , CASE  
    WHEN PTH.TransactionType = 'W' THEN StartDate 
    WHEN PTH.TransactionType IN ('S', 'I') THEN SOH.OrderDate 
    WHEN PTH.TransactionType IN ('P', 'B') THEN POH.OrderDate 
  END ReferenceDate FROM Production.TransactionHistory PTH 
    LEFT JOIN Purchasing.PurchaseOrderHeader POH       ON PTH.ReferenceOrderID = POH.PurchaseOrderID   AND PTH.TransactionType IN ('P', 'B') --not-equal constraint 
   LEFT JOIN Sales.SalesOrderHeader SOH 
     ON PTH.ReferenceOrderID = SOH.SalesOrderID 
  AND PTH.TransactionType IN ('S', 'I')  --not-equal constraint 
    LEFT JOIN Production.WorkOrder PWO 
      ON PTH.ReferenceOrderID = PWO.WorkOrderID 
   AND PTH.TransactionType = 'W'

    The need for writing such a query is seldom and could have been written as 3 distinct queries whose results are joined with two unions, both approaches having their pluses and minuses.

   In one of the above queries was used a constraint based on DateDiff function, and in theory any other function could be used in a join constraint, including user-defined functions. Such function-based join constraints are handy but should be used with caution because they could impact query’s performance. Pattern-based join constraints used with LIKE operator could be used as well.

    There are cases in which the constraints that typically would be included in the WHERE clause are added in the body of the join, apparently without any good reason. Such based denatured join constraints are base only on the attributes of one of the tables involved in the join, like in the next query:

-- PO for Products with StandardCost&gt;10 SELECT POD.PurchaseOrderDetailID  
, POD.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
, POD.UnitPrice 
, POD.OrderQty 
FROM Purchasing.PurchaseOrderDetail POD  
    JOIN Production.Product ITM       ON POD.ProductID = ITM.ProductID 
  AND ITM.StandardCost&gt;10  -- denaturated join constraint

03 April 2010

SQL Reloaded: Power of Joins V (Semi-Joins and Anti-Joins)

     An incursion in the world of joins is incomplete without approaching two other important techniques for writing queries: semi-joins and anti-joins, some of the techniques specific to them allowing to speed up queries and reduce queries’ complexity. An anti-join returns rows from the left table where no matches is found in the right table, while an semi-join returns returns rows from the left table if at least a match is found in the second table. The typical techniques for creating semi-joins and anti-joins are based on IN or EXISTS operators, respectively their negations the NOT EXISTS and NOT IN, both operators being available in SQL Server and Oracle. According to the above definitions also the EXCEPT and INTERSECT operators introduced in horizontal joins post could be used to create anti-joins, respectively semi-joins too, they being introduced by Microsoft starting with SQL Server 2005 in order to compare whole rows rather then a smaller set of attributes. Another operators that allows creating anti-joins and semi-joins and introduced with SQL Server 2005 is the CROSS APPLY and OUTER APPLY operators.

     Because the EXCEPT and INTERSECT operators were introduced in a previous post, this post is focused on the other mentioned operators, for exemplification of the techniques following to use again the Production.Product and Purchasing.PurchaseOrderDetail tables from AdventureWorks database.

      IN operator allows to test whether a given (attribute) value is in a list of values, the respective list could be based also on the values returned by another query, called actually a subquery. The output is not influenced by the eventual duplicates in the list of values, though for the sake of simplicity and testability it makes sense to provide a list of distinct values.

      EXISTS operator provides a similar functionality like the IN operator, however it checks only whether a record exists in what is called a correlated suquery (also known as repeated subquery, is a subquery making direct reference to one or more of the attributes of the main query). The subquery used with an IN operator could be easily translated to a correlated subquery by bringing the attribute used to create the list of values into the WHERE clause.

     Let’s start with a simple problem to exemplify the use of a semi-join, for example to select only the products that have a PO (Purchase Order) placed, the query could be written as follows:

-- Products with PO placed (IN operator) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
FROM Production.Product ITM  
WHERE ITM.ProductID IN (--list of Product IDs with PO placed 
    SELECT DISTINCT POD.ProductID  
     FROM Purchasing.PurchaseOrderDetail POD  
       JOIN Purchasing.PurchaseOrderHeader POH  
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
      WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     )

The above query could be rewritten using the EXISTS operator, the correlated query being based on ProductID attribute found in both tables:

-- Products with PO placed (EXISTS operator) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
FROM Production.Product ITM  
WHERE EXISTS (--list of Product IDs with PO placed 
     SELECT 1 
     FROM Purchasing.PurchaseOrderDetail POD  
       JOIN Purchasing.PurchaseOrderHeader POH  
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
         AND POD.ProductID = ITM.ProductID -- correlated constraint 
)

Notes:
1.    As usual special attention must be given to the handling of NULL values either by replacing them with a default value or by handing the NULL case in the WHERE clause.
2.   The Execution Plan for the above two queries is the same, SQL Server using a Hash Match (Left Semi Join) for both operators. I was trying to find information on whether there are any recommendations that advices on the use of one of the two operators in the detriment of the other, however I found nothing in this direction for SQL Server. R. Schrag offers a rule of dumb rooted in Oracle documentation and based on the selectivity of a query, a query being highly selective if it returns the least number of rows:
     “If the main body of your query is highly selective, then an EXISTS clause might be more appropriate to semi-join to the target table. However, if the main body of your query is not so selective and the subquery (the target of the semi-join) is more selective, then an IN clause might be more appropriate.” [1]

     Both above solutions are ideal when no data are needed from the tables used in the subquery, however quite often such a need arises, for example to show the volume of PO placed, the last PO Date, and so on, in such cases an inner join could be used instead:

-- Products with PO placed (inner join) SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
, POD.TotalOrderQty 
, POD.LastOrderDate 
FROM Production.Product ITM  
JOIN (--aggregated PO details 
    SELECT POD.ProductID  
    , SUM(POD.OrderQty) TotalOrderQty 
    , Max(OrderDate) LastOrderDate 
     FROM Purchasing.PurchaseOrderDetail POD  
       JOIN Purchasing.PurchaseOrderHeader POH  
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     GROUP BY POD.ProductID 
    ) POD 
    ON POD.ProductID = ITM.ProductID

For the same purpose could be used also a CROSS APPLY taking care to eliminate the Products for which no record is retrieved in the subquery, this approach is actually a mixture between an inner join and a correlated query:

 -- Products with PO placed (CROSS APPLY) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost  
, POD.TotalOrderQty  
, POD.LastOrderDate  
FROM Production.Product ITM  
     CROSS APPLY (--aggregated PO details  
     SELECT SUM(POD.OrderQty) TotalOrderQty  
     , Max(OrderDate) LastOrderDate  
     FROM Purchasing.PurchaseOrderDetail POD  
           JOIN Purchasing.PurchaseOrderHeader POH  
              ON POD.PurchaseOrderID = POH.PurchaseOrderID  
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
        AND POD.ProductID = ITM.ProductID  
    ) POD  
WHERE POD.TotalOrderQty IS NOT NULL

Notes:
1. The IN and EXISTS query versions seems to provide better performance than the use of vertical joins, though from my experience is often needed to provide information also from the subqueries used with the respective operators, therefore the inner join provides more flexibility in the detriment of relatively poorer performance.
2. Microsoft’s implementation for IN operator allows using only single list of values, in contrast with Oracle that allows working with lists of n-uples. This problem in theory could be solved using the IN operators with a correlated subquery. An example of such a problem is finding the last PO placed for each Product:

 -- Last PO placed per Product (IN + correlated query) 
SELECT POD.PurchaseOrderDetailID 
, POD.ProductID  
, POH.OrderDate 
, POD.OrderQty 
FROM Purchasing.PurchaseOrderDetail POD  
     JOIN Purchasing.PurchaseOrderHeader POH         ON POD.PurchaseOrderID = POH.PurchaseOrderID 
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
   AND POH.OrderDate IN (--Last PO Date per Product 
      SELECT Max(POH1.OrderDate) LastOrderDate 
      FROM Purchasing.PurchaseOrderDetail POD1 
          JOIN Purchasing.PurchaseOrderHeader POH1 
             ON POD1.PurchaseOrderID = POH1.PurchaseOrderID 
    WHERE POH1.Status IN (2, 4) -- 2-Approved, 4-Complete  
AND POD1.ProductID = POD.ProductID -- correlated constraint 
)

This might not be the best approach to solve this problem though the query is intended only to show a technique. The query could be rewritten using an inner join or the EXISTS operator, but I will show only the later approach because it comes with an interesting technique – because the LastOrderDate and ProductID are used as pair of values in order to retrieve the last PO per Product, given the fact that LastOrderDate is an aggregate value, the correlated constraint needs to be added in HAVING clause:

-- Last PO placed per Product (correlated query) 
SELECT POD.PurchaseOrderDetailID 
, POD.ProductID  
, POH.OrderDate 
, POD.OrderQty 
FROM Purchasing.PurchaseOrderDetail POD  
     JOIN Purchasing.PurchaseOrderHeader POH 
        ON POD.PurchaseOrderID = POH.PurchaseOrderID 
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     AND EXISTS(--Last PO Date per Product 
     SELECT 1 
     FROM Purchasing.PurchaseOrderDetail POD1 
          JOIN Purchasing.PurchaseOrderHeader POH1 
             ON POD1.PurchaseOrderID = POH1.PurchaseOrderID 
     WHERE POH1.Status IN (2, 4) -- 2-Approved, 4-Complete  
          AND POD1.ProductID = POD.ProductID -- correlated constraint 
     GROUP BY POD1.ProductID 
     HAVING Max(POH1.OrderDate) = POH.OrderDate 
     )

The NOT IN and NOT EXISTS versions of the operators simply negates the use of IN and EXISTS operators, showing thus the records for which a match is not found. The complementary problem of showing the Products with a PO placed is to show the Products without a PO placed, for this, as previously mentioned, being enough to add the NOT keyword in from of IN, respectively EXISTS in the first two queries from this post:

-- Products without PO placed (IN operator) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
FROM Production.Product ITM  
WHERE ITM.ProductID NOT IN (--list of Product IDs with PO placed 
    SELECT DISTINCT POD.ProductID 
     FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
      WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     )  

-- Products without PO placed (EXISTS operator) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
FROM Production.Product ITM  
WHERE NOT EXISTS (--list of Product IDs with PO placed 
     SELECT 1 
     FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
         AND POD.ProductID = ITM.ProductID -- correlated constraint 
)

An alternative to NOT IN and NOT EXISTS is the use of a LEFT JOIN transposing the project of A\B problem:

  -- Products without PO placed (left join) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost FROM Production.Product ITM  
    LEFT JOIN (--aggregated PO details 
    SELECT POD.ProductID  
    FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     GROUP BY POD.ProductID 
    ) POD 
    ON POD.ProductID = ITM.ProductID   
WHERE POD.ProductID IS NULL

The CROSS APPLY could be used too, this time the WHERE clause from the main query being modified to show only the records without a match in CROSS APPLY, for this needing to be used a "disguised" aggregate too:

 -- Products without PO placed (CROSS APPLY) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost  FROM Production.Product ITM  
    CROSS APPLY (-- PO details  
    SELECT Max(1) HasPOPlaced 
    FROM Purchasing.PurchaseOrderDetail POD  
          JOIN Purchasing.PurchaseOrderHeader POH  
             ON POD.PurchaseOrderID = POH.PurchaseOrderID  
    WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
        AND POD.ProductID = ITM.ProductID  
    ) POD 
WHERE POD.HasPOPlaced IS NULL

As can be seen there are multiple alternative ways of solving the same problem, none of the above described patters could be considered as best approach because in the end must be weighted between performance, flexibility, reusability and complexity. In many cases the performance of a query is the most important factor, though the performance of any of the above techniques is relative because it needs to be considered factors like table structures and relations’ cardinality, server’s configuration and available resources, etc.

References:
[1] Schrag R. (2005). Speeding Up Queries with Semi-Joins and Anti-Joins: How Oracle Evaluates EXISTS, NOT EXISTS, IN, and NOT IN. [Online] Available from: http://www.dbspecialists.com/files/presentations/semijoins.html (Accessed: 3 Apr 2010)

SQL Troubles

Pages