SQL Troubles: November 2008

11 November 2008

Data Management: Data Quality - Information Systems' Perspective

Data Management Series

One LinkedIn user brought to attention the fact that according to top IT managers the top two reasons why CRM investments fail is: (1) managing resistance within the organization; (2) bad data quality.

The two reasons are common not only to CRM or BI solutions but also to other Information Systems, though from the two data quality has usually the biggest impact. Especially in ERP systems the data quality continues to be a problem and here are a few reasons:

Processes span different functions and/or roles, each of them maintaining the data they are interested in, without any agreement or coordination on the ownership. The lack of ownership is in general management’s fault.
Within an enterprise many systems arrive to be integrated, the quality of the data depending on the quality and scope of the integrations, whether they were addressed fully or only superficially. Few integrations are stable and properly designed. If stability can be obtained in time, scope is seldom changed as it involves further investments, and thus the remaining data need to be maintained manually, respectively the issues need to be troubleshooted or let accumulate in the backlog.
There are systems which are not integrated but use the same data, users needing to duplicate their effort, so they often focus on their immediate needs. Moreover, the lack of mappings between systems makes data analysis and review difficult.
The lack of knowledge about the systems used in terms of processes, procedures, best practices, policies, etc. Users usually try to do their best based on the knowledge they have, and despite their best intent, the systems arrive to be misused just to get things done.
Basic or inexistent validation for data entry in each important entry point (UI, integration interfaces, bulk upload functionality), system permissiveness (allowing workarounds), stability and reliability (bugs/defects).
Inexistence of data quality control mechanisms or quality methodologies, respectively a Data and/or Quality Management strategy. If the data quality is not kept under review, it can easily decrease over time.
The lack of a data culture and processes that support data quality.
People lack consistency and/or the self-discipline to follow the processes and update the data as the processes requires it and not only the data to move to the next or final step. Therefore, the gap between reality and the one presented by the system is considerable.
People are not motivated to improve data quality even if they may recognize the importance of doing that.

Data quality is usually ignored in BI projects, and this is because few are the ones that go and search for the causes, making it easier to blame the BI solution or the technical team than to do something. This is one of the reasons for which users are reticent in using a BI solution, to which add up solution’s flexibility and the degree up to which the solution satisfies users’ needs. On the other side BI solutions are often abused, including also reports which have OLTP characteristics or of providing too much unstructured or inadequate content that needs to be further reworked.

Data quality comes on the managers' agenda, especially during ERP implementations. Unfortunately, as soon as that happens, it also disappears, despite being warned of the consequences poor data quality might have on the implementation and further data use. An ERP implementation is supposed to be an opportunity for improving the data quality, though for many organizations it remains in this state. Once this opportunity passes, organizations need more financial and human resources to reach a fraction from the opportunity missed.

The above topics are complex and need further discussion (see [1], [2]).

Previous Post <<||>> Next Post

Written: Nov-2008, Last Reviewed: Mar-2024

Resources:
[1] SQL-Troubles (2010) Data Management: Data Quality - An Introduction (link)
[2] SQL-Troubles (2012) Data Migration: Data Quality’s Perspective I - A Bird’s-Eye View (link)

08 November 2008

SQL Reloaded: Dealing with data duplicates on SQL Server

Duplicates or duplications can be defined as "result rows that are identical to one or more other rows" [1]. They occur frequently in data collections (e.g. Excel, Access, etc.) with concurrent access that have minimum or no validation at all. More users and/or more frequent the updates on the same information, higher the risks of having duplicated data. Duplicates occur also in modern architectures when validation wasn’t addressed correspondingly or from usage related issues, for example, in information systems the huge amount of data entry makes it hard to overlook documents already processed (e.g. purchase orders, invoices, payments, etc.).

Subject to duplication are whole records, a group of attributes (fields) or only single attributes. I depends from case to case. Often duplicates are easy to identify - it’s enough to let somebody who has the proper knowledge to look over them. But what you do when the volume of data is too large or when is need to automate the process as much as possible? Using the DISTINCT keyword in a SELECT statement might do the trick, while other times it requires more complicated validation, ranging from simple checks to Data Mining techniques.

I will try to exemplify the techniques I use to deal with duplicates with the help of a simple example based on table that tracks information about Assets:

-- create test table
CREATE TABLE [dbo].[Assets](
 [ID] [int] NOT NULL,
 [CreationDate] smalldatetime NOT NULL,
 [Vendor] [varchar](50) NULL,
 [Asset] [varchar](50) NULL,
 [Model] [varchar](50) NULL,
 [Owner] [varchar](50) NULL,
 [Tag] [varchar](50) NULL,
 [Quantity] [decimal](13, 2) NULL
) ON [PRIMARY]

Here's some test data:

-- insert test data (SQL Server 2000+)
INSERT INTO dbo.Assets
VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')
INSERT INTO dbo.Assets
VALUES  ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1')
INSERT INTO dbo.Assets
VALUES  ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
INSERT INTO dbo.Assets
VALUES ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
INSERT INTO dbo.Assets
VALUES  ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1')
INSERT INTO dbo.Assets
VALUES  ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')

Let’s check table’s content:

-- review the data
SELECT ID, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM dbo.Assets

Output:

ID	CreationDate	Vendor	Asset	Model	Owner	Tag	Quantity
1	1/29/2024 10:46:00 PM	IBM	Laptop 1	Model 1	Owner 1	XX0001	1
2	1/30/2024 10:46:00 PM	IBM	Laptop 2	Model 2	Owner 2	XX0002	1
3	1/31/2024 10:46:00 PM	Microsoft	Laptop 3	Model 3	Owner 2	WX0001	1
4	1/31/2024 10:46:00 PM	Microsoft	Laptop 3	Model 3	Owner 2	WX0001	1
5	1/31/2024 10:46:00 PM	Dell	Laptop 4	Model 4	Owner 3	DD0001	1
6	2/2/2024 10:46:00 PM	Dell	Laptop 4	Model 4	Owner 4	DD0001	1

Normally, a Tag or a Serial Number should uniquely identify a product coming from the same Vendor, so in this case duplicates’ identification will resume to the records in which the Tag and Vendor combination appears more than once:

-- retrieve the duplicates
SELECT Vendor, Tag
FROM dbo.Assets A    
GROUP BY Vendor, Tag
HAVING COUNT(*)>1

Output:

Vendor	Tag
Dell	DD0001
Microsoft	WX0001

In many cases showing only the duplicated values would be enough, however there are cases in which is required to see the whole record in which the duplicates appeared, for example to clarify the context in which the issues appear. A simple JOIN with the base table would provide the affected records:

-- retrieve duplicates with details
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN (-- duplicates
	  SELECT Vendor, Tag
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	  HAVING COUNT(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag

Output:

Id	CreationDate	Vendor	Asset	Model	Owner	Tag	Quantity
5	1/31/2024 10:46:00 PM	Dell	Laptop 4	Model 4	Owner 3	DD0001	1
6	2/2/2024 10:46:00 PM	Dell	Laptop 4	Model 4	Owner 4	DD0001	1
3	1/31/2024 10:46:00 PM	Microsoft	Laptop 3	Model 3	Owner 2	WX0001	1
4	1/31/2024 10:46:00 PM	Microsoft	Laptop 3	Model 3	Owner 2	WX0001	1

There are two scenarios in which we need to remove the duplicates – in a result set or within an existing table.

In a result set normally it's enough to use the DISTINCT keyword to remove duplicated rows:

-- select unique records
SELECT DISTINCT CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM dbo.Assets

Output:

CreationDate	Vendor	Asset	Model	Owner	Tag	Quantity
1/29/2024 10:46:00 PM	IBM	Laptop 1	Model 1	Owner 1	XX0001	1
1/30/2024 10:46:00 PM	IBM	Laptop 2	Model 2	Owner 2	XX0002	1
1/31/2024 10:46:00 PM	Dell	Laptop 4	Model 4	Owner 3	DD0001	1
1/31/2024 10:46:00 PM	Microsoft	Laptop 3	Model 3	Owner 2	WX0001	1
2/2/2024 10:46:00 PM	Dell	Laptop 4	Model 4	Owner 4	DD0001	1

In our example only some combinations are duplicated while the other attributes might slightly differ, and therefore is needed another approach. First of all we need to identify which one is the most reliable record, in some cases the latest records entry should be the most accurate or closer to reality, but that’s not necessarily the truth. There are also cases in which we don’t care which the record that is selected is, but from experience these cases are few.

Oracle and SQL Server introduced the dense_rank() analytic function, which returns the rank of rows within the partition of a result set, without any gaps in the ranking. In our case the partition is determined by Vendor and Tag, following to identify which the logic used for raking. Supposing that we are always interested in the last record entered, the query would look like this:

-- retrieve duplicates via ranking functions 
SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM (--subquery 
 SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
 , dense_rank() OVER(PARTITION BY Vendor, Tag ORDER BY CreationDate DESC , Id DESC) RANKING 
 FROM dbo.Assets 
) A 
WHERE RANKING = 1

Output:

CreationDate	Vendor	Asset	Model	Owner	Tag	Quantity
1/29/2024 10:46:00 PM	IBM	Laptop 1	Model 1	Owner 1	XX0001	1
1/30/2024 10:46:00 PM	IBM	Laptop 2	Model 2	Owner 2	XX0002	1
1/31/2024 10:46:00 PM	Dell	Laptop 4	Model 4	Owner 3	DD0001	1
1/31/2024 10:46:00 PM	Microsoft	Laptop 3	Model 3	Owner 2	WX0001	1
2/2/2024 10:46:00 PM	Dell	Laptop 4	Model 4	Owner 4	DD0001	1

Things are simple when records’ creation date is available and it's unique for the key used to validate the data, because a descending order would allow selecting the last record first. An Id can be added to the clause for the cases in which multiple records have the same date. At least in this example the same output is obtained by using the row_number analytic function, which returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition. Actually, the dense_rank function assigns the same rank for the same occurrence of the values appearing in ORDER BY clause (within the same partition), and thus forces the developer to choose the exact attributes that make a field unique within a partition, while row_number() will return a sequence no matter of the attributes used in ORDER BY clause.

Both functions should provide similar performance. Conversely, when the selection of records within a partition is not important, it’s better to use the row_number() function, which would need to do less processing.

Unfortunately, this technique doesn’t work in SQL Server 2000, where a different approach is needed. In most of the cases the unique identifier for a record is a sequential unique number, the highest id corresponding to the latest entered record. This would allow selecting the latest entered record, by using the Max function:

-- nonduplicated records (SQL server 2000+)
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN ( -- last entry
	  SELECT Vendor, Tag, MAX(Id) MaxId
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	 -- HAVING count(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag 
     AND A.ID = B.MaxId

Output:

Id	CreationDate	Vendor	Asset	Model	Owner	Tag	Quantity
4	1/31/2024 10:46:00 PM	Microsoft	Laptop 3	Model 3	Owner 2	WX0001	1
2	1/30/2024 10:46:00 PM	IBM	Laptop 2	Model 2	Owner 2	XX0002	1
1	1/29/2024 10:46:00 PM	IBM	Laptop 1	Model 1	Owner 1	XX0001	1
6	2/2/2024 10:46:00 PM	Dell	Laptop 4	Model 4	Owner 4	DD0001	1

The same technique can be used to delete the duplicates from a table:

-- nonduplicated records (SQL server 2000+)
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN ( -- last entry
	  SELECT Vendor, Tag, MAX(Id) MaxId
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	 -- HAVING count(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag 
     AND A.ID = B.MaxId

When an Id is not available, then we have most probably to create a temporary table with a sequence (numeric unique identifier), sort the records based on the criteria we chose for selection, and then apply the technique based on Id as above.

Notes:
1. In other scenarios it’s important to select all the records matching extreme values (first, last), the dense_rank function becoming handy, however for versions that doesn’t supports it, a creation date attribute saves the day, when available, and it's unique:

-- nonduplicated records (SQL server 2000+)  
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN (-- last entry
	  SELECT Vendor, Tag, MAX(CreationDate) LastCreationDate
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	  -- HAVING count(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag 
     AND DateDiff(d, A.CreationDate, B.LastCreationDate)=0

Output:

Id	CreationDate	Vendor	Asset	Model	Owner	Tag	Quantity
6	2/2/2024 10:46:00 PM	Dell	Laptop 4	Model 4	Owner 4	DD0001	1
1	1/29/2024 10:46:00 PM	IBM	Laptop 1	Model 1	Owner 1	XX0001	1
2	1/30/2024 10:46:00 PM	IBM	Laptop 2	Model 2	Owner 2	XX0002	1
3	1/31/2024 10:46:00 PM	Microsoft	Laptop 3	Model 3	Owner 2	WX0001	1
4	1/31/2024 10:46:00 PM	Microsoft	Laptop 3	Model 3	Owner 2	WX0001	1

2. Duplicated records can be created by merging two sets of data with UNION ALL, in such cases a simple UNION would remove the duplicates introduced by the queries coming after the first one however this will burden the database with more checks. It is recommended to find other solutions, when possible!

3. Instead of using a single multi-row insertion I used multiple insertion statements because I preferred to make the tutorial usable also on SQL Server 2000. Here’s the single multi-row insertion statement:

-- insert test data (SQL Server 2005+)
INSERT INTO dbo.Assets
VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')
, ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1')
, ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1')
, ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')

4. The above techniques should work also in Oracle with two amendments, attributes’ type must be adapted to Oracle ones, while instead of SQL Server GetDate() function should be used the corresponding Oracle SYSDATE function, as below:

-- insert test data (Oracle)
INSERT INTO dbo.Assets<br />
VALUES ('1', SYSDATE-1, 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')

References:
[1] S Hull (1998) Business Objects Glossary
[2] Microsoft Learn (2023) dense_rank() (link)
[3] Microsoft Learn (2023) row_number() (link)

ERP Systems: Learning about Oracle APPS internals I

How can I learn more about Oracle APPS (Oracle Applications) and from where? a normal question for every beginner but also for experienced professionals.

Oracle made available documentation about their products through Oracle Technology Network and Metalink. The first source contains documents mainly as pdf files, while Metalink provides richer content and it’s easier to use, however in order to access it, your company has to purchase an Oracle Support Identifier.

In Metalink, Oracle Applications’ documentation is grouped under eTRM (Electronic Technical Reference Manuals) section, while the pdf documents can be found under Oracle 11i Documentation Library, and many of them, especially for older versions, can be found also on the web, and revealed with a simple search by using tables' name or file’s name.
Both sources are by far incomplete, there are many gaps, not to forget that many of the Oracle implementations involve also some customization, information about these changes could find maybe in the documentation made during implementation/customization process.

Lately have appeared many blogs on Oracle Applications internals, and even if many of them resume by copying some material from Metalink or other documents, there are also professionals who respect themselves.

People can learn a lot by checking the objects that unveils the APPS internals, APPS.FND_TABLES providing the list of tables used, while APPS.FND_VIEWS provides the list of views, the problem with the later being that can't be done a search using the field that stores views' script, but the data can be exported to a text file and do the search in there (it won’t work to export the data completely to Excel). In time developers arrive to intuit how the views could be named, so a search on their name could help narrowing down the search.

Other professionals might be willing to help, so often it's a good idea to post questions on blogs, forums or social networks for professionals. Not all the questions get answered so rather than waiting for indirect enlightment, it’s better to do some research in parallel too.

There will be cases in which none of the specified sources will help you, most probably you'll have to reengineer Oracle Applications' internals by studying various business scenarios, and in this case the experimented users could help a lot.

Business Intelligence: Enterprise Reporting

Business Intelligence Series

Introduction

Let's suppose that your company invested lot of money in an ERP system, and besides the complex setup many customizations were made. To increase ERP system's value, monitor the operations and make accurate decisions you'll need some reports out of it. What do you do then?

In general, there are 5 types of reporting needs:

OLTP (On-Line Transaction Processing) system providing reports with actual (live) data;
OLAP (On-Line Analytical Processing) reports with drill-down, roll-up, slice and dice or pivoting functionality, working with historical data, the data source(s) being refreshed periodically;
ad-hoc reports – reports provided on request, often satisfying one time reports or reports with sporadic needs;
Data Mining tool(s) focusing on knowledge discovery (aka Data Science);
direct data access and analysis (aka self-service BI).

Standard Reports

ERP systems like Oracle Applications, Dynamics AX or SAP come by default with a set of (predefined) standard reports, which in theory cover basic reporting needs. Unfortunately the standard reports are not as flexible as expected, e.g. they can be exported only to text and/or in a non-tabular format, and therefore impossible to reuse for detailed analysis, have inadequate filtering parameters/constraints, behavior or scope. If existing functionality has been customized, most probably existing reports need to be adapted to the new logic. In the end customers need to change the existing reports or adopt an OLAP solution.

Vendors tend to keep the secrecy about their solutions and/or don't invest much time into documenting systems' functionality. Therefore, the information about ERP’s internals is limited, while good developers are hard to find or really expensive, and often they needing to reinvent the wheel. ERP vendors do provide documentation about their system's internals, though there are still many gaps concerning tables’ structure and functionality. Fortunately, armed with enough patience, some knowledge about existing business processes and databases, a developer can reengineer an important part of the logic, though there's always a shade of doubt whether the logic is entirely correct or complete. Other good news is that more and more professionals blog on ERP topics, however few are the source that bring something new.

OLAP Reporting

OLAP solutions presume the existence of a data warehouse that reflects the business model, and when intelligently built it can satisfy an important percentage from the BI requirements. Building a data warehouse or a set of data marts is an expensive and time consuming endeavor and rarely arrives to satisfy everybody’s needs. There are also vendors that provide commercial off-the-shelf data models and solutions, and at a first view they look like an important deal, however such models are inflexible and seldom cover all requirements. One can end up by customizing and extending the model, running in all kind of issues involving model’s design, flexibility, quality, resources and costs.

There are many ways in which things can go wrong or be misused. One of such scenarios is when an OLAP system is used to satisfy OLTP reporting needs. It’s like using a city car in a country cross race – you might make it to compete or even end the race, if you are lucky enough, but don’t expect to make a success out of it!

Ad-hoc Reporting

The need for ad-hoc reports will be there no matter how complete and flexible are your existing reports. There are always new requirements that must be fulfilled in utile time and not rely on the long cycle time needed for an OLTP/OLAP report. Actually many of the reports start as ad-hoc reports and once their scope and logic stabilized they are moved to the reporting solution. Talking about new reports requirements, it worth to mention that many of the users don’t know exactly what they want, what is possible to get and what information it makes sense to show and at what level of detail in order to have a report that reflects the reality.

In theory is needed a person who facilitate the communication between users and development team, especially when the work is outsourced. Such a person should have in theory a deep understanding of the business, of the ERP system and reporting possibilities, deeper the knowledge, shorter the delivery cycle time. Maybe such a person could be dispensable if the users and development have the required skill set and knowledge to define and interpret clearly the requirements, however I doubt that’s achievable on large scale. On the other side such attributions could be taken by the IM or functional leaders that support the ERP system, it might work, at least in theory.

Data Mining

Data Mining tools and models are supposed to leverage the value of an ERP system beyond the functionality provided by analytic reports by helping to find hidden patterns and trends in data, to elaborate predictions and estimates. Here I resume only saying that DM makes sense only when the business reached a certain maturity, and I’m considering here mainly the costs/value ratio (the expected benefits needing to be greater than the costs) and effort required from business side in pursuing such a project.

Self-Service BI

There are situations in which the functionality provided by reporting tools doesn’t fulfill users’ requirements, one of such situations being when users (aka data citizens) need to analyze data by themselves, to link data from different sources, especially Excel sheets. It’s true that vendors tried to address such requirements, though I don’t think they are mature enough, easy to use or allow users to go beyond their skills and knowledge.

Most of such scenarios resume in accessing various sources over ODBC or directly using Excel or MS Access, such solutions being adequate more for personal use. The negative side is that people arrive to misuse them, often ending up by having a multitude of such solution which maybe would make sense to have implemented as a report.

There are managers who believe that such tools would allow eliminating the need for ad-hoc reports, it might be possible in isolated cases though don’t expect from users to be a Bill Inmon or Bill Gates!

Conclusion

All the tools have their limitations, no matter how complex they are, and I believe that not always a single reporting tool or platform will address all requirements. Each of such tools need a support team and even a center of excellence, so assure yourself that you have the resources, knowledge and infrastructure to support them!

Previous Post <<||>> Next Post

SQL Troubles

Pages

11 November 2008

Data Management: Data Quality - Information Systems' Perspective

08 November 2008

SQL Reloaded: Dealing with data duplicates on SQL Server

ERP Systems: Learning about Oracle APPS internals I

Business Intelligence: Enterprise Reporting

About Me