11 November 2008

Data Quality in Information Systems

    One of the LinkedIn users brought to our attention the fact that “according to top managers in the IT business the top 2 reasons why CRM investments fails is:
1. Managing resistance within the organisation
2. Bad data quality” (B. Hellgren, 2008) and I think that’s valid not only to CRM or BI solutions but also to other information system solutions.

    Especially in ERP systems the data quality continues to be a problem and from my experience I can point here the following reasons:
- processes span across different functions, each of them maintaining the data they are interested in, being rare the cases in which a full agreement is met between functions, and here is also management’s fault.
- within an enterprise many systems arrive to be integrated, the quality of the data depending on integrations’ stability and scope; if stability can be obtained in time, scope resumes to the volume of data available, the delta needing to be maintained mainly manually.
- there are systems which are not integrated but using the same data, users needing to duplicate their effort so they often focus on their immediate needs. - lack of knowledge about the system used, the processes in place and best practices (they can be further detailed).
- basic or inexistent validation for data entry in each important entry point (User Interface, integration interfaces, bulk upload), system permissiveness (allowing workarounds), stability and reliability (bugs/defects).
- inexistence of data quality control mechanisms or quality methodologies.

    Data quality is usually ignored in BI projects, and this because few are the ones that go and search the causes, being easier to blame the BI solution or the technical team. This is ones of the reasons for which users are reticent in using a BI solution, to which add up solution’s flexibility and the degree up to which the solution satisfies users’ needs.     On the other side BI solutions are often abused, including also reports which have OLTP characteristics or of providing too much, unstructured or inadequate content that needs to be further reworked.

    I think that data quality comes on manager’s dashboards especially during ERP implementations, when the destination system has some strong validation. And then there is no wonder when each record contains at minimum a defect and the whole dashboard is in nuances of red (the red must be "en vogue").

08 November 2008

Dealing with data duplicates on SQL Server

    Duplicates or duplications can be defined as “result rows that are identical to one or more other rows” (S.Hull, 1998), they occur frequently in data collections (e.g. Excel, Access, etc.) that have minimum or no validation, more users, higher the risks of having duplicated data. Duplicates occur also in modern architectures when validation wasn’t addressed correspondingly or from usage related issues, for example in information systems, the huge amount of data entry makes it hard to overlook documents already processed (e.g. purchase orders, invoices, payments, etc.).

    Subject to duplication are whole records, a group of attributes (fields) or only single attributes, it depends from case to case. Often duplicates are easy to identify, it’s enough to let somebody who has the proper knowledge to look over them, but what you do when the volume of data is too large or when is need to automate the process as much as possible. Using the DISTINCT keyword in a SELECT statement might do the trick, while other times it requires more complicated validation, ranging from simple checks to data mining techniques.

    I will try to exemplify the techniques I use to deal with duplicates with the help of a self-constructed example, a table having the following structure, which tracks information about Assets:

CREATE TABLE [dbo].[Assets](
[ID] [int] NOT NULL,
[CreationDate] smalldatetime NOT NULL,
[Vendor] [varchar](50) NULL,
[Asset] [varchar](50) NULL,
[Model] [varchar](50) NULL,
[Owner] [varchar](50) NULL,
[Tag] [varchar](50) NULL,
[Quantity] [decimal](13, 2) NULL
) ON [PRIMARY]

    Next step is to provide the test data:
INSERT INTO Assets
VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')
INSERT INTO Assets
VALUES ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1')
INSERT INTO Assets
VALUES ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
INSERT INTO Assets
VALUES ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
INSERT INTO Assets
VALUES ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1')
INSERT INTO Assets
VALUES ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')

    Now let’s check table’s content:

SELECT ID, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM Assets

Output: All records selection     Normally a Tag or a Serial Number should uniquely identify a product coming from the same Vendor, so in this case duplicates’ identification will resume to the records in which the Tag-Vendor combination appears more than once:

SELECT Vendor, Tag
FROM Assets A
GROUP BY Vendor, Tag
HAVING COUNT(*)>1

Output: Vendor-Product duplications     In many cases showing only the duplicated values would be enough, however there are cases in which is required to see the whole record in which the duplicates appeared, for example to clarify the context in which the issues appear. A simple JOIN with the base table would provide the affected records:

SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity FROM Assets A
JOIN (
SELECT Vendor, Tag
FROM Assets A
GROUP BY Vendor, Tag
HAVING COUNT(*)>1
) B
ON A.Vendor = B.Vendor
AND A.Tag = B.Tag

Output:
selection of duplicates     There are two scenarios in which we need to remove the duplicates – in a result set or within an existing table.

In a result set normally is enough to use the DISTINCT keyword to remove duplicated rows,

SELECT DISTINCT CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM Assets

Output:
selecting distinct records     In our example only some combinations are duplicated while the other attributes might slightly differ, and therefore is needed another approach. First of all we need to identify which one is the most reliable record, in some cases the latest records entry should be the most accurate or closer to reality, but that’s not necessarily the truth. There are also cases in which we don’t care which the record that is selected is, but from experience these cases are few.

    Oracle and SQL Server introduced the dense_rank() analytic function, which “returns the rank of rows within the partition of a result set, without any gaps in the ranking” (MSDN, 2008 a). In our case the partition is determined by Vendor and Tag, following to identify which the logic used for raking. Supposing that we are always interested in the last record entered, the query would look like this:

SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM (--subquery
SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
, dense_rank() OVER(PARTITION BY Vendor, Tag ORDER BY CreationDate DESC , Id DESC) RANKING
FROM Assets
) A
WHERE RANKING = 1

Output:
using dense_rank() function     Things are simple when records’ creation date is available, because a descending order would allow selecting the last record first, the Id being added to the clause for the cases in which multiple records have the same date. At least in this example the same output is obtained by using the row_number() analytic function, which “returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition” (MSDN, 2008 b). Actually the dense_rank function assigns the same rank for the same occurrence of the values appearing in ORDER BY clause (within the same partition), and thus forces the developer to choose the exact attributes that make a field unique within a partition, while row_number() will return a sequence no matter of the attributes used in ORDER BY clause. In this case I suppose that both functions should have the same performance, on the other side, when the selection of records within a partition is not important, it’s better to use the row_number() function, which would need to do less processing and thus better performance.

Unfortunately this technique doesn’t work in SQL Server 2000, so a different approach is requested. In most of the cases the unique identifier for a record is a running number, the highest id corresponding to the latest entered record. This would allow selecting the latest entered record, by using the Max function:

SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM Assets A
JOIN (
SELECT Vendor, Tag, MAX(Id) MaxId
FROM Assets A
GROUP BY Vendor, Tag
) B
ON A.Vendor = B.Vendor
AND A.Tag = B.Tag
AND A.ID = B.MaxId
Output:
SQL Server 2000 technique The same technique can be used to delete the duplicates from a table:
DELETE Assets
FROM (
SELECT Vendor, Tag, MAX(Id) MaxId
FROM Assets A
GROUP BY Vendor, Tag
) B
WHERE Assets.Vendor = B.Vendor
AND Assets.Tag = B.Tag
AND Assets.ID <> B.MaxId
    When an Id is not available, then we have most probably to create a temporary table with a sequence (numeric unique identifier), sort the records based on the criteria we chose for selection, and then apply the technique based on Id as above.
Notes:
1. In other scenarios it’s important to select all the records matching extreme values (first, last), the dense_rank function becoming handy, however for versions that doesn’t supports it, a creation date attribute saves the day, when available:
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM Assets A
JOIN (
SELECT Vendor, Tag, MAX(CreationDate) LastCreationDate
FROM Assets A
GROUP BY Vendor, Tag
) B
ON A.Vendor = B.Vendor
AND A.Tag = B.Tag
AND DateDiff(d, A.CreationDate, B.LastCreationDate)=0

Output:
record matching extreme values 2. Duplicated records can be created by merging two sets of data with UNION ALL, in such cases a simple UNION would remove the duplicates introduced by the queries coming after the first one however this will burden the database with more checks. It is recommended to find other solutions, when possible!

3. Instead of using a single multi-row insertion I used multiple insertion statements because I preferred to make the tutorial portable also on SQL Server 2000. Here’s the single multi-row insertion statement:

INSERT INTO Assets VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')
, ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1')
, ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1')
, ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')

4. The above techniques should work also in Oracle with two amendments, the attributes’ type must be adapted to Oracle ones, while instead of SQL Server GetDate() function should be used the corresponding Oracle SYSDATE function as below:
INSERT INTO Assets
VALUES ('1', SYSDATE-1, 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')

References:
S.Hull. 1998. Business Objects Glossary. [Online] Available from:http://planning.ucsc.edu/IRPS/dwh/BOBGLOSS.HTM (Accessed: 7 November 2008)
MSDN. 2008 a. SQL Server 2008 Books Online, dense_rank(). [Online] Available from: http://msdn.microsoft.com/en-us/library/ms173825.aspx(Accessed: 7 November 2008)
MSDN. 2008 b. SQL Server 2008 Books Online, dense_rank(). [Online] Available from: http://msdn.microsoft.com/en-us/library/ms186734.aspx (Accessed: 7 November 2008)

Learning about Oracle APPS internals - Part I

How can I learn more about Oracle APPS (Oracle Applications) and from where? a normal question for every beginner but also for experienced professionals.

Oracle made available documentation about their products through Oracle Technology Network and Metalink. The first source contains documents mainly as pdf files, while Metalink provides richer content and it’s easier to use, however in order to access it, your company has to purchase an Oracle Support Identifier.

In Metalink, Oracle Applications’ documentation is grouped under eTRM (Electronic Technical Reference Manuals) section, while the pdf documents can be found under Oracle 11i Documentation Library, and many of them, especially for older versions, can be found also on the web, and revealed with a simple search by using tables' name or file’s name.
Both sources are by far incomplete, there are many gaps, not to forget that many of the Oracle implementations involve also some customization, information about these changes could find maybe in the documentation made during implementation/customization process.

Lately have appeared many blogs on Oracle Applications internals, and even if many of them resume by copying some material from Metalink or other documents, there are also professionals who respect themselves.

People can learn a lot by checking the objects that unveils the APPS internals, APPS.FND_TABLES providing the list of tables used, while APPS.FND_VIEWS provides the list of views, the problem with the later being that can't be done a search using the field that stores views' script, but the data can be exported to a text file and do the search in there (it won’t work to export the data completely to Excel). In time developers arrive to intuit how the views could be named, so a search on their name could help narrowing down the search.

Other professionals might be willing to help, so often it's a good idea to post questions on blogs, forums or social networks for professionals. Not all the questions get answered so rather than waiting for indirect enlightment, it’s better to do some research in parallel too.

There will be cases in which none of the specified sources will help you, most probably you'll have to reengineer Oracle Applications' internals by studying various business scenarios, and in this case the experimented users could help a lot.

07 November 2008

Enterprise reporting

    Supposing that your company invested lot of money in an ERP system, moreover around the nice expensive setup it needed to customize many of the existing functionality. So you end up with an ERP system which provides more or less the functionality you needed, though in order to increase its value and make accurate decisions you need some reports out of it. What you do then?

    From my point of view there are 4 types of reporting needs: - OLTP (On-Line Transaction Processing) system providing reports with actual (live) data; - OLAP (On-Line Analytical Processing) reports with drill-down, roll-up, slice and dice or pivoting functionality, working with historical data, the data source(s) being refreshed periodically;
- ad-hoc reports – reports provided on request, often satisfying one time reports or reports with sporadic needs;
- Data Mining tool(s) focusing on knowledge discovery;
    Actually I would have put a 5th requirement related to direct data access and analysis.


    ERP systems like Oracle Applications or SAP come by default with a set of (predefined) standard reports, which in theory they satisfy the prerequisites of an OLTP system. Unfortunately the standard reports satisfy only basic needs and are not as flexible as expected – it could happen that they can be exported only to text, moreover they being in a non-tabular format, and therefore impossible to reuse for detailed analysis. Another problem could be caused by inadequate filtering constraints, behavior or scope. If existing functionality has been customized, most probably existing reports need to be adapted to the new logic. In the end you realize that you either change the existing reports or adopt an OLTP solution.
    The bad news don’t stop here, as the vendors try to keep the secrecy about their solutions, the information about ERP’s internals is limited, good developers hard to find or really expensive, and often they needing to reinvent the wheel. It’s true that Oracle provided through Metalink a nice set of documentation about Oracle Application’s internals, and even if they updated currently web site’s look and feel, there are still having many gaps concerning tables’ structure and functionality. Fortunately armed with enough patience and some knowledge about existing business processes and databases, a developer can reengineer an important part of the logic. Other good news is that the increasing number of posting focusing on ERP solution, however few are the source that bring something new.

    OLAP solutions presume the existence of a data warehouse that reflects the business model, and when intelligently built it can satisfy an important percentage from the business intelligence requirements. Building a data warehouse or a set of data marts it’s an expensive and time consuming endeavor and rarely arrives to satisfy everybody’s needs. There are also vendors that provide already made models, and at a first view they look like an important deal, however in time excepting the fact that vendor detains the ownership of the model, I bet that such models have many gaps. Again you end up by customizing and extending the model, running in all kind of issues involving model’s design, flexibility, quality, resources and costs.     There are many ways of making things go wrong, one of the interesting situations being when an OLAP system is used to satisfy OLTP reporting needs. It’s like using a city car in a country cross race – you might make it to compete or even end the race, if you are lucky enough, but don’t expect to make a success out of it!

    The need for ad-hoc reports will be there no matter how complete and flexible are your existing reports. There are always new requirements that must be fulfilled in utile time and not rely on the long cycle time needed for an OLTP/OLAP report. Actually many of the reports start as ad-hoc reports and once their scope and logic stabilized they are moved to the reporting solution. Talking about new reports requirements, it worth to mention that many of the users don’t know exactly what they want, what is possible to get and what information it makes sense to show and at what level of detail in order to have a report that reflects the reality. In theory is needed a person who facilitate the communication between users and development team, especially when the work is outsourced. Such a person should have in theory a deep understanding of the business, of the ERP system and reporting possibilities, deeper the knowledge, shorter the delivery cycle time. Maybe such a person could be dispensable if the users and development have the required skill set and knowledge to define and interpret clearly the requirements, however I doubt that’s achievable on large scale. On the other side such attributions could be taken by the IM or functional leaders that support the ERP system, it might work, at least in theory.

    Data Mining tools and models are supposed to leverage the value of an ERP system beyond the functionality provided by analytic reports by helping to find hidden patterns and trends in data, to elaborate predictions and estimates. Here I resume only saying that DM makes sense only when the business reached a certain maturity, and I’m considering here mainly the costs/value ratio (the expected benefits needing to be greater than the costs) and effort required from business side in pursuing such a project.

    There are situations in which the functionality provided by reporting tools doesn’t fulfill users’ requirements, one of such situations being when users need to analyze data by themselves, to link data from different sources, especially Excel sheets. It’s true that vendors tried to address such requirements, though I don’t think they are mature enough, easy to use or allow users to go beyond their skills and knowledge.
    Most of such scenarios resume in accessing various sources over ODBC or directly using Excel or MS Access, such solutions being adequate more for personal use. The negative side is that people arrive to misuse them, often ending up by having a multitude of such solution which maybe would make sense to have implemented as a report.
    There are managers who believe that such tools would allow eliminating the need for ad-hoc reports, it might be possible in isolated cases though don’t expect from users to be a Bill Inmon or Bill Gates!

    All the tools have their limitations, no matter how complex they are, and I believe that not always a single tool addresses all requirements. Each of such tools need a support team and even a center of excellence, so assure yourself that you have the resources, knowledge and infrastructure to support them!