SQL Troubles: 2008

02 December 2008

Business Intelligence: General Issues in Business Intelligence

Business Intelligence Series

Introduction

BI projects are noble in intent though many managers and data professionals ignore their implications and prerequisites – data quality (incl. availability), cooperation, maturity, infrastructure, adequate tools and knowledge.

Data Quality

The problem with data starts usually at the source - ERP and other information systems (IS). In theory the system should cover all the basic reporting requirements existing in an enterprise, though that's seldom the case. Therefore, basic reporting needs arrive to be covered by ad-hoc developed tools which often include MS Excel/Access solutions, which are difficult to integrate and manage across organization.

Data Quality (DQ) is maybe the most ignored component in the attempt to build flexible, secure and reliable BI solutions. DQ is based on the validation implemented in source systems and the mechanisms used to cleanse the data before being reported, respectively on the efficiency and effectiveness of existing business processes and best practices.

DQ must be guaranteed for accurate decisions. If the quality is not validated and reviewed periodically, users will be reluctant in using the reports! The reports must be validated as part of the UAT process. Aggregated BI reports need detailed reports that can be used for validation, while the logic and data need to be synchronized accordingly.

The quality of decisions is based on the degree to which data were understood and presented to the decisional factors, though that’s not enough; it's need also a complete perspective, and maybe that’s why some business users prefer to prepare and aggregate data by themselves, the process allowing them in theory to get a deeper understanding of what’s happening.

Cooperation

A BI initiative doesn’t depend only on the effort of a department (usually IT), but on the business as a whole. Unfortunately, the so called partnership is more a theoretical term than a fact, while managers’ and business users' involvement is often suboptimal.

BI implementations are also dependent on consultants’ skills and the degree to which they understood business’ requirements, on team’s cohesion and other project (management) related prerequisites, respectively on knowledge transfer and training.

Tools

Most of the BI tools available on the market don’t satisfy all business, respectively users’ requirements. Even if they excel in some features, they lack in others. Usually, more than one BI tool is needed to cover (most of) the requirements. When features are not available, or they are not mature enough, or they are difficult to learn, users will prefer to use tools they already know.

Another important consideration is that BI tools rely on data models, often inflexible from the point of the data they provide, lacking integrating additional datasets, algorithms and customizations. The overall requirements need to be considered more recently from the point of cloud computing technologies, which becomes steadily a requirement for nowadays business dynamics.

Maturity

Besides the fact that Capability Maturity Models (CMMs) are difficult to implement, organizations lack the knowledge of transforming data into knowledge, respectively in understanding data and evolving it further in wisdom and competitive advantage.

Most of the fancy words used by salesmen to sell a product don’t become reality overnight. Of course, a BI tool might have the potentiality of fulfilling the various technical and nontechnical goals, though between a theoretical potentiality and harnessing the respective potential is a long road that need to be addressed at strategical, tactical and operational levels.

Infrastructure

Infrastructure refers to human and technical components and the way they interact in getting the job done. It's not only about "breaking habits" and using the best tools, but in aligning people and technologies to the desired level of performance, of retaining and diffusing knowledge.

Previous Post <<||>> Next Post

11 November 2008

Data Management: Data Quality - Information Systems' Perspective

Data Management Series

One LinkedIn user brought to attention the fact that according to top IT managers the top two reasons why CRM investments fail is: (1) managing resistance within the organization; (2) bad data quality.

The two reasons are common not only to CRM or BI solutions but also to other Information Systems, though from the two data quality has usually the biggest impact. Especially in ERP systems the data quality continues to be a problem and here are a few reasons:

Processes span different functions and/or roles, each of them maintaining the data they are interested in, without any agreement or coordination on the ownership. The lack of ownership is in general management’s fault.
Within an enterprise many systems arrive to be integrated, the quality of the data depending on the quality and scope of the integrations, whether they were addressed fully or only superficially. Few integrations are stable and properly designed. If stability can be obtained in time, scope is seldom changed as it involves further investments, and thus the remaining data need to be maintained manually, respectively the issues need to be troubleshooted or let accumulate in the backlog.
There are systems which are not integrated but use the same data, users needing to duplicate their effort, so they often focus on their immediate needs. Moreover, the lack of mappings between systems makes data analysis and review difficult.
The lack of knowledge about the systems used in terms of processes, procedures, best practices, policies, etc. Users usually try to do their best based on the knowledge they have, and despite their best intent, the systems arrive to be misused just to get things done.
Basic or inexistent validation for data entry in each important entry point (UI, integration interfaces, bulk upload functionality), system permissiveness (allowing workarounds), stability and reliability (bugs/defects).
Inexistence of data quality control mechanisms or quality methodologies, respectively a Data and/or Quality Management strategy. If the data quality is not kept under review, it can easily decrease over time.
The lack of a data culture and processes that support data quality.
People lack consistency and/or the self-discipline to follow the processes and update the data as the processes requires it and not only the data to move to the next or final step. Therefore, the gap between reality and the one presented by the system is considerable.
People are not motivated to improve data quality even if they may recognize the importance of doing that.

Data quality is usually ignored in BI projects, and this is because few are the ones that go and search for the causes, making it easier to blame the BI solution or the technical team than to do something. This is one of the reasons for which users are reticent in using a BI solution, to which add up solution’s flexibility and the degree up to which the solution satisfies users’ needs. On the other side BI solutions are often abused, including also reports which have OLTP characteristics or of providing too much unstructured or inadequate content that needs to be further reworked.

Data quality comes on the managers' agenda, especially during ERP implementations. Unfortunately, as soon as that happens, it also disappears, despite being warned of the consequences poor data quality might have on the implementation and further data use. An ERP implementation is supposed to be an opportunity for improving the data quality, though for many organizations it remains in this state. Once this opportunity passes, organizations need more financial and human resources to reach a fraction from the opportunity missed.

The above topics are complex and need further discussion (see [1], [2]).

Previous Post <<||>> Next Post

Written: Nov-2008, Last Reviewed: Mar-2024

Resources:
[1] SQL-Troubles (2010) Data Management: Data Quality - An Introduction (link)
[2] SQL-Troubles (2012) Data Migration: Data Quality’s Perspective I - A Bird’s-Eye View (link)

08 November 2008

SQL Reloaded: Dealing with data duplicates on SQL Server

Duplicates or duplications can be defined as "result rows that are identical to one or more other rows" [1]. They occur frequently in data collections (e.g. Excel, Access, etc.) with concurrent access that have minimum or no validation at all. More users and/or more frequent the updates on the same information, higher the risks of having duplicated data. Duplicates occur also in modern architectures when validation wasn’t addressed correspondingly or from usage related issues, for example, in information systems the huge amount of data entry makes it hard to overlook documents already processed (e.g. purchase orders, invoices, payments, etc.).

Subject to duplication are whole records, a group of attributes (fields) or only single attributes. I depends from case to case. Often duplicates are easy to identify - it’s enough to let somebody who has the proper knowledge to look over them. But what you do when the volume of data is too large or when is need to automate the process as much as possible? Using the DISTINCT keyword in a SELECT statement might do the trick, while other times it requires more complicated validation, ranging from simple checks to Data Mining techniques.

I will try to exemplify the techniques I use to deal with duplicates with the help of a simple example based on table that tracks information about Assets:

-- create test table
CREATE TABLE [dbo].[Assets](
 [ID] [int] NOT NULL,
 [CreationDate] smalldatetime NOT NULL,
 [Vendor] [varchar](50) NULL,
 [Asset] [varchar](50) NULL,
 [Model] [varchar](50) NULL,
 [Owner] [varchar](50) NULL,
 [Tag] [varchar](50) NULL,
 [Quantity] [decimal](13, 2) NULL
) ON [PRIMARY]

Here's some test data:

-- insert test data (SQL Server 2000+)
INSERT INTO dbo.Assets
VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')
INSERT INTO dbo.Assets
VALUES  ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1')
INSERT INTO dbo.Assets
VALUES  ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
INSERT INTO dbo.Assets
VALUES ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
INSERT INTO dbo.Assets
VALUES  ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1')
INSERT INTO dbo.Assets
VALUES  ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')

Let’s check table’s content:

-- review the data
SELECT ID, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM dbo.Assets

Output:

ID	CreationDate	Vendor	Asset	Model	Owner	Tag	Quantity
1	1/29/2024 10:46:00 PM	IBM	Laptop 1	Model 1	Owner 1	XX0001	1
2	1/30/2024 10:46:00 PM	IBM	Laptop 2	Model 2	Owner 2	XX0002	1
3	1/31/2024 10:46:00 PM	Microsoft	Laptop 3	Model 3	Owner 2	WX0001	1
4	1/31/2024 10:46:00 PM	Microsoft	Laptop 3	Model 3	Owner 2	WX0001	1
5	1/31/2024 10:46:00 PM	Dell	Laptop 4	Model 4	Owner 3	DD0001	1
6	2/2/2024 10:46:00 PM	Dell	Laptop 4	Model 4	Owner 4	DD0001	1

Normally, a Tag or a Serial Number should uniquely identify a product coming from the same Vendor, so in this case duplicates’ identification will resume to the records in which the Tag and Vendor combination appears more than once:

-- retrieve the duplicates
SELECT Vendor, Tag
FROM dbo.Assets A    
GROUP BY Vendor, Tag
HAVING COUNT(*)>1

Output:

Vendor	Tag
Dell	DD0001
Microsoft	WX0001

In many cases showing only the duplicated values would be enough, however there are cases in which is required to see the whole record in which the duplicates appeared, for example to clarify the context in which the issues appear. A simple JOIN with the base table would provide the affected records:

-- retrieve duplicates with details
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN (-- duplicates
	  SELECT Vendor, Tag
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	  HAVING COUNT(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag

Output:

Id	CreationDate	Vendor	Asset	Model	Owner	Tag	Quantity
5	1/31/2024 10:46:00 PM	Dell	Laptop 4	Model 4	Owner 3	DD0001	1
6	2/2/2024 10:46:00 PM	Dell	Laptop 4	Model 4	Owner 4	DD0001	1
3	1/31/2024 10:46:00 PM	Microsoft	Laptop 3	Model 3	Owner 2	WX0001	1
4	1/31/2024 10:46:00 PM	Microsoft	Laptop 3	Model 3	Owner 2	WX0001	1

There are two scenarios in which we need to remove the duplicates – in a result set or within an existing table.

In a result set normally it's enough to use the DISTINCT keyword to remove duplicated rows:

-- select unique records
SELECT DISTINCT CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM dbo.Assets

Output:

CreationDate	Vendor	Asset	Model	Owner	Tag	Quantity
1/29/2024 10:46:00 PM	IBM	Laptop 1	Model 1	Owner 1	XX0001	1
1/30/2024 10:46:00 PM	IBM	Laptop 2	Model 2	Owner 2	XX0002	1
1/31/2024 10:46:00 PM	Dell	Laptop 4	Model 4	Owner 3	DD0001	1
1/31/2024 10:46:00 PM	Microsoft	Laptop 3	Model 3	Owner 2	WX0001	1
2/2/2024 10:46:00 PM	Dell	Laptop 4	Model 4	Owner 4	DD0001	1

In our example only some combinations are duplicated while the other attributes might slightly differ, and therefore is needed another approach. First of all we need to identify which one is the most reliable record, in some cases the latest records entry should be the most accurate or closer to reality, but that’s not necessarily the truth. There are also cases in which we don’t care which the record that is selected is, but from experience these cases are few.

Oracle and SQL Server introduced the dense_rank() analytic function, which returns the rank of rows within the partition of a result set, without any gaps in the ranking. In our case the partition is determined by Vendor and Tag, following to identify which the logic used for raking. Supposing that we are always interested in the last record entered, the query would look like this:

-- retrieve duplicates via ranking functions 
SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM (--subquery 
 SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
 , dense_rank() OVER(PARTITION BY Vendor, Tag ORDER BY CreationDate DESC , Id DESC) RANKING 
 FROM dbo.Assets 
) A 
WHERE RANKING = 1

Output:

CreationDate	Vendor	Asset	Model	Owner	Tag	Quantity
1/29/2024 10:46:00 PM	IBM	Laptop 1	Model 1	Owner 1	XX0001	1
1/30/2024 10:46:00 PM	IBM	Laptop 2	Model 2	Owner 2	XX0002	1
1/31/2024 10:46:00 PM	Dell	Laptop 4	Model 4	Owner 3	DD0001	1
1/31/2024 10:46:00 PM	Microsoft	Laptop 3	Model 3	Owner 2	WX0001	1
2/2/2024 10:46:00 PM	Dell	Laptop 4	Model 4	Owner 4	DD0001	1

Things are simple when records’ creation date is available and it's unique for the key used to validate the data, because a descending order would allow selecting the last record first. An Id can be added to the clause for the cases in which multiple records have the same date. At least in this example the same output is obtained by using the row_number analytic function, which returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition. Actually, the dense_rank function assigns the same rank for the same occurrence of the values appearing in ORDER BY clause (within the same partition), and thus forces the developer to choose the exact attributes that make a field unique within a partition, while row_number() will return a sequence no matter of the attributes used in ORDER BY clause.

Both functions should provide similar performance. Conversely, when the selection of records within a partition is not important, it’s better to use the row_number() function, which would need to do less processing.

Unfortunately, this technique doesn’t work in SQL Server 2000, where a different approach is needed. In most of the cases the unique identifier for a record is a sequential unique number, the highest id corresponding to the latest entered record. This would allow selecting the latest entered record, by using the Max function:

-- nonduplicated records (SQL server 2000+)
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN ( -- last entry
	  SELECT Vendor, Tag, MAX(Id) MaxId
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	 -- HAVING count(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag 
     AND A.ID = B.MaxId

Output:

Id	CreationDate	Vendor	Asset	Model	Owner	Tag	Quantity
4	1/31/2024 10:46:00 PM	Microsoft	Laptop 3	Model 3	Owner 2	WX0001	1
2	1/30/2024 10:46:00 PM	IBM	Laptop 2	Model 2	Owner 2	XX0002	1
1	1/29/2024 10:46:00 PM	IBM	Laptop 1	Model 1	Owner 1	XX0001	1
6	2/2/2024 10:46:00 PM	Dell	Laptop 4	Model 4	Owner 4	DD0001	1

The same technique can be used to delete the duplicates from a table:

-- nonduplicated records (SQL server 2000+)
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN ( -- last entry
	  SELECT Vendor, Tag, MAX(Id) MaxId
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	 -- HAVING count(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag 
     AND A.ID = B.MaxId

When an Id is not available, then we have most probably to create a temporary table with a sequence (numeric unique identifier), sort the records based on the criteria we chose for selection, and then apply the technique based on Id as above.

Notes:
1. In other scenarios it’s important to select all the records matching extreme values (first, last), the dense_rank function becoming handy, however for versions that doesn’t supports it, a creation date attribute saves the day, when available, and it's unique:

-- nonduplicated records (SQL server 2000+)  
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN (-- last entry
	  SELECT Vendor, Tag, MAX(CreationDate) LastCreationDate
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	  -- HAVING count(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag 
     AND DateDiff(d, A.CreationDate, B.LastCreationDate)=0

Output:

Id	CreationDate	Vendor	Asset	Model	Owner	Tag	Quantity
6	2/2/2024 10:46:00 PM	Dell	Laptop 4	Model 4	Owner 4	DD0001	1
1	1/29/2024 10:46:00 PM	IBM	Laptop 1	Model 1	Owner 1	XX0001	1
2	1/30/2024 10:46:00 PM	IBM	Laptop 2	Model 2	Owner 2	XX0002	1
3	1/31/2024 10:46:00 PM	Microsoft	Laptop 3	Model 3	Owner 2	WX0001	1
4	1/31/2024 10:46:00 PM	Microsoft	Laptop 3	Model 3	Owner 2	WX0001	1

2. Duplicated records can be created by merging two sets of data with UNION ALL, in such cases a simple UNION would remove the duplicates introduced by the queries coming after the first one however this will burden the database with more checks. It is recommended to find other solutions, when possible!

3. Instead of using a single multi-row insertion I used multiple insertion statements because I preferred to make the tutorial usable also on SQL Server 2000. Here’s the single multi-row insertion statement:

-- insert test data (SQL Server 2005+)
INSERT INTO dbo.Assets
VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')
, ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1')
, ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1')
, ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')

4. The above techniques should work also in Oracle with two amendments, attributes’ type must be adapted to Oracle ones, while instead of SQL Server GetDate() function should be used the corresponding Oracle SYSDATE function, as below:

-- insert test data (Oracle)
INSERT INTO dbo.Assets<br />
VALUES ('1', SYSDATE-1, 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')

References:
[1] S Hull (1998) Business Objects Glossary
[2] Microsoft Learn (2023) dense_rank() (link)
[3] Microsoft Learn (2023) row_number() (link)

ERP Systems: Learning about Oracle APPS internals I

How can I learn more about Oracle APPS (Oracle Applications) and from where? a normal question for every beginner but also for experienced professionals.

Oracle made available documentation about their products through Oracle Technology Network and Metalink. The first source contains documents mainly as pdf files, while Metalink provides richer content and it’s easier to use, however in order to access it, your company has to purchase an Oracle Support Identifier.

In Metalink, Oracle Applications’ documentation is grouped under eTRM (Electronic Technical Reference Manuals) section, while the pdf documents can be found under Oracle 11i Documentation Library, and many of them, especially for older versions, can be found also on the web, and revealed with a simple search by using tables' name or file’s name.
Both sources are by far incomplete, there are many gaps, not to forget that many of the Oracle implementations involve also some customization, information about these changes could find maybe in the documentation made during implementation/customization process.

Lately have appeared many blogs on Oracle Applications internals, and even if many of them resume by copying some material from Metalink or other documents, there are also professionals who respect themselves.

People can learn a lot by checking the objects that unveils the APPS internals, APPS.FND_TABLES providing the list of tables used, while APPS.FND_VIEWS provides the list of views, the problem with the later being that can't be done a search using the field that stores views' script, but the data can be exported to a text file and do the search in there (it won’t work to export the data completely to Excel). In time developers arrive to intuit how the views could be named, so a search on their name could help narrowing down the search.

Other professionals might be willing to help, so often it's a good idea to post questions on blogs, forums or social networks for professionals. Not all the questions get answered so rather than waiting for indirect enlightment, it’s better to do some research in parallel too.

There will be cases in which none of the specified sources will help you, most probably you'll have to reengineer Oracle Applications' internals by studying various business scenarios, and in this case the experimented users could help a lot.

Business Intelligence: Enterprise Reporting

Business Intelligence Series

Introduction

Let's suppose that your company invested lot of money in an ERP system, and besides the complex setup many customizations were made. To increase ERP system's value, monitor the operations and make accurate decisions you'll need some reports out of it. What do you do then?

In general, there are 5 types of reporting needs:

OLTP (On-Line Transaction Processing) system providing reports with actual (live) data;
OLAP (On-Line Analytical Processing) reports with drill-down, roll-up, slice and dice or pivoting functionality, working with historical data, the data source(s) being refreshed periodically;
ad-hoc reports – reports provided on request, often satisfying one time reports or reports with sporadic needs;
Data Mining tool(s) focusing on knowledge discovery (aka Data Science);
direct data access and analysis (aka self-service BI).

Standard Reports

ERP systems like Oracle Applications, Dynamics AX or SAP come by default with a set of (predefined) standard reports, which in theory cover basic reporting needs. Unfortunately the standard reports are not as flexible as expected, e.g. they can be exported only to text and/or in a non-tabular format, and therefore impossible to reuse for detailed analysis, have inadequate filtering parameters/constraints, behavior or scope. If existing functionality has been customized, most probably existing reports need to be adapted to the new logic. In the end customers need to change the existing reports or adopt an OLAP solution.

Vendors tend to keep the secrecy about their solutions and/or don't invest much time into documenting systems' functionality. Therefore, the information about ERP’s internals is limited, while good developers are hard to find or really expensive, and often they needing to reinvent the wheel. ERP vendors do provide documentation about their system's internals, though there are still many gaps concerning tables’ structure and functionality. Fortunately, armed with enough patience, some knowledge about existing business processes and databases, a developer can reengineer an important part of the logic, though there's always a shade of doubt whether the logic is entirely correct or complete. Other good news is that more and more professionals blog on ERP topics, however few are the source that bring something new.

OLAP Reporting

OLAP solutions presume the existence of a data warehouse that reflects the business model, and when intelligently built it can satisfy an important percentage from the BI requirements. Building a data warehouse or a set of data marts is an expensive and time consuming endeavor and rarely arrives to satisfy everybody’s needs. There are also vendors that provide commercial off-the-shelf data models and solutions, and at a first view they look like an important deal, however such models are inflexible and seldom cover all requirements. One can end up by customizing and extending the model, running in all kind of issues involving model’s design, flexibility, quality, resources and costs.

There are many ways in which things can go wrong or be misused. One of such scenarios is when an OLAP system is used to satisfy OLTP reporting needs. It’s like using a city car in a country cross race – you might make it to compete or even end the race, if you are lucky enough, but don’t expect to make a success out of it!

Ad-hoc Reporting

The need for ad-hoc reports will be there no matter how complete and flexible are your existing reports. There are always new requirements that must be fulfilled in utile time and not rely on the long cycle time needed for an OLTP/OLAP report. Actually many of the reports start as ad-hoc reports and once their scope and logic stabilized they are moved to the reporting solution. Talking about new reports requirements, it worth to mention that many of the users don’t know exactly what they want, what is possible to get and what information it makes sense to show and at what level of detail in order to have a report that reflects the reality.

In theory is needed a person who facilitate the communication between users and development team, especially when the work is outsourced. Such a person should have in theory a deep understanding of the business, of the ERP system and reporting possibilities, deeper the knowledge, shorter the delivery cycle time. Maybe such a person could be dispensable if the users and development have the required skill set and knowledge to define and interpret clearly the requirements, however I doubt that’s achievable on large scale. On the other side such attributions could be taken by the IM or functional leaders that support the ERP system, it might work, at least in theory.

Data Mining

Data Mining tools and models are supposed to leverage the value of an ERP system beyond the functionality provided by analytic reports by helping to find hidden patterns and trends in data, to elaborate predictions and estimates. Here I resume only saying that DM makes sense only when the business reached a certain maturity, and I’m considering here mainly the costs/value ratio (the expected benefits needing to be greater than the costs) and effort required from business side in pursuing such a project.

Self-Service BI

There are situations in which the functionality provided by reporting tools doesn’t fulfill users’ requirements, one of such situations being when users (aka data citizens) need to analyze data by themselves, to link data from different sources, especially Excel sheets. It’s true that vendors tried to address such requirements, though I don’t think they are mature enough, easy to use or allow users to go beyond their skills and knowledge.

Most of such scenarios resume in accessing various sources over ODBC or directly using Excel or MS Access, such solutions being adequate more for personal use. The negative side is that people arrive to misuse them, often ending up by having a multitude of such solution which maybe would make sense to have implemented as a report.

There are managers who believe that such tools would allow eliminating the need for ad-hoc reports, it might be possible in isolated cases though don’t expect from users to be a Bill Inmon or Bill Gates!

Conclusion

All the tools have their limitations, no matter how complex they are, and I believe that not always a single reporting tool or platform will address all requirements. Each of such tools need a support team and even a center of excellence, so assure yourself that you have the resources, knowledge and infrastructure to support them!

Previous Post <<||>> Next Post

29 October 2008

W3: Resource Description Framework (Definitions)

"A framework for constructing logical languages that can work together in the Semantic Web. A way of using XML for data rather than just documents." (Craig F Smith & H Peter Alesso, "Thinking on the Web: Berners-Lee, Gödel and Turing", 2008)

"An application of XML that enables the creation of rich, structured, machinereadable resource descriptions." (J P Getty Trust, "Introduction to Metadata" 2nd Ed., 2008)

"An example of ‘metadata’ language (metadata = data about data) used to describe generic ‘things’ (‘resources’, according to the RDF jargon) on the Web. An RDF document is a list of statements under the form of triples having the classical format: <object, property, value>, where the elements of the triples can be URIs (Universal Resource Identifiers), literals (mainly, free text) and variables. RDF statements are normally written into XML format (the so-called ‘RDF/XML syntax’)." (Gian P Zarri, "RDF and OWL for Knowledge Management", 2011)

"The basic technique for expressing knowledge on The Semantic Web." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A graph model for describing formal Web resources and their metadata, to enable automatic processing of such descriptions." (Mahdi Gueffaz, "ScaleSem Approach to Check and to Query Semantic Graphs", 2015)

"Specified by W3C, is a conceptual data modeling framework. It is used to specify content over the World Wide Web, most commonly used by Semantic Web." (T R Gopalakrishnan Nair, "Intelligent Knowledge Systems", 2015)

"Resource Description Framework (RDF) is a framework for expressing information about resources. Resources can be anything, including documents, people, physical objects, and abstract concepts." (Fu Zhang & Haitao Cheng, "A Review of Answering Queries over Ontologies Based on Databases", 2016)

"Resource Description Framework (RDF) is a W3C (World Wide Web Consortium) recommendation which provides a generic mechanism for representing information about resources on the Web." (Hairong Wang et al, "Fuzzy Querying of RDF with Bipolar Preference Conditions", 2016)

"Resource Description Framework (RDF) is a W3C recommendation that provides a generic mechanism for giving machine readable semantics to resources. Resources can be anything we want to talk about on the Web, e.g., a single Web page, a person, a query, and so on." (Jingwei Cheng et al, "RDF Storage and Querying: A Literature Review", 2016)

"The Resource Description Framework (RDF) metamodel is a directed graph, so it identifies one node (the one from which the edge is pointing) as the subject of the triple, and the other node (the one to which the edge is pointing) as its object. The edge is referred to as the predicate of the triple." (Robert J Glushko, "The Discipline of Organizing: Professional Edition" 4th Ed., 2016)

"Resource description framework (RDF) is a family of world wide web consortium (W3C) specifications originally designed as a metadata data model." (Senthil K Narayanasamy & Dinakaran Muruganantham, "Effective Entity Linking and Disambiguation Algorithms for User-Generated Content (UGC)", 2018)

"A framework for representing information on the web." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"Resource description framework (RDF) is a W3C (World Wide Web Consortium) recommendation which provides a generic mechanism for representing information about resources on the web." (Zongmin Ma & Li Yan, "Towards Massive RDF Storage in NoSQL Databases: A Survey", 2019)

"It is a language that allows to represent knowledge using triplets of the subject-predicate-object type." (Antonio Sarasa-Cabezuelo & José Luis Fernández-Vindel, "A Model for the Creation of Academic Activities Based on Visits", 2020)

"The RDF is a standard for representing knowledge on the web. It is primarily designed for building the semantic web and has been widely adopted in database and datamining communities. RDF models a fact as a triple which consists of a subject (s), a predicate (p), and an object (o)." (Kamalendu Pal, "Ontology-Assisted Enterprise Information Systems Integration in Manufacturing Supply Chain", 2020)

"It is a language that allows to represent knowledge using triplets of the subject-predicate-object type." (Antonio Sarasa-Cabezuelo, "Creation of Value-Added Services by Retrieving Information From Linked and Open Data Portals", 2021)

"Resource Description Framework, the native way of describing linked data. RDF is not exactly a data format; rather, there are a few equivalent formats in which RDF can be expressed, including an XML-based format. RDF data takes the form of ‘triples’ (each atomic piece of data has three parts, namely a subject, predicate and object), and can be stored in a specialised database called a triple store." ("Open Data Handbook")

26 October 2008

GSCM: Kanban (Definitions)

"In lean cellular manufacturing, a visual device, such as a card, floor space (kanban square), or production bin, which communicates to a cell that additional materials or products are demanded from the subsequent cell." (Leslie G Eldenburg & Susan K Wolcott, "Cost Management" 2nd Ed., 2011)

"A card-based techniques for authorizing the replenishment of materials." (Daryl Powell, "Integration of MRP Logic and Kanban Shopfloor Control", 2014)

"A just-in-time technique that uses kanban cards to indicate when a production station needs more parts. When a station is out of parts (or is running low), a kanban card is sent to a supply station to request more parts." (Rod Stephens, "Beginning Software Engineering", 2015)

"A note, card, or signal, a Kanban used to trigger a series of processes, usually downstream in the supply chain, in order complete tasks, products, and/or services. As part of a workflow management systems, timely Kanbans allow for efficient operations that enable agile, just-in-time (JIT), and lean philosophies to work." (Alan D Smith, "Lean Principles and Optimizing Flow: Interdisciplinary Case Studies of Best Business Practices", 2019)

"Agile method to manage work by limiting work in progress. Team members pull work as capacity permits, rather than work being pushed into the process when requested. Stimulates continuous, incremental changes. Aims at facilitating change by minimizing resistance to it." (Jurgen Janssens, "Managing Customer Journeys in a Nimble Way for Industry 4.0", 2019)

"This tool is used in pull systems as a signaling device to trigger action. Traditionally it used cards to signal the need for an item. It can trigger the movement, production, or supply of a unit in a production chain." (Parminder Singh Kang et al, "Continuous Improvement Philosophy in Higher Education", 2020)

"A signal that communicates a requirement for a quantity of product." (Microsoft, "Dynamics for Finance and Operations Glossary")

"A signaling device that gives instruction for production or conveyance of items in a pull system. Can also be used to perform kaizen by reducing the number of kanban in circulation, which highlights line problems." (Lean Enterprise Institute)

25 October 2008

GSCM: Supply Chain Management (Definitions)

"The practice of designing and optimizing supply chain business processes to provide superior service to those customers who drive the bulk of one’s profit." (Steve Williams & Nancy Williams, "The Profit Impact of Business Intelligence", 2007)

"The management of business units in the provision of products and services. It spans the movement and storage of raw materials, work-in-process inventory, and finished goods from point-of-origin to point-of-consumption." (Tony Fisher, "The Data Asset", 2009)

"Software tools or modules used in the planning, scheduling, and control of supply chain transactions (spanning raw materials to finished goods from point of origin to point of consumption), managing supplier relationships, and controlling associated business processes." (Janice M Roehl-Anderson, "IT Best Practices for Financial Managers", 2010)

"To provision products or services to a network of interconnected businesses." (Martin Oberhofer et al, "The Art of Enterprise Information Architecture", 2010)

"The management of all of the activities along the supply chain, from suppliers, to internal logistics within a company, to distribution, to customers. This includes ordering, monitoring, and billing." (Linda Volonino & Efraim Turban, "Information Technology for Management 8th Ed", 2011)

"The process of ensuring optimal flow of inputs and outputs." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"In basic terms, supply chain is the system of organizations, people, activities, information and resources involved in moving a product or service from supplier to customer. The configuration and management of supply chain operations is a key way companies obtain and maintain a competitive advantage." (Alan D Smith, "Lean Principles and Optimizing Flow: Interdisciplinary Case Studies of Best Business Practices", 2019)

"Supply chain management (SCM) refers to the processes of creating and fulfilling demands for goods and services. It encompasses a trading partner community engaged in the common goal of satisfying end customers." (Gartner)

24 October 2008

GSCM: Supply Chain (Definitions)

"Fulfillment process from customer purchase through manufacturing, factory, raw material, and component supplier." (Timothy J Kloppenborg et al, "Project Leadership", 2003)

"The network of suppliers that provides raw materials, components, subassemblies, subsystems, software, or complete systems to your company." (Clyde M Creveling, "Six Sigma for Technical Processes: An Overview for R Executives, Technical Leaders, and Engineering Managers", 2006)

"The supply chain refers to the processes and methods supporting the physical existence of a product from the procurement of materials through the production, storage (creating inventory), and movement (logistics) of the product into its chosen distribution channels." (Steven Haines, "The Product Manager's Desk Reference", 2008)

"A pipeline composed of multiple companies that perform any of the following functions: procurement of materials, transformation of materials into intermediate or finished products, distribution of finished products to retailers or customers, recycling or disposal in a landfill." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed, 2011)

"Flow of resources from the initial suppliers (internal or external) through the delivery of goods and services to customers and clients. (510, 646)" (Leslie G Eldenburg & Susan K Wolcott, "Cost Management" 2nd Ed, 2011)

"The optimal flow of product from site of production through intermediate locations to the site of final use." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The people and processes involved in the production and distribution of goods or services. " (DK, "The Business Book", 2014)

"The channel of distribution that enables products to be delivered from the supplier to the final buyer."(Gökçe Ç Ceyhun, "An Assessment for Classification of Distribution Network Design", 2020)

"A system of organizations, people, activities, information, and resources, possibly international in scope, that provides products or services to consumers." (CNSSI 4009-2015)

"Linked set of resources and processes between multiple tiers of developers that begins with the sourcing of products and services and extends through the design, development, manufacturing, processing, handling, and delivery of products and services to the acquirer." (NIST SP 800-37)

"The network of retailers, distributors, transporters, storage facilities, and suppliers that participate in the sale, delivery, and production of a particular product." (NIST SP 800-98)

28 September 2008

W3: Semantic Web (Definitions)

"The Web of data with meaning in the sense that a computer program can learn enough about what the data means to process it." (Tim Berners-Lee, "Weaving the Web", 1999)

"An evolving, collaborative effort led by the W3C whose goal is to provide a common framework that will allow data to be shared and re-used across various applications as well as across enterprise and community boundaries." (J P Getty Trust, "Introduction to Metadata" 2nd Ed, 2008)

"Communication protocols and standards that would include descriptions of the item on the Web such as people, documents, events, products, and organizations, as well as, relationship between documents and relationships between people." (Craig F Smith & H Peter Alesso, "Thinking on the Web: Berners-Lee, Gödel and Turing", 2008)

"The Web of data with meaning in the sense that a computer program can learn enough about what the data means to process it. The principle that one should represent separately the essence of a document and the style is presented." (Craig F Smith & H Peter Alesso, "Thinking on the Web: Berners-Lee, Gödel and Turing", 2008)

"A machine-processable web of smart data, [where] smart data is data that is application-independent, composeable, classified, and part of a larger information ecosystem (ontology)." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"An evolving extension of the Web in which Web content can be expressed not only in natural language but also in a form that can be understood, interpreted, and used by intelligent computer software agents, permitting them to find, share, and integrate information more easily." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed., 2011)

"The next-generation Internet in which all content is tagged with semantic tags defined in published ontologies. Interlinking these ontologies will allow software agents to reason about information not directly connected by document creators." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"is a term coined by World Wide Web Consortium (W3C) director Sir Tim Berners-Lee. It describes methods and technologies to allow machines to understand the meaning - or 'semantics'- of information on the World Wide Web." (Jingwei Cheng et al, "RDF Storage and Querying: A Literature Review", 2016)

"The vision of a Semantic Web world builds upon the web world, but adds some further prescriptions and constraints for how to structure descriptions. The Semantic Web world unifies the concept of a resource as it has been developed in this book, with the web notion of a resource as anything with a URI. On the Semantic Web, anything being described must have a URI. Furthermore, the descriptions must be structured as graphs, adhering to the RDF metamodel and relating resources to one another via their URIs. Advocates of Linked Data further prescribe that those descriptions must be made available as representations transferred over HTTP." (Robert J Glushko, "The Discipline of Organizing: Professional Edition" 4th Ed., 2016)

"A collaborative effort to enable the publishing of semantic machine-readable and shareable data on the Web." (Panos Alexopoulos, "Semantic Modeling for Data", 2020)

16 September 2008

W3: Cyberspace (Definitions)

"A term used to describe the nonphysical, virtual world of computers." (Andy Walker, "Absolute Beginner’s Guide To: Security, Spam, Spyware & Viruses", 2005)

"A metaphoric abstraction for a virtual reality existing inside computers and on computer networks." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The online world of computer networks where people can interact with others without physically being with them. People commonly interact with cyberspace via the Internet." (Darril Gibson, "Effective Help Desk Specialist Skills", 2014)

"The interdependent network of information technology infrastructures, which includes the Internet, telecommunications networks, computer systems, and embedded processors and controllers." (Olivera Injac & Ramo Šendelj, "National Security Policy and Strategy and Cyber Security Risks", 2016)

"A complex hyper-dimensional space involving the state of many mutually dependent computer and network systems with complex and often surprising properties as compared to physical space." (O Sami Saydjari, "Engineering Trustworthy Systems: Get Cybersecurity Design Right the First Time", 2018)

"Artifacts based on or dependent on computer and communications technology; the information that these artifacts use, store, handle, or process; and the interconnections among these various elements." (William Stallings, "Effective Cybersecurity: A Guide to Using Best Practices and Standards", 2018)

"Refers to a physical and non-physical terrain created by and/or composed of some or all of the following: computers, computer systems, networks, and their computer programs, computer data, content data, traffic data, and users." (Thokozani I Nzimakwe, "Government's Dynamic Approach to Addressing Challenges of Cybersecurity in South Africa", 2018)

"Cyberspace, is supposedly 'virtual' world/network created by links between computers, Internet-enabled devices, servers, routers, and other components of the Internet’s infrastructure." (Sanjeev Rao et al, "Online Social Networks Misuse, Cyber Crimes, and Counter Mechanisms", 2021)

31 August 2008

SQL Server New Features: ROWCOUNT in action

Especially when working with big tables, the default behaviour of Query Analyzer is to not show the output until the last record has been fetched. This can be time and resource consuming and therefore I’ve appreciated the fact that TOAD and SQL Developer are fetching only a certain number of records. Now I can see that same can be done starting with SQL Server 2005 onward by modifying ROWCOUNT server property using Query/Query Options menu functionality.

Query Options under SQL Server 2008 or by running the command: SET ROWCOUNT <number of records>; Of course somebody may limit the number of records returned by a query using TOP function when working with SQL Server and ROWNUM in Oracle, though I find it not always handy – it depends from case to case. There are also technical implications between the two types of usage, according SQL Server Books online it is recommended to TOP with SELECT over ROWCOUNT with regard to scope and query optimization, however in this context only the later makes sense:

"As a part a SELECT statement, the query optimizer can use the value of expression in the TOP clause as part of generating an execution plan for a query. Because SET ROWCOUNT is used outside a statement that executes a query, its value cannot be used to generate a query plan for a query."

Notes: 1. Do not mix the ROWNUM with @@ROWNUM function which returns the number of rows affected by the last statement. 2. Some of us list all the records in order to see the number of records returned by a query, though that’s totally not advisable!

AdventureWorks requires FILESTREAM enabled

Surprises, surprises, surprises, programmers’ world is full of them! When you say that everything is ok, you just discover that something went wrong. I was expecting to have Adventure Works database installed though I haven’t checked that. I realized today that it’s missing, so I tried to reinstall it enabling this time the “Restore AdventureWorks DBs” feature, though I got another nice error:

Setup failed for MSSQLSERVER. The following features are missing: FILESTREAM Fix the problems and re-run setup.

Guy Burstein, in his blog, wrote that the STREAM support can be enabled using the following SQL command: exec [dbo.sp_filestream_configure] @enable_level = 3; I tried that and another error came in:

Msg 2812, Level 16, State 62, Line 1 Could not find stored procedure 'sp_filestream_configure'

Checking my local installation of SQL Server Books Online, I found no track of sp_filestream_configure stored procedure, but I found that I can enable the STREAM support using sp_configure stored procedure as below:

EXEC sp_configure filestream_access_level, 2
RECONFIGURE
GO

Once I executed the 3 lines together, I got the following confirmation message which, amusingly, still recommands me to run the RECONFIGURE statement even if I did that. Anyway better more redundant information than nothing…

Configuration option 'filestream access level' changed from 2 to 2. Run the RECONFIGURE statement to install.

30 August 2008

Oracle Troubleshooting: No records returned by queries (Checklist)

No records returned by a query even if there should be results? Usually I’m using the following checklist: 1. check if the tables contain data. Silly but effective, especially in Oracle APPS in which some tables got deprecated and were replaced by tables with similar names (PA_PROJECTS_ALL vs. PA_PROJECTS), though that could happen in other environments too;

2. check if the JOIN syntax is correct;

3. check if one of the columns use in JOIN has only NULL values;

4. check if the constraints used in WHERE clause causes makes sense (e.g. wrong values or syntax);

5. for Oracle flavored queries, check if in WHERE clause there is a column not referenced with the table name or alias, and the column is available in more than one table used in the query. This Oracle bug is really dangerous when doing fast query checks!

6. for Oracle (APPS), check whether the query or view uses USERENV function with LANG or LANGUAGE text parameter, normally a constraint like: TABLE1.LANGUAGE = USERENV(‘LANG’).

The problem with such queries comes when user’s system language is other than the one expected, and thus query’s output might not be as expected. Usually it is preferable to hardcode the value, when possible: TABLE1.LANGUAGE = ‘US’ Note: Actually, also the tools you are using to run a query could create issues, for example a query run under Oracle’s SQL Developer was not returning records even if in TOAD did that. The problem was solved with the installation of a newer SQL Developer version.

Oracle Troubleshooting: ANSI 92 JOIN syntax error

Lately I’ve been working a lot with Oracle APPS, doing mainly ad-hoc reporting. One of my nightmares is an Oracle bug related to ANSI 92 syntax:

“ORA-01445: cannot select ROWID from, or sample, a join without a key-preserved table”

Unfortunately, even if the bug was solved by Oracle, it seems the update was missed on some servers and the bug haunts my queries almost on a daily basis.

Having an SQL Server background and, for code clearness, I prefer ANSI 92 JOIN syntax:

SELECT A.column1, B.column2
FROM table1 A JOIN table2 B
 ON A.column1 = B.column2

instead of using the old fashioned writing:

SELECT A.column1, B.column2
FROM table1 A , table2 B
WHERE A.column1 = B.column2

In theory the two queries should provide the same output and have, hopefully, similar performance. The problem with ANSI 92 syntax is that, on some Oracle installations, when the number of joins exceeds a certain limit, usually greater than 7, the above error is thrown.

What one can do is to reduce the number of joins to the main table by restructuring the query and grouping multiple tables into subqueries, which are then joined to the main table. For the tables from which is returned only one column, one can move the table into the SELECT statement.

Happy coding!

SQL Troubles

Pages

02 December 2008

Business Intelligence: General Issues in Business Intelligence

11 November 2008

Data Management: Data Quality - Information Systems' Perspective

08 November 2008

SQL Reloaded: Dealing with data duplicates on SQL Server

ERP Systems: Learning about Oracle APPS internals I

Business Intelligence: Enterprise Reporting

29 October 2008

W3: Resource Description Framework (Definitions)

26 October 2008

GSCM: Kanban (Definitions)

25 October 2008

GSCM: Supply Chain Management (Definitions)

24 October 2008

GSCM: Supply Chain (Definitions)

28 September 2008

W3: Semantic Web (Definitions)

16 September 2008

W3: Cyberspace (Definitions)

31 August 2008

SQL Server New Features: ROWCOUNT in action

AdventureWorks requires FILESTREAM enabled

30 August 2008

Oracle Troubleshooting: No records returned by queries (Checklist)

Oracle Troubleshooting: ANSI 92 JOIN syntax error

About Me