23 January 2010

Data Warehousing: ETL and Synchronization


Data Warehousing
Data Warehousing Series

In a previous post on ETL: SSIS packages vs. SQL code I mentioned that there are three types of synchronization - of scope, business logic and data. The first type targets synchronizing the filters used to select the data, the second to synchronize the logic used in data processing, when is the case, and third to work with the same unaltered copy of the data.

The synchronization of scope is achieved by enforcing the same set of constraints to related data elements with the purpose of keeping the referential integrity between the various data elements. They involve parent-child relations like Purchase orders (PO) Headers/Lines or reference/referenced relations like PO Lines and Products or Vendors. Such relationships can involve datasets coming from distinct systems, usually involving different architecture - e.g., Products or Bill of Materials master data that could come from a Product Management Information (PMI) system.

The synchronization of business logic is usually a consequence of the synchronization of scope and concerns the set of business rules that must be applied in the logic used. For example, is need to consider Products or Vendors only for the open POs. Therefore, that would involve the duplication of logic for open POs across multiple database objects, logic that needs to be synchronized accordingly. One can reference directly the object encapsulating the logic for open POs, though that would create a recursive reference as the logic references the Products and/or Vendors as well. With one or two tricks this can be avoided.

Synchronizations can involve local logic, e.g. using exchange rates conversions and other transformations where a unique user-defined function could allow achieve this. No matter of the techniques used, one must make sure that all the logic is kept in synch.

The synchronization of data requires working with the same unaltered data in the various threads of processing, either SSIS packages or SQL code; that's quite simple as concept, though not always straightforward to achieve and occasionally ignored. For this purpose, it makes sense to load the data into an intermediary database or staging area, even for slowly changes data sources like data warehouses, and base the business logic on the respective local dataset(s) rather than loading the data repeatedly for each package from the source, which can also involve considerable additional network traffic. Why the alternative is not the best approach?

Supposing that two packages A and B are scheduled to run at the same time, even if the requests are sent simultaneously to the database, the simultaneously is relative, because most of the time the requests are queued, in general following to be processed in the order they came, though this depends on legacy system’s architecture and settings. Supposing that there is some time elapsed between the times the two requests are processed, there are good chances that one record was created, deleted, or updated in between.

The impact of changes in data could be minimum though strange situations might result with unpredictable impact. There are chances to find the issue when loading the data in the destination system, this if adequate validation is performed, though there are good chances for the issues to remain undiscovered until later, with all consequences resulting from this. Therefore, one should also build validation logic separately when feasible.

It should be targeted to cover all three types of synchronization even if it complicates the design of the solutions, as it's recommended in general to apply defensive architecting/programming. The increase in complexity is relative if one considers the effort needed for troubleshooting, plumbing or even redesign that needs to be done to fix the issues.


Created: Jan-2010, Last Reviewed: Mar-2024

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.