11 March 2010

SSIS: Data Flow Task (Third Magic Class)

     Independently of the data sources or destinations used, all the ETL tasks have as cornerstone the simple creation of a Data Flow Task. If in the previous two tutorials I was showing how to export data from an Excel file (see First Magic Class), respectively an Oracle database (see Second Magic Class), using the SQL Server Import and Export Wizard, in this tutorial I will show how you can create an SSIS package from the scratch using the SQL Server 2008 Business Intelligence Development Studio.

Step 1: Create the View
   Before starting to create the SSIS package and especially when dealing with multiple tables it’s useful to encapsulate the logic in a view, this step allowing you primarily to capture, test, refine and stabilize the requirements before creating the package, thus attempting to minimize the changes done in the package itself, and secondly it allows reusing the logic in other packages or database objects. Here’s the source code for vProducts view that will be used for this tutorial: 
 
CREATEVIEW [Production].[vProducts] 
-- Products (view) 
AS 
SELECT ITM.ProductID 
,ITM.ProductSubcategoryID 
,ITM.ProductModelID 
,ITM.Name ProductName 
,ITM.ProductNumber 
,PPM.Name ProductModel 
,PPM.CatalogDescription 
,PPS.Name ProductSubcategory 
,PPC.Name ProductCategory 
,ITM.MakeFlag 
,ITM.FinishedGoodsFlag 
,ITM.Color 
,ITM.SafetyStockLevel 
,ITM.ReorderPoint 
,ITM.StandardCost 
,ITM.ListPrice 
,ITM.Size 
,ITM.SizeUnitMeasureCode 
,SUOM.Name SizeUnitMeasure 
,ITM.WeightUnitMeasureCode 
,WUOM.Name WeightUnitMeasure 
,ITM.Weight 
,ITM.DaysToManufacture 
,ITM.ProductLine 
,ITM.Class 
,ITM.Style 
,ITM.SellStartDate 
,ITM.SellEndDate 
,ITM.DiscontinuedDate 
,ITM.ModifiedDate 
FROM Production.Product ITM 
    LEFT JOIN Production.UnitMeasure SUOM 
       ON ITM.SizeUnitMeasureCode = SUOM.UnitMeasureCode 
    LEFT JOIN Production.UnitMeasure WUOM 
        ON ITM.WeightUnitMeasureCode = WUOM.UnitMeasureCode 
    LEFT JOIN Production.ProductModel PPM 
       ON ITM.ProductModelID = PPM.ProductModelID 
    LEFT JOIN Production.ProductSubcategory PPS 
        ON ITM.ProductSubcategoryID = PPS.ProductSubcategoryID 
    LEFT JOIN Production.ProductCategory PPC 
       ON PPS.ProductCategoryID = PPC.ProductCategoryID 

Note:
    The reuse of database objects like views, stored procedures and user-defined functions comes with its costs because even if they are helping to simplify business logic’s maintenance, a change impacts the whole set of objects and applications referencing the changed object, thus for example some of the SSIS packages using the respective object could be invalidated. Sometimes it’s useful to keep a list with the objects and (especially) applications using your database objects, just in case you want to easily validate changes’ impact.

Step 2: Create the SSIS Project
    From Start/Microsoft SQL Server 2008 folder open the SQL Server Business Intelligence Development Studio and from its Menu choose File/New/Project, in the New Project dialog choose choose Integration Services Project as Template, enter a meaningful (Project) Name (e.g. SSIS Package Template), specify the Location (e.g. D:\Work\SQL Server) and eventually change the default Solution Name, though that’s not normally required, then click ‘Ok’. 
 
SSIS - New Project
 
  The new created project shell will look as below, with the default Package.dtsx, (eventually you could rename the package): 
 
SSIS - Project Shell
 
Step 3: Create the Data Source
     In the Solution Explorer click on Data Sources/New Data Source in order to open the Data Source Wizard in which you can create a new Data Source to AdventureWorks database using the Connection Manager as in the previous two tutorials or, in case the source is already available, just select it.

Notes:   
      Is not really required to create a Data Source explicitly, though this approach would allow you to easier manage the connections especially when the same connection will be reused in multiple packages.

Step 4: Create the Data Flow
     Within the Package.dtsx from Toolbox double click on the Data Flow Task, the Task being added to the Control Flow tab, then in order to add components to it click on the Data Flow tab and from Toolbox add an ‘OLE DB Source’ and an ‘OLE DB Destination’.
     Double click on the ‘OLD DB Source’ and in ‘OLE DB Source Editor’ dialog just opened selected the defined Data Source (e.g. AdventureWorks) in the OLE DB connection manager, in ‘Data Access mode’ from the list of available options select ‘SQL Command’ while in the below ‘SQL command text’ copy paste the below query: 

SELECT ITM.ProductID 
,ITM.ProductName 
, ITM.ProductNumber 
,ITM.ProductModel 
,ITM.ProductSubcategory 
,ITM.ProductCategory 
,ITM.MakeFlag 
,ITM.FinishedGoodsFlag 
,ITM.Color 
,ITM.StandardCost 
,ITM.ListPrice 
,ITM.Class 
,ITM.Style 
,ITM.SellStartDate 
,ITM.SellEndDate 
,ITM.DiscontinuedDate 
,ITM.ModifiedDate 
FROM Production.vProducts ITM 
WHERE ProductCategory = 'Bikes' 
 
SSIS - OLE DB Source Editor 
Notes:
1.    The vProducts could have included only the above attributes and the required constraint(s), though you always have to consider the changes in scope, typically being preferred to include all the attributes the business is typically interested in – this for the cases in which the number of attributes is quite huge or some of them are not used.
2.     Instead of using the ‘SQL Command’ you could have chosen the ‘Table or view’ Data access mode and use thus the vProducts view directly, in the end selecting only the needed columns in the Columns tab of the ‘OLE DB Source Editor’.
SSIS - OLE DB Source Editor Columns

   Once the data source was created, drag the green connector over the ‘OLE DB Destination’, and double click on the later in order to access the ‘OLE DB Destination Editor’. In ‘OLE DB connection manager’ select the destination source, it could be same connection used for the source, and select ‘Table or view – fast load’ as ‘Data access mode’. As the data needs to be dumped to a table, we either use an existing table or create one on the fly by clicking on the ‘New’ button, next to the ‘name of the table or the view’ control. Based on the source attributes, SQL Server already provides the script needed to create the table, all is needed to do is to change the default table name (e.g. OLE DB Destination) with a meaningful one (e.g. Production.Bikes):
CREATETABLE [Production].[Bikes] (
[ProductID]int,

[ProductName] nvarchar(50),

[ProductNumber] nvarchar(25),

[ProductModel] nvarchar(50),

[ProductSubcategory] nvarchar(50),
[ProductCategory] nvarchar(50),

[MakeFlag] bit,

[FinishedGoodsFlag] bit,

[Color] nvarchar(15),

[StandardCost] money,

[ListPrice] money,

[Class] nvarchar(2),

[Style] nvarchar(2),

[SellStartDate] datetime,

[SellEndDate] datetime,

[DiscontinuedDate] datetime,

[ModifiedDate] datetime

)

    Because is useful to keep the identity key (just in case you want to validate the logic) and the null values (unless you want to replace them with default values) check the ‘Keep identity’ and ‘Keep nulls’ checkboxes.
SSIS - OLE DB Destination Editor

Step 5: Testing the Data Flow Task
     Save the Package and from the Menu select Debug/Start debugging in order to run the package. Normally the package should run without problems, the tasks/components being shown in green as they are processed:
SSIS - Running the Package
     As always is a good idea to check if the data were saved as expected: 
SELECT * 
FROM [Production].[Bikes] 
ORDER BY ProductName 


Step 6: Delete the Previous Records
     Because the next time the package will be run the data will be written in the respective table making abstraction of the existing records, therefore is needed to delete the previous records before the data are dumped to the table. For this in Control Flow tab add an Execute SQL Task and in its Editor select the AdventureWorks Data Source as Connection, while in the SQL Statement add the following statement: 
 
TRUNCATE TABLE [Production].[Bikes]

SSIS - Delete Previous Records 
 
Notes:
1.    It’s preferred to TRUNCATE the table instead of using a DELETE statement, the later logging the deletion of each individual row in the transactional log while the TRUNCATE removes all table’s pages, no record deletion being actually performed.
2.    If you’re clicking on ‘Build Query’ button, because the Query Builder doesn’t support DDL statements you’ll get the following syntax error: “The TRUNCATE TABLE SQL construct or statement is not supported”. Just ignore the error message!
 
    Now link the Execution SQL Task connector to the Data Flow Task and by clicking again on it review its properties in Precedence Constraint Editor and assure that the Constraint option value is set to ‘Success’ (default value), this meaning that the next step in the package will be executed only if the precedent step succeeded. 
 
SSIS - Precedence Constraint Editor

Note:
     When in the current step data are deleted or important actions are performed, is always a good idea to process the next step only if the current task succeeded, otherwise your data/logic might be invalidated.

Step 7: Testing the Package
     Now you are ready to test (debug) the whole package,  but before this don’t forget to save your project! 
 
SSIS - Template Package

   Typically you’ll need to test the package at least twice in order to assure yourself that the previous data are deleted. It also helps to attempt a select from the [Production].[Bikes] table again, just to assure yourself that everything went as expected!

Notes:
1.    Actually you should save your work each time an important step is completed, because you never know when a runtime error occurs and your application closes unexpectedly, thus loosing your work.
2.    Usually when dealing with multiple tasks and transformations it makes sense to rename the respective objects, this allowing you to easier identify the objects.

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.