Friday, 2 January 2015

SSIS Transformations – Blocking, Partially Blocking and Non blocking


The type of transformation that we choose to achieve particular task has a great impact on performance of SSIS. Hence we should have  clear understanding of different types of transformations available in SSIS and be careful enough while deciding among them.
Here, I am going to mention few points that help in better understanding of transformation:

Blocking
Partially\ Semi Blocking
Non-Blocking
Need to read all the inputs before they produce some output. Clearly, it consumes time Need not read all inputs Need not read all inputs
It creates a new buffer for the output It creates a new buffer for the output It uses the existing buffer. This introduces huge performance benefits
These transformations may create additional records i.e. the number of input records might be different from number of output records. Hence, they are asynchronous in nature. Number of output records may or may not match the number of input records. They are also asynchronous in nature Number of output records are same as number of input records. These transformations are synchronous in nature
New thread is  introduced in Data flow New thread might introduce in data flow No additional thread is created
Available resources are impacted heavily Less impact on available resources than Blocking transformations It has the minimum impact to available resources out of all the transformation types
Ex: Sort, Aggregate, Ex: Merge, Merge Join Ex: Data Conversion, Derive Column, Look-up

Thursday, 1 January 2015

Code to divide a column data into Range


DECLARE @NoOfIterations int, –Indicates in how many chunks you want to pull the data
@RangeAdd int,
@ColSIDStart int,
@ColSIDEnd int,
@Counter int ,
@MaxColSID int,
@MinColSID int
DECLARE @ColSIDRANGETBL TABLE
(
Iteration int,
ColSIDStart int,
ColSIDEnd int
)
SET @NoOfIterations = ?
SET @Counter = 1
SELECT @MinColSID = MIN(ColSID) FROM [dbo].[Table] (NOLOCK)
SELECT @MaxColSID = MAX(ColSID) FROM [dbo].[Table]] (NOLOCK)
SET @RangeAdd = (@MaxColSID – @MinColSID)/@NoOfIterations
SET @ColSIDStart = @MinColSID
SET @ColSIDEnd = @ColSIDStart + @RangeAdd
WHILE ( (@Counter<=@NoOfIterations) and (@ColSIDStart <= @ColSIDEnd) )
BEGIN
INSERT INTO @ColSIDRANGETBL
SELECT @Counter,@ColSIDStart,@ColSIDEnd
SET @ColSIDStart = @ColSIDEnd + 1
SET @ColSIDEnd = @ColSIDStart + @RangeAdd
IF(@ColSIDEnd>@MaxColSID)
BEGIN
SET @ColSIDEnd = @MaxColSID
END
SET @Counter = @Counter + 1
END
SELECT ColSIDStart, ColSIDEnd FROM @ColSIDRANGETBL

For Analyze Data or Using Data Profiling Task



Sometimes, you may need to review the data source before initiating processing of the data from that source. You may want to review the source for NULL values in columns, distinct rows etc. This initial analysis of data before processing it may help the developers in building the logic based on that data. To perform such an analysis, Data Profiling Task of SSIS can be used.
To use Data Profiling Task, drag the same from SSIS Tool Box to the Control flow. Double click the task and following window will be displayed:
Untitled
Click the drop down ,create a new File connection as shown below and Click Ok:
Untitled4
Now, since the destination connections are set, we will proceed to set the Requests.
Note: Requests indicate the different ways in which we want to analyze source.
Click the “Quick Profile” button shown in below image:
Untitled1
Configure settings as shown in below figure:
Untitled2
Now click Ok and a dialogue box will open showing the requests selected for analyzing data. Refer figure below:
Untitled3
Now run the Task. Once the execution is completed go to Data Profiler Viewer in Windows or double click the Data Profiling Task and open Data Profiler Viewer as shown:
Untitled5
After clicking the button, following window will open, Click Open as shown in below screenshot:
Untitled7
Browse to the location of XML file as specified in Destination of Data Profile Task. Click OK and following window will open:
Untitled6
Now select any of the options pointed in above figure and fetch the source information.

Shortest code to get comma separated values from a column

Suppose you have a table named ‘tblCountry’ with following data:Untitled
Now you want your countries to be in comma separated format like India,Australia,US,UK.
Following snippet will give you the desired result:
DECLARE @SQL nvarchar(MAX)
SELECT @SQL = COALESCE(@SQL + ‘,’,”) + CountryName FROM [dbo].[tblCountry]
SELECT @SQL
The result is a string of characters that can be used further in the code

Multicast versus Conditional Split

Multicast and conditional split transformations are quite often used while developing an SSIS package. I have compiled some of the differences between the two transformations that might help understanding these transformations in a better way:
Multicast Conditional Split
Replicate source data to multiple destinations so that different logic can be applied to same source data Distribute the source data into multiple destinations based on conditions
Directs row to every output Directs row to single output
Does not contain any error output It has one error output
Do not require expressions Expressions are used for specifying conditions
Do not have default output It provides default output  so that if a row matches no expression it is directed to default output

Performance optimizations in SSIS

Developing an ETL is a part of challenge but the real challenge lies in developing an optimized one. In this very blog I will try to explain different performance optimizations technique that can be incorporated in ETL packages so as to run ETL in effective and efficient way. I have collected the material by reading MSBI books, googling out different blogs and from my own experience with ETL.
Here we goes..
  1. At connections level: By default, the network packet size has 0 value which means 4KB of data will be transferred at a time. Change this value to 32767 so as to enable 32 KB of data transfer to be transferred at one time
Untitled0
2. Using ‘Fast Parse’ property: This option can be used for very fast loading of data from flat files. To enable this option, perform following steps:
a. Right click the flat file source, and click ‘Show Advance Editor’:
Untitled1
b. Go to ‘Input and Output properties’ tab. In the Inputs and Outputs, expand the Output columns and select the column for which you want to enable ‘fast parse’.  After selecting the column, go to Custom Properties and set ‘FastParse’ property to ‘True’ as shown:
Untitled2
3. Avoid ‘Select *’: Instead of using ‘Select *’ while pulling data from SQL server table, use the column list and try to pull only required columns of table. SSIS uses buffer oriented architecture to transfer data wherein data from source first comes in buffers, required transformations are performed and then the data is dumped to destinations. However, size of the buffer depends on the estimated row size which is calculated as the maximum size of all the columns in a row. So if there are more number of columns in a row, they will occupy more space in buffer resulting in less number of rows in buffer. Therefore, always try to pull only required columns not all.
4. Avoid Asynchronous transformations: As far as possible, try to avoid them. Asynchronous transformations like sort needs to read and process all input data before creating any output records. As a result, data transfer process gets slower.
5. Considering data types: The data types should be as narrow as possible since they will be allocated less memory. Allocate the required number of bytes to the columns.
6. Indexes: If you are pulling large volume of data, then drop all the indexes if exists. After your load is complete, then re-create the indexes
7. OLEDB Destination settings:
Untitled3
  1. Keep Identity: By default, this property is unchecked, which means source identity values will not be copied to destination so destination table will create its own identity values if it has an identity column. If you check this property, then DFT engine will preserve the values of source identity and pass it to destination. If possible, keep it unchecked
  2. Check Constraints: By default, this property is checked. If you are sure that incoming data will not violate destination table constraints then uncheck this property. This setting specifies that the dataflow pipeline engine will validate the incoming data against the constraints of target table. If you un-check this option it will improve the performance of the data load.
  3. Rows per batch: By default, the value of this property is -1 or empty which means all the incoming rows will be treated as single batch. Change the value of this property to a positive value that specifies the number of rows in batch.
  4. Insert commit size: By default, the value is ‘2147483647’  as you can see in the above screenshot which means all the incoming data will be committed in one shot after successful completion of the DFT. We can specify a positive value for this setting to indicate that commit will be done for those number of records. Changing the default value for this setting will put overhead on the dataflow engine to commit several times but at the same time it will release the pressure on the transaction log and tempdb to grow specifically during high volume data transfers.
8. Using Delay Validation property: By default, this property is set to ‘False’. SSIS performs early validation that validates the package and its components before starting execution of package. If you are using some tasks that are dynamically passing some values to other tasks then set this property to ‘True’ for such tasks otherwise your package will fail.
9. Use Performance Counters to monitor SSIS performance
10. Using checkpoints
11. Parallel execution of tasks can also lead to performance benefits

Drilldown Matrix Reports


Matrix reports are very useful, and it allows total flexibility too.
For instance we could easily convert a matrix report to tabular.

In this tutorial, lets see how to add the drill down feature to an existing matrix report.
Lets take the same matrix report, which we created in the last tutorial.

This is how it looked:



Lets add some child groupings to this report, and explore some visibility toggling features.

Step 1: Right click on the Territory field and add a child group:




Step 2: Choose 'CountryRegionCode' to group by:


Step 3: Your design view would look something like this:


Step 4: Hit preview.


Congratulations - your matrix report is ready.

Lets proceed to see some visibility toggling features

Step 5: Click on the row groups - 'CountryRegionRegion' and select the group properties:



Step 6:  In the visibility pane, check 'Hide' for  'When the report is initially run'
Also, check the 'display can be toggled by this report item' for Territory


Step 7: Select OK and hit preview.
You should be able to toggle the report.