Thursday, 1 January 2015

Performance optimizations in SSIS

Developing an ETL is a part of challenge but the real challenge lies in developing an optimized one. In this very blog I will try to explain different performance optimizations technique that can be incorporated in ETL packages so as to run ETL in effective and efficient way. I have collected the material by reading MSBI books, googling out different blogs and from my own experience with ETL.
Here we goes..
  1. At connections level: By default, the network packet size has 0 value which means 4KB of data will be transferred at a time. Change this value to 32767 so as to enable 32 KB of data transfer to be transferred at one time
Untitled0
2. Using ‘Fast Parse’ property: This option can be used for very fast loading of data from flat files. To enable this option, perform following steps:
a. Right click the flat file source, and click ‘Show Advance Editor’:
Untitled1
b. Go to ‘Input and Output properties’ tab. In the Inputs and Outputs, expand the Output columns and select the column for which you want to enable ‘fast parse’.  After selecting the column, go to Custom Properties and set ‘FastParse’ property to ‘True’ as shown:
Untitled2
3. Avoid ‘Select *’: Instead of using ‘Select *’ while pulling data from SQL server table, use the column list and try to pull only required columns of table. SSIS uses buffer oriented architecture to transfer data wherein data from source first comes in buffers, required transformations are performed and then the data is dumped to destinations. However, size of the buffer depends on the estimated row size which is calculated as the maximum size of all the columns in a row. So if there are more number of columns in a row, they will occupy more space in buffer resulting in less number of rows in buffer. Therefore, always try to pull only required columns not all.
4. Avoid Asynchronous transformations: As far as possible, try to avoid them. Asynchronous transformations like sort needs to read and process all input data before creating any output records. As a result, data transfer process gets slower.
5. Considering data types: The data types should be as narrow as possible since they will be allocated less memory. Allocate the required number of bytes to the columns.
6. Indexes: If you are pulling large volume of data, then drop all the indexes if exists. After your load is complete, then re-create the indexes
7. OLEDB Destination settings:
Untitled3
  1. Keep Identity: By default, this property is unchecked, which means source identity values will not be copied to destination so destination table will create its own identity values if it has an identity column. If you check this property, then DFT engine will preserve the values of source identity and pass it to destination. If possible, keep it unchecked
  2. Check Constraints: By default, this property is checked. If you are sure that incoming data will not violate destination table constraints then uncheck this property. This setting specifies that the dataflow pipeline engine will validate the incoming data against the constraints of target table. If you un-check this option it will improve the performance of the data load.
  3. Rows per batch: By default, the value of this property is -1 or empty which means all the incoming rows will be treated as single batch. Change the value of this property to a positive value that specifies the number of rows in batch.
  4. Insert commit size: By default, the value is ‘2147483647’  as you can see in the above screenshot which means all the incoming data will be committed in one shot after successful completion of the DFT. We can specify a positive value for this setting to indicate that commit will be done for those number of records. Changing the default value for this setting will put overhead on the dataflow engine to commit several times but at the same time it will release the pressure on the transaction log and tempdb to grow specifically during high volume data transfers.
8. Using Delay Validation property: By default, this property is set to ‘False’. SSIS performs early validation that validates the package and its components before starting execution of package. If you are using some tasks that are dynamically passing some values to other tasks then set this property to ‘True’ for such tasks otherwise your package will fail.
9. Use Performance Counters to monitor SSIS performance
10. Using checkpoints
11. Parallel execution of tasks can also lead to performance benefits

No comments:

Post a Comment