Showing posts with label flow. Show all posts
Showing posts with label flow. Show all posts

Friday, March 23, 2012

Performance problems with SQL commands in data flow task

SQL statement within an OLE DB Command component is extremely slow (hours, days). Same SQL statement executed within a query window of SQL Server Management Studio takes only a few seconds. Using a fairly simple SQL UPDATE statement against a table with only 21,000 rows. Query:

UPDATE Pearson_Load
SET Process_Flag = 'E',
Error_Msg = 'Error: Missing address elements Address_Line_1, City, and/or State'
WHERE (Address_Line_1 = ' '
OR City = ' '
OR State = ' ')
AND Process_Flag = ' '

Any suggestions on how to improve the performance of this task or an alternate solution are appreciated. Thank you.

Jeff-B wrote:

SQL statement within an OLE DB Command component is extremely slow (hours, days). Same SQL statement executed within a query window of SQL Server Management Studio takes only a few seconds. Using a fairly simple SQL UPDATE statement against a table with only 21,000 rows. Query:

UPDATE Pearson_Load
SET Process_Flag = 'E',
Error_Msg = 'Error: Missing address elements Address_Line_1, City, and/or State'
WHERE (Address_Line_1 = ' '
OR City = ' '
OR State = ' ')
AND Process_Flag = ' '

Any suggestions on how to improve the performance of this task or an alternate solution are appreciated. Thank you.

You should redirect those rows destined for update to a table and then use an Execute SQL task in the control flow to perform a set-based update. What you've got now is a new, distinct update command for every row on the update path. This is costly.|||

Thank you Phil! I just moved the queries (I actually had 4 separate queries) that I was executing as separate OLE DB Command components in the data flow task into an Execute SQL task in the control flow and the process ran in seconds. I don't think that is exactly what you meant, but I wasn't sure what you meant by the suggestion to "redirect those rows destined for update to a table and then use an Execute SQL task in the control flow to perform a set-based update".

If you have time to comment so I understand the problem correctly, what I was doing wrong by using a data flow task with a table as an OLE DB source was executing the SQL statement in each OLE DB Command component I defined 21,000 times - once for each row in the table. So instead of executing 4 distinct queries, I was really executing 84,000 queries. If that is the case, when is it OK (if ever) to use such a scenario? Should the SQL command being executed be defined to only work on the current table entry? What would the syntax look like?

|||

Jeff-B wrote:

Thank you Phil! I just moved the queries (I actually had 4 separate queries) that I was executing as separate OLE DB Command components in the data flow task into an Execute SQL task in the control flow and the process ran in seconds. I don't think that is exactly what you meant, but I wasn't sure what you meant by the suggestion to "redirect those rows destined for update to a table and then use an Execute SQL task in the control flow to perform a set-based update".

If you have time to comment so I understand the problem correctly, what I was doing wrong by using a data flow task with a table as an OLE DB source was executing the SQL statement in each OLE DB Command component I defined 21,000 times - once for each row in the table. So instead of executing 4 distinct queries, I was really executing 84,000 queries. If that is the case, when is it OK (if ever) to use such a scenario? Should the SQL command being executed be defined to only work on the current table entry? What would the syntax look like?

My suggestion of moving the data to a table was assuming you were doing a parameter-based update query.

Your understanding is correct. You were executing 84,000 updates, and generally there is never a good time to do that. If you need to perform an update in the data flow on all of those rows, it would be best to insert the changes into a separate table, to be used later in a set-based update.|||Thank you for your latest response and your help with this problem.|||

Jeff-B wrote:

Thank you for your latest response and your help with this problem.

Jeff,

According to your post you managed to achieve this with an Execute SQL Task. Am I correct?

If using an Execute SQL Task is an option for you then I would go with that over a data-flow every time. SSIS will almost never be able to perform quicker than a RDBMS engine.

-Jamie

|||

Jamie,

Yes, I did solve this using an Execute SQL Task. It was a rather straightforward solution with this particular package because I wasn't using a parameterized query. I may have to use what Phil initially suggested above for another, similar package but one that one uses parameters in the query. One parameter needs to be referenced in a sub-query which it isn't allowed. That limitation is what led me to use a data flow task. I just wasn't aware of the inefficiency of that tack. Thanks.

|||

Jeff-B wrote:

Jamie,

Yes, I did solve this using an Execute SQL Task. It was a rather straightforward solution with this particular package because I wasn't using a parameterized query. I may have to use what Phil initially suggested above for another, similar package but one that one uses parameters in the query. One parameter needs to be referenced in a sub-query which it isn't allowed. That limitation is what led me to use a data flow task. I just wasn't aware of the inefficiency of that tack. Thanks.

Caveat that with the fact that its efficient in certain circumstances - unfortunetely doing updates is one of those scenarios. That's due to the vary nature of updates.

-Jamie

sql

Wednesday, March 7, 2012

Performance of Fuzzy Lookup?

Are the any ways to improve the performance of Fuzzy Lookup transformation?

I've already "sharpened" my data flow as much as possible in that I'm performing an exact lookup prior to the fuzzy and only passing the rows I need to compare in the inputs and in the reference table.

In one of my address match flows I have around 2,000 input rows matching against about 600,000 reference rows and its taking several HOURS to complete on a 2 proc, 4GB win2003 server (32 bit).

I've looked at perfmon and isolated that the operation is bogged during the actual lookups against tempdb - not the building of fuzzy match indexes or loading the reference data. Buffers on TempDB show the same statement being executed every few seconds?

I'm using the June CTP because we have too much code to migrate and decided to wait until Nov. 7 - has this component been improved in releases after June?

Any help on this would be greatly appreciated.

There is a whitepaper about Fuzzy Lookup and Fuzzy Grouping. It has a section for performance analysis. You might find it useful. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql90/html/FzDTSSQL05.asp