(Print this page)

Processing large XML data from SQL
Published date: Friday, September 3, 2010
On: Moer and Éric Moreau's web site

No VB or C# code this month! But something that many programmers are facing with more and more stored procedures accepting XML arguments.

Lately, I was faced to a serious performance problem into a stored procedure. This stored procedure is receiving a XML argument that may contain multiple rows to be parsed and processed by the Stored Procedure. During the testing period, this argument was containing maybe 10 rows but later we found somebody using this stored procedure passing around 12,000 rows of data into the argument. This is where the problem began!

If no VB or C#, what is technology then?

This month, it is all about Microsoft SQL Server. I have tested this solution on SQL Server 2005 and also SQL Server 2008 R2. Both give about the same results.

The script I use to measure performance can be downloaded. It contains 2 xml variables with the format explained below. The first one (@data1) contains 9999 rows. The second variable (@data2) contains 999 rows (a smaller subset to check if it is working).

The only tool used for this column is SSMS (Microsoft SQL Server Management Studio).

The origin of the problem

As I already started to explain, in one project on which I currently work, there is a stored procedure that has a XML argument. We used this argument to pass multiple rows of data much like this:

<data><ID>1</ID><Name>A1</Name></data> 
<data><ID>2</ID><Name>A2</Name></data>
<data><ID>3</ID><Name>A3</Name></data>

This is pretty straight forward and we were thinking we were doing the right thing by passing one big chunk to SQL Server to be processed instead of doing many round trips (passing one row each time). It was working as expected until somebody found this parameter and sent about 12,000 lines in this argument. The operation lasted something like 35 minutes (do I really need to say that it was a bit more complex then simply counting the lines?)!

I was told the problem and I was in charge of investigating the problem.

The original method

The first method to process the all the rows from the xml variable was really easy (as suggested on a MSDN site). We were always reading the first row, deleting the first row when processed until no more rows are available.

WHILE (@data.exist('/data[1]') = 1)
BEGIN
	SET @counter = @counter + 1

    -- Remove first data block & start again
    SET @data.modify('delete /data[1]')
END

For simplicity, I do nothing with the data. I just increment a variable that is outputted once the loop has completed to report the number of rows processed and the time it took as shown here.

        Rows        TimeRequired
------- ----------- --------------------
Method1 999         00:00:04:730

So as you can see, if we just want to go through all the rows and delete the row, it takes about 5 seconds to process 999 rows (your mileage may vary). When we first developed the procedure with that code, it ran in acceptable time because we were receiving less than 100 rows. The problems only began when we got 1000s of rows. To process my 9999 rows data set, it takes something like 8 minutes which was for us inacceptable.

A second method

While investigating, I was using a query which was returning me the data very quickly. To give you an idea, it takes less than a second to return a result set of my 9999 rows.

select t.n.value('ID[1]', 'int') as ID
     , t.n.value('Name[1]', 'varchar(100)') as Name
from @data.nodes('/data') AS t(n)

My first reaction was to create a cursor with the result of this query and to loop through the cursor. This is a classic reaction!

If you want to test the second method, here is the snippet you can try:

--Open the cursor directly on the XML variable
declare curData CURSOR FOR
	select t.n.value('ID[1]', 'int') as ID
           , t.n.value('Name[1]', 'varchar(100)') as Name
	from @data.nodes('/data') AS t(n)
open curData

--Loop through the data
fetch next from curData
into @ID, @Name

WHILE @@fetch_status = 0
BEGIN
	SET @counter = @counter + 1

	fetch next from curData
	into @ID, @Name
END

close curData
deallocate curData

With a query that fast to return a result set and using a cursor to do the looping, how do you think the performance will do compared to the first method? I was really disappointed to find that performance exactly the same! Apparently, a XML structure into a cursor is not really efficient.

The final method

Still having in mind the query which executes lightning fast in mind and having all my dreams vanished by the cursor, I had to find something else. The query runs quickly but the cursor seems to kill it. I then had a funny idea. What if we store the result set in a temp table and open the cursor on that temp table? To the point where I was, it was worth the try.

So I slightly modified my previous T-SQL code to insert the result of the lightning fast query into a temp table and open the cursor on that temp table. The resulting T-SQL code looks like this:

--Create a temp table from the XML data
select t.n.value('ID[1]', 'int') as ID, t.n.value('Name[1]', 'varchar(100)') as Name
into #temp
from @data.nodes('/data') AS t(n)

--Open the cursor on the temp table
declare curData CURSOR FOR
	select * from #temp
open curData

--Loop through the data
fetch next from curData
into @ID, @Name

WHILE @@fetch_status = 0
BEGIN
	SET @counter = @counter + 1

	fetch next from curData
	into @ID, @Name
END

close curData
deallocate curData

DROP TABLE #temp

I then took a deep breath and hit the F5 button. A result immediately (less than a second) came back:

        Rows        TimeRequired
------- ----------- --------------------
Method3 9999        00:00:00:630

I first though I made an error somewhere but reading what was on the screen, I discovered the results where good!

Conclusion

Every programmer or DBA knows that testing with real data is always better but sometime it is hard to estimate how many rows will be sent when you have an open structure like the XML argument I had.

The other lesson I learned, is that XML is great but seems to have a lot of overhead.


(Print this page)