Don’t use Query of Queries for simple datasets!

I’ve just finished a whole batch of optimisation routines for an Excel import/export that had to run on our CF8 based CRM system (thanks to Ben Nadel for his amazing POI CFC).

The export had a tricky requirement whereby each row (of some 56,000 rows) had n items which needed to be set according to a lookup in a link table in the DB. Rather that running each comparison as a query against the live DB (56,000 rows x 20+ items = 1,120,000 queries!) I figured I could pull the whole link table into memory using a simple query and then use Query of Queries to save a bit of time.

Trouble is, benchmarking the initial run, the QoQ was restricting throughput to some 500 rows a minute and, whilst this was an improvement on querying the live DB server, it was still way too slow.

Bear in mind the queries were on 2 integer based primary keys. E.g:

SELECT col1
FROM linkTable
WHERE col1 = #val1#
AND col2 = #val2#

So it occurred to me that I could look at pulling the whole link table into an array as a series of values in the form “val1-val2″ and using the indexOf method to determine if the link was present. The results were amazing! Throughput jumped from 500 a minute to 15,000 a minute and memory usage dropped massively despite in the new array variable holding all values from the link table.

I decided to benchmark Arrays vs. Query of Queries over a series of data set sizes running from 1000 to 10,000 in 1000 item increments and recorded the following results:

Seeding source query done in 7812ms

Val
Count
Array
Seed
Array
Query
Array
Total
QofQ
Total
Ratio
(Array to QoQ)
Total
1000 578ms 63ms 641ms 3703ms 17% 4344ms
2000 390ms 172ms 562ms 7672ms 7% 8234ms
3000 438ms 531ms 969ms 12281ms 8% 13250ms
4000 453ms 1125ms 1578ms 16547ms 10% 18125ms
5000 422ms 2172ms 2594ms 22515ms 12% 25109ms
6000 406ms 3484ms 3890ms 27235ms 14% 31125ms
7000 343ms 5485ms 5828ms 32890ms 18% 38718ms
8000 359ms 8282ms 8641ms 39062ms 22% 47703ms
9000 344ms 12375ms 12719ms 45046ms 28% 57765ms
10000 344ms 19703ms 20047ms 52062ms 39% 72109ms

Run of 10 iterations completed in 324.31 seconds

As you can see, the difference in processing times is rather amazing!

Of course QoQ is still a better choice when dealing with “proper” tables rather than 2 value link tables but this technique is definitely going into my “batch” toolkit for CF8.

Things to bear in mind when considering the above

  • Whilst the differences in processing times are significant this is still a pretty slow process with a 10,000 matches against 10,000 rows taking 20 seconds using the array method.
  • This benchmark was performed on our dev server, a quad core Xeon 1.60GHz with 1GB of RAM
  • This is only a portion of the processing that needed to be done on the original report.
  • I’ve also run this benchmark on my local workstation which runs CF9 and found the results to be similar

For those who like this kind of thing the benchmark script is included below

</code>
<cfset scriptStartTime = gettickcount() />
<cfflush interval="1" />

<!--- Turn on some debug and increase timeout --->
<cfsetting showdebugoutput="true" requestTimeOut="6000" />
<cfoutput>
<html>
<head>
<title>Array vs Query of Query Benchmark</title>
</head>
<body>

<!--- Set working Directory --->
<cfset workingDir = GetDirectoryFromPath(getcurrenttemplatepath()) />

<!--- Test for a file to allow us to kill this script mid run --->
<cfif fileExists("#workingDir#kill.die")>
	<p>Die File Found - Aborting</p><cfabort />
</cfif>

<!--- Setup the seed query --->
<cfset demoQuery = queryNew('col1, col2') />
<p>Seeding source query
<!--- Seed our source Query with 100,000 rows --->
<cfset tickBegin = GetTickCount() />
<cfloop from="1" to="100000" index="i">

	<!--- Generate 2 random values --->
	<cfset rand1 = randRange(1,999) />
	<cfset rand2 = randRange(1,99999) />

	<!--- Add new row to our source query --->
	<cfset QueryAddRow(demoQuery) />
	<cfset demoQuery["col1"][demoQuery.RecordCount] = rand1 />
	<cfset demoQuery["col2"][demoQuery.RecordCount] = rand2 />
</cfloop>

<cfset tickEnd = GetTickCount() /><cfset loopTime = tickEnd - tickBegin />

done in #loopTime#ms</p>

<!--- Setup our demo array --->
<cfset demoArray = arrayNew(1) />
<table border="1" cellspacing="0" cellpadding="2">
	<tr>
		<th>Val Count</th>
		<th>Array Seed</th>
		<th>Array Query</th>
		<th>Array Total</th>
		<th>QofQ Total</th>
		<th>Ratio (array to QoQ)</th>
		<th>Total</th>
	</tr>

	<cfset loopCount = 0 />
<!--- Start to loop to 10,000 at 1,000 intervals --->
<cfloop from="1000" to="10000" step="1000" index="i">
	<cfset iterationStartTime = gettickcount() />
	<tr>
	<!--- Test to see if the kill file has been created --->
	<cfif fileExists("#workingDir#kill.die")>
		<td colspan="7">Die File Found - Aborting</td></tr></table>
		<cfabort />
	</cfif>
	<td>#i#</td>

	<cfset arrayTickBegin = GetTickCount() />

	<cfquery name="getDemoDataSet" dbtype="query" maxrows="#i#">
		SELECT *
		FROM demoQuery
	</cfquery>

	<!--- Populate our demo array --->
	<cfloop query="getDemoDataSet">
		<cfset arrayValue = "#getDemoDataSet.col1#-#getDemoDataSet.col2#" />
		<cfset arrayAppend(demoArray,arrayValue) />
	</cfloop>
	<cfset tickEnd = GetTickCount() />
	<cfset popArrayTime = tickEnd - arrayTickBegin />

	<!--- Search the array--->
	<cfset arraySearchTickBegin = gettickcount() />
	<cfloop from ="1" to="#i#" index="s">
		<cfset searchVal1 = randRange(1,999) />
		<cfset searchVal2 = randRange(1,99999) />

		<cfset searchValCat = "#searchVal1#-#searchVal2#" />

		<cfset foundKey = demoArray.indexOf(searchValCat) />

	</cfloop>
	<cfset tickEnd = GetTickCount() />
	<cfset searchArrayTime = tickEnd - arraySearchTickBegin />
	<cfset totalArrayTime = tickend - arrayTickBegin />

	<td>#popArrayTime#ms</td>
	<td>#searchArrayTime#ms</td>
	<td>#totalArrayTime#ms</td>

	<!---Now to loop through and search our query--->
	<cfset querySearchTickBegin = gettickcount() />
	<cfloop from ="1" to="#i#" index="s">
		<cfset searchVal1 = randRange(1,999) />
		<cfset searchVal2 = randRange(1,99999) />

		<cfquery name="searchQuery" dbtype="query">
			SELECT 	col1
			FROM 	getDemoDataSet
			WHERE 	col1 = #searchVal1#
			AND 	col2 = #searchVal2#
		</cfquery>
	</cfloop>
	<cfset tickEnd = GetTickCount() />
	<cfset searchQueryTime = tickEnd - querySearchTickBegin />
	<cfset iterationTime = tickEnd - iterationStartTime />
	<td>#searchQueryTime#ms</td>

	<!--- Calc the ratio of times --->

	<cfset iterationRatio = round((totalArrayTime/searchQueryTime)*100) />

	<td>#iterationRatio#%</td>

	<td>#iterationTime#ms</td>

	</tr>
	<cfset loopCount++ />
</cfloop>
</table>

<cfset tickEnd = GetTickCount() />
<cfset runTime = (tickEnd - scriptStartTime)/1000 />
<p>Run of #loopCount# iterations completed in #runTime# seconds</p>
</body>
</html>
</cfoutput>

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>