I’ve just finished a whole batch of optimisation routines for an Excel import/export that had to run on our CF8 based CRM system (thanks to Ben Nadel for his amazing POI CFC).
The export had a tricky requirement whereby each row (of some 56,000 rows) had n items which needed to be set according to a lookup in a link table in the DB. Rather that running each compairson as a query against the live DB (56,000 rows x 20+ items = 1,120,000 queries!) I figured I could pull the whole link table into memory using a simple query and then use Query of Queries to save a bit of time.
Trouble is, benchmarking the initial run, the QoQ was restricting throughput to some 500 rows a minute and, whilst this was an improvement on querying the live DB server, it was still way too slow.
Bear in mind the queries were on 2 integer based primary keys. E.g:
SELECT col1
FROM linkTable
WHERE col1 = #val1#
AND col2 = #val2#
So it occurred to me that I could look at pulling the whole link table into an array as a series of values in the form “val1-val2″ and using the indexOf method to determine if the link was present. The results were amazing! Throughput jumped from 500 a minute to 15,000 a minute and memory usage dropped massively despite in the new array variable holding all values from the link table.
I decided to benchmark Arrays vs. Query of Queries over a series of data set sizes running from 1000 to 10,000 in 1000 item increments and recorded the following results:
Seeding source query done in 7812ms
Val
Count |
Array
Seed |
Array
Query |
Array
Total |
QofQ
Total |
Ratio
(Array to QoQ) |
Total |
| 1000 |
578ms |
63ms |
641ms |
3703ms |
17% |
4344ms |
| 2000 |
390ms |
172ms |
562ms |
7672ms |
7% |
8234ms |
| 3000 |
438ms |
531ms |
969ms |
12281ms |
8% |
13250ms |
| 4000 |
453ms |
1125ms |
1578ms |
16547ms |
10% |
18125ms |
| 5000 |
422ms |
2172ms |
2594ms |
22515ms |
12% |
25109ms |
| 6000 |
406ms |
3484ms |
3890ms |
27235ms |
14% |
31125ms |
| 7000 |
343ms |
5485ms |
5828ms |
32890ms |
18% |
38718ms |
| 8000 |
359ms |
8282ms |
8641ms |
39062ms |
22% |
47703ms |
| 9000 |
344ms |
12375ms |
12719ms |
45046ms |
28% |
57765ms |
| 10000 |
344ms |
19703ms |
20047ms |
52062ms |
39% |
72109ms |
Run of 10 iterations completed in 324.31 seconds
As you can see, the difference in processing times is rather amazing!
Of course QoQ is still a better choice when dealing with “proper” tables rather than 2 value link tables but this technique is definitely going into my “batch” toolkit for CF8.
Things to bear in mind when considering the above
- Whilst the differences in processing times are significant this is still a pretty slow process with a 10,000 matches against 10,000 rows taking 20 seconds using the array method.
- This benchmark was performed on our dev server, a quad core Xeon 1.60GHz with 1GB of RAM
- This is only a portion of the processing that needed to be done on the original report.
- I’ve also run this benchmark on my local workstation which runs CF9 and found the results to be similar
For those who like this kind of thing the benchmark script is included below
<cfset scriptStartTime = gettickcount() />
<cfflush interval="1" />
<!--- Turn on some debug and increase timeout --->
<cfsetting showdebugoutput="true" requestTimeOut="6000" />
<cfoutput>
<html>
<head>
<title>Array vs Query of Query Benchmark</title>
</head>
<body>
<!--- Set working Directory --->
<cfset workingDir = GetDirectoryFromPath(getcurrenttemplatepath()) />
<!--- Test for a file to allow us to kill this script mid run --->
<cfif fileExists("#workingDir#kill.die")>
<p>Die File Found - Aborting</p><cfabort />
</cfif>
<!--- Setup the seed query --->
<cfset demoQuery = queryNew('col1, col2') />
<p>Seeding source query
<!--- Seed our source Query with 100,000 rows --->
<cfset tickBegin = GetTickCount() />
<cfloop from="1" to="100000" index="i">
<!--- Generate 2 random values --->
<cfset rand1 = randRange(1,999) />
<cfset rand2 = randRange(1,99999) />
<!--- Add new row to our source query --->
<cfset QueryAddRow(demoQuery) />
<cfset demoQuery["col1"][demoQuery.RecordCount] = rand1 />
<cfset demoQuery["col2"][demoQuery.RecordCount] = rand2 />
</cfloop>
<cfset tickEnd = GetTickCount() /><cfset loopTime = tickEnd - tickBegin />
done in #loopTime#ms</p>
<!--- Setup our demo array --->
<cfset demoArray = arrayNew(1) />
<table border="1" cellspacing="0" cellpadding="2">
<tr>
<th>Val Count</th>
<th>Array Seed</th>
<th>Array Query</th>
<th>Array Total</th>
<th>QofQ Total</th>
<th>Ratio (array to QoQ)</th>
<th>Total</th>
</tr>
<cfset loopCount = 0 />
<!--- Start to loop to 10,000 at 1,000 intervals --->
<cfloop from="1000" to="10000" step="1000" index="i">
<cfset iterationStartTime = gettickcount() />
<tr>
<!--- Test to see if the kill file has been created --->
<cfif fileExists("#workingDir#kill.die")>
<td colspan="7">Die File Found - Aborting</td></tr></table>
<cfabort />
</cfif>
<td>#i#</td>
<cfset arrayTickBegin = GetTickCount() />
<cfquery name="getDemoDataSet" dbtype="query" maxrows="#i#">
SELECT *
FROM demoQuery
</cfquery>
<!--- Populate our demo array --->
<cfloop query="getDemoDataSet">
<cfset arrayValue = "#getDemoDataSet.col1#-#getDemoDataSet.col2#" />
<cfset arrayAppend(demoArray,arrayValue) />
</cfloop>
<cfset tickEnd = GetTickCount() />
<cfset popArrayTime = tickEnd - arrayTickBegin />
<!--- Search the array--->
<cfset arraySearchTickBegin = gettickcount() />
<cfloop from ="1" to="#i#" index="s">
<cfset searchVal1 = randRange(1,999) />
<cfset searchVal2 = randRange(1,99999) />
<cfset searchValCat = "#searchVal1#-#searchVal2#" />
<cfset foundKey = demoArray.indexOf(searchValCat) />
</cfloop>
<cfset tickEnd = GetTickCount() />
<cfset searchArrayTime = tickEnd - arraySearchTickBegin />
<cfset totalArrayTime = tickend - arrayTickBegin />
<td>#popArrayTime#ms</td>
<td>#searchArrayTime#ms</td>
<td>#totalArrayTime#ms</td>
<!---Now to loop through and search our query--->
<cfset querySearchTickBegin = gettickcount() />
<cfloop from ="1" to="#i#" index="s">
<cfset searchVal1 = randRange(1,999) />
<cfset searchVal2 = randRange(1,99999) />
<cfquery name="searchQuery" dbtype="query">
SELECT col1
FROM getDemoDataSet
WHERE col1 = #searchVal1#
AND col2 = #searchVal2#
</cfquery>
</cfloop>
<cfset tickEnd = GetTickCount() />
<cfset searchQueryTime = tickEnd - querySearchTickBegin />
<cfset iterationTime = tickEnd - iterationStartTime />
<td>#searchQueryTime#ms</td>
<!--- Calc the ratio of times --->
<cfset iterationRatio = round((totalArrayTime/searchQueryTime)*100) />
<td>#iterationRatio#%</td>
<td>#iterationTime#ms</td>
</tr>
<cfset loopCount++ />
</cfloop>
</table>
<cfset tickEnd = GetTickCount() />
<cfset runTime = (tickEnd - scriptStartTime)/1000 />
<p>Run of #loopCount# iterations completed in #runTime# seconds</p>
</body>
</html>
</cfoutput>