Processing Large CSV in Coldfusion

7 Nov, 2007

I was investigating on the CSV file processor that I did a couple of months ago, the client got back to us saying the application was timing out when processing CSV with more than 7000 records.

Basically what the application does is: allow user to upload a CSV file that contains various columns, but one of the column is a phone number column. The application will then validate these numbers according to some pre-determined criterias, the valid numbers are then stored in the database, the invalid ones are ignored.

The logic that I used was:

After the file is uploaded, put the data into an array, I used Ben Nadel’s CSV to Array utility to do this.
Loop through the array and do the number validation, if the number is valid put to the database (the call to the database is inside the loop).

Found out that this works fine for CSV files with records under 5000, but it gets slower and slower for bigger files. When first coding it we sort of knew this will become a problem in the future, I even put a comment on the code warning myself or other developers of it, however I guess I was looking for a quick solution so that we can quickly show progress to our client.

Now that the client has done some proper testing with the CSV files that they expected to receive from the users, the client requested that the application should be able to handle 30000 records in an acceptable processing time. Now how can I optimise the application in order to improve the performance five folds? Seems like an impossible task..

I have attempted few solutions that might solve this problem:

Using Java FileReader, read the file line by line and do the processing this way. I am dying to try to leverage the power of Java in Coldfusion, as in 1 year Coldfusion programming, I haven’t found myself in the position of desperately needed something that Coldfusion doesn’t provide.
Using ODBC to connect to the CSV directly, this is actually a suggestion from the client’s tech people. When I tried this, I ran into a problem with the way ODBC “guesses” the size of a column by sampling the first few rows. For example: numbers in rows 1 - 10 has a length of 8, and the 11th row the number has a length of 10, ODBC returns the 11th row as empty string! If I put the longest number into on one of the first rows, I don’t have this problem.

But I am convinced that in the end that the real bottleneck is the validation and the subsequent database insert inside the loop and there’s no way (that I can think of) to avoid this..

Having discussed this with one of my colleagues, it seems the only sensible way to solve this, is by changing the workflow of the application. We can delay the processing of the list and we can utilize Coldfusion scheduler to schedule the processing in batches.. Hm…