Faster File Consumption With Camel

One of the most frequently requested use cases that I encounter out in the field is to ingest file-based data (batch or otherwise), and then validate, transform, process, store… it. Luckily, Apache Camel’s File and FTP components make this extremely easy. So much so, that it requires very little thought to get up and running. And if you’re consuming small numbers of larger batch files, perhaps the defaults are good enough and you don’t need to put much though into it. If, however, you’re consuming large numbers of smaller files and you want to get the highest possible performance, there are a few configurations that you might want to consider.

When writing a file poller, one of the most commonly overlooked requirements is that you need some sort of mechanism to determine when the writer is done writing. If you grab a file before it’s complete, you’ll end up truncating it and ending up with garbage data. Unfortunately, how you would make that determination on one filesystem does not necessarily work on all filesystems. In addition, the different strategies will have different performance characteristics and usually end up being your biggest bottleneck (outside of the actual processing of the data). Luckily, Camel provides you with several strategies out-of-the-box and even allows you to create your own if none of them meet your needs. So how to choose…

First, let’s cover the absolute fastest, most generic solution. If you control the process writing the file as well as the process reading the file, you can use a “temp file” strategy. That is, I can have my writer write a file with some sort of temporary name (ie, appending an “.inprogress” extension), and then atomically move it to its final name when the write is complete (ie, remove the “.inprogress” extension). I can then easily have my consumer filter out the temporary files and only consume files that are complete. Camel can do all of this work for you. So no need to panic over writing a bunch of extra code. Simply set the appropriate options on the producer (ie, tempFileName) and the consumer (ie, exclude or antExclude) and call it a day. :)

Another similar solution is to use the “done file” strategy. In this strategy, you will write the file, and when it is complete you will write out an additional file with the same name and some “marker” extension (ie, “.done”). You will then instruct your consumer to only pick up files if it finds their corresponding “done” file. Again, Camel makes this a simple matter of configuration. Just set the doneFileName option on the producer and the doneFileName option on the consumer. To me, this seems a bit more clunky than the previous solution, but the end result is the same.

Both of the previous strategies are extremely fast and will work on pretty much any filesystem. However, as previously stated, they require you to have control over the producer and consumer sides. So what if you only control the consumer? Well… you could use one of the readLock options. Unfortunately, most of the available readLock options are more concerned with making sure no other consumers pick up the same file than they are with making sure the writer is done writing. And since we have other ways to make sure other consumers don’t step on our toes, we’ll just concentrate on the options that help us with the latter issue.

The most robust option that’s available out-of-the-box is the “changed” strategy. Basically, the way it works is that the consumer will grab a file and check its last-modified timestamp. It will then continue to check the last-modified timestamp (at the configured readLockCheckInterval) until it determines that the file has stopped changing (ie, the previous poll’s last-modified matches the current one). Once it has determined that the file is no longer changing, it will consume it and pass it to the rest of the route for processing. This strategy is an excellent option because it works pretty much anywhere (ie, local filesystem, FTP, SFTP, …), and is configurable enough to handle the case of “slow writers” (by tweaking/increasing the readLockCheckInterval option). And if you’re getting small numbers of larger files, it’s probably fast enough. But if you’re trying to consume large numbers of smaller files, you will quickly see the bottleneck… The current implementation will loop through each file and (for each file) check the timestamp. It will continue to loop and check the timestamp on that one file until it either detects that it has stopped changing, or it hits its timeout (configured via the readLockTimeout option). It will not move on to the next file until one of those conditions is satisified. Which means that, if I have lots of producers writing files, those files could all be stuck waiting to be consumed because of a single slow producer. In practice, I’ve actually seen this happen and it’s leads to a very bad situation where the polling itself starts to take too long (at the filesystem level and outside of the control of Camel) because of the sheer number of files in a directory. Once this starts to happen, it really just starts a snowball effect. So it’s difficult to recover from and usually requires manual intervention. So what do we do?

Well… Luckily, Camel is awesome enough that it allows us to extend it whenever it’s out-of-the-box options don’t meet our needs. Suck on that competition! :) Specifically, in the previous scenario, we actually solved the problem by creating our own custom version of the “changed” strategy. Only, in our version, we didn’t pause and repeat checks on a single file. Instead, we looped through the files and (for each file) checked its stats. We then added those stats to a cache and moved on to the next file. On each subsequent poll, we would check the file’s stats against the cached ones to determine if it had stopped changing (for at least the readLockCheckInterval amount of time). This allowed us to continue processing any files that were ready without having to wait behind a single one that wasn’t. In practice, we were able to use this strategy to consume very large numbers of files with only a single server. Take a look at the sample source code if you’d like to give it a try:

Worth noting that this is a recreation of the original work (as best as I could remember) that I did with my awesome colleague Scott Van Camp whose awesome coding skills are only rivaled by his awesome beard growing skills. So he gets to share in the credit/blame… :)