Non-Guarantied Logging for High Throughput Applications


Alexander Liss




            Any file system has limitations on speed with which it could write data into durable storage (disk). File systems deploy cache of different sophistication to ease load imposed by different users of file system on hardware and speed up overall performance. There is an obvious limit on speed of writing into durable storage; one can observe it by trying to write a lot of data very fast into one file for a long time. Cache usually does a good job handling users' writings in rare short bursts, but it could become an impediment, when there is a continuous writing in many files, as it is in case of data logging.

            As long a file system does not reach 100% utilization in any period, from the point of view of a user, different modes of writing into it are practically equivalent and type of data caching is irrelevant. However, when there are periods of 100% utilization, it is most likely that at least for some users writing to a file becomes a bottleneck of an application.

            Some applications couldn't tolerate such bottlenecks. Among them are applications, which have to process data with high throughput and provide simultaneous logging of these data.

            When such data log is used for audit, then the only solution is in using a fast data storage system. However, when it is used for application monitoring and troubleshooting and not for audit (data are logged for audit elsewhere), then an alternative solution is available - Non-Guarantied Logging.

            With Non-Guarantied Logging, it is allowed to drop some pieces of data from log, when an attempt to log would slow down the main processing. Often, these dropped pieces of data are complete messages that the rest of the log is still readable.

            When such drops of data in the log occur, this is an indication of an overload of the file system; hence, such overload of the file system should be periodically reported into some event log (syslog).

            A preferred implementation of such logging facility could include a limited size message queue (FIFO), where an application puts messages to be logged, and a thread, which reads the queue and writes messaged into a file. When the limit of the queue is reached, the message is dropped instead of being placed in the queue and reporting of such event is triggered. Such reporting could include a period of reporting, a counter of events, a timestamp of last time a report was attempted.

            While implementation of such strategy is straightforward, Non-Guarantied Logging is an unusual idea for specialists supporting applications and introduction of such functionality has to be explained.

            Logging as a part of more general idea of reporting and this strategy could be extended to reporting into a set of report destinations (files, screens, databases, etc.). Because each such destination could be a cause of an independent delay, each destination should be instrumented with own message queue and a thread. A single message presented for reporting into multiple destinations is copied separately into each such queue and threads send them independently.