In his recent blog post, our COO Don Brown outlined his perspective on REAL Big Data requirements -- Reliable, Extensible, Analytics-enabled, Limitless. In this series of posts I am going to dive deeper in the Reliable requirement and talk in some detail about failure points and how to deal with them.
There are several popular logging mechanisms -- notably syslog, log4j, and file-based logging -- each with its own benefits and drawbacks. Since we are talking about reliably collecting log data, not actually logging, we'll skip that discussion and drill-down into how these logging mechanisms affect collection. Specifically, today we'll focus on issues around file-based logging.
The collection of log data is done with a lightweight agent, which is pointed to the log file(s) of interest. A single agent can monitor multiple log files at once; we've tested collecting from hundreds of log files using a single agent. The basic mechanics of the Rocana agent are to watch a log file and pick up changes since the last read, then forward to the Rocana Ops server. There are a couple other additional steps, which are critical components of the error handling process that are detailed later.
Figure 1: Sequence of Steps for Log File Processing with the Rocana agent
Easy enough, what could go wrong with that? Well, since you asked:
What if the agent crashes?
The Rocana agent tracks which log messages are acknowledged by Rocana Ops and keeps a checkpoint of last known successfully transmitted and acknowledged event. Well, actually, by default it keeps track in batches of 100, but that is configurable. When the agent is restarted, log data will then be picked up from the last checkpoint.
What if the agent can't write the checkpoint file?
The Rocana agent tracks which log messages are acknowledged and keeps a checkpoint of last known, successfully transmitted and acknowledged event. The Rocana agent stops data collection operations until the issue is resolved. Log data will then be picked up from the last checkpoint.
What if the agent doesn't get an acknowledgement from Rocana Ops?
This should be an incredibly rare scenario as the Rocana Ops event bus is a highly-available, fault-tolerant application built using Apache Kafka. But fans of Chaos Monkey know these obscure situations can occur. When they do, the checkpoint and acknowledgement mechanisms utilized by the Rocana agent come into play. The Rocana agent will resend events that have not been acknowledged until it receives an ACK back. Obviously this means the same event may be sent to Rocana Ops several times.
What if the agent can't forward the data to Rocana Ops?
When the Rocana agent detects a communication issue with the consumer on the Rocana Ops server, the agent stops data collection activities. When connectivity is restored the agent resumes collection, picking up all file-based log data since the last checkpoint.
What if someone deletes the log file before you read all the lines?
Obviously we don’t recommend deleting files while we are still monitoring them, but we do make best efforts to collect the data. But file deletion is one the more likely cases to result in data loss. That said, specific to our Linux agent, the agent maintains a file handle to the log file and as long as we have a handle we are able to retrieve the data. When the file is deleted, the agent gets a delete notification via inotify but continues to tail the data until reaching the end of the file, then releases the file handle.
This all sounds good, are you saying you never lose data?
No. With file-based logs, data loss scenario exists, especially when log files are deleted while the Rocana agent is down. For example, if the Rocana agent is off-line for enough time that a rolling log files expire, that data would be lost. That's a specific example of a general class of problem, if a log file is deleted while the Rocana agent is off-line, then data past the last checkpoint cannot be retrieved.
Wait a minute, if I understand this checkpoint mechanism, the same event can be sent to Rocana Ops several times?
Yes, that's true. Which introduces the notions of “exactly once” and “at least once” delivery. “Exactly once” delivery is impossible to guarantee (a bravenewgeek.com has a good write-up on this topic), so we use an “at least once” approach. On the Rocana agent side of the connection, for each event the Rocana agent generates a unique key based on the contents of the event (not a serial event id), which is transmitted to Rocana Ops along with the event data. The process of ensuring this event is only processed once is a feature of Rocana Ops, which will be covered in another blog
This pretty much covers how we handle file-based logs. We'll cover other options like syslog and metrics soon. In the meantime, anything missing? Are there some failure scenarios for which you would like more detail?
Want to learn more about reliable collection from log files and how it pertains to finding the right big data solution? Watch the replay of REAL Big Data webinar.