Greetings! Last week for me at work consisted of a lot of drinking from that sweet, delicious, absinthe-laced Kool-Aid that is programming in a Unix-like environment. Linux system administrators typically only turn to writing a thing when there is a problem to solve, and adding an entry to cron or an alias in bashrc just isn't going to cut it. I had one such problem presented to me by the project manager this week.
We're migrating from self-hosted to Amazon Web Services. Amazon's pricing model, at least in this application, essentially just charges you for outgoing data. Data passed between your own servers is ignored, as is all incoming data. The PM requested an approximate figure of how much data we're probably going to get charged for.
Due to the specific requirements of what traffic needs to be monitored and how difficult it is to get new software approved in my system, I opted to go with the old quick and dirty script approach. It turns out Red Hat Enterprise Linux Server (and probably most other distros) ships with the tcpdump utility. You can use that utility with filters to monitor very specific traffic at the packet level (like if Wireshark were a command line tool).
Well that's great, but now I need to log and eventually aggregate the data into just a number or number set the PM will find useful. That's where Python comes in with all of its included batteries. The problem at this point is how do I get the output of tcpdump parsed and logged in real time so I have a data set to crunch on later. And that's where we get back to the title of this post.
NOTE: This probably doesn't work in Windows.
I don't know or care, didn't test.
Try it in a Linux VM if you're using a different OS.
The gist of the approach is you create a pipe, then your Python script forks, creating a second process which is an exact copy of itself. The copy is referred to as the child, and the original process is the parent. Both resume processing from that point in the program, and both have access to the same pipe, which has a reading end and a writing end. The child will be handling the writing end, and the parent will be reading. In the child you point its standard output to the pipe, and then you exec whatever, in my case tcpdump, which replaces the Python process with itself, inheriting that pipe as its standard output. Cool stuff.
Unfortunately, I can't share the exact code I used to solve this problem. I don't own it. You can skip ahead to the fun stuff now if you would like; I'm going to explain in brief how I solved my particular problem. The output of tcpdump contains a lot of fluff which might be useful in other circumstances, but all I wanted was the timestamp and the packet size. That's easy enough to parse out. I created a sqlite database with a single (for now) table in it, three columns: date, time, and packet size. And that's it really. As the script gets lines from its child, they get parsed, and that data gets loaded into the database as a row. That wraps up the monitoring end of the solution in under 100 lines of code including empty lines and numerous comments. In a couple of weeks, after I've gathered enough data, I'll write something else to make that data useful. For now, we wait.
If there is a better solution, and there usually is, let me know so I can learn and be a better admin. However, keep in mind I'm working with what Red Hat provides in its baseline server distro. If you would like a feel for what that includes, spin up a CentOS 7 virtual machine.
Try It At Home
You can also use this method to monitor logs with tail. In fact, if you're savvy with polling and such you can use one script to monitor all of your system's logs with a bunch of tails. But remember kids, forking leads to child processes, and too many children around can end up being a denial of service if you know what I mean. Fork responsibly.
For the sake of consistency between my machine and yours, in case you actually want to play with this stuff, I've written a short script which logs numbers in sequence. At work we're using Python 2, but at home I use Python 3, and since I'm doing this at home I went with 3 for the example. Save it in a file, give it execute permission, and fire it off as a background process. Just don't forget to kill it when you're done, or modify the range to a more reasonable value (1,000 is probably sufficient).
logger = logging.getLogger()
for i in range(sys.maxsize):
if __name__ == '__main__':
We set up the logger, and in a for loop we log a number every second. Easy peasy lemon squeezy, we get a new line in the log every second. The log file will be "example.log" in our current working directory. Now that we have a constant stream of output to monitor, let's monitor it.
You can see the script working with the "tail -f" command. Maybe we would like to do something when the log outputs a specific message. Since we're working with numbers, how about a FizzBuzz monitor? The FizzBuzz problem is an interview question for software developers. You can read about it here if you're unaware of how it works. Here is my simple FizzBuzz function for this example:
fizz = not number % 3
buzz = not number % 5
print('Fizz', end='', flush=False)
print('Buzz', end='', flush=False)
if not fizz and not buzz:
print(number, end='', flush=False)
Now we need data to apply our function to. This is the fun part. We're going to tail the log, and as our program receives a log line from tail, it will parse it and run the number through fizzbuzz. So how do we get tail to do this for us? Here's the main function of our log watcher:
read_pipe_fd, write_pipe_fd = os.pipe()
if not os.fork():
os.execlp('tail', 'tail', '-f', './example.log')
read_buffer = str()
with os.fdopen(read_pipe_fd) as rdpipe:
read_buffer += rdpipe.read(64)
if not '\n' in read_buffer:
lines = read_buffer.splitlines()
read_buffer = lines.pop()
for line in lines:
number = int(line.split(':'))
There's quite a bit going on. First, the pipe is created. The return value from the call to fork is evaluated to see if we're the child process (it will be 0 in the child), and if we are then close the read pipe, duplicate the write pipe to stdout and stderr, then replace python with tail.
Meanwhile in the parent process we close the write pipe and open the read pipe. The reading happens with an arbitrary 64 byte buffer. The read call will just block until tail outputs something. When we do get a read, make sure it's at least one complete line, and then do the parsing. The last item in the lines list is popped to become the new read_buffer. It's finished off with a call to fizzbuzz.
I could have reinvented the wheel, implementing my own version of 'tail -f'. Check the file size, open it if it's changed from last check, seek to the previous offset, and read everything from there to the end. Why bother with all of that though? Tail is a great little utility which can do all of that for me. The only overhead is having another process spawned. Obviously I wouldn't want to try that for tcpdump either, because that command reads and organizes information in a far more complex manner.
If I left some things out or you would like clarification on anything, let me know. Maybe I'll write up some more of these if reception is positive. Working as a Linux sysadmin I'm always learning cool new stuff to do with Python, and I would love to share more newly acquired knowledge with you geeks.