We frequently talk about the Paglo API but do not have much in the way of example code on how to use it. This post is more an exploration of the Paglo submitFile API then a way to submit logs to be indexed from Python. For a simpler way to submit logs from unix machines please consult Searching Ruby on Rails Production Log Files with Paglo
Here I am going to cover how to use the bulkSubmit API from Python to capture log files on unix based systems to submit up to Paglo. This article assumes some basic familiarity with Python and Unix administration concepts. Although the actual Python implementation of the bulkSubmit API will work fine under all OS’s that Python runs on, these example scripts were written specifically for a unix environment. As such they run under FreeBSD, Mac OS X, and Linux (Ubuntu). Also note these examples work with Python versions 2.5 and 2.6.
You can find all of the code referred to here in our public Subversion repository at: https://svn.paglo.com/paglo_open_source/paglo_for_python/trunk
You will need to either install the paglo Python module or make sure that it is included in your PYTHONPATH environment variable before you can make use the scripts and API’s that it provides. A setup.py file is provided so that installing the paglo Python module is a simple matter of running
python setup.py install
Consult Python’s documentation on the module search path for more information.
The basic problem we want to solve in this post is simple:
Watch log files in an ongoing fashion, and through log file rotation. Submit their contents up to Paglo as new log messages are appended to these files.
I split the task of actually monitoring the log files from the task of submitting data up to Paglo. This is done by the two scripts in the scripts/ directory in our subversion repository: scripts/paglo_submitter.py and scripts/paglo_logwatcher.py
Paglo Submitter
Let us look at the script which watches a directory for submission files and submits the data up to Paglo. This script is designed to submit two kinds of data: PQL ‘merge’ statements, and log file data. This is done by using the submitFile API paglo provides. This is the same API that the Paglo Crawler uses when it submits data it has gathered to Paglo. For other API’s that Paglo offers please check our documentation.
The submitFile API for Paglo has two required things: The processor parameter which indicates what subsystem is going process the submitted data, and a file attachment that is the data being submitted. This API is good for cases where you are bulk submitting data to Paglo that does not require any further interaction from the client once it is submitted.
Along the lines of it being a ‘fire and forget’ API it can submit whole a file of data at once. The file being submitted can also be compressed thus saving bandwidth along the way. A very useful thing for quickly growing log files.
There are currently only two supported ‘processors’ for bulk submission: bulk_store and log.
The bulk_store processor is for submitting a series of PQL ‘merge’ statements. This populates the PQL database directly. Please see the merge statement documentation for more information about the PQL ‘merge’ statement. There are no other parameters required when submitting a file of ‘merge’ statements to the bulk_store processor.
The log processor is for submitting log file fragments to Paglo. It requires additional arguments such as source and source_type. These are covered below.
To keep the submitter agnostic about all this we have whatever wants to submit data encode the parameters to pass to submitFile as the first line in the files it drops off. The paglo_submitter.py script will look in a specific directory for files that match a specific pattern every 10 seconds. This pattern is files whose name consists of only the digits 0-9, “_”, “-”, and “.”.
When the paglo_submitter.py script starts up, after doing various logging and daemon program setup, will create a ‘paglo session object.’ For this we need the Paglo API key for your company’s Paglo Index. To find your API key you go to https://app.paglo.com/user/edit_company and look for the section titled Data Key. By default the paglo_submitter.py script will look for your data key in the file /usr/local/etc/paglo_api_key, which should be protect so that only authorized entities can reads its contents.
submitter = PagloSubmitter(session, options.directory, logger = logger)
After it creates the ‘paglo session object’ the paglo_submitter.py will loop forever sweeping the drop off directory for new files.
The sweeping process loops through all the files that match the pattern mentioned above. When it finds one it reads the first line of that file. This line is parsed according to CGI parameter encoding rules. We then take the remainder of the file after the first line and submit it using the ‘Paglo session object’s submitFile method which takes as required parameters the processor to use and either the name of a file submit, or an open file handle with the read head positioned at the beginning of the data to submit. Any other parameters that were defined in the first line of the file are passed as a Python keyword argument dict to the submitFile method.
def submit_file(self, file_name):
# Open the file, read out the first line to find the parameters
# to submit to Paglo. (We need to turn each value from a list in to
# a single value since we will never have repeated parameters.
#
f = open(file_name, "r")
params = cgi.parse_qs(f.readline().strip())
for key in params.keys():
params[key] = params[key][0]
processor = params['processor']
del params['processor']
try:
resp = self.session.submitFile(processor, f, **params)
os.remove(file_name)
...
finally:
f.close()
return
We check for various kinds of errors that may arise from this submission attempt, removes the file if the submission succeeded and continue with our loop until we have looked at every file that matches our pattern. Sleep, rinse, and repeat.
This provides us a robust way of submitting arbitrary data up to Paglo and only needing to hand our Paglo Data Key to one program to limit exposure. One thing you need to make sure of is that this program has the permission to read the files that are dropped in its directory and delete them after it is done.
For more information about how the Paglo session object’s submitFile method works you can check out the source
Paglo Log Watcher
The other half of our dynamic duo is the program that monitors various log files. It also knows how to parse a maillog file and how to read /proc/meminfo for memory statistics to submit specific statistics as PQL merge statements. So paglo_logwatcher.py shows usage of both the log and bulk_submit processors for the submitFile API.
In the main() function of paglo_logwatcher.py, after we setup various daemon properties and logging, we make use of some Paglo provided utility objects. Both of the objects are defined in the module paglo.daemon_utils. You can look at the source in our subversion repository at: https://svn.paglo.com/paglo_open_source/paglo_for_python/trunk/paglo/daemon_utils.py
We create an instance of the paglo.daemon_utils.FileSubmitter to submit a file of data every 60 seconds, if any data has been written. This is used by both the MaillogParser and the MemWatch objects since those are both going to submit PQL merge statements:
bulk_store_submitter = paglo.daemon_utils.FileSubmitter(\
options.directory,
{ 'processor' : 'bulk_store' },
logger)
bulk_store_submitter.start_auto_submit(60)
With an instance of the paglo.daemon_utils.FileSubmitter class you specify the parameters to submit to the bulkSubmit API when you create it. You then invoke the write() method of paglo.daemon_utils.FileSubmitter object to output whole lines. Either at some automatic interval, or when you choose, this file is closed and renamed such that our paglo_submitter.py process will pick it up and submit it to Paglo with the parameters you specified when you created it. Automatically a new file for further submissions will be created when you write more data using the write() method. All you need to do is provide the data to write. For example here is the snippet of code that submits a PQL ‘merge’ statement for updating data gathered from /proc/meminfo:
def submit(self):
meminfo = self.get_meminfo()
# Construct the sub-tree for the values in the 'meminfo' node
# that we have gleaned from the OS.
#
meminfo = ("system",
("meminfo",
([("%s" % k, "%s" % v) for k,v in meminfo.iteritems()])))
t = paglo.utils.device_merge_prefix(self.intf_info, meminfo)
self.submitter.write(paglo.tree_builder.render_as_merge(t,datetime.utcnow()) + ";\n")
The other Paglo provided utility object we use is the paglo.daemon_utils.LogWatcher class. The job of this class is to make it easy to watch some file. Whenever a file being watched by a LogWatcher gets a new line of data it will call the provided function. In the case of our log files this is the submitter.process_line() method of the LogSubmitter object defined in paglo_logwatcher.py.
submitter = LogSubmitter(options.directory, log_file_name, stype,
hostname, logger)
watcher = paglo.daemon_utils.LogWatcher(None,
log_file_name,
submitter.process_line,
logger = logger)
The LogSubmitter class creates an instance of the paglo.daemon_utils.FileSubmitter class so that it can submit data to Paglo. The arguments for submitting a file are a bit different from the previous FileSubmitter we created. We specify processor to be ‘log’, and we provide the additional arguments needed by Paglo’s log processing to categorize the log messages we submit: source indicates the name of the file we are submitting log messages for. source_type is used by Paglo to determine how to parse the log messages, and finally host_name is used to classify what host produced this log message. We also set this FileSubmitter to automatically submit the file to the paglo_submitter.py process every minute if there is any data to submit.
self.submitter = paglo.daemon_utils.FileSubmitter(directory,
{ 'processor' : 'log',
'source' : log_file,
'source_type': log_type,
'host_name' : hostname },
logger)
self.submitter.start_auto_submit(60)
The LogSubmitter class defined the process_line() method. This is the method that will be invoked by the paglo.daemon_utils.LogWatcher object whenever a watched file has a new line of data.
def process_line(self, line = None, ign = None):
# Append our newly gotten line to our file.
#
self.submitter.write(line)
# If the file is larger then <m> bytes submit it to paglo.
#
if self.submitter.size >= self.size_limit:
self.submitter.submit()
return
Running the Programs
All that is left is to decide what parameters to run these programs with and set up their environment. For the paglo_submitter.py we need to make sure that an API key file exists, and that a directory for file submissions exists.
The defaults for these are /usr/local/etc/paglo_api_key and /var/tmp/paglo_submitter. You should set the permissions on the submission directory to be such that the paglo_submitter.py can delete files from it while also allowing processes to drop files for submission in to it. I use the mode 1770, making sure that the directory is owned by the same uid that the paglo_submitter.py is running as, and that the directory’s group is the same as what the paglo_logwatcher.py script is running as.
With this you can just run pagloy_submitter.py without any arguments.
The paglo_logwatcher.py requries at least the list of log files to watch. For example on my OS X Macs I run the command:
/usr/local/sbin/paglo_logwatcher.py \ --log_files=syslog:/var/log/install.log,syslog:/var/log/secure.log,\ syslog:/var/log/system.log,syslog:/var/log/windowserver.log,\ http_common:/var/log/cups/access_log,http_common:/var/log/cups/error_log,\ generic:/var/log/hdiejectd.log,http_common:/var/log/apache2/access_log,\ http_common:/var/log/apache2/error_log
The format of the log_files parameter is a comma separated list. Each element in the list is a tuple of “log file format” followed by “log file name” separated with a ‘:’ character.
If you watch the contents of /var/tmp/paglo_submitter/ you will see files appear as the logs you are monitoring change. After a minute, you see these renamed to the pattern that paglo_submitter.py expects, and shortly after, it will vanish indicating it has been sent up to Paglo for indexing.
The power of indexing these log files lets us do things like see when ‘Time Machine’ recently ran using the query “backupd completed”:

Stay tuned for an upcoming article on how to use the API methods that Paglo provides for your use.


