HomeSection_sub_breakBlogsSection_sub_breakTechnical Blog
Icon_blog_forum_small The Power of Correlating Data with PQL Queries
Icon_person
Scanner
Icon_time
08/28/2009 at 16:35
Icon_post
0 comments

We had observed a problem, using the Service Health app where a mail server was not responding to SMTP almost more often then it was up! Although this would not lose mail it could seriously delay messages being sent through the service so we had to figure out why this was happening. Looking through the data for the company we started to see a pattern. Now if we only had a way to correlate the different bits of historical data we had gathered.


The red bars show where the service was down. A short red bar indicates that it was only down for a part of that time interval.

One of the big advantages of Paglo is how we gather so many different kinds of data. Since it is all stored in to a single tree-structured database for each company you can correlate different pieces of data. Our SQL-like query language PQL allows you to extract disparate pieces of information and plot them together in the same chart.

For example, on unix hosts we can gather the 1 minute, 5 minute, and 15 minute load averages over time. We also gather the status of various network services offered by a host. You can learn more about the Server Health app in our earlier blog post Server Health, the Service Health Crawler Plugin, Alerts, and You.

A service that we monitor is SMTP. One of the most common Mail Service Agent programs on unix is sendmail. sendmail has a anti-denial of service attack feature where if the machine it is running on is too busy it will not answer requests, effectively appearing to be down to other hosts. So we should be able to correlate time periods when SMTP on the host is down with its load average. Here we unleash the power of PQL:

select service_id||’ on ‘||../../../../../system/dns_name as caption,
mean(constant(success))*10 as success,
../../../../../system/stats/loadavg_1min
from /network/device[system/dns_name=’<your>’]/apps/com/paglo/service_health/service
where service_id=‘smtp’
history from ‘1 day ago’ to ‘now’

We are not going to go in to too much detail on the specifics of the PQL statement but in short we are selecting a service id combined with the host’s DNS name, the ‘success’ state of that service as monitored by the Service Health Plugin in the crawler, and the 1 minute load average of that host. We want these values going back in time from now to one day ago.

This query gives us a large table of values. But if we click on the link to convert this to a chart, for a host that is frequently very busy we may see a chart something like this:

The blue line shows the 1 minute load average of the host over the past day. The red line, when it is at ‘10’ represents when the host is accepting mail via SMTP. Whenever the red line dips down to 0, the host is refusing mail.

You can quickly see that whenever the load average is below 10, it accepts mail and when it is above 10 it refuses mail. In many other charting and reporting systems to gather the right data and then generate such a chart would take a fair amount of custom code. With Paglo’s rich data gathering and PQL it only took us a few minutes to come up with a query to chart the correlated data.

For more information on PQL and how to use it consult our PQL documentation

Add a Comment