HomeSection_sub_breakBlogsSection_sub_breakTechnical Blog
Icon_blog_forum_small Using consolidation queries to analyze large log data sets
Icon_person
Chris
Icon_time
07/01/2009 at 11:44
Icon_post
0 comments

My previous post showed how PQL can be used with log searches to perform sophisticated analysis. If you are dealing with up to a few hundred thousand log events then queries can complete within a second or so. However with larger log data sets it may take many minutes to perform a PQL query that involves tens of millions of events. This makes these queries impractical to use in dashboards where you want to be able to see and navigate the data more quickly. To support analytics over massive data sets Paglo support “consolidation queries”.

A consolidation query is a PQL statement which executes a PQL query and stores the results back into PQL. This is analogous to the practice in SQL of storing intermediate results into a temporary table. The syntax of consolidation queries follows the form:


  consolidate into pql_path (key_field*) pql_select_statement

The consolidation query executes the PQL select statement and stores the results at the point in the tree specified by the PQL path. Optional key fields can be specified to created a nested structure. Consolidation queries run periodically (approximately every 15 minutes). Each time a consolidation query runs history is maintained in the resulting tree so that PQL history statements can be used to see how data has changed over time. As a special case, if one of the columns in the query is named ‘timestamp’ then that column will not be stored, but instead will be used to timestamp the data history for each of the other columns. If the select statement includes a log search then the select statement only considers events which have been indexed since the previous time the consolidation statement was run.

Some examples might make all of this clearer.

To compute the number of log events received with a resolution of 5 minutes a query like this could be used:


  select timestamp(round(timestamp(timestamp), '5 minutes')) as timestamp, count(*) as total_volume 
  from %*% 
  group by timestamp(round(timestamp(timestamp), '5 minutes'))

Since summing across tens of millions of events can be time consuming we can use a consolidation query to pre-compute the result for us, and also allow us to use PQL’s HISTORY syntax to then chart the result. This query


  consolidate into /consolidation/log_volume 
  select timestamp(round(timestamp(timestamp), '5 minutes')) as timestamp, count(*) as total_volume 
  from %*% 
  group by timestamp(round(timestamp(timestamp), '5 minutes'))

computes the log volume and stores it into the single tree node /consolidation/log_volume/total_volume but with timestamped history every 5 minutes. We can then use another PQL query to see the log volume over time:


  select sum(total_volume) as vol
  from /consolidation/log_volume
  history to 'now' from '1 week ago'

This query only takes a fraction of a second regardless of how many log events are involved because it is dealing with the aggregated data, not the raw events any more. To store the log volume by source we need to key on the source field:


  consolidate into /consolidation/log_volume/sources 
  keys (source) 
  select timestamp(round(timestamp(timestamp), '5 minutes')) as timestamp, source, count(*) as total_volume 
  from %*% 
  group by timestamp(round(timestamp(timestamp), '5 minutes')), source

This will create multiple /consolidation/log_volume/sources nodes, each one with source and total_volume sub nodes containing the selected values. Then we can use a query like this to get the history of a source:


  select sum(total_volume) as vol
  from /consolidation/log_volume/sources[source = 'source::/var/log/messages’]
  history to 'now' from '1 week ago'

So consolidation queries can be used to distill massive log data sets down to summary information that can be viewed in real-time.

Add a Comment