A good old-fashioned Perl log analyzer - The Phoenix Trap - JOYK Joy of Geek, Geek News, Link all geek

A recent Lobsters post lauding the virtues of AWK reminded me that although the language is powerful and lightning-fast, I usually find myself exceeding its capabilities and reaching for Perl instead. One such application is analyzing voluminous log files such as the ones generated by this blog. Yes, WordPress has stats, but I’ve never let reinvention of the wheel get in the way of a good programming exercise.

So I whipped this script up on Sunday night while watching RuPaul’s Drag Race reruns. It parses my Apache web server log files and reports on hits from week to week.

#!/usr/bin/env perl

use strict;
use warnings;
use Regexp::Log::Common;
use DateTime::Format::HTTP;
use List::Util 1.33 'any';
use Number::Format 'format_number';

my $parser = Regexp::Log::Common->new(
    format  => ':extended',
    capture => [qw<req ts status>],
);
my @fields      = $parser->capture;
my $compiled_re = $parser->regexp;

my @skip_uri_patterns = qw<
  ^/+robots.txt
  [-\w]*sitemap[-\w]*.xml
  ^/+wp-
  /feed/?$
  ^/+?rest_route=
>;

my ( %count, %week_of );
while (<>) {
    my %log;
    @log{@fields} = /$compiled_re/;

    # only interested in successful or cached requests
    next unless $log{status} =~ /^2/ or $log{status} = 304;

    my ( $method, $uri, $protocol ) = split ' ', $log{req};
    next unless $method eq 'GET';
    next if any { $uri =~ $_ } @skip_uri_patterns;

    my $dt  = DateTime::Format::HTTP->parse_datetime( $log{ts} );
    my $key = sprintf '%u-%02u', $dt->week;

    # get first date of each week
    $week_of{$key} ||= $dt->date;
    $count{$key}++;
}

printf "Week of %s: % 10s\n", $week_of{$_}, format_number( $count{$_} )
  for sort keys %count;

Here’s some sample output:

Week of 2021-07-31:      2,672
Week of 2021-08-02:     16,222
Week of 2021-08-09:     12,609
Week of 2021-08-16:     17,714
Week of 2021-08-23:     14,462
Week of 2021-08-30:     11,758
Week of 2021-09-06:     14,811
Week of 2021-09-13:        407

I first started prototyping this on the command line as if it were an awk one-liner by using the perl -n and -a flags. The former wraps code in a while loop over the <> “diamond operator”, processing each line from standard input or files passed as arguments. The latter splits the fields of the line into an array named @F. It looked something like this while I was listing URIs (locations on the website):

gunzip -c ~/logs/phoenixtrap.com-ssl_log-*.gz | \
perl -anE 'say $F[6]'

But once I realized I’d need to filter out a bunch of URI patterns and do some aggregation by date, I turned it into a script and turned to CPAN.

There I found Regexp::Log::Common and DateTime::Format::HTTP, which let me pull apart the Apache log format and its timestamp strings without having to write even more complicated regular expressions myself. (As noted above, this was already a wheel-reinvention exercise; no need to compound that further.)

Regexp::Log::Common builds a compiled regular expression based on the log format and fields you’re interested in, so that’s the constructor on lines 10 through 13. The expression then returns those fields as a list, which I’m assigning to a hash slice with those field names as keys in line 28. I then skip over requests that aren’t successful or browser cache hits, skip over requests that don’t GET web pages or other assets (e.g., POSTs to forms or updating other resources), and skip over the URI patterns mentioned earlier.

(Those patterns are worth a mention: they include the robots.txt and sitemap XML files used by search engine indexers, WordPress administration pages, files used by RSS newsreaders subscribed to my blog, and routes used by the Jetpack WordPress add-on. If you’re adapting this for your site you might need to customize this list based on what software you use to run it.)

Lines 37 and 38 parse the timestamp from the log into a DateTime object using DateTime::Format::HTTP and then build the key used to store the per-week hit count. The last lines of the loop then grab the first date of each new week (assuming the log is in chronological order) and increment the count. Once finished, lines 45 and 46 provide a report sorted by week, displaying it as a friendly “Week of date” and the hit counts aligned to the right with sprintf. Number::Format’s format_number function displays the totals with thousands separators.

Room for improvement

DateTime is a very powerful module but this comes at a price of speed and memory. Something simpler like Date::WeekNumber should yield performance improvements, especially as my logs grow (here’s hoping). It requires a bit more manual massaging of the log dates to convert them into something the module can use, though:

#!/usr/bin/env perl

use strict;
use warnings;
use Syntax::Construct 'regex-named-capture-group';
use Regexp::Log::Common;
use Date::WeekNumber 'iso_week_number';
use List::Util 1.33 'any';
use Number::Format 'format_number';

my $parser = Regexp::Log::Common->new(
    format  => ':extended',
    capture => [qw<req ts status>],
);
my @fields      = $parser->capture;
my $compiled_re = $parser->regexp;

my @skip_uri_patterns = qw<
  ^/+robots.txt
  [-\w]*sitemap[-\w]*.xml
  ^/+wp-
  /feed/?$
  ^/+?rest_route=
>;

my %month = (
    Jan => '01',
    Feb => '02',
    Mar => '03',
    Apr => '04',
    May => '05',
    Jun => '06',
    Jul => '07',
    Aug => '08',
    Sep => '09',
    Oct => '10',
    Nov => '11',
    Dec => '12',
);

my ( %count, %week_of );
while (<>) {
    my %log;
    @log{@fields} = /$compiled_re/;

    # only interested in successful or cached requests
    next unless $log{status} =~ /^2/ or $log{status} = 304;

    my ( $method, $uri, $protocol ) = split ' ', $log{req};
    next unless $method eq 'GET';
    next if any { $uri =~ $_ } @skip_uri_patterns;

    # convert log timestamp to YYYY-MM-DD
    # for Date::WeekNumber
    $log{ts} =~ m!^
      (?<day>\d\d) /
      (?<month>...) /
      (?<year>\d{4}) : !x;
    my $date = "$+{year}-$month{ $+{month} }-$+{day}";

    my $week = iso_week_number($date);
    $week_of{$week} ||= $date;
    $count{$week}++;
}

printf "Week of %s: % 10s\n", $week_of{$_}, format_number( $count{$_} )
  for sort keys %count;

It looks almost the same as the first version, with the addition of a hash to convert month names to numbers and the actual conversion (using named regular expression capture groups for readability, using Syntax::Construct to check for that feature). On my server, this results in a ten- to eleven-second savings when processing two months of compressed logs.

What’s next? Pretty graphs? Drilling down to specific blog posts? Database storage for further queries and analysis? Perl and CPAN make it possible to go far beyond what you can do with AWK. What would you add or change? Let me know in the comments.

Like this:

Loading…

A good old-fashioned Perl log analyzer - The Phoenix Trap

Post navigation

Room for improvement

Like this:

Related

Recommend

App Annie 8 月中国厂商及应用出海收入榜：《PUBG MOBILE》第一《原神》第二

Why even giant ships can't solve the shipping crisis

Love Island star Amy Hart says she got death threat from 13-year-old

Amazon CEO Andy Jassy urges Congress to raise debt ceiling

Apple iPhone 13 brings portrait mode for video

Hushpuppi: Social media influencer who funded his luxurious lifestyle by cybercr...

Tesla Autopilot to be compared with 12 other systems in NHTSA probe

Ex-Love Islander Amy Hart on online abuse and trolling

中青旅：在线旅游行业如何选型数据分析平台？

What have people used for an E-mail send relay?

About Joyk