ホーム
Python
PythonでApacheのログを読む

PythonでApacheのログを読む

PythonでApache（httpd）のログをパースして読んでみます。もっと具体的に言うと、さくらインターネットのレンタルサーバーのログです。

さくらインターネットのレンタルサーバーのApacheの設定
Pythonでログをパースする

さくらインターネットのレンタルサーバーのApacheの設定

公式サイトに説明があります。公式から引用すると、下記の項目がログに記録されています。記録されたログは、gzip圧縮されて、ログローテーションで古いものから廃棄されます。

アクセス先ドメイン名（※）
アクセス元のホスト名もしくはIPアドレス
クライアント側のユーザ名（通常は空欄）
認証を行った際のユーザ名（通常は空欄）
アクセス日時
クライアントがアクセスした際のリクエスト内容
リクエストに対するステータスコード
実際に転送したデータ量
リファラ（どのリンクから辿ってきたのか）
ユーザエージェント

（※）この項目を保存するかどうかは選択できるのですが、本記事では保存するという前提で記載します。

Pythonでログをパースする

生のログファイルを開けてみると、なにやらデータがスペースで区切って記録されているように見えます。

自力でパースしても良いのですが、 apache_log_parser という便利なモジュールがありますので使わせていただきます。

> pip install apache_log_parser

import apache_log_parser

line_parser = apache_log_parser.make_parser(patern)
ret = line_parser(log)

変数	型	内容
patern	str	Apacheのログフォーマット文字列。
log	str	Apacheのログの１行分。
ret	dict	ログをパースした結果。

make_parserメソッドにApacheのログファイルのパターン文字列を渡してline_parserオブジェクトを作り、そのパーサーオブジェクトにログを渡します。すると、パースした結果を辞書形式で返します。

では、パターン文字列はどのように指定しましょうか。

Apacheのログ書式は、こちらのサイトに書き方が書かれています。これをさくらインターネットの公式サイトの説明にあてはめていきます。

これでどうでしょうか。

%v %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"

試してみましょう。

さくらのレンタルサーバーのウェブサーバーのログファイルから、最初の1行のログをパースして、表示してみます。

import gzip
import apache_log_parser
import pprint

with gzip.open('access_log_20190728.gz', mode='rt') as f:
    log_data = f.read().splitlines()

line_parser = apache_log_parser.make_parser('%v %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"')

pprint.pprint(line_parser(log_data[0]))

出力はこうなります。

{'remote_host': 'xx.xxx.xx.xxx',
 'remote_logname': '-',
 'remote_user': '-',
 'request_first_line': 'GET /rum/post/python_trim_csv_row/ HTTP/1.1',
 'request_header_referer': '-',
 'request_header_user_agent': 'Mozilla/5.0 (compatible; xxxxxxbot/2.1; '
                              '+http://www.sample.com/bot.html)',
 'request_header_user_agent__browser__family': 'xxxxxbot',
 'request_header_user_agent__browser__version_string': '2.1',
 'request_header_user_agent__is_mobile': False,
 'request_header_user_agent__os__family': 'Other',
 'request_header_user_agent__os__version_string': '',
 'request_http_ver': '1.1',
 'request_method': 'GET',
 'request_url': '/rum/post/python_trim_csv_row/',
 'request_url_fragment': '',
 'request_url_hostname': None,
 'request_url_netloc': '',
 'request_url_password': None,
 'request_url_path': '/rum/post/python_trim_csv_row/',
 'request_url_port': None,
 'request_url_query': '',
 'request_url_query_dict': {},
 'request_url_query_list': [],
 'request_url_query_simple_dict': {},
 'request_url_scheme': '',
 'request_url_username': None,
 'response_bytes_clf': '39015',
 'server_name': 'water2litter.net',
 'status': '200',
 'time_received': '[28/Jul/2019:00:00:10 +0900]',
 'time_received_datetimeobj': datetime.datetime(2019, 7, 28, 0, 0, 10),
 'time_received_isoformat': '2019-07-28T00:00:10',
 'time_received_tz_datetimeobj': datetime.datetime(2019, 7, 28, 0, 0, 10, tzinfo='0900'),
 'time_received_tz_isoformat': '2019-07-28T00:00:10+09:00',
 'time_received_utc_datetimeobj': datetime.datetime(2019, 7, 27, 15, 0, 10, tzinfo='0000'),
 'time_received_utc_isoformat': '2019-07-27T15:00:10+00:00'}

ボットが来てたみたいですね。

辞書形式のデータが得られますので、その辞書にkeyを指定すればvalueが得られます。

404を返したログを抽出して、この日に何回404を返したか数えてみます。

import gzip
import apache_log_parser

with gzip.open('access_log_20190728.gz', mode='rt') as f:
    log_data = f.read().splitlines()

line_parser = apache_log_parser.make_parser('%v %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"')

logs = []
for i in range(len(log_data)):
    log_line_data = line_parser(log_data[i])
    logs.append(log_line_data)

not_found = [i for i in logs if i['status'] == '404']
print(len(not_found))

あれ、ずいぶん多い。

Pythonでオブジェクトを選択してクロップするツールを作ってみた Pythonでグラフ（Matplotlib）を表示して動的に変更する

公開日 2019-09-09

PythonでApacheのログを読む

さくらインターネットのレンタルサーバーのApacheの設定

Pythonでログをパースする

Pythonカテゴリの投稿

某エンジニアのお仕事以外のメモ（分冊）

Recent Posts

Tags

Categories