NagiosによるHadoopクラスタの監視

(1)

日本HP OpenSource/Linux技術文書

NagiosによるHadoopクラスタの監視

日本ヒューレット・パッカード株式会社 2011年4月5日

(2)

目次 [本ドキュメントについて] ... 4 システム構成 ... 5 Hadoop クラスタの準備 ... 6 Nagios のインストール（NN） ... 6 Nagios プラグインのインストール ... 6 Nagios プラグインの動作テスト（NN） ... 18 Nagios の GUI 画面での監視 ... 19

(3)

図表目次

図 1. Cloudera Distribution for Hadoop on HP ProLiant SL6500 のシステム構成例 . 5 図 2. Nagios による Hadoop クラスタの監視の様子 ... 20 図 3. Nagios による DataNode のサービス障害の検知の様子 ... 21 図 4. Nagios による DataNode の tasktracker サービス障害の検知の様子 ... 22

(4)

[本ドキュメントについて]

本ドキュメントでは、NameNode サーバーのみでの作業を(NN)、DataNode サーバーのみでの作業を(DN)、NameNode サーバーと DataNode サーバー両方での作業を(NN,DN)と記すことにします。例 1）「ファイルをコピーします。（NN）」と記載してあるものは、NameNode だけでファイルをコピーするという意味になります。例 2）「rpm コマンドでパッケージをインストールします（NN,DN）」と記載してあるものは、NameNode と DataNode の両方で rpm コマンドを使ってインストールを行うという意味になります。コマンドラインでの入力が長く紙面の都合で折り返して記載する場合は、下記のように「\\\\」記号を挿入して複数行にわたって記載しています。複数行にわたって記載されていても実際には１行で入力するものは、その記述の最後に「（実際には１行で入力）」を挿入しています。例 3）

# alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf \\\\

/etc/hadoop-0.20/conf.hp001 20 （実際には１行で入力）本ドキュメントの内容については充分チェックをしておりますが、その正確性を保証する物ではありません。また、将来、予告なしに変更することがあります。本ドキュメントの使用で生じるいかなる結果も利用者の責任となります。日本ヒューレット・パッカード株式会社は、本ドキュメントの内容に一切の責任を負いません。本ドキュメントの技術情報は、ハードウェア構成、OS、アプリケーションなど使用環境により大幅に数値が変化する場合がありますので、十分なテストを個別に実施されることを強くお薦め致します。本ドキュメント内で表示・記載されている会社名・サービス名・商品名等は各社の商標又は登録商標です。本ドキュメントで提供する資料は、日本の著作権法、条約及び他国の著作権法にいう著作権により保護されています。

(5)

本ドキュメントはHP ProLiantサーバー CDH3b4)が構築されたシステムにめのガイドです。CDH3b4は現時点は、Hadoop環境の監視を正常に改変されたNagiosプラグインの

システム構成

以下にCDHをインストールする環境

ハードウェア : HP ProLiant SL6500 (Server: HP ProLiant SE2170s x8) OS : Red Hat Enterprise Linux

JDK : 1.6u24 x64 CDH : CDH3b4 （ Nagios : 3.2.3 Nagiosプラグイン：1.4.15 以下にハードウェア外観を示します NameNodeの可用性は考慮していないため時のデータロストが発生するSPOF ださい。

図 1. Cloudera Distribution for Hadoop on HP ProLiant SL6500

– Proof Of Concept – Hadoop の Name Node

成 – Name Node 用 – Name Node – システムの可用性リカで担保できるがいレベルの動作確認用途 – 容量を確保するためスクを搭載 – 障害による交換頻度合は、SAS を検討 – Top of Rack スイッチはギガビットのポート数を多数 – HP A5800 では能

サーバー上で、Cloudera distribution for Hadoop 3 beta 4 ( されたシステムにNagiosをインストールし、Hadoopクラスターシステムを

現時点でベータリリースの最新版です。またNagiosプラグインについてに行うために一部改変を行っております。CDHのベータリリースおよびプラグインの本番商用利用については検討が必要となりますのでご

環境を示します。

: HP ProLiant SL6500 (Server: HP ProLiant SE2170s x8) Red Hat Enterprise Linux 5.6 x86-64（NameNodeおよびDataNode : 1.6u24 x64（Oracle社が提供するパッケージを利用）

（Red Hat系OSに対応したRPMパッケージを利用）

します。今回の構成ではしていないため、NameNodeの障害

SPOFが存在する点にご注意く

Cloudera Distribution for Hadoop on HP ProLiant SL6500 のシステム

Proof Of Concept 構成

Name Node は 1 台構 Name Node は、Job Tracker を兼 Name Node の可用性はなし可用性は HDFS のレプできるが、ロストしてもよ動作確認用途するため、SATA のディ交換頻度を下げたい場検討スイッチはギガビットの多数もつものを用意では 48 ポートが利用可

Cloudera distribution for Hadoop 3 beta 4 (通称システムを監視するたプラグインについてのベータリリースおよびとなりますのでご注意ください。 DataNode）のシステム構成例

(6)

Hadoop クラスタの準備

本ドキュメントに記載しているNagiosの設定手順を行う前に、Hadoopクラスタのインストールと事前準備を済ませておく必要があります。Hadoopクラスタの設定については、別紙「Cloudera Distribution for Hadoopインストール手順及び利用例」を参照してセットアップを完了させておいてください。

Nagios のインストール（NN）

Nagiosの本体及び関連パッケージを、以下URLに示すダウンロードサイトから入手します。 http://apt.sw.be/redhat/el5/en/x86_64/rpmforge/RPMS/fping-2.4-1.b2.2.el5.rf.x86_64.rpm ftp://fr2.rpmfind.net/linux/dag/redhat/el5/en/x86_64/dag/RPMS/perl-Crypt-DES-2.05-3.2.el 5.rf.x86_64.rpm http://apt.sw.be/redhat/el5/en/i386/rpmforge/RPMS/perl-Net-SNMP-5.2.0-1.2.el5.rf.noarch. rpm ftp://fr2.rpmfind.net/linux/dag/redhat/el5/en/x86_64/dag/RPMS/nagios-3.2.3-3.el5.rf.x86_ 64.rpm http://apt.sw.be/redhat/el5/en/x86_64/rpmforge/RPMS/nagios-plugins-1.4.15-2.el5.rf.x86_6 4.rpm rpmコマンドでNagiosをNameNodeにインストールします。 (NN) # rpm -vhi fping-2.4-1.b2.3.el5.rf.x86_64.rpm # rpm -vhi perl-Crypt-DES-2.05-3.2.el5.rf.x86_64.rpm # rpm -vhi perl-Net-SNMP-5.2.0-1.2.el5.rf.noarch.rpm # rpm -vhi nagios-3.2.3-3.el5.rf.x86_64.rpm # rpm -vhi nagios-plugins-1.4.15-2.el5.rf.x86_64.rpm

Nagios プラグインのインストール

Nagios プラグインを以下の URL からダウンロードします。 http://exchange.nagios.org/directory/Plugins/Others ダウンロードするプラグインは以下になります。 check_hadoop-datanode.pl check_hadoop_hdfs.sh check_hadoop_tasktracker.pl check_hadoop_metrics.sh get-dfsreport.sh check_port.pl # ls -1 /root/NagiosPlugins check_hadoop-datanode.pl check_hadoop_hdfs.sh check_hadoop_metrics.sh check_hadoop_tasktracker.pl get-dfsreport.sh check_port.pl

(7)

Hadoop の NameNode 上で Nagios のプラグインの設定を行います (NN) # cd /etc/nagios # cp nagios.cfg nagios.cfg.org # vi nagios.cfg ... cfg_dir=/etc/nagios/servers ... date_format=iso8601 ... HadoopのNameNodeの設定ファイルを作成します。DataNodeで構成されるHDFSのチェックを行う check_hadoop_metricsの設定は、このNameNode用の設定ファイルに記述することに注意してください。check_hadoop_metricsスクリプトでは、NameNode（ここではhd01）の50070ポートから NameNode のメトリクスを取得します。取得したメトリクスから、CapacityRemainingGB(HDFSのディスク残容量)の値を取得します。以下のhd01.cfgのcheck_hadoop_metricsは、HDFSのディスク残容量が 998GB を下回ったら warning、300GBを下回ったら criticalとして Nagiosに通知する例です。

# mkdir /etc/nagios/servers # cd /etc/nagios/servers # vi hd01.cfg define host { use linux-server host_name hd01 alias hd01 address 172.16.1.1 } define service { use generic-service host_name hd01 service_description PING check_command check_ping!100.0,20%!500.0,60% }

# Specify a name node on check_hadoop_datanode.

# Modify value 0 and 1 in proportion to the number of data nodes. define service { use generic-service host_name hd01 service_description DATANODE check_command check_hadoop_datanode!hd01!0!1 }

# Specify Name Node hd01 for monitoring the HDFS. define service{

use generic-service

host_name hd01 service_description HDFS

(8)

check_command check_hadoop_hdfs notifications_enabled 1

}

# This example is that the hadoop secondary name node is hd01. define service{ use generic-service host_name hd01 service_description SECONDARY check_command secondarynamenode_check notifications_enabled 1 }

# Specify the name node (hd01) on check_hadoop_tasktracker # for monitoring tasktracker that runs on data nodes.

# Modify value 0 and 1 in proportion to the number of data nodes. define service { use generic-service host_name hd01 service_description TASKTRACKER check_command check_hadoop_tasktracker!hd01!0!1 } define service { use generic-service host_name hd01

service_description HDFS DISK USAGE

check_command check_hadoop_metrics!hd01!50070!CapacityRemainingGB!lt!998!300

}

Hadoop の DataNode 用の設定ファイルを作成します。DataNode で構成される HDFS のチェックを行う check_hadoop_metrics の設定は、この DataNode 用の設定ファイルに記述せずに、NameNode 用の設定ファイル hd01.cfg に記述することに注意してください。 # pwd /etc/nagios/serviers # vi hd02.cfg define host { use linux-server host_name hd02 alias hd02 address 172.16.1.2 } define service { use generic-service host_name hd02 service_description PING check_command check_ping!100.0,20%!500.0,60% }

(9)

define service {

use generic-service

host_name hd02

service_description DATANODE PORT

check_command check_hadoop_datanode_port }

define service {

use generic-service host_name hd02

service_description DATANODE TASKTRACKER PORT

check_command check_hadoop_datanode_tasktracker_port } 上記設定ファイル hd02.cfg と同様に DataNode の hd03 から hd08 用の設定ファイルを作成します。 # vi hd03.cfg # vi hd04.cfg ... # vi hd08.cfg Nagios プラグインのコマンドを定義します。定義は設定ファイル commands.cfg に行います。 (NN) # pwd /etc/nagios/objects/ # vi commands.cfg ... ... ...

# 'check_hadoop_hdfs' command definition define command{

command_name check_hadoop_hdfs

command_line $USER1$/check_hadoop_hdfs }

# 'check_hadoop_metrics' command definition define command{

command_name check_hadoop_metrics

command_line $USER1$/check_hadoop_metrics -H $ARG1$ -p $ARG2$ -C $ARG3$ \\\\

-b $ARG4$ -w $ARG5$ -c $ARG6$ }

# 'secondarynamenode_check' command definition define command{

command_name secondarynamenode_check

command_line $USER1$/secondarynamenode_check }

# 'check_hadoop_datanode' command definition define command{

command_name check_hadoop_datanode

(10)

}

# 'check_hadoop_tasktracker' command definition define command{

command_name check_hadoop_tasktracker

command_line $USER1$/check_hadoop_tasktracker $ARG1$ $ARG2$ $ARG3$ }

# Check Hadoop datanode port 50075 define command{

command_name check_hadoop_datanode_port

command_line $USER1$/check_port -p 50075 -h $HOSTADDRESS$ -c 1.5 -w 1.0 }

# Check Hadoop datanode's tasktracker port 50060 define command{

command_name check_hadoop_datanode_tasktracker_port

command_line $USER1$/check_port -p 50060 -h $HOSTADDRESS$ -c 1.5 -w 1.0 } Nagios プラグインの設定ファイルを Nagios のプラグイン用のディレクトリにコピーします。コピー先のディレクトリ plugins に保存するプラグインの名前は、check_hadoop_hdfs にします。(NN) # cd /root/NagiosPlugin/ # cp check_hadoop_hdfs.sh /usr/lib64/nagios/plugins/check_hadoop_hdfs # chmod 755 /usr/lib64/nagios/plugins/check_hadoop_hdfs # dos2unix /usr/lib64/nagios/plugins/check_hadoop_hdfs # ls -l /usr/lib64/nagios/plugins/check_hadoop_hdfs

-rwxr-xr-x 1 root root 216 2 月 25 19:49 /usr/lib64/nagios/plugins/check_hadoop_hdfs # cp secondarynamenode_check.sh /usr/lib64/nagios/plugins/secondarynamenode_check # chmod 755 /usr/lib64/nagios/plugins/secondarynamenode_check

# dos2unix ./secondarynamenode_check

# ls -l /usr/lib64/nagios/plugins/secondarynamenode_check

-rwxr-xr-x 1 root root 309 2 月 25 19:51 /usr/lib64/nagios/plugins/secondarynamenode_check # cp check_hadoop_datanode.pl /usr/lib64/nagios/plugins/check_hadoop_datanode

# chmod 755 /usr/lib64/nagios/plugins/check_hadoop_datanode # ls -l /usr/lib64/nagios/plugins/check_hadoop_datanode

-rwxr-xr-x 1 root root 1416 2 月 25 19:53 /usr/lib64/nagios/plugins/check_hadoop_datanode # cp check_hadoop_tasktracker.pl /usr/lib64/nagios/plugins/check_hadoop_tasktracker # chmod 755 /usr/lib64/nagios/plugins/check_hadoop_tasktracker

# ls -l /usr/lib64/nagios/plugins/check_hadoop_tasktracker

-rwxr-xr-x 1 root root 1389 2 月 25 19:55 /usr/lib64/nagios/plugins/check_hadoop_tasktracker

Nagios プラグイン check_port.pl を修正します。修正として、Nagios プラグイン check_port.pl の最終行に「exit $rc;」を追加します。これは Hadoop の監視を Nagios で正常に行うために現時点でリリースされている Nagios プラグイン check_port.pl に対して必要な修正です。

(11)

# cd /root/NagiosPlugins/ # vi check_port.pl ... exit $rc; Nagios プラグイン check_port.pl の中身を確認します。 # cat check_port #!/usr/bin/perl use strict; use warnings; use Socket; use Getopt::Long;

use Time::HiRes qw(gettimeofday tv_interval);

my ($crit, $warn, $timeout, $host, $portnum, $verbose); GetOptions( 'crtitical=s' => \$crit, 'warning=s' => \$warn, 'timeout=s' => \$timeout, 'host=s' => \$host, 'port=s' => \$portnum,

'verbose' => \$verbose) or HELP_MESSAGE(); sub testport {

my ($host,$port,$protocol,$timeout) = @_; my $startsec;

my $elapsed;

if (!defined $timeout) { $timeout = 10 } if (!defined $protocol) { $protocol = "tcp" } my $proto = getprotobyname($protocol); my $iaddr = inet_aton($host);

my $paddr = sockaddr_in($port, $iaddr); $startsec = [gettimeofday()];

socket(SOCKET, PF_INET, SOCK_STREAM, $proto) or die "socket: $!"; eval {

local $SIG{ALRM} = sub { die "timeout" }; alarm($timeout);

connect(SOCKET, $paddr) or error(); alarm(0);

}; if ($@) {

close SOCKET || die "close: $!";

$elapsed = tv_interval ($startsec, [gettimeofday]); return 0,$elapsed;

} else {

close SOCKET || die "close: $!";

(12)

return 1,$elapsed; }

}

sub HELP_MESSAGE {

print "$0 -p <port> -h <host> (-c <critical> -w <warning> -v)\n"; print "\t -p <port> # port number to examine\n";

print "\t -h <hostname> # hostname or ip address to contact\n";

print "\t -c <seconds> # the number of seconds to wait before a going critical\n"; print "\t -w <seconds> # the number of seconds to wait before a flagging a warning\n"; print "\t -v # displays nagios performance information\n";

print "\te.g $0 -p 80 -h www.google.com -c 1.5 -w 1.0 -v\n"; exit(4);

}

sub printperf {

my ($warning,$critical,$elapsed) = @_;

if ((defined $warning) && (defined $critical)) {

print "|rta=$elapsed" . "s;$warning;$critical;0;$critical"; } else { print "|rta=$elapsed" } } sub test { my ($critical,$warning,$host,$portnum,$timeout) = @_; my $proto = "tcp"; my ($rc,$elapsed) = testport($host,$portnum,$proto,$timeout); if ($rc) { if (defined $critical) { if ($critical le $elapsed) { return 2,$elapsed; } } if (defined $warning) { if ($warning le $elapsed) { return 1,$elapsed; } } return 0,$elapsed; } else { return 2,$elapsed; } }

unless ((defined $portnum) && (defined $host)) { HELP_MESSAGE();

exit 1; }

(13)

if ($crit <= $warn) {

print "Error: warning is greater than critical will never reach warning\n"; exit 4;

} }

my @mess = qw(OK WARNING CRITICAL);

my @mess2 = ("is responding","is slow responding","is not responding"); my ($rc,$elapsed) = test($crit,$warn,$host,$portnum,$timeout);

print "PORT $portnum $mess[$rc]: $host/$portnum $mess2[$rc]"; if (defined $verbose) {

printperf($warn,$crit,$elapsed); }

print "\n";

exit $rc;

修正した Nagios プラグイン check_port.pl を Nagios プラグイン用のディレクトリ plugins にコピーします。コピー先のディレクトリでの Nagios プラグインの名前は、check_port にします。（NN） # cp check_port.pl /usr/lib64/nagios/plugins/check_port

# chmod 755 /usr/lib64/nagios/plugins/check_port # ls -l /usr/lib64/nagios/plugins/check_port

-rwxr-xr-x 1 root root 4082 2 月 25 21:33 /usr/lib64/nagios/plugins/check_port

Nagios プラグイン check_hadoop_metrics.sh を修正します。修正項目は青字で示すとおりです。これは Hadoop の監視を Nagios で正常に行うために現時点でリリースされている Nagios プラグイン check_hadoop_metrics.sh に対して必要な修正です。 # cd /root/NagiosPlugins/ # vi check_hadoop_metrics.sh ... ... # ##

# [[ $warn -le $crit ]] || \\\\

echo -e "Critical level must be greater than Warning level." && exit 3

if [[ $warn if [[ $warn if [[ $warn

if [[ $warn ----ge $crit ]]; thenge $crit ]]; thenge $crit ]]; then ge $crit ]]; then echo

echo echo

echo ----e "Critical level must be greater than Warning level."e "Critical level must be greater than Warning level."e "Critical level must be greater than Warning level."e "Critical level must be greater than Warning level." exit 3 exit 3 exit 3 exit 3 fi fi fi fi ... ... # ##

# [[ $warn -ge $crit ]] || \\\\

echo -e "Warning level must be greater than Critical level." && exit 3 if [[ $warn

if [[ $warn if [[ $warn

if [[ $warn ----le $crit ]]; thenle $crit ]]; thenle $crit ]]; then le $crit ]]; then echo

echo echo

echo ----eeee "Warning level must be greater than Critical level.""Warning level must be greater than Critical level.""Warning level must be greater than Critical level." "Warning level must be greater than Critical level." exit 3 exit 3 exit 3 exit 3 fi fi fi fi Nagios プラグイン check_hadoop_metrics.sh の中身を確認します。

(14)

# cat check_hadoop_metrics.sh #!/bin/bash

#

# FILE: check_hadoop_metrics.sh

# SYNOPSIS: Check a Hadoop metric via HTTP. Assumes metrics have been # exposed in $HADOOP_CONF/hadoop-metrics.properties

# LICENSE: GPL, Copyright 2010 William Stiern <wstiern (at) gmail {dot} com> # # Declare variables. declare hostname declare port declare counter declare bias declare warn declare crit declare metricIsNumeric declare -a hadoopMetrics declare -a hadoopMetricName declare -a hadoopMetricValue while [[ $# -gt 0 ]] do case "$1" in -H | --hostname ) shift hostname="$1" ;; -p | --port ) shift port="$1" -p | --port ) shift port="$1" ;; -C | --counter ) shift counter="$1" ;; -b | --bias ) counter="$1" ;; -b | --bias ) shift bias="$1" ;;

(15)

-w | --warn ) shift warn="$1" ;; -c | --crit ) shift ;; -c | --crit ) shift crit="$1" ;; -* | --help ) echo -e >&2 \ "

Usage: ${0} [-H hostname] [-p port] [-C counter] \\\\

[-b | --bias] \t\t Evaluation bias: Greater-than (gt), \\\\

Less-than (lt), Informative (na). [-w | --warn] \t\t Warning threshold. [-c | --crit] \t\t Critical threshold. "

[-w | --warn] \t\t Warning threshold. [-c | --crit] \t\t Critical threshold. " exit 3 ;; *) break ;; esac shift done

[[ -z $hostname ]] && echo -e "Hostname not specified." && exit 3 [[ -z $port ]] && echo -e "Port number not specified." && exit 3 [[ -z $hostname ]] && echo -e "Hostname not specified." && exit 3 [[ -z $port ]] && echo -e "Port number not specified." && exit 3 [[ -z $counter ]] && echo -e "Counter name not specified." && exit 3 [[ -z $bias ]] && echo -e "Bias not specified." && exit 3

[[ -z $crit ]] && echo -e "Crit level not specified." && exit 3 [[ -z $warn ]] && echo -e "Warn level not specified." && exit 3 # Grab metrics.

(16)

grep -i [a-z0-9_]=[0-9] | grep -v \{ ) ) # Grab all the metrics up into arrays.

for (( m = 0 ; m < ${#hadoopMetrics[@]} ; m++ )) do # Grab all the metrics up into arrays.

for (( m = 0 ; m < ${#hadoopMetrics[@]} ; m++ )) do

hadoopMetricName=( $( echo "${hadoopMetrics[${m}]}" | cut -d = -f 1 ) ) hadoopMetricValue=( $( echo "${hadoopMetrics[${m}]}" | cut -d = -f 2 ) ) metricIsNumeric=$( echo "${hadoopMetricValue} == ${hadoopMetricValue}" | bc ) # If the metric exists

if [[ "${counter}" == "${hadoopMetricName}" ]]; then # If the metric is a number

if [ $metricIsNumeric -eq 1 ]; then # Round off decimals

hadoopMetricValue=$( printf "%0.f" "${hadoopMetricValue}" ) # Round off decimals

hadoopMetricValue=$( printf "%0.f" "${hadoopMetricValue}" ) # Set up performance data

perf="${counter}=${hadoopMetricValue};${warn};${crit}" # If the bias is greater-than

if [[ "${bias}" == "gt" ]]; then # Validate crit/warn levels

# ##

# [[ $warn -le $crit ]] || \\\\

echo -e "Critical level must be greater than Warning level." && exit 3 if [[ $warn if [[ $warn -if [[ $warn if [[ $warn ---ge $crit ]]; thenge $crit ]]; thenge $crit ]]; then ge $crit ]]; then

echo echo echo echo ----e "Critical level must be greater than Warning level."e "Critical level must be greater than Warning level."e "Critical level must be greater than Warning level."e "Critical level must be greater than Warning level."

exit 3exit 3exit 3exit 3

fifififi

# If the crit value is exceeded, crit if [ $crit -le $hadoopMetricValue ]; then

echo -e "CRITICAL: ${hadoopMetricName} = ${hadoopMetricValue} | ${perf}" exit 2

# If the warn value is exceeded, warn elif [ $warn -le $hadoopMetricValue ]; then

echo -e "WARNING: ${hadoopMetricName} = ${hadoopMetricValue} | ${perf}" exit 1

(17)

# Otherwise, everything is fine else

echo -e "OK: ${hadoopMetricName} = ${hadoopMetricValue} | ${perf}" # Otherwise, everything is fine

else

echo -e "OK: ${hadoopMetricName} = ${hadoopMetricValue} | ${perf}" exit 0

fi

# If the bias is less-than

elif [[ "${bias}" == "lt" ]]; then # Validate crit/warn levels

# ##

# [[ $warn -ge $crit ]] || \\\\

echo -e "Warning level must be greater than Critical level." && exit 3 if [[ $warn if [[ $warn if [[ $warn if [[ $warn ----le $crit ]]; thenle $crit ]]; thenle $crit ]]; thenle $crit ]]; then

echo echo echo -echo ---e "Warning level must be greater than Critical level."e "Warning level must be greater than Critical level."e "Warning level must be greater than Critical level." e "Warning level must be greater than Critical level."

exit 3exit 3exit 3 exit 3

fififi fi

# If the crit value is under threshold, crit if [ $crit -ge $hadoopMetricValue ]; then

echo -e "CRITICAL: ${hadoopMetricName} = ${hadoopMetricValue} | ${perf}" exit 2

# If the warn value is under threshold, warn elif [ $warn -ge $hadoopMetricValue ]; then

echo -e "WARNING: ${hadoopMetricName} = ${hadoopMetricValue} | ${perf}" exit 1

# Otherwise, everything is fine else

fi

# If the bias is not applicable elif [[ "${bias}" == "na" ]]; then # Display metric info

fi else

(18)

echo -e "UNKNOWN: Metric is not a number." exit 3

fi fi done

# Error if the metric is not found echo -e "UNKNOWN: Metric not found." exit 3

修正した上記の Nagios プラグイン check_hadoop_metrics.sh を Nagios プラグイン用のディレクトリ plugins にコピーします。コピー先のディレクトリでの Nagios プラグインの名前は、 check_hadoop_metrics にします。（NN）

# cp check_hadoop_metrics.sh /usr/lib64/nagios/plugins/check_hadoop_metrics # chmod 755 /usr/lib64/nagios/plugins/check_hadoop_metrics

# ls -l check_hadoop_metrics

-rwxr-xr-x 1 root root 4511 2 月 28 15:33 check_hadoop_metrics

Nagios プラグインの動作テスト（NN）

インストールした Nagios プラグインをコマンドラインから実行して簡易テストを行います。 (NN) # cd /usr/lib64/nagios/plugins/ # ./check_hadoop_hdfs OK - HDFS is healthy # ./secondarynamenode_check OK - checkpoint is healthy # ./check_hadoop_datanode hd01 0 1 OK - Datanodes up and running: 2 # ./check_hadoop_tasktracker hd01 0 1 OK - TaskTracker up and running: 2 # ./check_port -p 50075 -h hd02

PORT 50075 OK: hd02/50075 is responding

NameNode と DataNode を含む hostgroup を localhost.cfg ファイルに定義します。(NN) # cd /etc/nagios/objects/

# vi localhost.cfg ...

define hostgroup{

hostgroup_name linux-servers ; The name of the hostgroup alias Linux Servers ; Long name of the group

members localhost,hd01,hd02,hd03,hd04,hd05,hd06,hd07,hd08 }

...

define service{

use local-service ; Name of service template to use host_name localhost

(19)

check_command check_ssh notifications_enabled 1

} ...

define service{

use local-service ; Name of service template to use host_name localhost service_description HTTP check_command check_http notifications_enabled 1 } ... Nagios の設定が正しいかどうかをチェックします。(NN) # nagios -v /etc/nagios/nagios.cfg Nagios の管理ユーザーnagiosadmin を作成します。(NN) # htpasswd -c /etc/nagios/htpasswd.users nagiosadmin New password: xxxxxxxx

Re-type new password: xxxxxxxx Adding password for user nagiosadmin

Apache のサービスを NameNode で起動させます。(NN） # cd /var/www/html # touch index.html # /etc/init.d/httpd start # chkconfig httpd on Nagios サービスを起動します。(NN) # /etc/init.d/nagios start # chkconfig nagios on

Nagios の GUI 画面での監視

Web ブラウザ経由で、Nagios の管理画面にアクセスできるかを確認します。Nagios の管理ユーザー nagiosadmin でログインします。 (NN)

# firefox http://hd01/nagios

Naigos プラグインにより、NameNode と DataNode の Hadoop に関連するサービスが正常に稼働しているかどうかの監視が行われているかをチェックします。Nagios のメイン画面から「Services」をクリックし、各種監視項目をチェックします。「Services」をクリックしたら、「Next scheduled check」を確認し、その時刻に再び監視が行われているかをチェックします。

(20)

図 2. Nagios による Hadoop クラスタの監視の様子

以下のように hadoop コマンド等により、HDFS 上にファイルをコピーし、check_hadoop_metrics で定義したディスク使用量について Warning、Critical のアラートが「HDFS DISK USAGE」に表示されるかどうかをチェックします。この「HDFS DISK USAGE」は、hd01.cfg の check_hadoop_metircs の service_description で定義された文字列です。(NN)

# dd if=/dev/zero of=/tmp/file10GB bs=1024M count=10 # su – hdfs

$ whoami hdfs

$ hadoop dfs -put /tmp/file10GB /user/hdfs/BIGFILEDIR/

DataNode の 50075 番ポートを利用するサービス/etc/init.d/hadoop-0.20-datanode を停止させ、 Nagios で監視できるかをテストします。まず、DataNode の hd02 にログインし、hadoop-0.20-datanode サービスを停止させます。(DN)

# ssh hd02

Passowrd: xxxxxxxx

[root@hd02 ~]# /etc/init.d/hadoop-0.20-datanode stop

Nagios の GUI 画面で DataNode の「DATANODE PORT」をクリックし、「Next Scheduled Check」の時刻を確認します。「Next Scheduled Check」の時刻を過ぎるまでは状態が変化しませんので、その時刻が経過するまで待ちます。「Next Scheduled Check」で示されている時刻が経過したら、再び、 Nagios の GUI 画面の「Services」をクリックし、「DATANODE PORT」をチェックします。

(21)

図 3. Nagios による DataNode のサービス障害の検知の様子

図 3 のように、DataNode の「DATANODE PORT」の Status が「CRITICAL」になっていることを確認します。すべての DataNode でポート障害のテストが終了したら、サービスを起動させ、図 2 のように Status が「OK」になることを確認します。

[root@hd02 ~]# /etc/init.d/hadoop-0.20-datanode start

同様に tasktracker についても障害検知ができるかどうかをテストして下さい。Tasktracker の障害検知は、DataNode において hadoop-0.20-tasktracker サービスを停止させます。（DN） [root@hd02 ~]# /etc/init.d/hadoop-0.20-tasktracker stop

Nagios の GUI 画面で DataNode の「DATANODE TASKTRACKER PORT」をクリックし、「Next Scheduled Check」の時刻を確認します。「Next Scheduled Check」の時刻を過ぎるまでは状態が変化しませんので、その時刻が経過するまで待ちます。「Next Scheduled Check」で示されている時刻が経過したら、再び、Nagios の GUI 画面の「Services」をクリックし、「DATANODE TASKTRACKER PORT」をチェックします。

Nagios の GUI 上で表示されている「DATANODE TASKTRACKER PORT」は、Nagios の DataNode に関する設定ファイル hd02.cfg ～ hd08.cfg の check_hadoop_datanode_tasktracker_port の service_description で定義された文字列です。(DN)

(22)

図 4. Nagios による DataNode の tasktracker サービス障害の検知の様子

すべての DataNode で tasktracker ポート障害のテストが終了したら、サービスを起動させ、図 2 のように Status が「OK」になることを確認します。

[root@hd02 ~]# /etc/init.d/hadoop-0.20-tasktracker start