DatadogでTCP疎通確認を設定する方法

2019-10-29

概要

はじめに

  • 今回は、DatadogのTCP疎通確認(TCP Check)を設定する方法をご紹介します。
  • Datadogは、SaaSで提供されるシステムモニタリングツールであり、現在担当する案件でも、サーバーやアプリケーションの監視に使用しています。Datadog の概要は、下記のログミーTechで公開されている服部さんの記事を参照ください。
 

Datadogで実現するTCP疎通確認

  • DatadogのTCP Checkを利用して、ホストとポートのTCP接続と応答時間が監視可能です。

TCP疎通確認のmonitorはNetworkから作る

  • DatadogでTCP疎通確認(TCP Check)を行う際のmonitor は、「New Monitor」 ⇒ 「monitor type: Network」 を選択します。
  • ただし、監視対象のサーバーでAgent が未設定の場合は、下記のメッセージが表示されます。
    • 「Please ensure that HTTP and/or TCP checks have been configured in the agent.」
 

TCP疎通確認の送信元を決める

  • 先ず、監視対象のサーバーに定期的にリクエストを送る検証サーバーを選定します。
  • 上記検証サーバーは、Datadog のAgent がインストールされており、かつ監視対象のTCPサービス提供するサーバーと異なるサーバーであるべきです。
 

Datadog Agent の設定

  • Linux の場合は、下記にDatadog Agent のTCP Check設定を配置します。
    • /etc/datadog-agent/conf.d/tcp_check.d/conf.yaml
  • Windows の場合は、下記にDatadog Agent のTCP Check設定を配置します。
    • C:\ProgramData\Datadog\conf.d\tcp_check.d\conf.yaml
 
  • 以下、conf.yaml のサンプルです。 
  • Webサービス(port 80, 443)だけでなく、sshd(port 22)やRDP(port 3389)などのネットワークに必須のサービスを疎通確認も可です。

init_config:

instances:
  - name: TCP-443_check_AAA.example.com
    host: AAA.example.com
    port: 443
    collect_response_time: true

  - name: TCP-80_check_BBB.example.com
    host: BBB.example.com
    port: 80
    collect_response_time: true
   

Datadog Agent の再起動

  • Linuxの場合は、下記コマンドでconf.yamlを配置して、Datadog Agentを再起動します。


$ cd /etc/datadog-agent/conf.d/tcp_check.d
$ sudo vi conf.yaml

$ sudo systemctl restart datadog-agent.service

$ sudo datadog-agent status
 
  • Windowsの場合は、下記コマンドでconf.yamlを配置して、Datadog Agentを再起動します。


← "C:\ProgramData\Datadog\conf.d\tcp_check.d"にconf.yamlを配置します

← PowerShell を管理者として起動します

PS C:\Users\Administrator> cd "C:\Program Files\Datadog\Datadog Agent\embedded"
PS C:\Program Files\Datadog\Datadog Agent\embedded> ./agent.exe restart-service
PS C:\Program Files\Datadog\Datadog Agent\embedded> ./agent.exe status
 
  • 以下は、agent.exe status の結果サンプルです。(Linux のsudo datadog-agent status も同様です)
  • TCP Checkの内容が「tcp_check」 として、情報が出力されています。

===============
Agent (v6.14.1)
===============

  Status date: 2019-10-28 00:36:34.429243 JST
  Agent start: 2019-10-28 00:27:48.177974 JST
  Pid: 2200
  Go Version: go1.12.9
  Python Version: 2.7.16
  Check Runners: 4
  Log Level: info

  Paths
  =====
    Config File: C:\ProgramData\Datadog\datadog.yaml
    conf.d: C:\ProgramData\Datadog\conf.d
    checks.d: C:\ProgramData\Datadog\checks.d

  Clocks
  ======
    NTP offset: -3.887ms
    System UTC time: 2019-10-28 00:36:34.429243 JST

  Host Info
  =========
    bootTime: 2019-10-07 16:58:43.000000 JST
    os: windows
    platform: Windows Server 2019 Datacenter
    platformFamily: Windows Server 2019 Datacenter
    platformVersion: 10.0 Build 17763
    procs: 109
    uptime: 487h29m4s

  Hostnames
  =========
    ec2-hostname: ip-XX-XX-XX-XX.ap-northeast-1.compute.internal
    hostname: hostname1
    instance-id: i-0123456789abcdefg
    socket-fqdn: hostname1
    socket-hostname: hostname1
    hostname provider: os
    unused hostname providers:
      aws: not retrieving hostname from AWS: the host is not an ECS instance, and other providers already retrieve non-default hostnames
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname

=========
Collector
=========



  Running Checks
  ==============

    cpu
    ---
      Instance ID: cpu [OK]
      Configuration Source: file:C:\ProgramData\Datadog\conf.d\cpu.d\conf.yaml.default
      Total Runs: 35
      Metric Samples: Last Run: 7, Total: 239
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s


    disk (2.5.0)
    ------------
      Instance ID: disk:e5dffb8bef24336f [OK]
      Configuration Source: file:C:\ProgramData\Datadog\conf.d\disk.d\conf.yaml.default
      Total Runs: 34
      Metric Samples: Last Run: 6, Total: 204
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 2ms


    file_handle
    -----------
      Instance ID: file_handle [OK]
      Configuration Source: file:C:\ProgramData\Datadog\conf.d\file_handle.d\conf.yaml.default
      Total Runs: 35
      Metric Samples: Last Run: 1, Total: 35
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 1ms


    http_check (4.2.0)
    ------------------
      Instance ID: http_check:AAA.example.com:18c391eda9aa3f3d [OK]
      Configuration Source: file:C:\ProgramData\Datadog\conf.d\http_check.d\conf.yaml
      Total Runs: 2
      Metric Samples: Last Run: 5, Total: 10
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 2, Total: 4
      Average Execution Time : 419ms

      Instance ID: http_check:BBB.example.com:86495e385cafb9dd [OK]
      Configuration Source: file:C:\ProgramData\Datadog\conf.d\http_check.d\conf.yaml
      Total Runs: 2
      Metric Samples: Last Run: 5, Total: 10
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 2, Total: 4
      Average Execution Time : 52ms


    io
    --
      Instance ID: io [OK]
      Configuration Source: file:C:\ProgramData\Datadog\conf.d\io.d\conf.yaml.default
      Total Runs: 34
      Metric Samples: Last Run: 5, Total: 170
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s


    memory
    ------
      Instance ID: memory [OK]
      Configuration Source: file:C:\ProgramData\Datadog\conf.d\memory.d\conf.yaml.default
      Total Runs: 35
      Metric Samples: Last Run: 17, Total: 595
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s


    network (1.11.4)
    ----------------
      Instance ID: network:e0204ad63d43c949 [OK]
      Configuration Source: file:C:\ProgramData\Datadog\conf.d\network.d\conf.yaml.default
      Total Runs: 35
      Metric Samples: Last Run: 12, Total: 420
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 1ms


    ntp
    ---
      Instance ID: ntp:d884b5186b651429 [OK]
      Configuration Source: file:C:\ProgramData\Datadog\conf.d\ntp.d\conf.yaml.default
      Total Runs: 35
      Metric Samples: Last Run: 1, Total: 35
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 35
      Average Execution Time : 0s


    tcp_check (2.3.0)
    -----------------
      Instance ID: tcp_check:TCP-443_check_AAA.example.com:801b3307e5549a11 [OK]
      Configuration Source: file:C:\ProgramData\Datadog\conf.d\tcp_check.d\conf.yaml
      Total Runs: 35
      Metric Samples: Last Run: 2, Total: 70
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 35
      Average Execution Time : 8ms

      Instance ID: tcp_check:TCP-80_check_BBB.example.com:cb01989ad30c1161 [OK]
      Configuration Source: file:C:\ProgramData\Datadog\conf.d\tcp_check.d\conf.yaml
      Total Runs: 35
      Metric Samples: Last Run: 2, Total: 70
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 35
      Average Execution Time : 19ms


    uptime
    ------
      Instance ID: uptime [OK]
      Configuration Source: file:C:\ProgramData\Datadog\conf.d\uptime.d\conf.yaml.default
      Total Runs: 35
      Metric Samples: Last Run: 1, Total: 35
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s


    windows_service (2.1.0)
    -----------------------
      Instance ID: windows_service:89b0f21a4c4f699f [OK]
      Configuration Source: file:C:\ProgramData\Datadog\conf.d\windows_service.d\conf.yaml
      Total Runs: 34
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 5, Total: 170
      Average Execution Time : 50ms


    winproc
    -------
      Instance ID: winproc [OK]
      Configuration Source: file:C:\ProgramData\Datadog\conf.d\winproc.d\conf.yaml.default
      Total Runs: 35
      Metric Samples: Last Run: 2, Total: 70
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 1ms

========
JMXFetch
========

  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  Transactions
  ============
    CheckRunsV1: 35
    Dropped: 0
    DroppedOnInput: 0
    Events: 0
    HostMetadata: 0
    IntakeV1: 2
    Metadata: 0
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Series: 0
    ServiceChecks: 0
    SketchSeries: 0
    Success: 72
    TimeseriesV1: 35

  API Keys status
  ===============
    API key ending with 123ab: API Key valid

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - 123ab

==========
Logs Agent
==========

  Logs Agent is not running

=========
Aggregator
=========
  Checks Metric Sample: 3,321
  Dogstatsd Metric Sample: 2,360
  Event: 1
  Events Flushed: 1
  Number Of Flushes: 35
  Series Flushed: 3,393
  Service Check: 959
  Service Checks Flushed: 993

=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 2,359
  Metric Parse Errors: 0
  Service Check Packets: 0
  Service Check Parse Errors: 0
  Udp Bytes: 152,117
  Udp Packet Reading Errors: 0
  Udp Packets: 2,360
  Uds Bytes: 0
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 0

 

TCP疎通確認のmonitor を作成する

  • Datadog Agentの設定が完了したら、TCP疎通確認のmonitor を作成します。
  • Agentのconf.yaml 設定後、”tcp” が選択可能になります。続けて、conf.yaml に設定したエンドポイント(URL)を選択します。以下、monitor 作成時の画面サンプルです。
 

参考情報

network.tcp.response_timeメトリクスの説明

  • The response time of a given host and TCP port, tagged with url, e.g. ‘url:192.168.1.100:22’.
    (Shown as second)
  • Returns DOWN if the Agent cannot connect to the configured host and port, otherwise UP.

network.tcp.can_connectメトリクスの説明

  • Value of 1 if the agent can successfully establish a connection to the URL, 0 otherwise
 

その他

  • 下記資料を参照ください。