Zabbix: More On Dependencies

After a week of struggling with Zabbix and getting trigger dependencies done, I figured I was all set.  Everything was set to be dependent on something else (usually ping checks to indicate that something is down).

Then we lost connectivity to a school.  My email LIT UP with alerts.  Almost every single device behind those switches reported that they were down.  ARGH!

After doing some more research on it, I found the culprit: how often the data is checked.

I was reading the Zabbix documentation here and the line that got me was: “Before changing the status of the ‘Host is down’ trigger, Zabbix will check for corresponding trigger dependencies. If found, and one of those triggers is in ‘Problem’ state, then the trigger status will not be changed and thus actions will not be executed and notifications will not be sent.”

I interpreted this to mean that if I have two hosts, A and B, with B depending on A that if B went down then Zabbix would IMMEDIATELY turn around and check A.  As a real world example:

MAS Main Office IDF Ping Check depends on MAS MDF Ping Check.  In theory I expected the following:

  1. Zabbix checks MAS Main Office IDF Ping via ICMP.
  2. Result of Step 1 is: 0 (Down).
  3. Zabbix looks at the trigger and says: this trigger depends on MAS MDF Ping Check.
  4. Zabbix immediately runs the MAS MDF Ping Check via ICMP.
    1. Result of Step 4 is: 0 (Down).
      1. Do not alert about MAS Main Office IDF being down.
      2. Alert about MAS MDF being down.
    2. Result of Step 4 is: 1 (Up).
      1. Alert about MAS Main Office IDF being down.
      2. Do not alert about MAS MDF (it’s up).

This is not what happens.

Zabbix DOES check the parent trigger.  It checks the last recorded value.  So the steps are:

  1. Zabbix checks MAS Main Office IDF Ping via ICMP.
  2. Result of Step 1 is: 0 (Down).
  3. Zabbix looks at the trigger and says: this trigger depends on MAS MDF Ping Check.
  4. Zabbix immediately checks the last value of the MAS MDF Ping Check via ICMP.
    1. Result of Step 4 is: 0 (Down).
      1. Do not alert about MAS Main Office IDF being down.
      2. Alert about MAS MDF being down.
    2. Result of Step 4 is: 1 (Up).
      1. Alert about MAS Main Office IDF being down.
      2. Do not alert about MAS MDF (it’s up).

There is a big difference in these scenarios.  If the last MAS MDF Ping Check was before the MAS Main Office IDF Ping Check and everything was OK at that particular point in time: well then you’re going to get unnecessary alerts.  Grrrr.

The solution, as I’ve found, is to stagger your dependent checks.  That is to say: Parent triggers should occur more frequently than child triggers.

For example:

Core -> MDF -> IDF -> Devices.  Camera depends on IDF, IDF depends on MDF, MDF depends on Core.  There are 4 layers of checks here.

Level 0 Check (Core): Check every 30 seconds.

Level 1 Check (MDF; Anything else immediately behind the core): Check every 60/90 seconds.

Level 2 Check (IDF; Anything else immediately behind the MDF): Check every 120/180 seconds.

Level 3 Check (Devices; Anything else immediately behind the IDF): Check every 240/360 seconds.

I setup the staggered checks by setting up different ping templates per level.  I then applied the templates to the hosts.

Zabbix

This morning I arrived at work and low and behold: I had a single alert in my inbox.  The CHS Science Wing 1st Floor switch was down.  Zabbix indicated a bunch of Cameras and Access Points were down too.  But I didn’t get an email about a single one.

Perfection.

On to the next bit: Monitoring RAID arrays and RAID disks.  More on that later.

-M, out.

5 comments

  1. Mike,

    First of all, thank you for taking the time to write up this blog. You’ve certainly touched on one of the biggest issues with Zabbix that most other monitoring programs do out of the box. I would love to hear more about how you’ve been able to configure trigger dependancies that are host based. Would you be able to share some of your thoughts and collaborate on a bigger scale to simplify replicating this configuration? Would certainly be up for a discussion.

    Thanks,

    Jeff

    1. Hello Jeff,

      What did you have in mind? If you elaborate we can talk it out and figure it out. Reach out to me via mike at talesofatech dot com

      -Mike

  2. Hello, very good article. By cons how do you change the times (30 seconds, 60/90 seconds …)? I use the same template for my router and my hosts. These have a dependency on the router. So if the router goes down, I have to receive only one message: the one about the router.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.