We just received new UPS units (Battery Backups) for our office since we got new ESXi infrastructure (Cisco Hyperflex, which is baller frankly) and the power requirements changed as a result of that new infrastructure. Plus, the old batteries were starting to die in mass quantities (after not being replaced for 3+ years…) which was leading to headaches up to and including loss of power in the datacenter. Not cool.
These new SmartUPS X 3000 are VERY cool though.
We just finished configuring the management interfaces for them, and that got me to thinking: running ping checks on these is all well and good but what kind of data can we pull from them via SNMP. If it’s on the network I can grab data from it, that’s my story and I’m sticking to it.
Thankfully the management cards support SNMP v1 through v3, and the configuration of it is easy enough. That’s an exercise for the reader, if you can’t figure out the 3 clicks it takes then the rest of this will probably be way over your head.
So I set out to write my first set of truly self-written monitoring templates. It was surprisingly easy once I started to read the documentation for Zabbix 3.0.
It’s got three “Applications” with different items, about 15 total:
- UPS Connectivity
- Packet Loss
- Response Time
- UPS Information
- Battery Installation Date
- Firmware Revision
- Serial Number
- System Model
- UPS Status
- Battery Remaining Capacity (in %, the charge left on the battery)
- Battery Run Time Remaining (in minutes, the time until bad things happen)
- Battery Supplying Voltage (in V, the voltage the battery is supplying to the battery backup system)
- Battery Temperature (in C, the temperature of the battery)
- Input Frequency (in Hz, to check the quality of the power)
- Input Voltage (in V, to check the quality of inbound power)
- Output Frequency (in Hz, to check how the UPS regulators are working)
- Output Voltage (in V, to check how the UPS regulators are working)
It has a bunch of triggers too.
- Battery Remaining Capacity 1% (Disaster Alert)
- Battery Remaining Capacity 10% (High Alert)
- Battery Remaining Capacity 25% (Average Alert)
- Battery Temperature Excessive >36C (High Alert)
- Battery Temperature High >30C, <36C (Average Alert)
- Inbound Power Quality Warning (+/-20 V over spec, +/-10Hz over spec, while power still being providing (>0 V input))
- Inbound Power Failure (0V being supplied)
- Outbound Power Quality Warning (+/-20 V over spec, +/-10Hz over spec, while power still being providing (>0 V output))
- Packet Loss
- Response Time
- Network Unavailable
All the triggers depend on the network being available, so the dependency is set there. Additionally, Inbound Power Quality Warning depends on Inbound Power Failure. I do not like duplicate alerts. These were fairly interesting to write up and the expression constructor was VERY helpful, as was the expression tester.
Here is the Zabbix template, in case you’d like to use it. 🙂
Here’s a screenshot of the monitoring so far:
Cheers, and may your Zabbix instance alerts be few!