For a long time we had been using Nagios for monitoring services and equipment in our shop. During one of our I.T. services commission meetings a discussion about monitoring came up and a bunch of ideas were thrown around. We talked about the advantages and disadvantages of a base Nagios installation like we were using (managing devices, templates, etc is not exactly easy since it’s a bunch of text files). A number of names for replacements were dropped by the other I.T. managers and my boss suggested I take a look and see if any of them could do the job we needed.
Suggestions included Nagios & Cacti with Weathermap Plugin, Eyes of Network, PRTG, and Zabbix. After looking at all the options, I found Zabbix to be the easiest to get rolling (which turned out to be wrong!) so I went with it. I spent about a week setting up the VM and it was going great, until I added some switches and enabled SNMP Discovery for Interfaces. Suddenly, the server slammed to a halt. Processes were flying through the roof, the server itself was overloaded, and the housekeeper process was stuck at 100% use for over 4 hours a time, every hour. Doing some digging on the Zabbix forums I discovered that there are a LOT of configuration tweaks that should be done in order to keep the machine happy.
To that end, I decided to write up a guide about how to get an optimal setup (it has been working SO much better for me). I’ll also briefly touch on making Zabbix communicate with Cachet for a public landing page.
- Install and configure an instance of Ubuntu x64 Server edition (in this case, Ubuntu 14.04 LTS)
- For reference, the specifications I used were:
- RAM: 8 GB
- CPU: 4 CPUs, 2 Cores
- Storage: 128 GB
 
- Be sure to install SSH Server and LAMP Server during the installation process.
 
- For reference, the specifications I used were:
- Do updates (always a good idea as a general rule of thumb):
 sudo apt-get update && sudo apt-get upgrade
- Now we need to configure the SQL Server
- Enable innodb_file_per_table
- sudo nano /etc/mysql/my.cnf
- Under the [mysqld] heading, add the line:
 innodb_file_per_table
 
- Generic tweaks
- From this link we gathered the following tweaks for the my.cnf, again under the [mysqld] heading:
- innodb_buffer_pool_size = 4G (set this to 50% RAM if running the entire server on this box, 75% if you’re only running the database on this box).
- innodb_buffer_pool_instances = 4 (change to 8 or 16 on MySQL 5.6)
- innodb_flush_log_at_trx_commit = 0
- innodb_flush_method = O_DIRECT
- innodb_old_blocks_time = 1000
- innodb_io_capacity = 600 (400-800 for standard drives, >= 2000 for SSD drives)
- sync_binlog = 0
- query_cache_size = 0
- query_cache_type = 0
- event_scheduler = ENABLED
 
- Run the MySQL Tuner utility
- wget https://raw.githubusercontent.com/major/MySQLTuner-perl/master/mysqltuner.pl
- chmod +x mysqltuner.pl
- ./mysqltuner.pl
- We can ignore the query_cache_type (we set it to 0 for a reason)
- Ignore InnoDB is enabled but isn’t being used ( we don’t have any tables yet!)
- Ignore -FEDERATED (this is deprecated in MySQL > 5.5)
- Ignore Key buffer hit rate (since we JUST started the server)
 
- Keep in mind, this utility is best used after you’ve got some data in your tables.
 
 
- From this link we gathered the following tweaks for the my.cnf, again under the [mysqld] heading:
 
- Enable innodb_file_per_table
- Get and install the Zabbix Server and Agent
- wget http://repo.zabbix.com/zabbix/2.4/ubuntu/pool/main/z/zabbix-release/zabbix-release_2.4-1+trusty_all.deb
- sudo dpkg -i zabbix-release_2.4-1+trusty_all.deb
- sudo apt-get update
- sudo apt-get install zabbix-server-mysql zabbix-frontend-php zabbix-agent
 
- Time to do the Web Installation
- sudo nano /etc/php5/apache2/php.ini
- Uncomment ;date.timezone =
- Set date.timezone appropriately (for me: “America/New_York”)
- sudo service apache2 restart
 
- Now do the web installation. That part you can do without me guiding you through it. 🙂
- Test your login with admin/zabbix.
 
- sudo nano /etc/php5/apache2/php.ini
- Setup partitioning of the SQL instance
- There’s a guide for it here.
- mysql -u <your mysql login> -p (login appropriately)
- use zabbix;
- ALTER TABLE housekeeper ENGINE = BLACKHOLE;
- From the “Getting ready” section:
- ALTER TABLE `acknowledges` DROP PRIMARY KEY, ADD KEY `acknowledges_0` (`acknowledgeid`);
- ALTER TABLE `alerts` DROP PRIMARY KEY, ADD KEY `alerts_0` (`alertid`);
- ALTER TABLE `auditlog` DROP PRIMARY KEY, ADD KEY `auditlog_0` (`auditid`);
- ALTER TABLE `events` DROP PRIMARY KEY, ADD KEY `events_0` (`eventid`);
- ALTER TABLE `service_alarms` DROP PRIMARY KEY, ADD KEY `service_alarms_0` (`servicealarmid`);
- ALTER TABLE `history_log` DROP PRIMARY KEY, ADD INDEX `history_log_0` (`id`);
- ALTER TABLE `history_log` DROP KEY `history_log_2`;
- ALTER TABLE `history_text` DROP PRIMARY KEY, ADD INDEX `history_text_0` (`id`);
- ALTER TABLE `history_text` DROP KEY `history_text_2`;
- ALTER TABLE `acknowledges` DROP FOREIGN KEY `c_acknowledges_1`, DROP FOREIGN KEY `c_acknowledges_2`;
- ALTER TABLE `alerts` DROP FOREIGN KEY `c_alerts_1`, DROP FOREIGN KEY `c_alerts_2`, DROP FOREIGN KEY `c_alerts_3`, DROP FOREIGN KEY `c_alerts_4`;
- ALTER TABLE `auditlog` DROP FOREIGN KEY `c_auditlog_1`;
- ALTER TABLE `service_alarms` DROP FOREIGN KEY `c_service_alarms_1`;
- ALTER TABLE `auditlog_details` DROP FOREIGN KEY `c_auditlog_details_1`;
 
- Create the managing partition table:
- CREATE TABLE `manage_partitions` (
 `tablename` VARCHAR(64) NOT NULL COMMENT ‘Table name’,
 `period` VARCHAR(64) NOT NULL COMMENT ‘Period – daily or monthly’,
 `keep_history` INT(3) UNSIGNED NOT NULL DEFAULT ‘1’ COMMENT ‘For how many days or months to keep the partitions’,
 `last_updated` DATETIME DEFAULT NULL COMMENT ‘When a partition was added last time’,
 `comments` VARCHAR(128) DEFAULT ‘1’ COMMENT ‘Comments’,
 PRIMARY KEY (`tablename`)
 ) ENGINE=INNODB;
 
- CREATE TABLE `manage_partitions` (
- Create the maintenance procedures
- Guide here, we need the “Stored Procedures”.
- 
DELIMITER $$ CREATE PROCEDURE `partition_create`(SCHEMANAME VARCHAR(64), TABLENAME VARCHAR(64), PARTITIONNAME VARCHAR(64), CLOCK INT) BEGIN /* SCHEMANAME = The DB schema in which to make changes TABLENAME = The table with partitions to potentially delete PARTITIONNAME = The name of the partition to create */ /* Verify that the partition does not already exist */ DECLARE RETROWS INT; SELECT COUNT(1) INTO RETROWS FROM information_schema.partitions WHERE table_schema = SCHEMANAME AND TABLE_NAME = TABLENAME AND partition_description >= CLOCK; IF RETROWS = 0 THEN /* 1. Print a message indicating that a partition was created. 2. Create the SQL to create the partition. 3. Execute the SQL from #2. */ SELECT CONCAT( "partition_create(", SCHEMANAME, ",", TABLENAME, ",", PARTITIONNAME, ",", CLOCK, ")" ) AS msg; SET @SQL = CONCAT( 'ALTER TABLE ', SCHEMANAME, '.', TABLENAME, ' ADD PARTITION (PARTITION ', PARTITIONNAME, ' VALUES LESS THAN (', CLOCK, '));' ); PREPARE STMT FROM @SQL; EXECUTE STMT; DEALLOCATE PREPARE STMT; END IF; END$$ DELIMITER ; 
- 
DELIMITER $$ CREATE PROCEDURE `partition_drop`(SCHEMANAME VARCHAR(64), TABLENAME VARCHAR(64), DELETE_BELOW_PARTITION_DATE BIGINT) BEGIN /* SCHEMANAME = The DB schema in which to make changes TABLENAME = The table with partitions to potentially delete DELETE_BELOW_PARTITION_DATE = Delete any partitions with names that are dates older than this one (yyyy-mm-dd) */ DECLARE done INT DEFAULT FALSE; DECLARE drop_part_name VARCHAR(16); /* Get a list of all the partitions that are older than the date in DELETE_BELOW_PARTITION_DATE. All partitions are prefixed with a "p", so use SUBSTRING TO get rid of that character. */ DECLARE myCursor CURSOR FOR SELECT partition_name FROM information_schema.partitions WHERE table_schema = SCHEMANAME AND TABLE_NAME = TABLENAME AND CAST(SUBSTRING(partition_name FROM 2) AS UNSIGNED) < DELETE_BELOW_PARTITION_DATE; DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE; /* Create the basics for when we need to drop the partition. Also, create @drop_partitions to hold a comma-delimited list of all partitions that should be deleted. */ SET @alter_header = CONCAT("ALTER TABLE ", SCHEMANAME, ".", TABLENAME, " DROP PARTITION "); SET @drop_partitions = ""; /* Start looping through all the partitions that are too old. */ OPEN myCursor; read_loop: LOOP FETCH myCursor INTO drop_part_name; IF done THEN LEAVE read_loop; END IF; SET @drop_partitions = IF(@drop_partitions = "", drop_part_name, CONCAT(@drop_partitions, ",", drop_part_name)); END LOOP; IF @drop_partitions != "" THEN /* 1. Build the SQL to drop all the necessary partitions. 2. Run the SQL to drop the partitions. 3. Print out the table partitions that were deleted. */ SET @full_sql = CONCAT(@alter_header, @drop_partitions, ";"); PREPARE STMT FROM @full_sql; EXECUTE STMT; DEALLOCATE PREPARE STMT; SELECT CONCAT(SCHEMANAME, ".", TABLENAME) AS `table`, @drop_partitions AS `partitions_deleted`; ELSE /* No partitions are being deleted, so print out "N/A" (Not applicable) to indicate that no changes were made. */ SELECT CONCAT(SCHEMANAME, ".", TABLENAME) AS `table`, "N/A" AS `partitions_deleted`; END IF; END$$ DELIMITER ; 
- 
DELIMITER $$ CREATE PROCEDURE `partition_maintenance`(SCHEMA_NAME VARCHAR(32), TABLE_NAME VARCHAR(32), KEEP_DATA_DAYS INT, HOURLY_INTERVAL INT, CREATE_NEXT_INTERVALS INT) BEGIN DECLARE OLDER_THAN_PARTITION_DATE VARCHAR(16); DECLARE PARTITION_NAME VARCHAR(16); DECLARE LESS_THAN_TIMESTAMP INT; DECLARE CUR_TIME INT; CALL partition_verify(SCHEMA_NAME, TABLE_NAME, HOURLY_INTERVAL); SET CUR_TIME = UNIX_TIMESTAMP(DATE_FORMAT(NOW(), '%Y-%m-%d 00:00:00')); SET @__interval = 1; create_loop: LOOP IF @__interval > CREATE_NEXT_INTERVALS THEN LEAVE create_loop; END IF; SET LESS_THAN_TIMESTAMP = CUR_TIME + (HOURLY_INTERVAL * @__interval * 3600); SET PARTITION_NAME = FROM_UNIXTIME(CUR_TIME + HOURLY_INTERVAL * (@__interval - 1) * 3600, 'p%Y%m%d%H00'); CALL partition_create(SCHEMA_NAME, TABLE_NAME, PARTITION_NAME, LESS_THAN_TIMESTAMP); SET @__interval=@__interval+1; END LOOP; SET OLDER_THAN_PARTITION_DATE=DATE_FORMAT(DATE_SUB(NOW(), INTERVAL KEEP_DATA_DAYS DAY), '%Y%m%d0000'); CALL partition_drop(SCHEMA_NAME, TABLE_NAME, OLDER_THAN_PARTITION_DATE); END$$ DELIMITER ; 
- 
DELIMITER $$ CREATE PROCEDURE `partition_verify`(SCHEMANAME VARCHAR(64), TABLENAME VARCHAR(64), HOURLYINTERVAL INT(11)) BEGIN DECLARE PARTITION_NAME VARCHAR(16); DECLARE RETROWS INT(11); DECLARE FUTURE_TIMESTAMP TIMESTAMP; /* * Check if any partitions exist for the given SCHEMANAME.TABLENAME. */ SELECT COUNT(1) INTO RETROWS FROM information_schema.partitions WHERE table_schema = SCHEMANAME AND TABLE_NAME = TABLENAME AND partition_name IS NULL; /* * If partitions do not exist, go ahead and partition the table */ IF RETROWS = 1 THEN /* * Take the current date at 00:00:00 and add HOURLYINTERVAL to it. This is the timestamp below which we will store values. * We begin partitioning based on the beginning of a day. This is because we don't want to generate a random partition * that won't necessarily fall in line with the desired partition naming (ie: if the hour interval is 24 hours, we could * end up creating a partition now named "p201403270600" when all other partitions will be like "p201403280000"). */ SET FUTURE_TIMESTAMP = TIMESTAMPADD(HOUR, HOURLYINTERVAL, CONCAT(CURDATE(), " ", '00:00:00')); SET PARTITION_NAME = DATE_FORMAT(CURDATE(), 'p%Y%m%d%H00'); -- Create the partitioning query SET @__PARTITION_SQL = CONCAT("ALTER TABLE ", SCHEMANAME, ".", TABLENAME, " PARTITION BY RANGE(`clock`)"); SET @__PARTITION_SQL = CONCAT(@__PARTITION_SQL, "(PARTITION ", PARTITION_NAME, " VALUES LESS THAN (", UNIX_TIMESTAMP(FUTURE_TIMESTAMP), "));"); -- Run the partitioning query PREPARE STMT FROM @__PARTITION_SQL; EXECUTE STMT; DEALLOCATE PREPARE STMT; END IF; END$$ DELIMITER ; 
- 
DELIMITER $$ CREATE PROCEDURE `partition_maintenance_all`(SCHEMA_NAME VARCHAR(32)) BEGIN CALL partition_maintenance(SCHEMA_NAME, 'history', 28, 24, 14); CALL partition_maintenance(SCHEMA_NAME, 'history_log', 28, 24, 14); CALL partition_maintenance(SCHEMA_NAME, 'history_str', 28, 24, 14); CALL partition_maintenance(SCHEMA_NAME, 'history_text', 28, 24, 14); CALL partition_maintenance(SCHEMA_NAME, 'history_uint', 28, 24, 14); CALL partition_maintenance(SCHEMA_NAME, 'trends', 730, 24, 14); CALL partition_maintenance(SCHEMA_NAME, 'trends_uint', 730, 24, 14); END$$ DELIMITER ; 
 
- 
 
- Guide here, we need the “Stored Procedures”.
- Create the new timing event
- DELIMITER $$
 CREATE EVENT IF NOT EXISTS `zabbix-maint`
 ON SCHEDULE EVERY 7 DAY
 STARTS ‘2015-04-29 01:00:00’
 ON COMPLETION PRESERVE
 ENABLE
 COMMENT ‘Creating and dropping partitions’
 DO BEGIN
 CALL partition_maintenance_all(‘zabbix’);
 END$$
 DELIMITER ;
- This will run the partition maintenance procedure on all tables in Zabbix every 7 days (creating 14 days of future partitions as well)
 
- DELIMITER $$
 
- Tweak the Zabbix instance
- Disable Housekeeping in Config -> General -> Housekeeping
- Install snmp utilities
- sudo apt-get install snmp snmp-mibs-downloader
 
- Tweak the Zabbix config files
- sudo nano /etc/zabbix/zabbix_server.conf
- Fix number of pingers: option StartPingers = 20 (we have 350 hosts currently, with 20 pingers, this yields ~10.52% utilization of the Pingers)
- Fix number of db syncers: option StartDBSyncers = 4
- Enable SNMP Checks: StartSNMPTrapper = 1
- Increase CacheSizes
- CacheSize = 1G
- HistoryCacheSize = 256M
- TrendCacheSize = 256M
- HistoryTextCacheSize = 128M
- ValueCacheSize = 256M
 
- Prepare the server for maximum cache size increase
- sudo nano /etc/sysctl.conf
- Add: kernel.shmmax = 1342177280
 
 
 
- sudo nano /etc/zabbix/zabbix_server.conf
- Optional: Enable ldap
- sudo apt-get install php5-ldap
- sudo service apache2 restart
 
 
- Getting Zabbix to throw data to Cachet
- Create a file “notifyCachet” in /usr/lib/zabbix/alertscripts
- #!/bin/bash
 to=$1
 compID=$2
 statusID=$3#Comment this next line out for Production environments
 #echo “curl -H ‘Content-Type: application/json’ -H ‘X-Cachet-Token: <your cachet API token>’ http://<cachet server ip>/api/components/$compID -d ‘{“status”:$statusID}’ -X PUT”#Uncomment this next line for Production
 curl -H ‘Content-Type: application/json’ -H ‘X-Cachet-Token: <your cachet API token>’ http://<cachet server ip>/api/components/”$compID” -d “{‘status’:$statusID}” -X PUT
 
- #!/bin/bash
- From Zabbix: go to Admin -> Media Types -> Create Media Type
- Set Name to whatever
- Type is Script
- Script Name is “notifyCachet”
 
- Go to Config -> Actions -> Create Action
- Action Settings:
- Default/Recovery Subject: {$CACHET}
- Default Message: 4 (A major outage)
- Recovery Message: 1 (Operational)
 
- Conditions: Add Trigger Severity >= Average
- Operations: Add a User Group, Send ONLY to Cachet_Notify (from Section 8, Subsection 2, Section 1)
 
- Action Settings:
- In all Hosts for Cachet, you MUST set a Macro {$CACHET} where the value is the Cachet ID Number
 
- Create a file “notifyCachet” in /usr/lib/zabbix/alertscripts
I know, this is a lot of stuff to process, but honestly it’s worth going through and setting it up properly. Zabbix is running flawlessly for us right now. This is a bit messy right now (yay wordpress) so in a day or so here’s a PDF version of the guide.
Cheers,
-M
