<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Carlos&#039; Corner &#187; monitoring</title>
	<atom:link href="http://cars.lostroncos.org/tag/monitoring/feed/" rel="self" type="application/rss+xml" />
	<link>http://cars.lostroncos.org</link>
	<description>The tired geek-dad in the corner</description>
	<lastBuildDate>Wed, 12 May 2010 19:46:13 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.5</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Simple Neverfail Monitoring with Zabbix part 2</title>
		<link>http://cars.lostroncos.org/2010/02/23/simple-neverfail-monitoring-with-zabbix-part-2/</link>
		<comments>http://cars.lostroncos.org/2010/02/23/simple-neverfail-monitoring-with-zabbix-part-2/#comments</comments>
		<pubDate>Tue, 23 Feb 2010 17:53:36 +0000</pubDate>
		<dc:creator>cars</dc:creator>
				<category><![CDATA[Work]]></category>
		<category><![CDATA[monitoring]]></category>
		<category><![CDATA[neverfail]]></category>
		<category><![CDATA[zabbix]]></category>
		<category><![CDATA[neverfail for Exchange]]></category>
		<category><![CDATA[neverfail heartbeat]]></category>
		<category><![CDATA[reg_dword_big_endian]]></category>

		<guid isPermaLink="false">http://cars.lostroncos.org/?p=363</guid>
		<description><![CDATA[Recap
<p>So in the previous post I put together a simple script for getting the data out of a specified registry entry that handled the REG DWORD BIG ENDIAN data type.  In this one I&#8217;ll go over the general process of getting the registry based perf data into Zabbix and setting up alerting based on it.</p>
Setting [...]]]></description>
			<content:encoded><![CDATA[<h2>Recap</h2>
<p>So in the<a href="http://cars.lostroncos.org/2009/05/31/simple-monitoring-of-neverfail-with-zabbix-part-1/"> previous post</a> I put together a simple script for getting the data out of a specified registry entry that handled the REG DWORD BIG ENDIAN data type.  In this one I&#8217;ll go over the general process of getting the registry based perf data into Zabbix and setting up alerting based on it.</p>
<h2>Setting up Zabbix</h2>
<p>I won&#8217;t cover the actual installation of Zabbix here, but before we can put data into Zabbix we need to add the counters/items that I will be populating in the future. The first thing I need to do is determine exactly what those counters are and which of the nodes they need to come from.</p>
<table border="1">
<tbody>
<tr>
<th>Registry Path/Value</th>
<th>Node</th>
<th>Description</th>
</tr>
<tr>
<td>\Neverfail\R2\Performance\CurrentThroughput</td>
<td>Active</td>
<td>Nominally the throughput  of data between the two nodes</td>
</tr>
<tr>
<td>\Neverfail\R2\Performance\MegaBytessent</td>
<td>Active</td>
<td># of Megabytes sent</td>
</tr>
<tr>
<td>\Neverfail\R2\Performance\MegabytesReceived</td>
<td>Active</td>
<td># of MB received</td>
</tr>
<tr>
<td>\Neverfail\R2\Performance\OldestUnsafeupdatequeueentry</td>
<td>Active</td>
<td>Age of the oldest item in the Unsafe Queue</td>
</tr>
<tr>
<td>\Neverfail\R2\Performance\UnsafeUpdateQueueSize</td>
<td>Active</td>
<td>How much data is in the Unsafe Queue waiting to be passed to the passive node</td>
</tr>
<tr>
<td>\Neverfail\R2\Performance\UnsafeUpdateQueueSize (dup)</td>
<td>Active</td>
<td>Same as above but I want to measure the rate of growth as a possible factor in alerting</td>
</tr>
<tr>
<td>\Neverfail\R2\Performance\KBDispatchedFromUnsafeQueue</td>
<td>Active</td>
<td>How much total data has been sent from the unsafe queue</td>
</tr>
<tr>
<td>\Neverfail\R2\Performance\Oldestsafeupdatequeueentry</td>
<td>Passive</td>
<td>The age of the oldest item in the safe queue</td>
</tr>
<tr>
<td>\Neverfail\R2\Performance\safeUpdateQueueSize</td>
<td>Passive</td>
<td>Size of the Safe Queue</td>
</tr>
<tr>
<td>\Neverfail\R2\Performance\safeUpdateQueueSize(dup)</td>
<td>Passive</td>
<td>Same as above but I want to measure the rate of growth as a possible factor in alerting</td>
</tr>
<tr>
<td>\Neverfail\R2\Performance\KBDispatchedFromsafeQueue</td>
<td>Passive</td>
<td>How much total data has been written from the Safe Queue</td>
</tr>
<tr>
<td>\JavaSoft\Prefs\Neverfail\current\/Registry/State/Manager\/Status\/Value</td>
<td>Active</td>
<td>Current status of the registry synchronization.</td>
</tr>
<tr>
<td>\JavaSoft\Prefs\Neverfail\current\/New/File/State/Mgr\/Synchronization/Status\/Tag</td>
<td>Active</td>
<td>Current file synchronization status.</td>
</tr>
<tr>
<td>\JavaSoft\Prefs\Neverfail\current\/Controller\/Is/Primary/Server</td>
<td>Active</td>
<td>Is the active server the primary or not. From this I can tell which node is active.</td>
</tr>
</tbody>
</table>
<p>Because I have multiple Neverfail clusters in my environment I will create a template in Zabbix that has all the necessary counters associated with it that I can then apply to the hosts rather than adding them manually to each host.  Since a host can have multiple templates assigned to it I&#8217;ll also include a new &#8220;application&#8221; called Neverfail to help with separating Neverfail counters from any other counters that might be associated with a host (ex: Exchange counters).</p>
<p>To help with some of the drudgery associated with manually creating all the items, I&#8217;ve provided  <a href="http://cars.home.lostroncos.org/wp-uploads/2010/02/zbx_Template_NeverfailCluster.xml">a version of the template</a> that can simply be imported into Zabbix. The template includes all of the counters from above as well as a couple of basic triggers for alerting.</p>
<p>Here are a couple of short videos that walk through manually creating a template, and importing the one I&#8217;ve provided.</p>
<table border="0" width="550">
<tbody>
<tr>
<td><a href="http://cars.lostroncos.org/wp-content/uploads/2010/02/zbx_create_template.swf" target="_blank"><img class="alignnone size-full wp-image-493" title="Creating a template" src="http://cars.lostroncos.org/wp-content/uploads/2010/02/create_video.png" alt="Creating a template" /><br />
Creating a template</a></td>
<td><a href="http://cars.lostroncos.org/wp-content/uploads/2010/02/zbx_import_and_create.swf" target="_blank"><img class="alignnone size-full wp-image-493" title="Importing a template" src="http://cars.lostroncos.org/wp-content/uploads/2010/02/import_video.png" alt="Importing a template" /><br />
Importing the Neverfail Template into Zabbix</a></td>
</tr>
</tbody>
</table>
<p>Sharp eyes might notice that I&#8217;m capturing  bothUnsafeUpdateQueueSize and safeUpdateQueueSize twice.  In doing so these values are being treated differently. The first is a simple measurement of how much data is in the queue.</p>
<h2>About Zabbix_sender</h2>
<p>Now turning our attention to how we get the info into Zabbix let&#8217;s look at Zabbix_sender.  It&#8217;s available a pre-compiled binary for Windows from <a href="http://www.zabbix.com/download.php">Zabbix&#8217;s website</a>. Getting it ready is as simple as unzipping the download and putting the executable somewhere. By running <em>zabbix_sender -h</em> we can see it can take a number of options.</p>
<div class="codecolorer-container text blackboard" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;height:300px;"><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">C:\Temp&amp;gt;zabbix_sender -h<br />
ZABBIX send v1.6.2 (16 January 2009)<br />
<br />
usage: zabbix_sender [-Vhv] {[-zpsI] -ko | [-zpI] -i &amp;lt;file&amp;gt;} [-c &amp;lt;file&amp;gt;]<br />
<br />
Options:<br />
-c Specify configuration file<br />
-z Hostname or IP address of ZABBIX Server.<br />
-p Specify port number of server trapper running on the server. Default is 10051.<br />
-s Specify hostname or IP address of a host.<br />
-I Specify source IP address<br />
-k Specify metric name (key) we want to send.<br />
-o Specify value of the key.<br />
-i<br />
<br />
&lt;input type=&quot;text&quot; /&gt; Load values from input file.<br />
Each line of file contains:<br />
.<br />
-v Verbose mode<br />
Other options:<br />
-h Give this help.<br />
-V Display version number.</div></div>
<p>The ones I use  are -s, -z, -k and -o.  So a typical command line for me would look something like:</p>
<div class="codecolorer-container text blackboard" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">C:\temp\Zabbix_sender -z zabbix.crtcorp.com -s neverfail01 -k&quot;nf_cluster[file_sync_status]&quot; -o &quot;/Synchronized&quot;</div></div>
<p>Breaking down the command line:</p>
<ul>
<li><em><strong>zabbix.crtcorp.com</strong></em> is the Zabbix server we&#8217;re sending this data to</li>
<li><strong><em>neverfail01</em></strong> is the Neverfail node we&#8217;re sending information about</li>
<li>the key for the Zabbix item (i.e. counter) we want the information associated with is <strong><em>nf_cluster[file_sync_status]</em></strong>;</li>
<li>the value we want in the key is  &#8221;<strong><em>/Synchronized</em></strong>&#8220;</li>
</ul>
<p>In the example the value we&#8217;re putting into Zabbix is a string rather than a numerical value. Here&#8217;s an example with a numeric value being put into Zabbix:</p>
<p>C:\temp\Zabbix_sender -z zabbix.crtcorp.com -s neverfail01 -k&#8221;nf_cluster[throughput]&#8221; -o &#8220;103453&#8243;</p>
<p>Here we&#8217;re specifying the item with key <strong><em>nf_cluster[throughput]</em></strong> and giving it a value of <strong><em>103453</em></strong>.</p>
<h3>Adding Zabbix_Sender</h3>
<p>Now what I needed to do is to combine the script I wrote earlier with zabbix_sender to actually put the registry data into Zabbix. So  I added a new function to the GetRegValue.vbs script to execute the actual zabbix_send. It is pretty straightforward it builds a formulaic command line and then executes it. You&#8217;ll notice there is no error checking.</p>
<div class="codecolorer-container text blackboard" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">'###################################################################################<br />
Function Zabbix_Send(ZabbixKey,Value)<br />
Dim WshShell, oExec, CommandLine<br />
Set WshShell = CreateObject(&quot;WScript.Shell&quot;)<br />
'Build Our Command line so we can also echo it to console<br />
'Ex zbx send cmd line = C:\temp\Zabbix_sender -z&quot;wv-zabbix-01&quot; -s&quot;neverfail01&quot; -k&quot;nf_cluster[file_sync_status]&quot; -o &quot;/Synchronized&quot;<br />
CommandLine = ZBXSend &amp;amp; &quot; -v -z&quot;&quot;&quot; &amp;amp; ZBXServer &amp;amp; &quot;&quot;&quot; -s&quot;&quot;&quot; &amp;amp; ZBXClient &amp;amp; &quot;&quot;&quot; -k&quot;&quot;&quot; &amp;amp; ZabbixKey &amp;amp; &quot;&quot;&quot; -o &quot;&quot;&quot; &amp;amp; Value &amp;amp; &quot;&quot;&quot;&quot;<br />
WScript.Echo &quot;Commandline is [&quot; &amp;amp; CommandLine &amp;amp; &quot;]&quot;<br />
'Execute our command line<br />
Set oExec = WshShell.Exec(CommandLine)<br />
End Function</div></div>
<p>The next step is to modify the main body of the original GetRegValue script to turn it into a function. I then changed the WScript.Echos so that we were returning the registry value rather than simply writing it to the console.  (WScript.Echo HexToDec(HexValue) -&gt; GetRegValue = HexToDec(HexValue) , Wscript.Echo strValue -&gt; GetRegValue=strValue, and so on).  At the end we have this script which is good for reading <strong><em>one</em></strong> specified registry value and then inserting it into Zabbix.</p>
<p>Since there are a number of values we want to put into Zabbix we need to think about how to approach this given that the script only handles one value at a time.  What I settled on was a a batch file that used a<strong><em> for</em></strong> loop to go through a file with a list of registry based perf counters related to Neverfail.  The script as it now stands needs three arguments passed to it. It needs the ZabbixKey, the registry key path, and the registry value .  For values I want to get from the passive node the registry path needs to include the private IP address of the passive node (ex: \\10.0.0.2\HKLM\Software\Neverfail\R2\Performance) so that reg.exe knows where to go get them from.  The script can then query the registry using the path and value combination to get the data which it can then send to Zabbix using the key specified on the command line.  So having the list of registry values from the part 1 post I&#8217;m able to put together my file.</p>
<p>Because  I need to specify a delimiter to the <strong><em>for</em></strong> command and I use commas &#8216;,&#8217; in the Zabbix keys that I&#8217;ve defined, I need to use something else as a delimiter for my input file, so I&#8217;ve settled on using a pipe symbol as shown below.</p>
<div class="codecolorer-container text blackboard" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">nf_cluster[throughput]|HKLM\Software\Neverfail\R2\Performance|CurrentThroughput<br />
nf_cluster[MB_sent]|HKLM\Software\Neverfail\R2\Performance|MegaBytessent<br />
nf_cluster[MB_recvd]|HKLM\Software\Neverfail\R2\Performance|MegabytesReceived<br />
nf_q[unsafe,age]|HKLM\Software\Neverfail\R2\Performance|OldestUnsafeupdatequeueentry<br />
nf_q[unsafe,size]|HKLM\Software\Neverfail\R2\Performance|UnsafeUpdateQueueSize<br />
nf_q[unsafe,rate]|HKLM\Software\Neverfail\R2\Performance|UnsafeUpdateQueueSize<br />
nf_q[unsafe,total_kb_sent]|HKLM\Software\Neverfail\R2\Performance|KBDispatchedFromUnsafeQueue<br />
nf_q[safe,age]|\\10.0.0.2\HKLM\Software\Neverfail\R2\Performance|Oldestsafeupdatequeueentry<br />
nf_q[safe,size]|\\10.0.0.2\HKLM\Software\Neverfail\R2\Performance|safeUpdateQueueSize<br />
nf_q[safe,rate]|\\10.0.0.2\HKLM\Software\Neverfail\R2\Performance|SafeUpdateQueueSize<br />
nf_q[safe,total_kb_sent]|\\10.0.0.2\HKLM\Software\Neverfail\R2\Performance|KBDispatchedFromsafeQueue<br />
nf_cluster[reg_sync_status]|HKLM\Software\JavaSoft\Prefs\Neverfail\current\/Registry/State/Manager\/Status|/Value<br />
nf_cluster[file_sync_status]|HKLM\Software\JavaSoft\Prefs\Neverfail\current\/New/File/State/Mgr\/Synchronization/Status|/Tag<br />
nf_cluster[primary]|HKLM\Software\JavaSoft\Prefs\Neverfail\current\/Controller|/Is/Primary/Server</div></div>
<p>While my batch file  is about 35 lines, it really boils down to one line which does all the real work:</p>
<div class="codecolorer-container text blackboard" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">for /F &quot;tokens=1-3 delims=|&quot; %%I in (%ZBXKEYS%) do cscript %SENDVALUES% &quot;%%I&quot; &quot;%%J&quot; &quot;%%K&quot;</div></div>
<p>With the environment variables expanded it would look more like;</p>
<div class="codecolorer-container text blackboard" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">for /F &quot;tokens=1-3 delims=|&quot; %%I in (zbx_keys_to_send.txt ) do cscript SENDVALUES.vbs &quot;%%I&quot; &quot;%%J&quot; &quot;%%K&quot;</div></div>
<p>This for loop reads in each line of  the text file zbx_keys_to_send.txt and using the pipe symbol as a delimiter reads in the first three tokens/strings of each line and call the SENDVALUES.vbs script with the three tokens/strings as arguments.  The script and input file worked fine when I ran them on the primary node, but not so well when I ran them while the secondary was active. After some troubleshooting I realized  that one thing I didn&#8217;t think through at first wat that I actually need two lists/input files. Since the private IP address I want to use to get data from the passive node&#8217;s registry will change depending on which node is active I&#8217;ll need one list for when the script is sending from the primary node (10.0.0.1)  and another for when the secondary (10.0.0.2) is active. The lists should essentially be identical with the only difference being the IP address specifed for the passive node.</p>
<p><img class="alignnone size-full wp-image-369" title="A generic Neverfail cluster" src="http://cars.lostroncos.org/wp-content/uploads/2010/02/nf_cluster.png" alt="A generic Neverfail cluster" width="474" height="275" /></p>
<p>Now all that is left to do is copy the batch file, the vbscript file and the approapirate inputer to each node in the cluster. Prior to setting up the scehiled task I like to manually run the batch file a few time to make sure  the data is getting populated into Zabbix. To do this I need to use a local account that exists on both nodes (in my case I use the local Administrator account). This is so that the reg.exe util can seamlessly get values from the passive node (assuming the account has the same password on both nodes).</p>
<h3>A little troubleshooting hint.</h3>
<p>When running the script manually I can see each time the VBScript file calls zabbix_sender and whether or not that submission was successful. Running zabbix_sender and mistyping the key was not an common issue when I was putting this together. Fortunately zabbix_sender lets me know what happened when I attempted to submit data.  As an example, below is the output I get when trying to submit a value for the nf_q[safe,size] key, if I mistype the key as nfq[safe,size]</p>
<p><img class="alignnone size-full wp-image-390" title="zbx_send_failed" src="http://cars.lostroncos.org/wp-content/uploads/2010/02/zbx_send_failed.png" alt="zbx_send_failed" width="739" height="86" /></p>
<p>I can see that it reports that I have 1 failed item, and no Processed items. When I run it without any typos (intentional or otherwise) I get:</p>
<p><img class="alignnone size-full wp-image-389" title="zbx_send_good" src="http://cars.lostroncos.org/wp-content/uploads/2010/02/zbx_send_good.png" alt="zbx_send_good" width="718" height="85" /></p>
<p>Now I can see that I had 1 item processed successfully and no Failed ones.</p>
<h2>Setting up Alerting</h2>
<p>If you import the template I&#8217;ve provided it should have also created four triggers that can be used to generate actions within Zabbix.</p>
<p><img class="alignnone size-full wp-image-392" title="template_triggers" src="http://cars.lostroncos.org/wp-content/uploads/2010/02/template_triggers.png" alt="template_triggers" width="851" height="151" /></p>
<p>These triggers are based on situations I&#8217;ve run into in my environment that I want to be aware of.  The first is when the size of either the Safe or Unsafe queue has been above 2GB for over an hour. Neverfail was great at letting me know the queue was full and it was going to stop replicating but not so much on the warning me it was happening front.  I generally wanted to be aware well before we got to that state where it stopped replicating and these triggers are a way of warning me something is going on.  The second situation is when data to be replicated has been sitting in one of the queue&#8217;s for more than a specified amount of time.  This is similar to watching the queue get beyond a certain size as the first two triggers do but is helpful in situations where there isn&#8217;t a whole of data changing on the active node(i.e. over weekends).</p>
<p>It is of course  possible to change these and set them to what fits for your environment and even to add other triggers. In later versions of this monitoring I&#8217;ve added some other counters/keys related to the task state using the nfcmd.exe command line tool. This allows me to see when a server is doing a full system check or even the dreaded &#8220;internal system task&#8221; as well as how much progress it&#8217;s made.  Some example screenshots are included below.</p>
<table border="0" width="100%">
<tbody>
<tr align="center">
<td><div id="attachment_394" class="wp-caption alignnone" style="width: 160px"><br />
<a rel="lightbox[nf]" href="http://cars.lostroncos.org/wp-content/uploads/2010/02/zbx_nf_data.png"><img class="size-thumbnail wp-image-394" title="Sample Data for one cluster" src="http://cars.lostroncos.org/wp-content/uploads/2010/02/zbx_nf_data-150x150.png" alt="Sample Data for one cluster" width="150" height="150" /></a><p class="wp-caption-text">Sample Data for one cluster</p></div></td>
<td><div id="attachment_396" class="wp-caption alignnone" style="width: 160px"><a rel="lightbox[nf]" href="http://cars.lostroncos.org/wp-content/uploads/2010/02/zbx_nf_landscape.png"><img class="size-thumbnail wp-image-396" title="Overview of all the clusters in my environment" src="http://cars.lostroncos.org/wp-content/uploads/2010/02/zbx_nf_landscape-150x150.png" alt="Overview of all the clusters in my environment" width="150" height="150" /></a><p class="wp-caption-text">Cluster Overview</p></div></td>
<td><div id="attachment_397" class="wp-caption alignnone" style="width: 160px"><a rel="lightbox[nf]" href="http://cars.lostroncos.org/wp-content/uploads/2010/02/zbx_nf_safeq_size_graph.png"><img class="size-thumbnail wp-image-397 " title="Graph of the Safe Queue size" src="http://cars.lostroncos.org/wp-content/uploads/2010/02/zbx_nf_safeq_size_graph-150x150.png" alt="Graph of the Safe Queue size" width="150" height="150" /></a><p class="wp-caption-text">Sample Graph of the Safe Queue size</p></div></td>
<td><div id="attachment_395" class="wp-caption alignnone" style="width: 160px"><a rel="lightbox[nf]" href="http://cars.lostroncos.org/wp-content/uploads/2010/02/zbx_nf_fullcheck.png"><img class="size-thumbnail wp-image-395" title="Enhanced view showing a Full System Check that is 3% complete." src="http://cars.lostroncos.org/wp-content/uploads/2010/02/zbx_nf_fullcheck-150x150.png" alt="Enhanced view showing a Full System Check that is 3% complete." width="150" height="150" /></a><p class="wp-caption-text">Enhanced view</p></div></td>
</tr>
</tbody>
</table>
<p>The three files I use are included here:</p>
<ul>
<li><a href="http://cars.lostroncos.org/wp-content/uploads/2010/02/Do_Zabbix.cmd.txt">DO_Zabbix.cmd</a> &#8211; The batch file that reads the input file with reg values &amp; zabbix keys and calls SendRegValue.vbs</li>
<li><a href="http://cars.lostroncos.org/wp-content/uploads/2010/02/SendRegValue.vbs.txt">SendRegValue.vbs</a> &#8211; The vbscript file that actually reads the registry entry and does any necessary conversions to send the value to Zabbix</li>
<li><a href="http://cars.lostroncos.org/wp-content/uploads/2010/02/zabbix_keys_to_send.txt">zabbix_keys_to_send.txt</a> &#8211; the input file used by DO_Zabbix.cmd. This version is the one I run when the primary node is active. IP addresses would need to be changed for this to run on a passive node.</li>
</ul>
<p>&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;</p>
<p><em>A few additional notes:</em></p>
<p><em>Because Neverfail  is continually pushing the perf data to the registry it does happen on occasion that the script will catch spuriously large or odd values for some counters. </em></p>
<p><em>If I were to use the zabbix_agent on my Neverfail nodes it is possible to include all this same monitoring within the agents configuration so that the agent pushes the data rather than using zabbix_sender via a scheduled task. <em>However that&#8217;s a post for some other time&#8230;<br />
-crt</em></em></p>
]]></content:encoded>
			<wfw:commentRss>http://cars.lostroncos.org/2010/02/23/simple-neverfail-monitoring-with-zabbix-part-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Simple Neverfail monitoring with Zabbix part 1</title>
		<link>http://cars.lostroncos.org/2009/05/31/simple-monitoring-of-neverfail-with-zabbix-part-1/</link>
		<comments>http://cars.lostroncos.org/2009/05/31/simple-monitoring-of-neverfail-with-zabbix-part-1/#comments</comments>
		<pubDate>Mon, 01 Jun 2009 05:01:54 +0000</pubDate>
		<dc:creator>cars</dc:creator>
				<category><![CDATA[VMware]]></category>
		<category><![CDATA[monitoring]]></category>
		<category><![CDATA[neverfail]]></category>
		<category><![CDATA[zabbix]]></category>
		<category><![CDATA[neverfail for Exchange]]></category>
		<category><![CDATA[neverfail heartbeat]]></category>
		<category><![CDATA[reg_dword_big_endian]]></category>
		<category><![CDATA[windows registry]]></category>

		<guid isPermaLink="false">http://cars.lostroncos.org/?p=157</guid>
		<description><![CDATA[Background
<p>This is the first of a couple of posts on how I&#8217;ve cobbled together some basic monitoring of Neverfail&#8217;s  Neverfail Heartbeat H/A software which is also now the basis for VMWare&#8217;s vCenter Server Heartbeat. Since Neverfail seems to consider their command lines privileged information I will only cover how to do some simple monitoring using [...]]]></description>
			<content:encoded><![CDATA[<h2>Background</h2>
<p>This is the first of a couple of posts on how I&#8217;ve cobbled together some basic monitoring of <a href="http://www.neverfailgroup.com/">Neverfail&#8217;s  Neverfail Heartbeat H/A</a> software which is also now the basis for <a href="http://www.vmware.com/products/vcenter-server-heartbeat/">VMWare&#8217;s vCenter Server Heartbeat</a>. Since Neverfail seems to consider their command lines privileged information I will only cover how to do some simple monitoring using the registry. When starting on this effort internally I was only interested initially in figuring out a quick and simple way to get the info I needed and not so much on the how to get it into something part.</p>
<p>I&#8217;ve been working with another team where I work to look at Zabbix as an alternative for some of the monitoring we do in our environment. We use Microsoft Operations Manager 2005 (MOM) but haven&#8217;t fully cut over from out previous monitoring solution. I had looked at Zabbix earlier as a potential solution for monitoring a bunch of VMware ESX boxes but another team ended up getting tasked with that particular duty. So I had had some experience with Zabbix but hadn&#8217;t done too much with it since.</p>
<p>One of the things that&#8217;d been rattling around in my brain is using the capabilities of using the zabbix_sender feature/client to monitor some of other components/things we can&#8217;t easily get into MOM.  Zabbix_Sender is a utility that is available for use with Zabbix that allows one to &#8220;send&#8221; information to Zabbix. In my case it was appealing because we&#8217;re already running two different monitoring agents on the Exchange servers where we have Neverfail installed.  Since I only wanted to use Zabbix to monitor a small set of data related specifically to Neverfail zabbix_sender lets me do that without having to run the fullblown zabbix_agent as a service on the boxes.</p>
<p><span id="more-157"></span></p>
<h2>Getting the Data</h2>
<p>Neverfail (at least the versions we have installed) doesn&#8217;t obviously expose performance data. However if you look in the registry on each Neverfail server you will find some registry values (see <strong><em>HKLM\Software\Neverfail\R2\Performance</em></strong>) that get updated on a regular and frequent basis that correspond to data presented in the Neverfail GUI . Because of the way Neverfail works some of this data (Unsafe Queue info) is available on the Active node and some of it (Safe Queue info) is in the registry on the Passive node. This presents a couple of issues when trying to put together the solution (at least in my environment).</p>
<p>The first of these is trying to find a single consistent way to get the data out of the registry, especially since all the counters involved are of the REG_DWORD_BIG_ENDIAN variety (you can see a <a href="http://cars.lostroncos.org/2009/03/09/big_endian-registry-values/">previous entry related to BIG_ENDIAN here</a>).  I ended up settling on using the Reg.exe util available in Windows.  This utility let&#8217;s you manipulate the registry locally and remotely. While it doesn&#8217;t necessarily deal happily with REG_DWORD_BIG_ENDIAN (RDBE) entries in the registry it is able to extract the data which we can then manipulate to get the correct values.</p>
<p>As an example if I have the following two values in the registry as shown by RegEdit</p>
<p><a href="http://cars.lostroncos.org/wp-content/uploads/2009/05/reg_example_01.png"><img class="alignnone size-full wp-image-159" title="reg_example_01" src="http://cars.lostroncos.org/wp-content/uploads/2009/05/reg_example_01.png" alt="reg_example_01" width="462" height="166" /></a></p>
<p>When I run <strong><em>reg.exe</em></strong> I get the following output&#8230;</p>
<p><a href="http://cars.lostroncos.org/wp-content/uploads/2009/05/reg_query_rdword_rdbe.png"><img class="alignnone size-full wp-image-160" title="reg_query_rdword_rdbe" src="http://cars.lostroncos.org/wp-content/uploads/2009/05/reg_query_rdword_rdbe.png" alt="reg_query_rdword_rdbe" width="576" height="129" /></a></p>
<p>So while Dword_example and DWORD_BE_Example nominally have the same value <strong><em>reg.exe</em></strong> doesn&#8217;t get the data out correctly for the latter. However as I said earlier once we have the data out we can actually do some magic to get the right value.</p>
<p>We can also use <strong><em>reg.exe</em></strong> to get values on a remote machine (i.e. our Passive Neverfail node) by pre-pending the host info to the query registry path. So in this case to reach the passive secondary node over the private channel at 10.0.0.2 I can do something like  reg.exe Query \\10.0.0.2\CRT_CORP\Performance. Testing this out leads us to  a second issue. Getting an  &#8221;Acces is denied&#8221; error.</p>
<p><a href="http://cars.lostroncos.org/wp-content/uploads/2009/05/reg_error_access_denied.png"><img class="alignnone size-full wp-image-158" title="reg_error_access_denied" src="http://cars.lostroncos.org/wp-content/uploads/2009/05/reg_error_access_denied.png" alt="reg_error_access_denied" width="592" height="176" /></a></p>
<p>Since my passive Neverfail node is essentially off-net but still thinks the network cables is live I can&#8217;t use a domain based account to run the reg.exe command because it can&#8217;t contact a domain controller to authenticate my domain account. However if I use the local Administrator account which has a common password on both nodes I can get this work just fine. (It may be possible to use an account other than the local Administrator but in my case where I also run some Neverfail command lines I need an account that&#8217;s authorized in Neverfail)</p>
<p><a href="http://cars.lostroncos.org/wp-content/uploads/2009/05/reg_remote_as_admin.png"><img class="alignnone size-full wp-image-161" title="reg_remote_as_admin" src="http://cars.lostroncos.org/wp-content/uploads/2009/05/reg_remote_as_admin.png" alt="reg_remote_as_admin" width="528" height="188" /></a></p>
<p>Given this info I was able to put together a<a href="http://cars.lostroncos.org/?attachment_id=168"> </a><a href="http://cars.lostroncos.org/?attachment_id=172">vbscript that takes two arguments</a>: a reg path and a value name;  and it returns the data value to the console converting REG_DWORD and REG_DWORD_BIG_ENDIAN to the correct decimal value. Using <a href="http://cars.lostroncos.org/wp-content/uploads/2009/06/getregvaluevbs.txt">this script</a> it&#8217;s then possible to get  any of the counters we&#8217;re interested in on either the active or passive node.  So based on the example above where I ran <em>reg.exe hklm\software\CRT_CORP\Performance /s</em> we can run the script for each of the values and see that we do in fact get the right decimal value for each one.</p>
<p><a href="http://cars.lostroncos.org/wp-content/uploads/2009/06/getregvalue_example_01.png"><img class="alignnone size-full wp-image-176" title="getregvalue_example_01" src="http://cars.lostroncos.org/wp-content/uploads/2009/06/getregvalue_example_01.png" alt="getregvalue_example_01" width="702" height="213" /></a></p>
<p>So now the trick is to figure out which of the registry based perf values we want to use and which host we need to draw them from.  Each of the Neverfail nodes has the same set of values present even though they&#8217;re not all populated the same way. That is to say that the counters related to the Safe Queue are not updated on the Active node since the Safe Queue exists on the passive node. And the converse is true with regard to the UnsafeQueue counters.  As I was mostly interested in alerting related to an issue we have occur occassionally I really wanted to get the SafeQueue and UnsafeQueue related counters (OldestSafeUpdateQueueEntry, SafeUpdateQueueSize etc). But since the other counters are also equally easy to get I decided I to include several more.  The image below shows the available values.</p>
<p><a href="http://cars.lostroncos.org/wp-content/uploads/2009/05/nf_perf_reg_values.png"><img class="alignnone size-full wp-image-164" title="nf_perf_reg_values" src="http://cars.lostroncos.org/wp-content/uploads/2009/05/nf_perf_reg_values.png" alt="nf_perf_reg_values" width="481" height="337" /></a></p>
<p>So now that I have a simple way of getting the information I want I can focus on how to get it into whatever system I want to monitor with whether it&#8217;s Zabbix (now) or Systems Center Operations Manager 2007 (later).  In the next article(s) I&#8217;ll talk about setting up the Zabbix part of this monitoring.</p>
<p><strong><em>Acknowledgement: The hex to decimal routine in the GetRegValue.vbs script is lifted directly from </em></strong><strong><em><a href="http://www.sonofsofaman.com/hobbies/code/hextodec.asp">http://www.sonofsofaman.com/hobbies/code/hextodec.asp</a> Thanks to Joel for keeping me from having to reinvent the wheel. -crt</em></strong></p>
<p><strong>Addendum</strong>: While traipsing through the registry in figuring this stuff out I also discovered that there&#8217;s a bunch of configuration information stored in a whole different key under HKLM\Software\Javasoft\Prefs\neverfail\current\* It&#8217;s also possible to watch a few entries here to help monitor the  file and registry synchronization status even though it&#8217;s not as granular/descriptive/timely as can be obtained by using the command line.</p>
<p>The two items I&#8217;ve found that might be of interest are the <strong>/Registry/State/Manager\/Statu</strong>s Key and the <strong>/Value</strong> entry</p>
<p><a href="http://cars.lostroncos.org/wp-content/uploads/2009/06/reg_java_prefs_reg.png"><img class="alignnone size-full wp-image-183" title="reg_java_prefs_reg" src="http://cars.lostroncos.org/wp-content/uploads/2009/06/reg_java_prefs_reg.png" alt="reg_java_prefs_reg" width="725" height="212" /></a></p>
<p>and  the <strong>/New/File/State/Mgr\/Synchronization/Status</strong> key and <strong>/Tag</strong> entry</p>
<p><a href="http://cars.lostroncos.org/wp-content/uploads/2009/06/reg_java_prefs_file.png"><img class="alignnone size-full wp-image-182" title="reg_java_prefs_file" src="http://cars.lostroncos.org/wp-content/uploads/2009/06/reg_java_prefs_file.png" alt="reg_java_prefs_file" width="713" height="221" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://cars.lostroncos.org/2009/05/31/simple-monitoring-of-neverfail-with-zabbix-part-1/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>A Simple Replay Report</title>
		<link>http://cars.lostroncos.org/2009/04/30/a-simple-replay-report/</link>
		<comments>http://cars.lostroncos.org/2009/04/30/a-simple-replay-report/#comments</comments>
		<pubDate>Fri, 01 May 2009 06:08:41 +0000</pubDate>
		<dc:creator>cars</dc:creator>
				<category><![CDATA[Work]]></category>
		<category><![CDATA[monitoring]]></category>
		<category><![CDATA[powershell]]></category>
		<category><![CDATA[Replay]]></category>

		<guid isPermaLink="false">http://cars.lostroncos.org/?p=122</guid>
		<description><![CDATA[<p>Where I work we use AppAssure&#8217;s Replay product to back up some of our Exchange servers.  Because the servers in question are very geographically dispersed we have multiple servers running Replay.  Monitoring and keeping an eye on them to assure backups are happening properly was requiring more time than I wanted to spend because we [...]]]></description>
			<content:encoded><![CDATA[<p>Where I work we use AppAssure&#8217;s Replay product to back up some of our Exchange servers.  Because the servers in question are very geographically dispersed we have multiple servers running Replay.  Monitoring and keeping an eye on them to assure backups are happening properly was requiring more time than I wanted to spend because we had different versions of Replay running in the environment. I ended up having to RDP to multiple machines on a regular basis to ensure things were going smoothly.</p>
<p>In poking around the install directory I came across the <a href="https://support.appassure.com/ics/support/KBAnswer.asp?questionID=119" target="_blank">Replayc.exe command</a>. Replayc is a command line utilty that offers information about the Replay server and a way to manually mount and dismount Recovery Points (RPs). After playing with it a little and being the very lazy person that I am  I decided to write a Powershell script to help give me a high level status overview of my servers.  The script runs on each server at about the same time (relative to me here in Oregon) every day and emails me the output. So instead of having to muck around in the console Ionly have to spend a few seconds each to make sure everything&#8217;s running properly.</p>
<p>The <a href="http://cars.lostroncos.org/?attachment_id=145">script is available here</a> and needs to be renamed appropriately.</p>
<p>When the script runs the email (HTML formatted)  I get is like the one below.  It tells me a number of things:</p>
<ul>
<li>The status of the Replay Server (running/not running)</li>
<li>The name of the server that&#8217;s being protected</li>
<li>How much disk space is available and being used for RPs for that protected server</li>
<li>The size of the disk where those RPs are being stored</li>
<li>The # of valid and invalid RPs</li>
<li>The timespan between first and last valid RP</li>
<li>Last time an RP occurred.</li>
</ul>
<p>Example Email:</p>
<p style="padding-left: 60px;">Starting Script at 04/30/2009 23:20:12</p>
<p style="padding-left: 60px;">Replay Service is running</p>
<p style="padding-left: 60px;">Server <strong><em>mailserver.company.com</em></strong> snapshots are being stored on R: and currently using 818.54GB. This is 99.98% of the used space(818.68GB) on the volume which is 1,360.22GB</p>
<p style="padding-left: 60px;">The drive currently has 39.81% free space (e.g. 541.54GB)</p>
<p style="padding-left: 60px;">Number of reported Recovery Points is 395 of these 395 are valid, and 0 are invalid (100.00%).<br />
The valid RPs span 23.98 days</p>
<p style="padding-left: 60px;">The most recent valid RP was taken 1 Minutes ago</p>
<p style="padding-left: 60px;"> </p>
]]></content:encoded>
			<wfw:commentRss>http://cars.lostroncos.org/2009/04/30/a-simple-replay-report/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
