A while back I posted about a Replay report that I wrote to help me monitor the multiple Replay servers we have deployed globally. It was a good first effort and was useful, but having to engage my brain first thing in the morning to read (and more importantly actually comprehend) the emailed reports eofre my second cup of coffee was less than ideal.
Thwe original idea behind generating the report was to have the info come to me rather than logging into multiple servers and firing up the console multiple times (what can I say I’m lazy).
The report in the first version of the script was straighforward text. Recently I’ve been looking into and thinking about different ways to present the information in the report so I could just sort of glance at it and get the status. The disk related portion of the report wasn’t initially where I was focusing my attention. I was more interested in being able to get a quick idea of where we stood with the # of Recovery Points we were expecting to have. An example of one of the simple reports is below. From this we can see that we’re in pretty good shape with 100% valid RPs spanning about 24 days.
Starting Script at 04/30/2009 23:20:12
Replay Service is running
Server mailserver.company.com snapshots are being stored on R: and currently using 818.54GB. This is 99.98% of the used space(818.68GB) on the volume which is 1,360.22GB
The drive currently has 39.81% free space (e.g. 541.54GB)
Number of reported Recovery Points is 395 of these 395 are valid, and 0 are invalid (100.00%).
The valid RPs span 23.98 days
The most recent valid RP was taken 1 Minutes ago
The issue becomes less clear when invalid RPs occur for whatever reason. If I have 395 RPs and only 250 of the are valid is that a good or bad state? It’s not immediately clear but one can log in to the Replay server and get a better idea of how things stand. It might be the case where there was network issue during the day and instead of 96 RPs (that’s an RP every 15 minutes * 24 hrs) for each of the last three days we’ve only gotten 40 RPs each of those days which while less than ideal might still be an okay state. Or it could be that there are several days for which we don’t have RPs.
I was trying to think of a way to visualize this information. Because of the retention schedule some days we’d expect a large number of RPs (~90) and some other days we’d expect to have just one. I looked into the possibility of using sparklines even going so far as to download a C# based version of a PHP based sparkline web service from Joe Gregorio.
I tried several different iterations of the script using sparklines trying to use the data I had in different ways (ex: use percentages of expected RPs, diffs between expected and actaul) but wasn’t able to find a good way to represent the state using those. In digging around I came across the Google Chart API and that looked at bar different ways of using bar graphs to represent the info I wanted. Using either side-by-side bar graphs
overlapping ones with green and red where a lot of red would be a bad thing.
Either of these would have been an improvement over what I was getting in the text based report. While trying to refine the overlaid version I came across an example in the Google documentation on line styles of this graph:
This caught my eye as a possible solution to my problem about how to present this data because of the ability to show both sets of data overlaid on each other while still managing to keep both of them visible.
In my particular scenario the retention schedule is:
- RPs every 15 minutes which are kept for 4 days
- These roll up to hourly RPs which are kept for 5 days
- Hourly’s roll up to dailies which are kept for ~25 days
Our goal is to keep about 30 consecutive days worth of RPs on hand. When plotting out the # of expected RPs per day we get a graph that looks like the one below.
As one can see the number of Recovery Points per day decreases over time. When adding the line showing the number of actual RPs it can be hard to tell what the status is for the days where there’s only one RP per day. If things are going well the green bars will be obscured by the red line.
In the rare instance where we might be missing a few daily RPs the green bars do become somewhat visible as shown in the blue box below.
I experimented with a couple of different ways to try to make this more obvious including altering the width and height of the chart to make it more obvious. (see below). Using the Google Chart API one is limited to an image with 300000 pixels (500×600)
But for me persoanlly making the chart a lot bigger like this seemed like it didn’t really add all that much to being able to see what was going on. So I stuck with 600×250 for the graph.
It should also be noted that in the case where you aren’t taking snapshots every 15 minutes but evey 30 minutes or maybe even every hour it becomes easier to see missed daily RPs. Here’s an example
After going through all of this with the RPs I almost as an afterhtought went back and added the logic to graph the disk usage data as well. It shows the size of the Replay archive data, the free space and other used space on the drive by generating something like this:
Here’s a “real-life” example of the whole report.
The script is available here. If it’s of any use to you please drop me a line and let me know.
To use it rename it to something like ReplayReport.ps1. You’ll need to modify the variables at the beginning of the file:
- $ReportRecipients – array of recipient email addresses.
- $MailServer – The SMTP server to use to send the report out
- $ReportSender -Address the email should appear to come from.
- $replay_exe – Path to the Replayc.exe file. May differ on x64 vs x86 servers.
- $ExpectedRPCount – array containing the number of expected RPs for the last X days. Used to generate the graph of expected vs present RPs. (See Note below)
The script doesn’t take any arguments to run. I run it via a scheduled task on the replay server.
In reference to $ExpectedRPCount the time of day that the report is run will affect how the first data point on the graph appears. RPs are tracked by the date that they were taken. If you run the script just before midnight there will obviously be a lot more RPs for “today” than if you run it at 1 am.