No more insane clicking in ESXi – setup your testbed in a minute

Hello all!

Very often, I find myself in the situation where I have to quickly setup a test environment. In my case that usually means that I will quickly setup a:

  • VMWare ESXi (nowdays vCenter Hypervisor) virtual machine
  • 1GB RAM
  • 16GB Disk divided into /, /boot, swap
  • Ubuntu Server, 64 bit, 12.04.2 LTS

You might have your own favorite setup, this is mine. The goodies on top of a minimal server installation is:

  • vi as the default editor
  • luxury items like ksh, zsh
  • necessary tools like pv, sysstat, open-vm-tools

I’ve done this so many times by now, that I just cannot bear the thought of doing it again.

Why? Well, doing it by hand is summarised by the following:

  1. Connect to my office over VPN
  2. Startup my Windows VM (I am a Mac owner; live with it)
  3. Startup the vCenter Client
  4. Right click the proper resource group -> New virtual machine…
  5. Typical
  6. Give the VM a decent name
  7. Choose datastore
  8. Linux/Ubuntu 64-bit
  9. Network -> VM Network/VMXNET 3
  10. 16GB disk
  11. Modify VM properties
  12. Choose the CD -> ISO file -> browse, browse (got my own quick install, modified Ubuntu ISO)
  13. Connect on power-on
  14. Power on

I mean, that is 14 steps that I could live without. So, I started looking at ways to do this from a terminal window.

To this story, you need to know that my office environment I’ve got the following setup:

  • synology02 – nfs/cifs files, a Synology DS1511+
    • Sharing a handful of nfs filesystems
  • esxi01 – My ESXi 5.something server which hosts all my VM’s
    • 2 resource groups – NFSDev, NFSProd
    • 2 data stores: NFSDev and NFSProd – nfs mounted datastores located on the synology02
  • guran – my “central” server for more or less all and nothing

In my environment, I can browse my datastores when logged into “guran”, since all my virtual machines are located in the datastores on NFS. Look here:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
malu@kmg-guran-0001:/mnt/synology02/files/vmware/datastores $ls -la
total 24
drwxr-xr-x 6 malu malu 4096 2012-10-02 12:49 .
drwxr-xr-t 9 malu malu 4096 2012-12-05 14:58 ..
drwxr-xr-x 19 malu malu 4096 2013-02-19 20:38 dev
drwxr-xr-x 8 malu malu 4096 2013-02-04 18:42 prd

malu@kmg-guran-0001:/mnt/synology02/files/vmware/datastores $ls -la dev
total 76
drwxr-xr-x 19 malu malu 4096 2013-02-19 20:38 .
drwxr-xr-x 6 malu malu 4096 2012-10-02 12:49 ..
drwxr-xr-x 2 root root 4096 2013-02-04 17:27 jira-v001fry
drwxr-xr-x 2 root root 4096 2013-01-04 11:11 kmg-buildbox-0001
drwxr-xr-x 2 malu malu 4096 2012-11-24 15:12 kmg-op5-0003
drwxr-xr-x 2 root root 4096 2013-01-04 11:10 kmg-op5-0004
drwxr-xr-x 2 root root 4096 2012-12-09 21:56 kmg-sandbox-0005
drwxr-xr-x 2 root root 4096 2012-12-09 19:49 kmg-sandbox-0005.save
drwxr-xr-x 2 root root 4096 2013-01-04 11:11 kmg-web-0001
drwxr-xr-x 2 root root 4096 2013-01-04 11:11 kmg-web-0002
drwxr-xr-x 2 root root 4096 2012-12-28 13:24 kmg-zenLoadbalancer-0001
drwxr-xr-x 2 malu malu 4096 2013-02-07 14:27 nexenta-v001test
drwxr-xr-x 2 root root 4096 2013-01-16 16:24 op5-v001test
drwxr-xr-x 2 root root 4096 2013-01-04 11:10 openstack-v001fry

This is very useful, I must say. The goal for me is to be able to create new virtual machines without even thinking of starting up my Windows machine. I accomplished this by doing the following:

  1. I created a template.vmx file with the size of the VM I needed (which mounts my specially adapted Ubuntu ISO)
  2. Replaced all references to the name of the VM (in my case the hostname of the system) with the unique string “XXX_HOST_NAME_XXX”
  3. Figured out how to use this template properly

The basic recipe in my environment is:

  1. Create a new directory in the NFSDev datastore with the same name as the hostname of the new system
  2. Create a new _host_name_.vmx file from the template
  3. Create a new vmdk for the VM
  4. Register the VM in my one and only hypervisor/ESXi
  5. Startup the VM
    1. If there already is a VM with the same uuid/mac address, tell ESXi that I copied the VM

 

My template.vmx file looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
.encoding = "UTF-8"
config.version = "8"
virtualHW.version = "8"
pciBridge0.present = "TRUE"
pciBridge4.present = "TRUE"
pciBridge4.virtualDev = "pcieRootPort"
pciBridge4.functions = "8"
pciBridge5.present = "TRUE"
pciBridge5.virtualDev = "pcieRootPort"
pciBridge5.functions = "8"
pciBridge6.present = "TRUE"
pciBridge6.virtualDev = "pcieRootPort"
pciBridge6.functions = "8"
pciBridge7.present = "TRUE"
pciBridge7.virtualDev = "pcieRootPort"
pciBridge7.functions = "8"
vmci0.present = "TRUE"
hpet0.present = "TRUE"
nvram = "XXX_HOST_NAME_XXX.nvram"
virtualHW.productCompatibility = "hosted"
powerType.powerOff = "default"
powerType.powerOn = "hard"
powerType.suspend = "default"
powerType.reset = "default"
displayName = "XXX_HOST_NAME_XXX"
extendedConfigFile = "XXX_HOST_NAME_XXX.vmxf"
floppy0.present = "TRUE"
scsi0.present = "TRUE"
scsi0.sharedBus = "none"
scsi0.virtualDev = "lsilogic"
memsize = "1024"
scsi0:0.present = "TRUE"
scsi0:0.fileName = "XXX_HOST_NAME_XXX.vmdk"
scsi0:0.deviceType = "scsi-hardDisk"
ide1:0.present = "TRUE"
ide1:0.fileName = "/vmfs/volumes/c262ee3b-00d1a1ed/images/kmg-ubuntu-12.04.2.LTS.iso"
ide1:0.deviceType = "cdrom-image"
floppy0.startConnected = "FALSE"
floppy0.fileName = ""
floppy0.clientDevice = "TRUE"
ethernet0.present = "TRUE"
ethernet0.virtualDev = "e1000"
ethernet0.networkName = "VM Network"
ethernet0.addressType = "generated"
chipset.onlineStandby = "FALSE"
guestOS = "ubuntu-64"
uuid.location = "56 4d 5d 9a c5 dc 8e a1-45 76 3d 90 34 83 82 d1"
uuid.bios = "56 4d 5d 9a c5 dc 8e a1-45 76 3d 90 34 83 82 d1"
vc.uuid = "52 75 89 2c 80 59 17 93-b9 0b 33 49 04 8c c8 a3"
snapshot.action = "keep"
sched.cpu.min = "0"
sched.cpu.units = "mhz"
sched.cpu.shares = "normal"
sched.mem.min = "0"
sched.mem.shares = "normal"

Notice all the entries of “XXX_HOST_NAME_XXX”? I picked that string, since it is very unlikely that it is used by VMWare, and it is easy to replace using “sed”.

To make things a bit easier, I first setup my ESXi host to accept my public key to login as root:

1
2
3
ssh root@myESXiHost

vi /etc/ssh/keys-root/authorized_keys

After this, the recipe is easy:

1
2
3
4
5
6
7
8
9
10
11
12
13
newHostName=testbox-v003fry
templateDir=/mnt/synology02/files/vmware/datastores/dev/template
datastoreDir=/mnt/synology02/files/vmware/datastores/dev
cd $datastoreDir
mkdir $newHostName
cat $templateDir/template.vmx | sed -e 's/XXX_HOST_NAME_XXX/'$newHostName'/' > $datastoreDir/$newHostName/$newHostName.vmx
ssh root@192.168.2.204 vmkfstools -c 16g /vmfs/volumes/NFSDev/$newHostName/$newHostName.vmdk -a lsilogic
vmID=$(ssh root@192.168.2.204 vim-cmd solo/registervm /vmfs/volumes/NFSDev/$newHostName/$newHostName.vmx $newHostName pool0)
#-- turn on VM
ssh root@192.168.2.204 vim-cmd vmsvc/power.on $vmID &
sleep 1
#-- check there is a message and choose 2 (default, moved it)
[ -z "`ssh root@192.168.2.204 vim-cmd vmsvc/message $vmID _vmx1 | grep 'No message'`" ] && ssh root@192.168.2.204 vim-cmd vmsvc/message $vmID _vmx1 2

 

The ampersand (&) is there after the power.on, as the command will hang if there is already (most likely) a VM with the same uuid (defined in the vmx file) in the system. After this we need to sleep for one second, since there will be no message in the message buffer until the ESXi host realizes the there is a conflict, after which I first check if there is a message. If there is a message, pick it up and tell ESXi to use the default (2 – I copied the VM) alternative.

P.S I found out the “pool0” reference to my NFSDev resource pool by browsing this page: http://communities.vmware.com/message/1114467

1
cat /etc/vmware/hostd/pools.xml | grep "YOUR-RESOURCE-POOL-NAME" -A1 | grep "" | sed 's///;s/<\/objID>//g' | sed -e 's/^[[:blank:]]*//;s/[[:blank:]]*$//'

Done. All for today.

Alias is silver, function() is gold!

Everyone who has reached the first advancement of any meaningful martial arts, know that in order to get that first level you need to learn some very basic, yet useful techniques.

Being a unix admin is the same. The more tools you know, and the more you know how to put them together, the fancier the belt to keep your trousers up will be.

I sometimes claim to be a black belt unix admin; or at least I was once upon the time. Some people might disagree, but after almost 20 years of hacking around I think I know my way around. I have Niklas to thank for my yellow belt, since he lent me my a Sun SPARCstation 2 so that I could cram in the basics of Solaris after accepting a job offer. It helped, after which I practiced a lot to get better.

Now over to this post.

Everyone who claims to know anything about UNIX/Linux/OSX knows about alias. It is very useful to simplify commands you use a lot. I often use it to manage ssh connections for customers.

I.e, in my .bash_profile, I have an alias for each customer I am working for at the moment:

alias set_custA=". ~/.aliases.custA"

alias set_custB=". ~/.aliases.custB"

alias set_kmg=". ~/.aliases.kmggroup"

Every time I start a new terminal window, and if I would like to work for i.e my own company, I would type set_kmg, which would source the alias file ~/.aliases.kmggroup, which could look like this:

alias op5-v001fry="ssh malu@192.168.2.34"

alias guran="ssh malu@192.168.2.37"

This is decently ok, but there is one drawback; aliases does not accept any paramters. I.e, I could not do a:

guran ls -la

At least not if I expect it to resolve to “ssh malu@192.168.2.37 ls -la”. And I often have a wish to do so. Or to do something more complex. But this will serve as a good example.

So, how do I solve this, then? Enter: function()

function() beats alias in any cage match there ever will be. Bash has a wonderful implementation of functions which you can use in all it’s glory. I will use guran again to show how to implement this. Another good example I have is for one customer where I sometimes want to do the same thing on multiple servers, as root. But let’s start with guran. I would put this in my ~/.aliases.kmggroup file next to the aliases.

function guran(){

ssh malu@192.168.2.37 "$@"

}

Voilla!

Again, if I would like to use my example of the multiple hosts:

function sshAll(){
hosts="100 101 102 103 104"

for host in $hosts
do

echo "Host: $host" 1>&2

ssh -t malu@10.0.0.$host "$@"
done
}

Using this script I could, for example, edit /etc/hosts on these five hosts by issuing “sshAll sudo vi /etc/hosts“.

That’s it, for today!

 

Edit: Fredrik Roubert just mentioned that I should change the functions to “ssh xxx xxx $@” from $* to better handle parameters with spaces. Thanks!

Edit2: Fredrik came back to me, pointing out that there is a difference between $@ and “$@”, which there of course is. He was not sure about the implications in bash, since he is a zsh guy, but I made this quick hack as an example:

 

malu@KMG-Hotspot.local:/Users/malu/test $cat functions
function mother(){
echo Parameter 1: $1
echo Parameter 2: $2
echo Parameter 3: $3
echo Parameter 4: $4
}
function function1(){
mother $@
}

function function2(){
mother "$@"
}
malu@KMG-Hotspot.local:/Users/malu $function1 "parameter 1" "parameter 2"
Parameter 1: parameter
Parameter 2: 1
Parameter 3: parameter
Parameter 4: 2
malu@KMG-Hotspot.local:/Users/malu $function2 "parameter 1" "parameter 2"
Parameter 1: parameter 1
Parameter 2: parameter 2
Parameter 3:
Parameter 4:

 

Sawtooth – The power of a waveform!

Triangles are nice. They are robust, the strongest shape of them all. A triangle will also help you spot anomalies in contextually complex situations. Today we will use this shape to make sure that your backups are running properly, as well as showing you one of the amazing capabilities of the human brain; pattern recognition.

This is just one example of how you can use arithmetic on timestamps to get  more or less anything under control. Here is a good example of something I am trying to achieve today:

The triangle is simple, you know what to expect from it. And that, is the whole point of this blog entry.

Like in any good cooking show, I prepared the dish in beforehand. This is an example of what I can see in my OP5 monitoring. At this point it does not matter what the graph shows. Look at it for a few seconds, then answer the following questions:

  1. When did I have a problem with my backups (it did not run)?
  2. When could my monitoring system _not_ get any information from my backup system?

 

You see? You could answer these two questions. If you by any chance could not come up with the answers, you are either tired, or not really the target group of this blog. Without knowing anything about my system you easily could spot the exceptions in the pattern.

The graph shows the age in seconds of the last successful backup of my file share data. My backup policy is to make a backup (incremental, more about that in a different blog entry) every four hours. But even so, it doesn’t really matter what my backup schedule is. Given your inherited human skill of being able to recognize patterns, the two abnormalities just popped out in your face.

If you didn’t see it, and still find this blog interesting, the answer is: Just before midnight the 31st of whatever month it displays, my monitoring system could not gather this data (empty spot in the graph). And, all by a sudden, just before midnight between the 1st and the second, my backups stopped running (or failed, remember the graph shows the age of the last successful backup).

And now over to the long, interesting explanation on how I got there.

THE SETUP

I am using rsnapshot for my backups. There are several reasons and considerations behind this, but the interesting point is that I really want to know that this is working (disclaimer: this type of monitoring does not guarantee anything), and my implementation outputs logs into /var/log/rsnapshot.log where a successful backup looks like this:

1
[14/Aug/2012:16:27:54] /usr/bin/rsnapshot hourly: completed successfully

So, basically, since I am interested in the age of the last successful backup, I can simply filter the logfile for this (grep), use the last line of the output (tail -1), and get the timestamp (awk, tr -d”[]”).

1
cur_output=$(grep "successfully" /var/log/rsnapshot.log | tail -1 | awk '{print $1}' | tr -d "[]" )

This, of course, has to be scrubbed a bit, since the output is a timestamp that is not really machine readable. And, it is a timestamp, not an age. I know, my way of doing it is a little bit complicated. But it is the way I learned to do it many years ago, and it is hard to teach an old dog how to sit.

1
2
3
4
5
6
7
8
9
10
11
12
#--- the timestamp is in a really weird format; [14/Aug/2012:16:27:54]
curDate=$(echo $cur_output | awk -F":" '{print $1}')
curTime=$(echo $cur_output | awk -F":" '{print $2":"$3":"$4}')

#--- split the date part into day month year
echo $curDate | awk -F"/" '{print $1, $2, $3}' | read curDay curMonth curYear

#--- get the age of the last successful backup
#--- %s returns the number of seconds since 1.1.1970, epoc
lastBackupTime=$(date -d "$curMonth $curDay $curYear $curTime" "+%s")
nowTime=$(date "+%s")
(( lastOkBackupAge=$nowTime - $lastBackupTime ))

In principle what I do, is to convert the time to unix_timestamp (seconds from epoc, 1st of January, 1970), then subtract this from the current time. This gives me the number of seconds that has passed since the last successful backup until now. I do this through OP5/Nagios every 5 minutes, and the backups are supposed to run every 4 hours. In between backups, the output of my script will show an increasing age of the last successful backup, until just after a new backup, where the age is close to 0 seconds old.

So, the configuration for the whole setup is done on:

  • OP5 server
  • The backup server

On the OP5 server, checkcommands.cfg:

1
2
3
4
5
# command 'kmgBackup'
define command{
command_name kmgBackup
command_line $USER1$/check_nrpe -H kmg-sandbox-0003 -c kmg_backup
}

On the OP5 server, services.cfg:

1
2
3
4
5
6
7
# service 'Rsnapshot backup'
define service{
use default-service
host_name kmg-sandbox-0003
service_description Rsnapshot backup
check_command kmgBackup
}

On the backup server, /etc/nrpe.d/kmg_commands.cfg (any filename ending with .cfg will do):

1
2
3
4
malu@kmg-sandbox-0003:/etc/nrpe.d $cat kmg_commands.cfg
#--- kmg backup

command[kmg_backup]=/app/prd/op5/bin/checkBackup

And, at last, the script that checks the backups:

malu@kmg-sandbox-0003:/app/prd/op5/bin $cat /app/prd/op5/bin/checkBackup
#!/bin/ksh

logFile=/var/log/rsnapshot.log
logHost=kmg-sandbox-0003.localdomain

getLastComplete() {

#--- We are looking for lines like this. Only the timestamp of the last one is interesting
#--- [14/Aug/2012:16:27:54] /usr/bin/rsnapshot hourly: completed successfully
cur_output=$(grep "successfully" /var/log/rsnapshot.log | tail -1 | awk '{print $1}' | tr -d "[]" )

#--- the timestamp is in a really weird format; [14/Aug/2012:16:27:54]
curDate=$(echo $cur_output | awk -F":" '{print $1}')
curTime=$(echo $cur_output | awk -F":" '{print $2":"$3":"$4}')

#--- split the date part into day month year
echo $curDate | awk -F"/" '{print $1, $2, $3}' | read curDay curMonth curYear

#--- get the age of the last successful backup
#--- %s returns the number of seconds since 1.1.1970, epoc
lastBackupTime=$(date -d "$curMonth $curDay $curYear $curTime" "+%s")
nowTime=$(date "+%s")
(( lastOkBackupAge=$nowTime - $lastBackupTime ))

#--- echo the age of the last successful backup, in seconds
echo $lastOkBackupAge
}
backupAge=$(getLastComplete)

#--- hard coded crit and warn
# - 18000 seconds is 5 hours
# - 86400 seconds is 24 hours
retMessage="OK"
returnCode=0
[ $backupAge -gt 18000 ] &amp;&amp; {
retMessage="WARN"
returnCode=1
}

[ $backupAge -gt 86400 ] &amp;&amp; {
retMessage="CRIT"
returnCode=2
}

echo "$retMessage - Backup $backupAge seconds old"
echo "| backupAge=$backupAge"

Web server load balancing on (in the end) Raspberry Pi for less than 100 bucks

So…

The other day I got very interested in web server load balancing.

Again.

One of these things that most people would just never care about, but as always… I got obsessed with it.

Again.

Not so much because the technology is particularly complex (it is), or that you need to have so much knowledge of low level IP protocol to understand it well (you do). It was more that someone at work told me that I didn’t understand it. Such remarks alone have made smarter people than me insane; some of them brought elephants over the alps.

But, in an ancient workplace of mine, some 10-12 years ago, I ran into this topic for the first time. I was working at a company called Inserve, and a good old friend Rober Carlsson introduced me to the topic. Shortly after we started working together with a company called RadWare, selling their take on the issue. I can still remember how fascinated I was of their, at the time, so simple way of implementing so called “triangulation”, where a load balancer would get a TCP packet, change the MAC address in the headers and put it back on the network, so that a back end web-server would fetch it and send the reply directly over the router to the client.

I am getting way too detailed for this post already.

The other day, as I sad, I realized that I was a bit out of the loop. I had not researched the topic in quite a while. And all those company- and product- names I was using fluently a few years ago, was a bit faint. Extreme networks, F5, RadWare.

At work, we are using fairly old F5 boxes. They work well, and there are a couple of features I really enjoy. Setting up a load balanced web server environment is fairly straight forward.

  • Set up a virtual host
  • Set up a pool
  • Add back end members to the pool

Done.

And you can add rules to the pool (monitors), which are ran against all back end members. I if all monitors return OK, the targeted back end server deserves to stay in the configuration.

This is a very neat way of making sure that only correctly configured back end servers are serving web pages to the public.

A simple test can be to check if port 80 is responding. A more complex test would connect, request a web page, and parse the content for some keywords. This way, one can easily setup a web page that would represent an end to end view of the stack. I.e embedding a function in the web page, that connects to the back end database, checking for a certain value in the database. If all goes well, the web page would eventually show “Database connection OK”. A monitor could parse this output for this string, and voila, the test proves that both the web server and it’s database connectivity is ok.

I love such features.

But, and there is always a but. A major corporation can afford the big players. A clustered F5 system is expensive. The “brand new BMW” kind of expensive. They have really good products, don’t take me wrong, but it is not for the startup where a free lunch consists of half a sandwich.

So, I started looking around a bit and got fascinated by two products that I looked at a bit more:

  • http://Loadbalancer.org
  • http://zenloadbalancer.org

The first one is decently priced, but commercial. The second one, zenloadbalancer.org, is free as in beer.

I started looking into the zenloadbalancer.org product, and downloaded their ISO image (the product is based on Debian). There was a good video available on Youtube, so I got started in no time at all.

The magic in this sauce, is the web-frontend that the guys at ZenLoadbalancer delivers. The load balancing itself is based on the open source project Pen, which is fairly heavy to get into. But with the help from the web gui, I had a working solution up and running in an hour or so.

Basically, my configuration was:

  1. A forward of port 8080 on my firewall to port 80 on my internal IP 192.168.2.49, which would be the virtual IP on my load balancer
  2. the virtual IP on eth0:1 on my load balancer
  3. the two web servers, 192.168.2.21 and 192.168.2.22, which are the two apache web servers I had configured on the “inside”.
I configured a monitor to check my back end web servers (more about this in a separate post). It worked. Not that I had expected anything else.
But then I had a chat with another friend, Karl Gilén, who I hooked up with a couple of Raspberry Pi:s. We started talking about a “Raspberry Pi only” clustered web farm. With a load balancer, a couple of web servers, perhaps also the mysql database on Raspberry Pi. Well, all is possible since it is all in the Debian distro. But now it had to be done.
I got started thinking of how I could reproduce the load balancing on the Raspberry Pi, and after just a few seconds it came to me. “Just rip it off off the zenloadbalancer box!”.
Connecting to the zenloadbalancer, i did the following:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
root@kmg-zenlb-0001:~# ps -ef | grep pen
root 1045 1 0 13:52 ? 00:00:00 /usr/local/zenloadbalancer/app/pen/bin/pen -S 10 -c 2049 -x 257 -F /usr/local/zenloadbalancer/config/Kalle_pen.cfg -C 127.0.0.1:10448 192.168.2.49:80
root 1055 1 0 13:52 ? 00:00:00 /usr/local/zenloadbalancer/app/pen/bin/pen -S 10 -c 2049 -x 257 -F /usr/local/zenloadbalancer/config/1200kcaltcp_pen.cfg -C 127.0.0.1:12162 192.168.2.48:80
root 27021 12111 0 17:30 pts/0 00:00:00 grep pen
root@kmg-zenlb-0001:~# cat /usr/local/zenloadbalancer/config/Kalle_pen.cfg
# Generated by pen 2012-09-21 13:43:20
# pen -S 10 -c 2049 -x 257 -F '/usr/local/zenloadbalancer/config/Kalle_pen.cfg' -C 127.0.0.1:10448 192.168.2.49:80
no acl 0
no acl 1
no acl 2
no acl 3
no acl 4
no acl 5
no acl 6
no acl 7
no acl 8
no acl 9
acl 9 deny 0.0.0.0 0.0.0.0
no ascii
blacklist 30
no block
client_acl 0
control_acl 0
debug 0
no delayed_forward
no hash
no http
no log
no roundrobin
server 0 acl 0 address 192.168.2.21 port 80 max 0 hard 0
server 1 acl 0 address 192.168.2.22 port 80 max 0 hard 0
server 2 acl 0 address 0.0.0.0 port 0 max 0 hard 0
server 3 acl 0 address 0.0.0.0 port 0 max 0 hard 0
server 4 acl 0 address 0.0.0.0 port 0 max 0 hard 0
server 5 acl 0 address 0.0.0.0 port 0 max 0 hard 0
server 6 acl 0 address 0.0.0.0 port 0 max 0 hard 0
server 7 acl 0 address 0.0.0.0 port 0 max 0 hard 0
server 8 acl 0 address 0.0.0.0 port 0 max 0 hard 0
server 9 acl 0 address 0.0.0.0 port 0 max 0 hard 0
no stubborn
timeout 5
tracking 0
no web_stats
no weight
no prio
root@kmg-zenlb-0001:~# ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:0c:29:62:9c:e1
inet addr:192.168.2.47 Bcast:192.168.2.255 Mask:255.255.255.0
inet6 addr: fe80::20c:29ff:fe62:9ce1/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:50817 errors:0 dropped:0 overruns:0 frame:0
TX packets:44155 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:32617453 (31.1 MiB) TX bytes:3929003 (3.7 MiB)

eth0:1 Link encap:Ethernet HWaddr 00:0c:29:62:9c:e1
inet addr:192.168.2.49 Bcast:192.168.2.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

This could not be hard to reproduce on the raspberry pi, no? I brought down the IP address 192.168.2.49 on the zenloadbalancer box and just did the following on my Raspberry Pi.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
pi@raspberrypi /tmp $ sudo apt-get install pen
pi@raspberrypi /tmp $ vi /tmp/kalle.cfg
pi@raspberrypi /tmp $ cat /tmp/kalle.cfg
# Generated by pen 2012-09-21 13:43:20
# pen -S 10 -c 2049 -x 257 -F '/usr/local/zenloadbalancer/config/Kalle_pen.cfg' -C 127.0.0.1:10448 192.168.2.49:80
no acl 0
no acl 1
no acl 2
no acl 3
no acl 4
no acl 5
no acl 6
no acl 7
no acl 8
no acl 9
acl 9 deny 0.0.0.0 0.0.0.0
no ascii
blacklist 30
no block
client_acl 0
control_acl 0
debug 0
no delayed_forward
no hash
no http
no log
no roundrobin
server 0 acl 0 address 192.168.2.21 port 80 max 0 hard 0
server 1 acl 0 address 192.168.2.22 port 80 max 0 hard 0
server 2 acl 0 address 0.0.0.0 port 0 max 0 hard 0
server 3 acl 0 address 0.0.0.0 port 0 max 0 hard 0
server 4 acl 0 address 0.0.0.0 port 0 max 0 hard 0
server 5 acl 0 address 0.0.0.0 port 0 max 0 hard 0
server 6 acl 0 address 0.0.0.0 port 0 max 0 hard 0
server 7 acl 0 address 0.0.0.0 port 0 max 0 hard 0
server 8 acl 0 address 0.0.0.0 port 0 max 0 hard 0
server 9 acl 0 address 0.0.0.0 port 0 max 0 hard 0
no stubborn
timeout 5
tracking 0
no web_stats
no weight
no prio
pi@raspberrypi /tmp $ sudo ifconfig eth0:1 192.168.2.49 netmask 255.255.255.0
pi@raspberrypi /tmp $ sudo pen -S 10 -c 2049 -x 257 -F /tmp/kalle.cfg -C 127.0.0.1:10448 192.168.2.49:80

And… It works! I now run my site 1200kcal.com on port 8080 load balanced on a Raspberry Pi. If you are lucky, it is still running (I don’t aim to run this config for very long).

Ask around for how much your fellow web admins paid for their web load balancing (a lot), and compare it to my 35 chf Raspberry Pi, with a 10 chf power supply. My load balancer maxes out quite easily, though. =)

 

Integrating a traffic light with OP5 through a Raspberry Pi

The last couple of months, I have been amazed about the Raspberry Pi. I won’t go into great detail on this wonderful device, other than the obvious.

  • It is cheap
  • It comes with Ethernet, HDMI
  • It runs Debian Linux
  • It has GPIO, general purpose input output, pins on the board

So, the device itself is just screaming for some interesting projects, interfacing with stuff that normally are not integrated with your IT stack. I’ve tried it out in a few different ways already, and I’ve bought a handful of devices. One of them has a fixed position, connected to my TV, running XBMC.

Well, let’s go back a few years in time. Quite many years, will say. Back to the days when I had just started growing a beard, and was just that naïve that only youngsters freshly out of high-school can be. I headed off to university and one of the first day of the introduction week we went to the student pub. There I saw one of the coolest things since the invention of sliced bread, a pedestrian crossing light.

Normally, such thing would not leave that much of an impression with me, but this one was special. Our own student pub was one of three student pubs in the same basement corridor. But there was only one WC. In principle, this was not a problem early in the evening, but as the night went on, the line to the WC was an annoyance. In our pub, the one with the crossing light, we were less stressed about it than people in the other pubs, since our crossing light was connected to the lock of the WC door. Whenever the lock was locked, our crossing light was red; hence when the door was unlocked it was green.

When we relocated the pub, we ended up in a location where the WC situation was much better. I had personally made sure that we brought the crossing light with us, but I could never come up with as good of a use for it. This was a long time ago, but the bright idea (pun intended) stuck with me over the years.

When I implemented Nagios/OP5 at my new workplace, I started playing with the thought that I wanted some way of raising the attention of the monitoring with my guys. The classical “big screen on the wall” was of course one of the first things that crossed our minds. This was easy, and people notice it – goal achieved. But I wanted something else, something more… catchy.

Just by a coincident, I got in contact with a guy who is responsible for replacing old bulb-based traffic lights with new LED-based traffic lights. He gave me a handful of 3-light (red, yellow, green) pedestrian crossing lights, and now I was able to get going with my latest project.

A Nagios/OP5 installation that shows service states on a traffic light.

Note that you are on your own when connecting the things together. If you are not completely sure of what you are doing, then don’t. I’ve been doing things like this for a while, so I trust myself. But I do not take any responsibility what so ever for any mistakes you do yourself. This project involves 220V, which can be lethal. Don’t blame me if things go bo-boo. End of disclaimer.

 

This is the conceptual view:

All in all, I’ve got a Nagios/OP5 installation in a virtual machine. In this example, one of my services is configured with an event handler, pointing to a command configured in checkcommands.cfg. I decided that the communication between my monitoring system and the Raspberry Pi would go over nrpe, which is anyways the standard Nagios Agent. Adding a couple of scripts on the server to take care of events, and the client side, to control the GPIO, was fairly simple.

Let us go through the whole chain on both hosts, the Nagios/OP5 system and the Raspberry Pi. I will start in a reverse order, as the Nagios configuration is simple and does not require much of explanation. Which service to chose for your traffic light is up to you. There are a couple of interesting constructs that come with OP5, that is not readily available out of the box in a Nagios installation. One of these is the “Business process view”, that allows you to perform logical operations on service states. You can also create new services from these aggregations. I will leave this part out of this blog entry, but the principle is so easy, that if you manage with the rest in this article, your should not have a hard time figuring that part out either.

In principle, I just chose one service and added an event_handler to it. That’s it. The event_handler is configured as any other command in the _checkcommands.cfg_ file. An event handler should take care of a couple of checks, so that you are sure that you really want to do something when the event hander is triggered, as it is triggered quite often. Basically, check the state (OK, WARNING, CRITICAL) and the state type (hard, soft) and make up your mind.

As mentioned before, I chose to use _check_nrpe_ to integrate Nagios with my Raspberry Pi, since it is readily available and very simple to configure. All I need to do, is to remotely run a script on the Raspberry Pi, which nrpe allows me to do after a very simple configuration. I just had to come up with a name for my remote command, add it to the nrpe configuration on the Raspberry Pi, restart nrpe, and start using check_nrpe on the Nagios/OP5 server.

On the Nagios/OP5 server, you need to get the following into your configuration files. When using OP5 there is a very simple web-gui to do this in. Othervise just fire up your favorite editor (which should be _vi_).

Here is the service in the Nagios/OP5 services.cfg for my dummy service to monitor and display on the traffic light. The magic is in the sauce, I mean event_handler:

1
2
3
4
5
6
7
8
# service 'remote debug'
define service{
  use default-service
  host_name pi-s001
  service_description remote debug
  check_command check_nrpe!-H $HOSTADDRESS$ -c kmg_dummy
  event_handler event_ampel
}

This event handler is configured in the Nagios/OP5 configuration file checkcommands.cfg. Note the arguments I am sending to the script.

  • $SERVICESTATE$ – Nagios macro for the current state of the service
  • $SERVICESTATETYPE$ – Nagios macro for the current type (soft or hard) of the service
  • $HOSTNAME$ – Nagios macro for the service’s host
  • $SERVICEDESC$ – Nagios macro for the name of the service
  • $SERVICEATTEMPTS$ – Nagios macro for the number of attempts
1
2
3
4
5
6
</div>
# command 'event_ampel'
define command{
  command_name event_ampel
  command_line $USER1$/kmg/event_ampel $SERVICESTATE$ $SERVICESTATETYPE$ $HOSTNAME$ $SERVICEDESC$ $SERVICEATTEMPT$
}
The $SERVICEATTEMPTS$ macro together with the $SERVICESTATETYPE$ and the $SERVICESTATE$ gives you the possibility to actually try and fix a problem before Nagios/OP5 even has notified a sysadmin. In a default installation/config, Nagios will test a service so many times before it is a hard CRITICAL, which is notified to the outside world. If you write your event handler in such way, that it tests these parameters, i.e service state = CRITICAL, type = soft, service attempts = 3, then you can perform something like restarting a server, before waking up the sysadmin. My event handler is a bit simpler than that. I am just to trigger a relay, controlling a light bulb, so I skipped some of that. This is the event handler script:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/usr/bin/ksh

state=$1
statetype=$2
serviceHost=$3
service=$4
serviceattempt=$5

raspberryPi=10.64.150.5

logfile=/opt/monitor/var/eventhandler.log

# Sep 25 14:53:14
date=`date +"%b %d %H:%M:%S"`

case "$state" in
  OK)
    command=kmg_ampel_green
    ;;
  WARNING)
    command=kmg_ampel_yellow
    ;;
  CRITICAL)
    command=kmg_ampel_red
    ;;
esac
/bin/echo -en "$date; restart_windows_service.sh; $serviceHost:$service; Got a $statetype $state at try $serviceattempt, sending $command to host $raspberryPi " >> $logfile

/opt/plugins/check_nrpe -H $raspberryPi -c $command >> $logfile
echo "Set Ampel ok"

If you by any chance had the Raspberry Pi setup and running already, you could test the event handler and the integration in two ways from the Nagios/OP5 host:

1
2
3
4
5
6
#--- check that nrpe works (i.e allowed_hosts on the pi is set properly)
op5$ /opt/plugins/check_nrpe -H 10.64.150.5
#--- check that the commands config on the pi works
op5$ /opt/plugins/check_nrpe -H 10.64.150.5 kmg_ampel_green
#--- check that the whole stack works
op5$ ./event_ampel kmg_ampel_red

So far, the Nagios configuration. We also need the relay control and the Raspberry Pi configuration. This is the hardware you need to complete this project:

  • One Raspberry Pi
  • 5V DC source (Micro USB charger)
  • 12V DC source (to drive the relays)
  • 3 x 5kOhm resistors
  • 3 x BC237C transistors
  • 3 x Relays, 12VDC activation/250AC switching
  • A few cables.
I used a 12V/DC power source to drive the relays, since I could not quickly find any 5V/DC relays. If you do find such relays, just connect the Raspberry Pi pin 2 (5V) to the driver, and you will save yourself a couple of bucks.
Install the Debian Squeeze on a SD card, enable SSH, and expand the root fs at the first boot. Then log into your box and install ksh and nagios-nrpe-server.
1
2
sudo apt-get install ksh
sudo apt-get install nagios-nrpe-server

That is basically it. You will survive without ksh, but since I am an old fart, I tend to stick to what I know from before. Ksh was there in the dawn of UNIX. In the good old days, it was the bread and butter of decent scripting. Nowadays you’ll manage with bash, and wont miss ksh a lot. More about that in a different blog post. If you don’t care for ksh, just change the references to it in my examples to bash.

Now we will configure the nrpe daemon correctly.

1
2
3
4
5
6
7
8
9
10
11
12
pi@raspberrypi /etc/nagios $ grep allowed_hosts nrpe.cfg
allowed_hosts=127.0.0.1,192.168.2.34,194.40.128.84
pi@raspberrypi /etc/nagios $ cat nrpe.d/kmg_commands.cfg
command[kmg_ampel_red]=/app/prd/op5/bin/ampel red
command[kmg_ampel_yellow]=/app/prd/op5/bin/ampel yellow
command[kmg_ampel_green]=/app/prd/op5/bin/ampel green
command[kmg_ampel_blink]=/app/prd/op5/bin/ampel blink
command[kmg_ampel]=/app/prd/op5/bin/ampel blink
command[kmg_dummy]=/app/prd/op5/bin/dummy
pi@raspberrypi /etc/nagios $ sudo /etc/init.d/nagios-nrpe-server restart
[ ok ] Stopping nagios-nrpe: nagios-nrpe.
[ ok ] Starting nagios-nrpe: nagios-nrpe.

With that taken care of, the integration between the Nagios/OP5 server and the Raspberry Pi is set up. The best way to test this is the commands in one of the examples above (check_nrpe -H 10.64.150.5). If that doesn’t work, kill the nrpe process on the Raspberry Pi with “kill” and start it again. That usually solves most problems.

Interfacing with the GPIO ports is extremely easy in the Debian/Raspbian squeeze. You just echo some commands to files in the /sys filesystem. Firstly you’ve got to enable some GPIO ports and set the direction. I’ve done this in a startup script, /etc/init.d/ampel, to which I also symbolically linked /etc/rc2.d/S99ampel so that it will automatically do this at reboot. Debian is a weird beast, though. Note that I put the startup script in rc2.d, which is the default runlevel for Debian, whereas I would have put this in rc3.d on an Ubuntu box.

Here is my /etc/init.d/ampel script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/usr/bin/ksh
# /etc/init.d/ampel
#

### BEGIN INIT INFO
# Provides: ampel
# Required-Start:
# Required-Stop:
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# Short-Description: Ampel
# Description: Ampel
### END INIT INFO
################################################################################

TS=$(date "+%Y%m%d %H:%M:%S")
echo "$TS; $0 $1" >> /var/log/ampel.log

case $1 in
  start)
    echo "$TS; setup gpio" >> /var/log/ampel.log

    echo "17" > /sys/class/gpio/export
    echo "18" > /sys/class/gpio/export
    echo "21" > /sys/class/gpio/export

    echo "out" > /sys/class/gpio/gpio17/direction
    echo "out" > /sys/class/gpio/gpio18/direction
    echo "out" > /sys/class/gpio/gpio21/direction

#--- to rid the sudo part - change the permissions

    chmod 666 /sys/class/gpio/gpio17/value
    chmod 666 /sys/class/gpio/gpio18/value
    chmod 666 /sys/class/gpio/gpio21/value
    ;;
esac
This makes sure that the GPIO pins are correctly set, which you can control in the file system by checking the /sys/class/gpio directory. Note that I am changing the file permissions for the “value” files. If you don’t do this, you must write to these files as root, which is easily done by using “sudo”. But since the Raspberry Pi isn’t a state of the art box when it comes to security anyways, I just decided that if you by any chance is logged in to the system, you should be able to set these values, never mind which user you are.
If you see “your” gpio pins there, you should be fine.
1
2
3
4
5
6
7
8
9
10
pi@raspberrypi ~ $ ls -la /sys/class/gpio/
total 0
drwxr-xr-x 2 root root 0 Aug 16 16:58 .
drwxr-xr-x 26 root root 0 Aug 16 16:58 ..
--w------- 1 root root 4096 Aug 16 16:58 export
lrwxrwxrwx 1 root root 0 Aug 16 16:58 gpio17 -> ../../devices/virtual/gpio/gpio17
lrwxrwxrwx 1 root root 0 Aug 16 16:58 gpio18 -> ../../devices/virtual/gpio/gpio18
lrwxrwxrwx 1 root root 0 Aug 16 16:58 gpio21 -> ../../devices/virtual/gpio/gpio21
lrwxrwxrwx 1 root root 0 Aug 16 16:58 gpiochip0 -> ../../devices/virtual/gpio/gpiochip0
--w------- 1 root root 4096 Aug 16 16:58 unexport
The nrpe configuration in /etc/nagios/nrpe.d/kmg_commands.cfg points out the /app/prd/op5/bin/ampel script, which is merely a wrapper, if there were anything clever to be done before actually switching the relays. In this example, I am not logging anything, and there is very little magic around it. It is usually wise to have such a wrapper between nrpe and whatever you want to perform on the system; for me it comes naturally to do it this way.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#!/usr/bin/ksh

ampel_bin=/app/prd/ampel/bin
#ampel="sudo $ampel_bin/setAmpel_a1"
ampel="$ampel_bin/setAmpel_a1"

blink(){
  $ampel all
  sleep 0.2s
  $ampel none
  sleep 0.2s
  $ampel all
  sleep 0.2s
  $ampel none
  sleep 0.2s
}

red(){
  blink
  echo "  - Set ampel to: red"
  $ampel red
}
yellow(){
  blink
  echo "  - Set ampel to: yellow"
  $ampel yellow
}
green(){
  blink
  echo "  - Set ampel to: green"
  $ampel green
}

echo "Ampel"
case $1 in
  red)
    red
    ;;
  yellow)
    yellow
    ;;
  green)
    green
    ;;
  blink)
    blink
    ;;
esac

exit 0
In the end, this wrapper will execute the “$ampel” script, which is set to “/app/prd/ampel/bin/setAmpel_a1” with one parameter; red, yellow, green, or blink. This is the setAmpel_a1 script:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
#!/usr/bin/ksh

#-----------------------------
# Base configuration
#-----------------------------
this_dir=$(cd `dirname $0`; pwd)
base_dir=$(cd `dirname $0`/..;pwd)

. $base_dir/conf/gpio.conf

off=0
on=1

red(){
  echo "$on" > /sys/class/gpio/gpio${a1red}/value
}

yellow(){
  echo "$on" > /sys/class/gpio/gpio${a1yellow}/value

}

green(){
  echo "$on" > /sys/class/gpio/gpio${a1green}/value
}

none(){
  echo "$off" > /sys/class/gpio/gpio${a1red}/value
  echo "$off" > /sys/class/gpio/gpio${a1yellow}/value
  echo "$off" > /sys/class/gpio/gpio${a1green}/value
}

all(){
  echo "$on" > /sys/class/gpio/gpio${a1red}/value
  echo "$on" > /sys/class/gpio/gpio${a1yellow}/value
  echo "$on" > /sys/class/gpio/gpio${a1green}/value
}

#================================
# MAIN
#================================
case $1 in
  red)
    none
    red
    ;;
  yellow)
    none
    yellow
    ;;
  green)
    none
    green
    ;;
  none)
    none
    ;;
  all)
    all
    ;;
esac
That’s all folks!

Supersize io operations in Linux

Hi all,

This time I will tell you something that most of you really don’t care about: Max io size.

More specific, how large of an io operation I can issue to a storage system in one go. This might not make much of a difference for most people, but when you are tweaking a system for databases, distributed computing etc, there is a case for being able to tune this properly.

About a year ago I was testing storage performance on different platforms, i.e Solaris and Linux. One of the tests was to measure the throughput and number of iops when ramping one to many processes over io operation sizes (i.e 8,32,64,128,256,512,1024,2048k).

To finalize this blog post, I had to go back to some emails I sent to myself (note book-keeping), so I am lacking some of the screendumps. But the basic setup is the same on both Ubuntu and SUSE using the lpfc-driver (Emulex).

On Linux, I just could not get the box to issue larger io operations than 256k, which I found to be quite disturbing as I really tried hard to bash the system with large iops. No matter how hard I tried (using different tools to generate the IO, and pressing the enter button on my keyboard really, really hard), I just could not get iostat to show IOs larger than 256k. Since I trust iostat, I deduct that my system does not produce larger IOs.

When it comes to reading the output in this post, there are a couple of things to keep in mind so that you don’t get confused.

* Even if you start iostat with -k to show kilobytes, the avgrq-sz is shown as the number of 512 byte disk blocks, hence you need to multiply the number seen in iostat with 512 to get the io size
* Some config parameters need to be multiplied with 4kb memory page size (I will get to that later)

To get to my point, I need to show you a couple of things. For example, I can see that dd is issuing 1MB iops like this:

1
2
3
malu@kmg-sandbox-0001:~$ sudo strace dd if=/dev/zero of=/dev/sdc bs=1024k count=10000
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576

But, when checking the iostat output, it only shows me 256k iops.

1
dm-7              0.00     0.00    0.00 1072.00     0.00 274432.00   512.00     3.07    2.85   0.93  99.20

Remember to multiply the 512.00 with 512 to get the io size.

To find out how to tune this, I really bent myself over twice. In the bad old days, pre kernel 2.6, there was MAXPHYS kernel parameter, which was probably not optimal, since it was removed in the 2.6 kernel. Don’t ask me too much about the history, but I know it was a pain to get this set correctly, and often a kernel recompile was required, which in turn voided any support.

When investigating this topic, most people I talked to either told me “can’t be done”, or “don’t bother”.

The first thing I found out was that there is a tweakable parameter per device, which controls the max physical IO. It is not very convenient to use, and it only affects the LUN layer; how large of an IO can you send to a LUN.

1
2
malu@kmg-guran-0001:/sys/block/sda/queue $cat max_hw_sectors_kb
4096

Setting this (echo 128 > max_hw_sectors_kb) change the IO size sent to that device, _up to_ some magic limit (256k) and not the 4MB as seen in the tunable above.

So… I could only limit the max io size to something which was smaller or equal to 256k, regardless of the max_hw_sectors_kb content. It is quite easy to check this, as the system will react immediately when you tune this parameter. Here is an example:

In one terminal window, run the following which will run “dd” reading 1M blocks over and over again:

1
malu@poc01:/sys/block/sda/queue> while true; do echo "Restarting dd" ; sudo dd iflag=direct if=/dev/sda of=/dev/null bs=1024k; done

In another terminal window, do the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
malu@poc01:/sys/block/sda/queue> for bs in 8 16 32 64 128 256 512 1024 2048; do date "+%Y%m%d %H:%M:%S"; echo "Setting max_sectors_kb to $bs"; sudo sh  -c "echo $bs > max_sectors_kb"; sleep 10; done
20110923 09:05:05
Setting max_sectors_kb to 8
20110923 09:05:15
Setting max_sectors_kb to 16
20110923 09:05:25
Setting max_sectors_kb to 32
20110923 09:05:35
Setting max_sectors_kb to 64
20110923 09:05:45
Setting max_sectors_kb to 128
20110923 09:05:55
Setting max_sectors_kb to 256
20110923 09:06:05
Setting max_sectors_kb to 512
20110923 09:06:15
Setting max_sectors_kb to 1024
20110923 09:06:25
Setting max_sectors_kb to 2048

In a third terminal window you run “iostat -xtc 1” or similar, and you will see the blocksize read from the device changing from 8k to 256k (don’t forget multiplying the avgrq-sz by 512 to get the IO size).

This got me very frustrated. I really wanted to be able to tune this properly (to be able to issue larger IOs) and none of my favorite contacts at different vendors could help me out. After quite some googling, I came across a discussion about Lustre, where someone had a similar issue. This directed me to Bug 22850, where I finally found the configuration parameter I needed, to change the size of IOs sent through the driver stack.

https://bugzilla.lustre.org/show_bug.cgi?id=22850

Voila!

Knowing this, I could get the following information from my system:

1
2
3
4
5
malu@poc01:/sys/class/scsi_host> cat host*/lpfc_sg_seg_cnt
64
64
64
64

The Emulex driver was limited to 256k IO size, which is easily tuned by changing the _lpfc_sg_seg_cnt_ parameter to 256 in /etc/modprobe.conf.local

1
2
3
4
5
6
7
8
9
10
11
12
13
14
malu@poc01:/sys/class/scsi_host> cat /etc/modprobe.conf.local

options lpfc lpfc_lun_queue_depth=16
options lpfc lpfc_sg_seg_cnt=256
options lpfc lpfc_link_speed=4

malu@poc01:/sys/class/scsi_host> sudo mkinitrd
malu@poc01:/sys/class/scsi_host> sudo shutdown -r now
...
malu@poc01:/sys/class/scsi_host> cat host*/lpfc_sg_seg_cnt
256
256
256
256

Now the max IO size is 1MB and I am happy again. Try it out yourself!

Monitoring the progress through a pipe

Small linux command, but still cooler than sliced bread: pv

When performing some operations that produces output, plenty of it, I sometimes wonder “how fast is it, really”.

Normally, I just kick in a “time” before the command, check the size of the output and divide it by the time it took to perform the operation.

The other day, I was troubleshooting nfs on my Synology DS1511+, as I suspected it was not very responsive. This is a side note, though; it _was_ slow for a datastore on my ESXi-box, as I had forgotten to enable asynchronous in the NFS rule setting for that share, so I had only 6MB/sec write rate. So, it was only slow on my ESXi-box, useless for storing virtual machines. All other use cases were ok. Turned on asynchronous in the shared folder/NFS Privileges, which solved the problem.

Back to pv and my examples. First, this is what I used to do:

1
2
3
4
5
6
7
malu@kmg-sandbox-0001:/mnt/synology02/files$ ls -la testfile
-rw-r--r-- 1 malu malu 448790528 2011-09-16 02:50 testfile
malu@kmg-sandbox-0001:/mnt/synology02/files$ time cat testfile > /dev/null

real 0m3.914s
user 0m0.020s
sys 0m0.910s

So, the ca 450 MB was transferred in 4 seconds, just over 100MB/sec. Not bad over a 1Gbit ethernet fileshare. But what if I want to see this _during_ the transfer? Enter pv:

1
2
3
4
5
6
<pre>malu@sandbox $ time cat testfile | pv -b -r  > /dev/null
 428MB [ 110MB/s]

real    0m3.902s
user    0m0.040s
sys 0m1.270s</pre>

The parameters:

  • -b shows the number of bytes (MB)
  • -r shows the speed/transfer-rate through the pipe
pv simply eats stdin, counts the bytes and sends the stream to stdout without modifying it. It cost a couple cpu-cycles, but gives me some well needed info during the transfer. You can also tell you the size of the expected stream, and it will kick in a progress bar for you as well.
Sweet!

Could not chdir to home directory

I was at a customer’s site the other day, and ran into an issue that I could not really understand.

When logging in on my Linux box, a server I was setting up for a small application, I got the following error message when logging in, as the first thing on my terminal:

1
Could not chdir to home directory /app/prd/kmggroup: Permission denied

The background is that the application I am setting up has it’s home directory in a non-standard location. Let us call the user kmggroup, just for kicks, and that the home directory is /app/prd/kmggroup. Logging into this user directly, using a password should be banned anyways, as it is an anonymous user, owning an application. I will write about my prefered way of logging in as anonymous users (e.g oracle, apache, kmgapp, whatever) in a different post.

At this point, my user “landed” in “/”, but it was still possible to do a “cd /app/prd/kmggroup” to go to that directory. Very annoying, though.

It took me a little while to figure out, as I had just ordered a virtual machine, no preference of flavor. I got a RedHat server, and for me there is not much to say about that.

1
2
kmggroup@server.org:/usr/local/samba/etc $cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.0 (Santiago)

I debugged my .bash_profile, the /etc/passwd file, the /etc/profile, tweaking it a bit (adding “echo bappen” to the startup scripts here and there). I realized that the error message appeared even before the /etc/profile script was ran, so I got a bit curious.

After searching the topic on the world wide information network, also known as the Internet, I slowly realized that this has to do with the SELinux, namely the context settings for the directories.

SELinux is dreaded by the un-initiated, and there are few admins out there who really know how to set it up and live with it properly. (I am one of those, mainly un-initiated).

Enough said about that. Here is my solution to solve the problem, without messing up someone else’s system.

My directories were set up like this:

1
2
3
4
kmggroup@server.org:/app/prd/kmggroup $ls --context -d /app /app/prd /app/prd/kmggroup
drwxr-xr-x. root root unconfined_u:object_r:default_t:s0 /app
drwxr-xr-x. kmggroup kmggroup unconfined_u:object_r:default_t:s0 /app/prd
drwxr-xr-x. kmggroup kmggroup unconfined_u:object_r:default_t:s0 /app/prd/kmggroup

Normally, /home is set to the following context:

1
2
3
kmggroup@server.org:/app/prd/kmggroup $ls --context -d /home /home/*
drwxr-xr-x. root root system_u:object_r:home_root_t:s0 /home
drwx------. apa apa unconfined_u:object_r:user_home_dir_t:s0 /home/apa

My “/app/prd/kmggroup” directory is “special”, as we set it up in a non-default location, where the context was not set yet.

So, a couple of chcon later, the problem was solved:

1
2
3
4
5
6
7
8
sudo chcon -t home_root_t /app
sudo chcon -t home_root_t /app/prd
sudo chcon -t user_home_dir_t /app/prd/kmggroup

kmggroup@server.org:/app/prd/kmggroup $ls --context -d /app /app/prd /app/prd/kmggroup
drwxr-x---. kmggroup kmggroup unconfined_u:object_r:home_root_t:s0 /app
drwxr-x---. kmggroup kmggroup unconfined_u:object_r:home_root_t:s0 /app/prd
drwx------. kmggroup kmggroup unconfined_u:object_r:user_home_dir_t:s0 /app/prd/kmggroup

The error message does not appear, and my user ends up in his homedir. After telling the sysadmin at the site, he told me that they are not using SELinux (for good reasons in their environment), he had just forgotten to turn it off before giving me the box.

We both had a good laugh about it.

Have a nice day!
//magnus

Nagios and OP5 – writing a nrpe check script

Long time no see…

One of my main interests in working with production systems, is to be able to sleep well at the night. A very important component to help making sure I can, is to know when things go bad, which they will; sooner or later. It is just part of life. Just like any car or mechanical thing, a computer system will eventually have a hickup. It is better to know yourself when and what went wrong, than having a customer call you and tell you that something in your shop is broken.

Be proactive, not reactive.

In my world, where a small shop has a minimum of a handful of servers, and a large shop has hundreds – or perhaps thousands of servers and services, there is no way one can for sure know that something is working or if it is broken. A single server, no matter which brand/make/OS, has more than one service running, and everything running can break. So, unless you are willing to constantly log in to each and every system, you need to automate the monitoring of your stuff. For decades there has been monitoring systems around, ranging from very cheap to very expensive.

Short story: You can implement quite a mature and powerful monitoring even with a very small budget. Even large corporations are looking into cost effective solutions.

Today, I checked out the OP5 Monitor, which is a commecial but yet very attractive extension of Nagios. It has many bells and whistles, which are not part of the standard issue, mainly when it comes to reporting and configuration. It still took me a couple of hours to set it up the way I wanted. But man, the configuration is a walk in the park in comparison. After the first hit, there is almost no way back to plain vanilla Nagios.

I have used Nagios quite a lot in the past, but it is ugly (eh, the gui honestly looks like crap, but it for sure fulfills it’s purpose) and there is a horde of config files to keep track of.

Well, being an old school Nagios hacker, I already know the basic concepts. Perhaps the ease of config of the OP5 Monitor software is easier for me than for many others, but I will put that aside. Here, I will just give you a quick glance on how easy it is to extend the Nagios NRPE (Nagios Remote Plugin Executor), so that the monitoring server (Nagios or OP5) can execute remote scripts on a host withot having to deal with weird home grown ssh scripts and keys.

First, I have to give you a short introduction to how Nagios checks a service. It is simple, really simple.

If you want to write your own check-script, you need to know what you want to check. A good example is to look for the presence of a file, e.g /tmp/foo.bar. Let us say, that your whole corporation is depending on knowing whether this file exists. A simple way to check this, is to write a script.

1
2
#!/bin/ksh
[ ! -f /tmp/foo.bar ] && echo "The file does not exist"

This will just echo a warning if the file does not exist.

If you would like for Nagios to understand this, you need to tell it just a little more; a return code.

  • 0 – All is fine, just go on as before
  • 1 – Warn that something is not really ok
  • 2 – Critical – this is bad, call for the fire brigade

So, to extend this script, to make it a fully phledged Nagios module, you just need to send back the correct return code:

1
2
3
4
5
6
7
8
9
10
11
12
13
#!/bin/ksh

if [ ! -f /tmp/foo.bar ]
then
  msg="CRITICAL - The file does not exist"
  rc=2
else
  msg="OK - The file is here!"
  rc=0
fi

echo $msg
return $rc

It is simple as that (plus that you have to go through the tedious job of configuring the _chkcommand.cfg_ file and your Nagios services). With this you have a simple Nagios module.

To make this a NRPE module, which is remotely executed by the Nagios or OP5 server on the server of choice,  you just have to put this script somewhere on your monitored server, e.g in /opt/plugins/check_myfile and setup the NRPE configuration.

1
2
3
4
5
6
remote host $> sudo chmod 755 /opt/plugins/check_myfile
remote host $> grep check_myfile /etc/nrpe.d/my_config.cfg

command[myfile]=/opt/plugins/check_myfile

remote host $> sudo /etc/init.d/nrpe restart

On the Nagios server, check that your script works (my remote host has the IP address 192.168.2.90):

1
2
3
4
5
6
7
8
9
OP5 $> /opt/plugins/check_nrpe -H 192.168.2.90 -c myfile

CRITICAL - The file does not exist

remote_host $> touch /tmp/foo.bar

OP5 $> /opt/plugins/check_nrpe -H 192.168.2.90 -c myfile

OK - The file is here!

That is basically it! Now, go ahead and configure a new _nrpe_ service for a host in your OP5 environment, and put the work “myfile” in the “check_command_args” field, and you are done. Two minutes of work, and you save yourself tons of head ache.

DEBUG: The script _has to_ send at least something to stdout, it doesn’t really matter what. Othervise you will get an error message from the server side _check_nrpe_ script:

1
2
3
4
remote host $> grep echo
#  echo $msg
OP5 $> /opt/plugins/check_nrpe -H 192.168.2.90 -c myfile
NRPE: Unable to read output

Benchmarking – to impress or not impress; is not even a question

I love benchmarking. Benchmarking is what I do well.

There is something shimmering about running tests on a system, trying to find out what it can do. After all, who is to say that something is fast or slow? Who defines “fast”, and how bad is “slow”.

And, there is a but; the word “benchmarking” is badly misused and misunderstood. The goal when performing a benchmark is not to produce impressive numbers (that is called performance tuning). A benchmark will show you metrics that an isolated system with a predefined and given set of parameters can produce. It is, of course, always satisfying to show metrics which are impressive. I mean, who keeps statistics of the losing team in the last year’s series of your favorite sport?

The graph below is not impressive at all, showing write performance of ca 30Mb/sec. Well, had it been 15 – 20 years ago, many companies would have paid good money to reach these numbers, but these figures were measured just recently, on a not that impressive piece of hardware stack.

So, what does it show?

  • Server: AsRock ION 330
  • CPU: Intel Atom 330@1.6GHz
  • RAM: 4GB RAM
  • Storage: 1TB IOMega USB Drive, 8MB Cache

And this is how I produced the nubers:

1) Start collecting io data

1
nohup iostat -d -t -k -x 10 > iostat.out &

2) Do something to produce an io load

1
cp -r /export/stuff/new/Favorite_tv_series.S08E1* /export/stuff/tv_serier/temp/

From here, we are set to go. We’ve collected data, on a given system, and we produced a well defined and reproducible workload. But what about the data? Can just about anyone graph it so nicely and be able to read what it means? For sure not. As Kevin Closson once said in a very good talk I visited; “The solution is simple, but it is not easy”.

The output of iostat data looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
2010-07-22T12:28:28+0200
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.40     0.50    0.70    0.70     6.80     4.80    16.57     0.07   47.86   7.86   1.10
sda1              0.40     0.50    0.70    0.70     6.80     4.80    16.57     0.07   47.86   7.86   1.10
sda2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda5              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-0              0.00     0.00    1.10    1.20     6.80     4.80    10.09     0.09   39.57   4.78   1.10
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdc               0.00     3.60    0.00   15.90     0.00    78.00     9.81     1.99  125.22   2.45   3.90
sdc1              0.00     3.60    0.00   15.90     0.00    78.00     9.81     1.99  125.22   2.45   3.90

...

2010-07-23T10:37:49+0200
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.30    0.00    0.50     0.00     3.20    12.80     0.01   30.00  10.00   0.50
sda1              0.00     0.30    0.00    0.50     0.00     3.20    12.80     0.01   30.00  10.00   0.50
sda2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda5              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.80     0.00     3.20     8.00     0.01   18.75   6.25   0.50
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdc1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

You have to be quite darn good to graph someting like this. Even though I consider myself to possess a black belt in scripting, or perhaps because of it, I would not get at this data right away. You first have to transform it into something your favorite graph plotting tool can handle (read gnuplot). I like semicolon separated files, and I like ISO style date formats. This is what I did:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
device=sdc
cat iostat.out | \
awk -v DEVICE=$device '
$1 ~ /^2010/ {
ts=$1;
gsub("T"," ",ts);
gsub("\+0200","",ts);
}

$0 !~ /^$/ &&
$1 !~ /^2010/ &&
$0 !~ /Device/ &&
$1 == DEVICE {
#--- set semicolon as output field separator
OFS=";";
#--- recalculate $0 with OFS
$1=$1;
print ts, $0
}' > a.out

This is what the data looks like right now. Notice, that I filtered the data to keep only “sdc” as well:

1
2
3
4
5
6
7
8
9
10
11
12
13
2010-07-22 12:34:08;sdc;0.00;0.00;0.00;0.00;0.00;0.00;0.00;0.00;0.00;0.00;0.00
2010-07-22 12:34:18;sdc;0.00;0.00;0.10;0.00;0.40;0.00;8.00;0.00;20.00;20.00;0.20
2010-07-22 12:34:28;sdc;0.00;5001.40;0.00;164.30;0.00;19068.80;232.12;40.23;218.65;4.08;67.00
2010-07-22 12:34:38;sdc;0.00;6382.60;0.20;235.20;0.80;26948.80;228.97;109.58;472.74;4.00;94.20
2010-07-22 12:34:48;sdc;0.00;7032.70;0.20;250.30;0.80;29018.80;231.69;112.57;445.85;3.99;100.00
2010-07-22 12:34:58;sdc;0.00;5140.10;0.10;184.80;0.40;21379.20;231.26;47.91;265.46;4.05;74.80
2010-07-22 12:35:08;sdc;0.00;4776.10;0.20;204.80;0.80;21086.00;205.72;58.86;297.32;3.77;77.30
2010-07-22 12:35:18;sdc;0.00;4133.80;0.10;148.50;0.40;16593.20;223.33;44.39;288.05;3.84;57.10
2010-07-22 12:35:28;sdc;0.00;4056.70;0.10;149.90;0.40;17242.40;229.90;41.07;273.85;3.96;59.40
2010-07-22 12:35:38;sdc;0.00;4093.00;0.10;155.70;0.40;17115.20;219.71;40.54;269.99;3.87;60.30
2010-07-22 12:35:48;sdc;0.00;4284.60;0.00;161.30;0.00;17783.20;220.50;41.93;258.28;3.89;62.80
2010-07-22 12:35:58;sdc;0.00;0.00;0.00;0.80;0.00;3.60;9.00;0.01;350.00;1.25;0.10
2010-07-22 12:36:08;sdc;0.00;0.00;0.00;0.00;0.00;0.00;0.00;0.00;0.00;0.00;0.00

The interesting data (you know, the column with the highest values) is in the 8th column, kB written per second. So the only thing I need to do now, is to run it through gnuplot. To simplify the script a bit, I here show you the hardcoded section. In my day to day world, I just don’t have the time to rewrite gnuplot scripts every time I use them, so I have written a wrapper around the whole thing so that I can reuse it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
malu@ml-sst7-0001:/home/malu/public_html/temp/quick_display/iostat $cat quickplot.gplot
reset

set terminal png size 800,400
set xdata time
set timefmt "%Y-%m-%d %H:%M:%S"
set output "pretty_picture.png"

#---  time range must be in same format as data file
set yrange [-1000:40000]
set xlabel "Date-Time"
set ylabel "kB/s"
set title "ml-sst7-0001 - kB write - from date this and that _to_ date this and that"

set datafile separator ";"

set grid
set grid front

set key right

filename="a.out"

#--- for shading offset (below the plotted line)
a=41000/20

#--- Plot the data 6 times with different shades of gray, black and green
#--- to get the illusion of having a green line with a black frame (last two plots)
#--- and a shade (the first 4 plots)
plot ["2010-07-22 12:33:00":"2010-07-22 12:40:00"] \
filename u 1:($8 - a)  w l lc rgb "#eeeeee" lw 9 t '' ,\
filename u 1:($8 - a)  w l lc rgb "#dddddd" lw 7 t '', \
filename u 1:($8 - a)  w l lc rgb "#cccccc" lw 5 t '', \
filename u 1:($8 - a)  w l lc rgb "#bbbbbb" lw 3 t '', \
filename u 1:8         w l lc rgb "#555555" lw 3 t '', \
filename u 1:8         w l lc rgb "#00ff00" lw 1 t "kB/s"

And… to make the magic, make sure a.out is in the same directory, and fire off gnuplot.

1
cat quickplot.gplot | gnuplot

That’s it, that’s how I produced the graph above.

Go ahead, define your own set of tests, find out how to collect the metrics, transform the collected data into something useful, plot it, write a presentation, make a decent load of money from it. People are eager to read benchmark papers to confirm or disprove an idea, and they are very often willing to pay you a fair fee for it as well.