Monday, February 28, 2011

When marketing and technical information meet: Hyper-V

While reading an article about Hyper-V per-VM CPU settings, I saw this in the FAQ:


Why do you use percentage for the limit and reserve – and not MHz / GHz?

Many people find it easier to think in MHz / GHz rather than percentage of a physical computer. They also argue that using a percentage means that as you move a virtual machine from computer to computer you may get different amounts of resource depending on the underlying capability.

This is something that has been discussed extensively on the Hyper-V team, and while I do believe there is some merit in this approach, there are a number of reasons why we chose to use a percentage instead. Two key ones are:

  1. Predictable mobility

    If all your virtual machines have a reserve of 10% – you know that you can run 10 of them on any of your servers. The same would not be true if they all had a reserve of 250Mhz. Given how important virtual machine mobility is to our users – we believe that this is something that needs to be easy to manage.
  2. Not all MHz are the same

    1GHz on a Pentium IV is much slower than 1GHz on a Core i7. Furthermore – newer processors tend to be more efficient at virtualization than older processors, so the difference between the “bang for buck” that you get out of each MHz varies greatly between processor types. This means that in reality – defining a reserve or limit in MHz / GHz does not really give you a great performance guarantee anyway.

Even though this seems to be a list of technical arguments, the claims made are non-sensical:
  1. "we use a relative percentage instead of a fixed unit because we want you to be sure you can run a certain number of guests on any CPU." What ?? Who says that my VMs will actually still run when they suddenly get only half of the power they needed, because they were moved to a CPU with half the horsepower ? A reserve is supposed to be a guarantee, a limit is supposed to be just that: a limit. Even the examples they give for using a reserve or a limit would fail. A misbehaving app that sucks CPU, will suddenly be allowed to use even more, just because it's now running on a faster CPU.
  2. "Not all MHz are the same." That's not a very good reason to use percentages instead, is it. Are they claiming that every % _is_ the same ?
Dear Microsoft (and any other company reading this), please make your technical information technical, and correct. Do whatever you want with your marketing docs, but don't let the marketing seep into the technical documentation.

Every error is a DNS error.

Newly installed RHEL5 machine in an existing network. Users opening firefox on the machine got an error "The bookmarks and history system will not be functional". The googlesphere suggested renaming places.sqlite and such, but that didn't help. Things began to clear up when I found errors on the NFS server that exports the home directory: "lockd: failed to monitor newmachine.companydomain". I checked the nfslock service, but it was running fine. Configuration files for NFS and autofs were identical to other machines that didn't show the problem. Then, like a bolt of lightning, it hit me: I had forgotten to create a reverse DNS entry for the new machines IP. Forward DNS was OK, but reverse wasn't. That caused the NFS lock error, and that caused the firefox error... The old saying is confirmed once more: every error is a DNS error.

Saturday, February 19, 2011

Link aggregation and VLANs on QNAP with firmware 3.4.0

The new QNAP firmware (3.4.0) supports 802.1q VLAN tagging, but you can't create multiple interfaces in different VLANs on the same physical interface through the webinterface.
In the case of link aggregation (LACP 802.3ad for example), that means only 1 VLAN and 1 IP address can be used.
Fortunately, QNAP allows full access to the underlying Linux system. Adding a VLAN interface goes like this (the example uses VLAN 234)
# /usr/local/bin/vconfig add bond0 234
# ifconfig bond0.234 broadcast netmask

of course, this change is not permanent, a reboot will not automatically start this interface. I'll blog about making it permanent later.

software RAID on old vs. new CPUs

The Linux kernel has several software RAID algorithms, and selects the one that is fastest on your CPU. Isn't that always the same algorithm then ? No, definitely not. Newer CPUs have additional instructions that help speed things up. And it's not just clock speed that matters, memory bandwidth plays an important role too.

  • On an old Pentium II Xeon 450 MHz, raid5 uses p5_mmx, and raid6 uses mmxx2. Software raid6 calculations are 72% slower than raid5.
  • On a Pentium IV Xeon 1.5 GHz, raid5 using pIII_sse, and raid6 uses sse2x2. Software raid6 calculations are 12% slower than raid5.
  • On an AMD Athlon XP2000+ (1.6 GHz), raid5 uses pIII_sse, raid6 uses sse1x2. Software raid6 calculations are 42% faster than raid5.
On 64-bit systems, no relevant instructions are different between generations so far:
  • On a AMD Athlon64 XP3400 (2.4 GHz), raid5 uses generic_sse, raid6 uses sse2x4 (raid6 44% slower than raid5).
  • On a Xeon 5160 3GHz, raid5 uses generic_sse, raid6 uses sse2x4 (raid6 15% slower than raid5).
  • Same algorithms on a Xeon X5450 3GHz (raid6 20% slower than raid5).
  • Same algorithms on a Xeon E5430 2.66GH (raid6 18% slower than raid5).
  • Same algorithms again on a Xeon X5650 2.66GHz (raid6 15% slower than raid5).