Tuesday, August 12, 2008

Oh No! Vmware ESXi bug.

Interesting details on the bug affecting vmware. For me, specifically ESXi - I just killed ntpd for now.

--

As of tomorrow morning, VM's running on all hosts with ESX 3.5U2 in enterprise configurations will not power on. VMotion/HA/DRS will probably also not work.

Boom.

Apparently, there is some bug in the vmware license management code. VMware is scrambling to figure out what happened and put out a patch.

Running VM's will not be immediately impacted.

There is a major discussion going on in the vmware communities about the issue: http://communities.vmware.com/thread/162377?tstart=0

OK, while we're all remaining calm....just imagine the implications that bugs like this can occur and get past QA testing....5 years down the road, nearly all server apps worldwide running in VM's if you believe a lot of forecasts ......some country decides to initiate cyberwarfare and manages to get a backdoor into whatever is the prevailing hypervisor of the day.....boom. All your VM's are belong to us.

I honestly think a lot of the hype from those who want to build a vm security industry is crap, but god protect us if the baseline code for critical hypervisors like ESX isn't kept secure and regularly audited.

I'd love to find out what happened here.

What regression testing on new releases does vmware do to check for date based bugs? I'd think they'd at least check for simple things like changing the date to 1 year or 1 month in the future.

UPDATE: Frank Wegner has posted the following suggestions:

You can see the latest status here: http://kb.vmware.com/kb/1006716 Please check back often, because it will notify you when this issue has been fixed. Until then the best workaround I can think of is:

* Do nothing
* Turn DRS off
* Avoid VMotion
* Avoid to power off VM's

I'd council against turning DRS off as that actually deletes resource pool settings....instead, set sensitivity to 5 which should effectively disable it w/ minimal impact.

UPDATE 2: VMware Website appears to be having trouble keeping up with people requesting updates.

UPDATE 3: VMware has stated they will have fixes available in 36hrs at the earliest.

UPDATE 4: Anand Mewalal comments:

We used the following workaround to power on the VM's.
Find the host where a VM is located
run ' vmware-cmd -l ' to list the vms.
issue the commands:
service ntpd stop
date -s 08/01/2008
vmware-cmd /vmfs/volumes/vm path/vmname.vmx start
service ntpd start

UPDATE 5: Apparently, there are no easily seen warnings in logs/etc or VC prior to hitting the bug. VC will continue to show the hosts as licensed and no errors will appear in vmkernel log file until you try to start up a new vm, reboot a vm, or reboot the host.

UPDATE 6: Welcome Slashdot readers! I've temporarily disabled comments to allow the server vm to handle the load. Apparently Movable Type 4.1 executes a seperate perl cgi script to handle comments on each page load. Load times might have been slow for the last 45 minutes, but should be OK now.

UPDATE 7: I made some minor corrections to this entry that others have requested.

No comments:

Post a Comment