A forum for reverse engineering, OS internals and malware analysis 

Forum for discussion about kernel-mode development.
 #11263  by Kamala
 Wed Jan 25, 2012 1:51 pm
Hi,

I encounter an issue with "VMLaunch" that I am trying to track down. If I keep a breakpoint at the first instruction in guest before calling "VMLaunch" (opcode 0f 01 c2), I hit that breakpoint and then successfully execute guest code. If I don't keep that breakpoint, guest hangs the moment I run "VMLaunch". Obviously, the int 3 trap handler in Windows is doing something that helps. Can anyone think of a reason why it might help? Thanks.

Kamala
 #11296  by feryno
 Thu Jan 26, 2012 1:34 pm
I think you must revide VMCS settings - something is missing or misconfigured - at first try to check all guest-state fields

I encountered also some strange behaviour with my hypervisor, luckily solved that already.
Does you hypervisor share paging tables with guest (host CR3 = guest CR3 at the first execution of VMLAUNCH)?
I had olways hangs while sharing virtual spaces. No matter only my hypervisor, I also tried another which host_CR3=guest_CR3 and had hangs.
That ugly thing disappeared after creating private paging tables and private system structures for hypervisor.
I'm loading it as driver from running ms win. Later was also able to turn it off and resume OS without necessity of reboot (much less effort although it is not so well documented as detailed documentation how to create and launch hypervisor) - it saves a lot of time if you recompile more versions every day after implementig every small step.

My hypervisor driver (after made it stable) is doing these things:
[0] preparing few MB of nonpaged aligned memory (MmAllocateContiguousMemorySpecifyCache + MmProbeAndLockPages + zeroing memory). 2 MB without virtualization of guest memory, 4 MB when virtualizing guest memory.
[1] creating all necessary hypervisor things shared among all CPUs (paging tables, IDT, GDT, TSS, ...) + copying the hypervisor code from driver memory into hypervisor memory allocated in step 0
[2] now per CPU private things:
2a. detecting CPU capabilities
2b. setting some MSRs
2c. setting vmxon region and doing vmxon
2d. setting VMCS region
2e. vmlaunch
2f. guests goes on under control of hypervisor
[3] when all CPUs finish the step 2. then unmap the hypervisor memory from guest paging tables (MmFreeContiguousMemory) so guest cannot see it easily (guest still able to access host if it knows host physical memory address and calls MmGetSystemAddressForMdlSafe or another way using DMA transfer, that can be defeated using memory virtualization and protection against DMA)

What helped in my case - it seems that completely splitting hypervisor memory from guest. It was years ago but till today I still don't know why.
Maybe I had only bug in the first design and rewriting the source file caused to correct the previous mistake, but was unable to find anything suspicious.
Disadvantage in that design is the separation - not easy to detect what happened in host in case of malfunction. Solved that by not clever solution (didn't find another choice, my developmental PC didn't have serial port) - didn't call MmFreeContiguousMemory in step 3, storing debug data in hypervisor memory, storing pointer in driver pointing to the new location of hypervisor memory, analyzing crash dump file). Later I developed simple ring3 interface to read the whole hypervisor memory (its contiguous physical memory) and display it under guest (was able to see some debug info stored in hypervisor memory at certain positions).
The nightmare for me were hangs as in your case (no crash dump file, no opinion what went wrong).
first instruction in guest before calling "VMLaunch"
is it like this?
int3
vmlaunch

I think the guest is running after succesfully executing VMLAUNCH
immediately before VMLAUNCH it is not yet guest

What is in zero flag and carry flag after failure of VMLAUNCH?
What is in VM-exit information field after VMLAUNCH failure?
 #11303  by Kamala
 Thu Jan 26, 2012 7:13 pm
Hi,

Thanks for your response!

>I think you must revide VMCS settings - something is missing or misconfigured - at first try to check all guest-state fields
I wouldn't be surprised if I need some tweak there as well except I am not able to spot which one just yet. For a bit I was worried about the EFlags.

>I encountered also some strange behaviour with my hypervisor, luckily solved that already.
I have two modes - One share pages and the other one does not. I encounter this hang in both cases.

> is it like this?
> int3
> vmlaunch

It's the other way around. vmlaunch always succeeds. When the first instruction in guest is int 3 or if I keep a breakpoint in the debugger, all is well.

So, if I can pinpoint the one thing Windows trap 03 handler does that is helping, that would get us closer to solving the problem. Is there anything you can think of that the 03 trap handler might be doing that could help here?

> What is in zero flag and carry flag after failure of VMLAUNCH?

VMLaunch never fails.

> What is in VM-exit information field after VMLAUNCH failure?

It exits because of guest double fault. Sometimes it is the guest eip and sometimes it is the guest stack that's the culprit.

Do you have any insight into what might be going on given that the guest double faults after executing some code when vmlaunch is called without int 3 as the first executing instruction in guest. Appreciate your help. Thanks.

Kamala
 #11305  by Kamala
 Fri Jan 27, 2012 9:58 am
Addendum to my last response -

I had this thought while perusing through something relevant -

Triggering that initial guest breakpoint does create a trap frame and may be that makes all the difference given where I fail otherwise - Double fault happens around the area where sysexit happens or int 2* is called in guest which almost implies the trap frame created during that time get corrupted when we fail but keeping a breakpoint fixes that issue.

Does that make sense?

Kamala
 #11309  by feryno
 Fri Jan 27, 2012 2:34 pm
Maybe int3 handler sanitizes some system thing, maybe some selector... Maybe the sanitizing is done not immediately after int3 but delayed when OS reloads it in SwapContext (only guessing) - really very strange behaviour, not easy to explain.
I see that it is 32 bit intel hypervisor so I attached some disassembled parts of 32 bit intel hypervisor skeleton which runs perfectly at me.
Perhaps you find some ideas there what did you set in different way.
PM me if you want to discuss something suspicious/unclear you find in the attachment.
Attachments
(12.66 KiB) Downloaded 32 times
 #11341  by feryno
 Mon Jan 30, 2012 10:00 am
Hi,
I also remembered another thing which may help you.

In one sentence - you should also revide vm exit handler.

The story explaining something:
My primary developmental OS was win 2003 server x64 (it was about 3 years ago), minimalistic hypervisor worked there like a charm. I also tested it under win 2008 server x64 and there I had always hangs after nondeterministic time, usually in less than 10 minutes after launching hypervisor (running ring3 application -> installing driver -> calling driver -> launching the hypervisor).
Obviously it was caused by bug in my code (but I didn't know that until I discovered the bug).
The bug was silent in win 2003 server x64 but appeared only in win 2008 server x64 thanks to differences in SwapContext procedure.
The bug was in hypervisor procedure handling vm exit caused by writing CR3 register (hypervisor was designed to observe some things in guest OS and required to watch CR3 changes), the bug was caused by reversing pop rcx \ pop rax at the procedure epilogue (so the hypervisor correctly wrote the right value into CR3 but then the hypervisor exchanged rcx rax registers so the guest was resumed with swapped rax rcx).
So I also suggest you to validate that vm exit handler is working well:
Try to intercept only unconditional vm exits (set to 0 as much bits of VM-execution control as you can) to reduce vm exits to minimum.
Then as the first instruction in guest try an unconditional vm exit.
So in the sample I attached previously, these instructions would appear starting from address 00010812

00010812 xor eax,eax
00010814 cpuid ; unconditional vm exit

In your hypervisor it should be at the address where you place your int3. Place a breakpoint using debugger after the CPUID instruction and run that.

I observed that CPUID is executed quite frequently in ms win (added counter into hypervisor for that), never encountered INVD from running ms win x64
Current versions of ms win also don't execute getsec, xsetbv, VMX instructions (you can't run hyper-V because then you can't run your hypervisor)
So you intercept only vm exits caused by CPUID.

If that passes OK, then your minimalistic vm exit handler is healthy and then you may try e.g. to intercept writes to control registers.
Set to 1 dedicated bits in VM-execution control and try instructions like
00010812 mov ecx,cr3
00010815 mov cr3,eax

Because you are able to trace guest by kernel debugger, you may easily detect possible mistakes in you vm exit handler.
 #11352  by Kamala
 Mon Jan 30, 2012 4:31 pm
Hi,

Thanks for the detailed note! I did make some progress in narrowing down the cause but not in solving it.

I tried to focus on the instruction executed within the guest before the double fault. It ended up being one of two things. For the sake of this discussion I will focus on one of the issue - int 2b call. Anytime int 2b instruction is executed within guest from user mode, it resulted in a double fault. So I am assuming the fault happens when the processor tries to switch to kernel stack, the value of which it gets from the task state structure. When I look at the task state structure, the kernel stack value looks valid. Does this give any clue as to what else might be going wrong around this time? Thanks.

Kamala
 #11385  by feryno
 Wed Feb 01, 2012 2:05 pm
Hi,

Do you have at least 1 vmexit to your hypervisor and then vmresume to the guest and then guest ring3 executes the int 2B ?

During hypervisor startup you have to write only 4 vmcs fields concerning guest task:
Code: Select all
str ecx
mov eax,80Eh ; guest TR selector VMCS encoding
vmwrite eax,ecx

lsl eax,ecx
mov edx,480Eh ; guest TR limit VMCS encoding
vmwrite edx,eax

lar eax, ecx
jnz L0
shr eax, 8
and eax, 0F0FFh
jmp L1
L0:
mov eax,10000h
L1:
mov edx,4822h ; guest TR access rights VMCS encoding
vmwrite edx,eax

push edx
push eax
sgdt [esp+2]
pop eax
pop edx
; edx=guest_GDT_base
and cl,0F8h
mov eax,[edx+ecx*1+2]
mov ecx,[edx+ecx*1+4]
and eax,0FFFFFFh
and ecx,0FF000000h
or eax,ecx
; eax = TR base
mov edx,6814h ; guest TR base VMCS encoding
vmwrite edx,eax
During hypervisor startup, you don't have to change anything in guest TSS either guest task descriptor either guest task register, just have to grab them and vmwrite into VMCS.

There is also no need to change anything in guest task fields during vmexit/vmresume, either when handling vm exits, there is even no need to access guest task vmcs fields (unless you are doing some exotic things).

There is a choice to change something in task descriptor if you execute the LTR instruction, but there is no need to execute the LTR at Intel hypervisor startup either in Intel VM exit handler, such instruction is required at Intel when leaving virtualization after VMXOFF and then resuming OS. This instruction + erasing the busy bit is also required to do with HOST task after every AMD64 SVM vmexit, but not under Intel. Executing LTR changes available task to busy task and then it is necessary to make it available in a way like:
Code: Select all
; task is now available
mov eax,task_selector
ltr eax ; load the task, this changes available task to busy (one bit of desriptor changes from 0 to 1)
; now making it again available (to allow the LTR instruction to load it again)
push edx
push eax
sgdt [esp+2]
pop eax
pop edx
; edx=guest_GDT_base
mov ecx,eax
and cl,0F8h ; erase low 3 bits, just to be sure
and [edx+ecx*1+5],0FDh	; not 0010b = 0FDh
; now the task is again available and you are allowed to execute the LTR EAX (executing LTR with busy task raises #GP)
During vm entry, CPU checks the type and requires it to be busy, not available.
But again, I never had to change anything in guest task, had to change busy to available when turning virtualization off (a lot of VMREAD, then VMXOFF and then restoring OS from values obtained by vmreads before vmxoff).
 #11409  by Kamala
 Thu Feb 02, 2012 5:36 pm
Hi,

Your description with respect to task is very close to the area of problem. So, we are getting closer to solving it. Thanks for your response in that line. After reading your note, I did try a few changes with respect to task structure/state but the problem still persist. I have few other things to try in the same area based on your note which I will try and then get back.

I still feel that I am not asking the right questions which is why the problem is evasive. Let me try the rest of the task changes and get back with a different set of questions if the problem still persist. Thanks.

Kamala