A forum for reverse engineering, OS internals and malware analysis 

Forum for discussion about kernel-mode development.
 #18569  by zico_guru
 Mon Mar 18, 2013 8:06 am
hello
I am trying develop my branch trace store for my x64 windows 7 sp1.i have done in my machine and it records the recent branch.The problem is that it records golbaly means it records it for every process in my system , i want it for just single process. is it possiable by hooking KiSwapContext , through "WriteMSR(DEBUGCNTRL,0xc0,0); //Enabling TR and BTS" for that process that i want... my windows internals skill is not so good. I hope if this possible i will able to log every jmp,call ,j* instruction. it will very help full for live debugging.
branch trace store details canbe found in Intel SDM Vol3 17.4
Code: Select all
#include <wdm.h>

#pragma pack(1)

#define IA32_MISC_ENABLE 0x1A0
#define IA32_PREF_CAPABILITIES 0x345
#define MSR_LASTBRANCH_0_FROM_IP 0x680  // 16 Contigous Block
#define MSR_LASTBRACH_TOS 0x1c9         

#define MSR_LER_TO_LIP   0x1dd
#define MSR_LER_FROM_LIP 0x1de

#define MSR_LASTBRANCH_0_TO_IP 0x6c0 // 16 Contigous Block

#define IA32_DS_AREA 0x600

#define DEBUGCNTRL 0x1d9



#define MSR_LASTBRANCH_FORM 0x1db
#define MSR_LASTBRANCH_TO 0x1dc

typedef struct _CR_REGS
{
	 ULONG64 CR0;
	 ULONG64 CR2;
	 ULONG64 CR3;
	 ULONG64 CR4;
	 ULONG64 CR8;
}CR_REGS , *PCR_REGS;

extern VOID GetCRSet(PCR_REGS) ;

extern VOID EnablePCE();

extern ULONG64 ReadMSR(ULONG MSRIndex);
extern VOID WriteMSR ( ULONG Index, ULONG LowPart,ULONG HiPart);
VOID DriverUnload(PDRIVER_OBJECT pDriverObject)
{
		DbgPrint("Unload Called!!!!");

}



NTSTATUS DriverEntry(PDRIVER_OBJECT pDriverObject,PUNICODE_STRING RegistryPath)
{
    CR_REGS 	CR_REG_PROCESSOR [4];
    KAFFINITY kAffinity;
	ULONG64 LBR_TOS;
	PULONG64 ds_area=NULL,bts_buffer_base,bts_index,bts_absulute_max ;
    UCHAR ProcessorNum;

	pDriverObject->DriverUnload=DriverUnload;
    
	DbgPrint("Loadiing DS Save Area in Processor 0");
	KeSetSystemAffinityThreadEx(1); 
    
	ds_area=(PULONG64) ExAllocatePoolWithTag(NonPagedPool ,0x3fffff ,'BTS0');

	DbgPrint("Loading DS_AREA %llx",ds_area);

	memset(ds_area,0,0x3fffff);
	

    bts_buffer_base=(PULONG64)((ULONG64)ds_area+ 0x100);  //recording starts here..
	bts_index=bts_buffer_base;
    bts_absulute_max=bts_buffer_base+0x400000;

	DbgPrint("Setuping bts_buffer_base") ;

	*(PULONG64 )((ULONG64)ds_area+0)=(ULONG64 )bts_buffer_base;
	*(PULONG64 )((ULONG64)ds_area+0x8)=(ULONG64 )bts_index;
	*(PULONG64 )((ULONG64)ds_area+0x10)=(ULONG64 )bts_absulute_max;
	*(PULONG64 )((ULONG64)ds_area+0x18)=(ULONG64 )bts_absulute_max;  // When Max Generate Interrupt

	WriteMSR ( IA32_DS_AREA, ((ULONG64) ds_area & 0xffffffff),((ULONG64)ds_area>>32));

	WriteMSR(DEBUGCNTRL,0xc0,0);  //Enabling TR and BTS 

	return STATUS_SUCCESS;

}
  
Thanking you...
 #18573  by feryno
 Mon Mar 18, 2013 1:06 pm
hi zico_guru
few years ago I found in windows server 2003 SP1 x64 that DebugCtl.LBR, DebugCtl.BTF is shadowed in some not often used bits of DR7 (DR7.LE, DR7.GE = bits 9, 8 of DR7)
DR7 is not global but is thread specific (saved in / restored from ThreadContext)
you can tweak DR7 from user mode (GetThreadContext / SetThreadContext)
it was implemented for AMD processors and x64 versions of ms win, since some newer versions of ms win it was implemented for Intel also (can't remember the version of ms win from my head just now, maybe win server 2008 R2 x64)
I doubt it is implemented in 32 bit version - I never checked that (I left 32 bits years ago, I'm interested only in 64 bits). Try to disassemble something like this to check that:
KiSaveDebugRegisterState
(the same may be perhaps observed in more procedures, I like the above one only because I found the shadowing there)

here fragments from x64 ms win and comments
Code: Select all
.text:00000001400723B0 KiSaveDebugRegisterState proc near
...
.text:00000001400723D8                 mov     rdx, dr7
...
.text:00000001400723F6                 test    dx, 300h                 ; check whether bit 9 or bit 8 is set (DR7.GE, DR7.LE)
.text:00000001400723FB                 jz      short loc_140072473                 ; none of them set to 1, nothing to do then
.text:00000001400723FD                 mov     r8d, cs:KiLastBranchTOSMSR                 ; get value of MSR (value is stored here during CPU initialization)
.text:0000000140072404                 or      r8d, r8d
.text:0000000140072407                 jz      short loc_140072411                 ; CPU does not have TOS for branching feature
.text:0000000140072409                 mov     ecx, r8d
.text:000000014007240C                 rdmsr                 ; get TOS
.text:000000014007240E                 mov     r8d, eax                 ; save TOS
.text:0000000140072411
.text:0000000140072411 loc_140072411: 
.text:0000000140072411                 mov     ecx, cs:KiLastBranchFromBaseMSR                 ; value for this MSR may differ among AMD and various versions of Intel CPUs, right value saved into KiLastBranchFromBaseMSR during early CPU detection and initialization
.text:0000000140072417                 add     ecx, r8d                 ; add TOS (implemented only at Intel CPUs)
.text:000000014007241A                 rdmsr
.text:000000014007241C                 mov     [rbp+98h], eax
.text:0000000140072422                 mov     ecx, cs:KiLastBranchToBaseMSR
.text:0000000140072428                 mov     [rbp+9Ch], edx
.text:000000014007242E                 add     ecx, r8d
.text:0000000140072431                 rdmsr
.text:0000000140072433                 mov     [rbp+90h], eax
.text:0000000140072439                 mov     [rbp+94h], edx
.text:000000014007243F                 mov     ecx, cs:KiLastExceptionFromBaseMSR
.text:0000000140072445                 rdmsr
.text:0000000140072447                 mov     [rbp+0A8h], eax
.text:000000014007244D                 mov     [rbp+0ACh], edx
.text:0000000140072453                 mov     ecx, cs:KiLastExceptionToBaseMSR
.text:0000000140072459                 rdmsr
.text:000000014007245B                 mov     [rbp+0A0h], eax
.text:0000000140072461                 mov     [rbp+0A4h], edx
.text:0000000140072467                 mov     ecx, 1D9h
.text:000000014007246C                 rdmsr
.text:000000014007246E                 and     eax, 0FFFFFFFCh                 ; erase DebugCtl.BTF, DebugCtl.LBR
.text:0000000140072471                 wrmsr
.text:0000000140072473                 test    word ptr [r9+208h], 355h                 ; new DR7 to be loaded, 355h is mask for GE, LE, L3, L2, L1, L0 bits
.text:000000014007247D                 jz      short locret_1400724EB                 ; when all above bits will be loaded with zero then no need to load DR0,DR1,DR2,DR3, DebugCtlMSR
.text:000000014007247F                 mov     rax, [r9+1E0h]
.text:0000000140072486                 mov     rdx, [r9+1E8h]
.text:000000014007248D                 mov     dr0, rax
.text:0000000140072490                 mov     dr1, rdx
.text:0000000140072493                 mov     rax, [r9+1F0h]
.text:000000014007249A                 mov     rdx, [r9+1F8h]
.text:00000001400724A1                 mov     dr2, rax
.text:00000001400724A4                 mov     dr3, rdx
.text:00000001400724A7                 mov     rdx, [r9+208h]                 ; value of DR7 to load
.text:00000001400724AE                 xor     eax, eax
.text:00000001400724B0                 mov     dr6, rax
.text:00000001400724B3                 mov     dr7, rdx
.text:00000001400724B6                 test    byte ptr gs:4D4Ah, 2
.text:00000001400724BF                 jz      short locret_1400724EB
.text:00000001400724C1                 test    dx, 200h                 ; is there an attempt to load DR7 with GE bit = 1 (shadow for DebugCtl.BTF)
.text:00000001400724C6                 jz      short loc_1400724CB
.text:00000001400724C8                 or      eax, 2                 ; when DR7.GE=1 then DebugCtlMSR.BTF=1
.text:00000001400724CB                 test    dx, 100h                 ; is new DR7.LE = 1 ?
.text:00000001400724D0                 jz      short loc_1400724D5
.text:00000001400724D2                 or      eax, 1                 ; when DR7.LE=1 then DebugCtlMSR.LBR=1
.text:00000001400724D5                 test    eax, eax
.text:00000001400724D7                 jz      short locret_1400724EB
.text:00000001400724D9                 mov     r8d, eax
.text:00000001400724DC                 mov     ecx, 1D9h
.text:00000001400724E1                 rdmsr
.text:00000001400724E3                 and     eax, 0FFFFFFFCh                 ; erase BTF, LBR bits
.text:00000001400724E6                 or      eax, r8d                 ; set BTF, LBR bits to be equal DR7.GE, DR7.LE
.text:00000001400724E9                 wrmsr                 ; and write MSR
.text:00000001400724EB                 retn    0
once you set the required bits how to erase them then (else you have feeling it is globally shared and not per thread specific) - if the next thread have DR7=0 (and most of threads have DR7=0) then ms win for performance reasons shorts the procedure of loading the rest of debugging state so LBR, BTF stays set to 1
to reset them back to 0 the new thread should have any of DR7.L3, L2, L1, L0 bits set to 1 and GE, LE set to 0
it is required for this fragment from the above:
Code: Select all
.text:0000000140072473                 test    word ptr [r9+208h], 355h
.text:000000014007247D                 jz      short locret_1400724EB
this performance optimization complicates our lives
 #18574  by zico_guru
 Mon Mar 18, 2013 2:48 pm
basically im trying to monitor or trace the execution of a selected process.means when the process is selected for execution the The trace logger is started to log and when the
process is swapout or another process is selected i want to off my logger.i know i have 2 hook thread dispatching function.. but which one im confused.
 #18590  by feryno
 Tue Mar 19, 2013 1:22 pm
MSRs for branching features are different among various CPUs
MSR_LASTBRANCH_0_FROM_IP may be 40h on some Intel CPUs and the count of branching pairs may be less than 16
top of stack for branching feature even does not exist at older Intel CPUs and was never implemented at AMD (only 1 pair lastbranchfrom / lastbranchto)
but this is not problem if you want it to run only at your CPU

if you set DebugCtl.BTS=1 too early (before OS loads DR7 + debugctl MSRs) it is erased by OS if the DR7.GE=0 (bit 9 of DR7) of the task to be loaded (task you want to record)

OS enables branchig feature for kernel tasks (e.g. after BSOD caused by some driver you can obtain LastBranchFrom / To from crash dump file) - if such task interrupts your recording and you attempt to continue recording after your task is rescheduled, you will have different MSR_LASTBRANCH_TOS.
MSR_DEBUGCTL_TOS is read only MSR
OS stores only the last pair (where TOS is pointing) even your CPU has 16 pairs
OS does not utilize TR bit of DebugCtl (so here OS won't interfere with your recording), OS will interfere with BTR bit

you must install the hook certainly after OS loads new DR7 and branching MSRs during switching tasks

try to install the hook somewhere at the end of KiSwapContext
if that won't work correctly (let us know whether the problem is "bits set to 1 in MSR are erased to 0 by OS" or something like "bits set to 1 in MSR stay persistent for all tasks") then you have to try this approach:

scan your kernel (ntoskrnl.exe) for occurence of
mov ecx,000001D9h (hexa bytes B9 D9 01 00 00)
and then this:
wrmsr (hexa bytes 0F 30)
then determine whether this part is called during task switching
you may install the hook after the WRMSR into MSR_DebugCtl

quite complicated to implement it (patchguard may detect the hook)
 #18594  by zico_guru
 Tue Mar 19, 2013 3:26 pm
hey feryno thnx for the response. but i am talking about the BTS , TR not about the MSR recheck my code or "intel sdm vol3 17.4.9 BTS and DS Save Area" ,, means it is awesomely safe. DS_AREA is expandable so it is not bounded to msr. in this case i have 2 check the CPUID.1:EDX[21] bit. KiSwapContext is the Most used function in wondows im thinking about the performance penalties and patchguard is still dumb ass!!. it is complected but possible. Hardware support is available means it very fast.it can be implemented to trace usermode code / kernelmode code and not only trace , remember patchfinder(with gr8 respect to Joanna) ? it provides details of path of the code execution,hooks canbe detected,malware drivers that intercept code possible to detect! u have 2 put in just KiFastSystemCall. it will give the path of ntdll!ZwWriteFile to msahci!AhciHwStartIo.Joanna Show the way 2 trace path trough single step or "idt 1".Now Idt hooking is very big issue in recent version of windows and very slow. I need it to fasten reverse engg and a antirootkit,Cheers!!! :D

Image
 #18601  by feryno
 Wed Mar 20, 2013 7:18 am
I posted a lot of useless things in my 2 previous posts
few things confused me (the name of thread which ends with "user mode" + you put more MSRs in the sample than necessary)
I rather had to think few days before answering (no answer immediatelly after reading)
the "user mode" in the name of the thread means you want to watch only user mode threads
you may clean the posted sample and remove all unrelated MSRs from there (e.g. MSR_LASTBRANCHFROM and so on) and let there only these 2 MSRs: MSR_DEBUGCTL = 1D9h, MSR_DS_AREA = 600h
I realised after 2 days that you don't want to trace user mode program (not to generate debug exceptions), you just want to record branching instructions so the only interesting bits in MSR_DEBUGCTL for you are bits 7, 6 (BTS, TR), no need to be interested in bits 1, 0 (BTF, LBR) + trap flag (bit 8 of rflags)

I checked that your OS plays game only with bits 1, 0 of MSR_DEBUGCTL and never touches bits 7, 6 (bits you need to use to do recording) but that may change in feature versions of OS

you could try to install the hook at KiSwapContext
KiSwapContext is called from about 30 procedures, there is a risk that it is called for also something else than running new thread (e.g. for synchronization, KeWaitForSingleObject etc), you must check how does it behave (and never will be 100% sure whether it is not called for something unwanted in some rare situations)

because you want to record only user mode addresses, another choices where to place the hook are before the last instructions executed in ring0 before switching into ring3
there are:
IRETQ (hexadecimal 48 CF)
SYSRETQ (hexadecimal 48 0F 07) and here the preceeding instruction is always SWAPGS (in IRETQ way it may be with or without SWAPGS)

you posted Intel hypervisor sample at this forum some time ago, I assume you improved it to be SMP capable
to detect ring0 to ring3 transitions at Intel CPU I had success with tracing ring0 from SwapContext until switching to ring3: rflags.TF=1, DebugCtl.BTF=1, intercepting guest generated #DB by hypervisor (without injecting #DB back to guest, just handling them silently)
OS survived that even it added some overhead, the performance was quite good (unnoticeable penalty)

I saw that KiRestoreDebugRegisterState is called before IRETQ / SYSRETQ that may be good point to start recording if it is called always

your hypervisor must check at the hook point whether thread you want to record is going to be run or an alien thread (PsGetCurrentProcessId, PsGetCurrentThreadId which must be obtained before SWAPGS instruction because they get ID from ring0 GS base)

transitions from ring3 to ring0 (when to stop recording)
- interrupts, timer interrupt and external interrupts (generated by hardware, interprocessor interrupts etc)
- exceptions (even bugless application generates a lot of KiPageFault so OS maps pages into virtual memory space as application accesses them)
- KiSystemCall64, KiSystemCall32 (you can retrieve the address by RDMSR from e.g. MSR_LSTAR)

the exceptions and KiSystemCallxx call KiSaveDebugRegisterState early (maybe interrupts, timer and all external interrupts also - check that) so maybe you can stop recording there (e.g. vm exits by writing into DR7)

so to sumarize the above long talk:
try hook in KiSwapContext
if it doesn't work as excpected then try KiSaveDebugRegisterState/KiRestoreDebugRegisterState to stop/start recording
determining process ID / thread ID from ring0 GS base: use way like PsGetCurrentProcessId, PsGetCurrentThreadId, from ring3 GS base: use way like GetCurrentProcessId, GetCurrentThreadId exported from kernel32.dll (I'm not saying to call these procedures, you should manually extract values from offsets as the above procedures), ring0 and ring3 GS bases are swapped by SWAPGS instruction (swapping values in MSRs C0000101, C0000102)
using hypervisor for the hook will be good choice (e.g. same vm exits like writing DR7 in KiSaveDebugRegisterState/KiRestoreDebugRegisterState, if you decide to use KiSwapContext then it always calls SwapContext where are also some instructions causing vm exits)
 #18694  by feryno
 Tue Mar 26, 2013 9:31 am
Hi _Lynn, you repeated the same mistake as I did
zico_guru should give better thread name e.g. "implementing Branch Trace Store", then also unnecessary MSRs should be removed from the posted sample
he doesn't want to use bits 1, 0 of DebugCtlMSR
he is interested only in bits 7, 6 and he wants to implement branch trace store
he doesn't want to set TF bit of RFLAGS, he doesn't want to read LastBranchFrom/To MSRs (neither get them from ThreadContext), he wants to record branching addresses of target ring3 process/thread in a buffer in memory, target process won't have any suspicion that it is watched
he is attempting to achieve very complicated technology, he wants to add unimplemented CPU feature into ms windows kernel
he needs to know which process/thread is going to execute ring3 code and he needs to discover points where to start/stop watching (else alien processes/threads mess the buffer with their records)
the technology zico_guru tries to achieve will be very usefull - he may let some malware to run and fully infect the system and the infection leaves traces in debug store buffer so it may help in the analysis (easier to defeat some antidebugs and anti VM tricks - addresses of each branching instruction as call/ret/jmp/syscall will be known), the malware will run natively at baremetal (no virtual machine), the malware won't run under debugger, the malware just leaves traces (branching instructions) in debug store buffer while infecting the system
debug store buffer may hold millions of branching addresses, branching MSRs may hold only upto 16 pairs (CPU limit, some CPUs even less than 16, e.g. only 1 pair at AMD) and ms win kernel saves only the last MSR pair (so you need debugger to collect them from ThreadContext and target must generate debug exceptions every branching instruction else MSRs are overwritten with next branching instruction = slowdown + debuging flags which will be detected by malware)
the idea posted here by zico_guru is genial
buffer in memory will be definitely the best and most stealth technology (no debugger, no VM), maybe no one achieved yet this technology at ms windows kernel (although adding this CPU feature into open source kernels like Linux or some hobby OS is quite trivial task)
 #18696  by _Lynn
 Tue Mar 26, 2013 2:22 pm
You are right hehe, i have misread. The author of that article 'everdox' on this forum is working on a windows 7 driver for that purpose, i think he plan to release source code. However some important details i am aware of.. he is targeting nehalem based intel cpu's because of their large LBR buffer, he say with his driver it will be like: tracing program without trap flag.. very fast. However as i know, AMD cpu's currently do not have LBR buffers anywhere near as large as intel nehalem.

I believe he hook int1 when LBR buffer is full, then something like KiSaveDebugRegisterState/KeSaveDebugRegisterState to set/unset it for each task (kind of how you mention above)..

you can PM him, i know he live on this forum somewhere.