Counting Intel CPU Clock Ticks

Accurately timing sections of a program for performance profiling is a surprisingly error prone task.  Whilst timer circuitry exists on all modern PC motherboards, it is designed for long duration use and can only perform accurate timing in the millisecond range or longer.  Modern CPUs execute at gigahertz frequencies and therefore need timing capabilities in the nanosecond range to be useful for code profiling.

The obvious solution is to read the CPU’s own internal clock tick counter as this is the actual mechanism used by the chip itself.  Fortunately, Intel have exposed this tick count through the rdtsc instruction, which returns the CPU tick count since the chip was powered up.  It is tempting to imagine that dividing this number by the processor’s clock frequency will yield a nanosecond precise time value.  This used to be true for older generations of Intel CPU’s running single process operating systems, but this is no longer the case.

Firstly, there is a conflict of interest between power use and performance.  Chips consume energy for every clock cycle completed, so chip makers have developed sophisticated methods of adjusting the actual clock speed depending upon current system load.  As a result, it is not possible to be certain of the clock frequency in effect at any given moment, even if the exact model of the CPU is known to the program.

Secondly, on modern pre-emptive multitasking operating systems, any given section of code may be paused at any point for small periods by the operating system as it allocates CPU resources to other tasks.  This means that not all of the clock ticks recorded will necessarily have been used by the code being observed.

These limitations have to be understood and the solution from a profiling point of view comes firstly from the observation that the the actual real time value isn’t important for comparing code. The number of clock cycles is, so changes to code that reduce the number of clock cycles required to perform a task are always going to yield a performance improvement, even if that cannot be precisely timed. The impact of the second problem can be reduced by attempting to quantify level of uncertainty in the observations recorded. This situation can be expressed mathematically in the following simple equation:

ObservedTicks = ActualTicks+Error

The actual clock cycles required for a given sequence of instructions are invariant. Therefore if multiple observations are made it follows that:

ObservedTicksmin = ActualTicks+Errormin

This implies that the error can be minimised by increasing the number of observations until there is no further reduction in the minimum observed count.  At this point, the error is either zero or some residual constant which can’t be altered from within our code anyway. This means that whilst a single measurement cannot be relied upon for profiling, an appropriately sized set of observations can give a useful benchmark value.

To be able to profile my own Free Pascal code, I have developed a small unit to make use of the rdtsc instruction. It contains 32 and 64 bit Intel assembler so that it can be used on both the 32 and 64 bit versions of the compiler with no operating system dependencies (It has been tested on Windows, Linux and Mac OS X). An example of its use is shown in the following code snippet:


Procedure DrawImage;
Begin
  CPUTimer.Start;
  Canvas.Draw(0, 0, ImageBuffer);  { The code being profiled. }
  CPUTimer.Stop;
  WriteLn('Image Draw Time: '+CPUTimer.MegaCountAsText+'MTicks');
End;

The timer is created as a singleton global object, so it does not need to be created or freed. There are three helper methods that return the tick count as text. Given that these values can often be very large numbers, two of these output routines perform divisions on the count by 1,000 and 1,000,000 respectively to make display more manageable.

Given that the code is quite short I have pasted it here in full:

Unit CPUTime;

{$IFDEF FPC}
  {$ASMMODE INTEL}
  {$MODE OBJFPC}
  {$LONGSTRINGS ON}
{$ENDIF}

Interface

Uses
  Classes, SysUtils;

Type TTickUnits = (tuTicks, tuKiloTicks, tuMegaTicks);

Type
  TCPUTimer = Packed Object
    StartTick: Int64;
    TickCount: Int64;
    Function FormatTickCount(Count: Int64; Units: TTickUnits = tuTicks; ShowUnits: Boolean = False): String;
    Function KiloCountAsText: String;
    Function MegaCountAsText: String;
    Function TickCountAsText: String;
    Procedure Mark; Assembler;
    Procedure Start; Assembler;
    Procedure Stop; Assembler;
  End;

Var
  CPUTimer: TCPUTimer;

Implementation

Function TCPUTimer.FormatTickCount(Count: Int64; Units: TTickUnits = tuTicks; ShowUnits: Boolean = False): String;
Var
  DisplayTicks: Int64;
  UnitSuffix: String;
Begin
  UnitSuffix := EmptyStr;
  Case Units Of
  tuTicks:
    Begin
      DisplayTicks := Count;
      If ShowUnits Then
        UnitSuffix := 't';
    End;
  tuKiloTicks:
    Begin
      DisplayTicks := Count div 1000;
      If ShowUnits Then
        UnitSuffix := 'Kt';
    End;
  tuMegaTicks:
    Begin
      DisplayTicks := Count div 1000000;
      If ShowUnits Then
        UnitSuffix := 'Mt';
    End;
  End;
  Result := IntToStr(DisplayTicks div 1000)+UnitSuffix;
End;

Function TCPUTimer.KiloCountAsText: String;
Begin
  Result := FormatTickCount(TickCount, tuKiloTicks, True);
End;

Function TCPUTimer.MegaCountAsText: String;
Begin
  Result := FormatTickCount(TickCount, tuMegaTicks, True);
End;

Function TCPUTimer.TickCountAsText: String;
Begin
  Result := IntToStr(TickCount);
End;

Procedure TCPUTimer.Mark; Assembler; NoStackFrame;
Asm
  {$IFDEF CPU64}
  mov rcx, Self
  rdtsc
  shl rdx, 32
  or rax, rdx
  sub rax, rcx[StartTick]  
  mov rcx[TickCount], rax
  {$ELSE CPU32}
  mov ecx, Self
  rdtsc
  sub eax, ecx[StartTick]  { Subtract low 32bits first. }
  sbb edx, ecx[StartTick+4]  { Subtract high 32bits with borrow. }
  mov ecx[TickCount], eax
  mov ecx[TickCount+4], edx
  {$ENDIF CPU64}
End;

Procedure TCPUTimer.Start; Assembler; NoStackFrame;
Asm
  {$IFDEF CPU64}
  mov rcx, Self
  rdtsc
  shl rdx, 32
  or rax, rdx
  mov rcx[StartTick], rax
  {$ELSE CPU32}
  mov ecx, Self
  rdtsc
  mov ecx[StartTick], eax
  mov ecx[StartTick+4], edx
  {$ENDIF CPU64}
End;

Procedure TCPUTimer.Stop; Assembler; NoStackFrame;
Asm
  {$IFDEF CPU64}
  mov rcx, Self
  rdtsc
  shl rdx, 32
  or rax, rdx
  sub rax, rcx[StartTick]
  mov rcx[TickCount], rax
  mov rcx[StartTick], 0
  {$ELSE CPU32}
  mov ecx, Self
  rdtsc
  sub eax, ecx[StartTick]  { Subtract low 32bits first. }
  sbb edx, ecx[StartTick+4]  { Subtract high 32bits with borrow. }
  mov ecx[TickCount], eax
  mov ecx[TickCount+4], edx
  mov ecx[StartTick], 0
  mov ecx[StartTick+4], 0
  {$ENDIF CPU64}
End;

End.