What are the CPU c-states? How to check and monitor the CPU c-state usage in Linux per CPU and core?

Below list of topics are covered in this article

What are C-states, cstates, or C-modes?

There are various power modes of the CPU which are determined on the basis of their current usage and are collectively called “C-states” or “C-modes.”

The lower-power mode was first introduced with the 486DX4 processor. To the present, more power modes has been introduced and enhancements has been made to each mode for the CPU to consume less power in these low-power modes.

  • Each state of the CPU utilises different amount of power and impacts the application performance differently.
  • Whenever a CPU core is idle, the builtin power-saving logic kicks in and tries to transition the core from the current C-state to a higher C-state, turning off various processor components to save power
  • But you also need to understand that every time an application tries to bind itself to a CPU to do some task, the respective CPU has to come back from its "deeper sleep state" to "running state" that needs more time to wake up the CPU and be again 100% up and running.  It also has to be done in an atomic context, so that nothing tries to use the core while it's being powered up.
  • So the various modes to which the CPU transitions are called C-states
  • They are usually starting in C0, which is the normal CPU operating mode, i.e., the CPU is 100% turned on
  • With increasing C number, the CPU sleep mode is deeper, i.e., more circuits and signals are turned off and more time the CPU will require to return to C0 mode, i.e., to wake-up.
  • Each mode is also known by a name and several of them have sub-modes with different power saving – and thus wake-up time – levels.

Below table explains all the CPU C-states and their meaning

How can I disable processor sleep states?

Latency sensitive applications do not want the processor to transition into deeper C-states, due to the delays induced by coming out of the C-states back to C0. These delays can range from hundreds of microseconds to milliseconds.

There are various methods to achieve this.

Method 1
By booting with the kernel command line argument processor.max_cstate=0 the system will never enter a C-state other than zero.

You can add these variable in your grub2 file. Append "processor.max_cstate=0" as shown below

# vim /etc/sysconfig/grub
GRUB_CMDLINE_LINUX="novga console=ttyS0,115200 panic=1 numa=off elevator=cfq rd.md.uuid=f6015b65:f15bf68d:7abf04cc:e53fa9a2 rd.lvm.lv=os/root rd.md.uuid=a66dd4fd:9bf06835:5c2bc8df:f150487f rd.md.uuid=84bfe346:bb18024a:054d652a:d7678fa4 processor.max_cstate=0"

Rebuild your initramfs

# grub2-mkconfig -o /boot/grub2/grub.cfg

Reboot the node to activate the changes

Method 2

  • The second method is to use the Power Management Quality of Service interface (PM QOS). 
  • The file /dev/cpu_dma_latency is the interface which when opened registers a quality-of-service request for latency with the operating system. 
  • A program should open /dev/cpu_dma_latency, write a 32-bit number to it representing a maximum response time in microseconds and then keep the file descriptor open while low-latency operation is desired.  Writing a zero means that you want the fastest response time possible.
  • Various tuned profile can do this by reading the file continously and writing a value based on the input provided foe eg, network-latency, latency-performance etc.

Below is a snippet from latency-performance tuned file

[cpu]
force_latency=1

Here as you see this file will always be on open state by the tuned as long as tuned is in running state

# lsof /dev/cpu_dma_latency
COMMAND  PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
tuned   1543 root    8w   CHR  10,61      0t0 1192 /dev/cpu_dma_latency

These profiles write force_latency as 1 to make sure the CPU c-state does not enters deeper C state other than C1.

How to read and interpret /dev/cpu_dma_latency?

If we use normal text editor tool to read this file then the output would be something like

# cat /dev/cpu_dma_latency
▒5w

Since this value is "raw" (not encoded as text) you can read it with something like hexdump.

# hexdump -C /dev/cpu_dma_latency
00000000  00 94 35 77                                       |..5w|
00000004

When you read this further

# echo $(( 0x77359400 ))
2000000000

It tells us that the current latency value time is 2000 seconds which is the time a CPU would need or take to come up from a deeper C state to C0.

NOTE: By default on Red Hat Enterprise Linux 7 it is set to 2000 seconds.

When we set a tuned profile with force_latency=1

For example here I will set tuned profile of network-latency

# tuned-adm profile network-latency

Check the existing active profile

# tuned-adm active
Current active profile: network-latency

Now lets check the latency value

# hexdump -C /dev/cpu_dma_latency
00000000  01 00 00 00                                       |....|
00000004

As you see the latency value has been changed to 1 micro second.

What is the maximum C-state allowed for my CPU?

We have multiple CPU c-states as you can see in the above table but depending upon the latency values and other max_cstate value provided in the GRUB the maximum allowed c-states for any processor can vary.

Below file should give the value from your node

# cat /sys/module/intel_idle/parameters/max_cstate
9

How do I check the existing latency value for different C-states?

The latency value may change depending upon various C-states and the transition time from deeper C-states to C0.

Below command shall give you the existing latency values of all the c-states per cpu

# cd /sys/devices/system/cpu/cpu0/cpuidle

# for state in state{0..4} ; do echo c-$state `cat $state/name` `cat $state/latency` ; done
c-state0 POLL 0
c-state1 C1-HSW 2
c-state2 C1E-HSW 10
c-state3 C3-HSW 33
c-state4 C6-HSW 133

Similar value can be grepped for all the available CPUs by changing the cpu number in the above highlighted area.

How to check and monitor the CPU c-state usage in Linux per CPU and core?

You can use "turbostat" tool for this purpose which will give you runtime value for the CPU c-state usage for all the available CPU and cores.

I will be using 'turbostat' and 'stress' tool to monitor the CPU c-state and put some load on my CPU respectively.

To install these rpms you can use

# yum install kernel-tools
# yum install stress

For example

Case 1: Using throughput-performance tuned profile

To check the currently active profile

# tuned-adm active
Current active profile: throughput-performance

With this our latency value is default i.e. 2000 seconds

# hexdump -C /dev/cpu_dma_latency
00000000  00 94 35 77                                       |..5w|
00000004

Check the output using turbostat

# turbostat
        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c3  CPU%c6  CPU%c7  CoreTmp PkgTmp  PkgWatt RAMWatt PKG_%RAM_%
        -       -       6       0.34    1754    2597    2963    640     1.24    0.07    98.35   0.00    54      61      29.33   6.65    0.00 0.00
        0       0       5       0.30    1817    2597    116     40      0.76    0.06    98.88   0.00    51      61      15.36   2.62    0.00 0.00
        1       8       7       0.39    1722    2597    253     40      1.84    0.08    97.69   0.00    52
        2       1       5       0.28    1786    2597    97      40      1.04    0.04    98.64   0.00    51
        3       9       4       0.22    1811    2597    45      40      0.45    0.00    99.32   0.00    51
        4       2       5       0.29    1883    2597    86      40      0.69    0.06    98.96   0.00    53
        5       10      4       0.22    1830    2597    39      40      0.46    0.00    99.31   0.00    52
        6       3       7       0.39    1682    2597    279     40      1.67    0.07    97.87   0.00    54
        7       11      7       0.39    1762    2597    200     40      1.79    0.08    97.75   0.00    51
        0       4       8       0.43    1837    2597    268     40      1.59    0.07    97.91   0.00    37      49      13.97   4.03    0.00 0.00
        1       12      7       0.39    1734    2597    251     40      1.49    0.10    98.02   0.00    40
        2       5       5       0.27    1727    2597    84      40      0.64    0.06    99.03   0.00    39
        3       13      5       0.27    1837    2597    70      40      0.58    0.03    99.12   0.00    40
        4       6       6       0.32    1775    2597    164     40      1.07    0.04    98.56   0.00    40
        5       14      6       0.37    1675    2597    234     40      1.44    0.07    98.13   0.00    40
        6       7       7       0.43    1735    2597    299     40      1.75    0.15    97.68   0.00    39
        7       15      9       0.56    1634    2597    478     40      2.63    0.16    96.66   0.00    38

As you see all the available CPU and cores are at c-6 state because all are free. Now if I start putting stress then the CPU will start transitioing from C6 to c0 state and c6 will become free as all CPU will be in running state

        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c3  CPU%c6  CPU%c7  CoreTmp PkgTmp  PkgWatt RAMWatt PKG_%RAM_%
        -       -       384     13.84   2782    2594    16172   640     2.14    0.17    83.84   0.00    54      58      42.87   8.42    0.00 0.00
        0       0       419     15.09   2790    2590    896     40      1.19    0.08    83.64   0.00    50      58      21.18   3.16    0.00 0.00
        1       8       255     9.21    2778    2590    1073    40      4.91    0.55    85.34   0.00    51
        2       1       439     15.76   2793    2591    892     40      1.29    0.05    82.90   0.00    54
        3       9       441     15.81   2800    2591    997     40      0.64    0.02    83.53   0.00    53
        4       2       439     15.74   2797    2592    890     40      0.80    0.06    83.39   0.00    54
        5       10      258     9.39    2758    2594    1118    40      5.34    0.41    84.86   0.00    51
        6       3       317     11.43   2780    2594    962     40      3.47    0.32    84.78   0.00    52
        7       11      327     11.86   2764    2594    1236    40      5.00    0.41    82.73   0.00    50
        0       4       39      1.46    2660    2594    485     40      2.31    0.22    96.01   0.00    37      47      21.69   5.26    0.00 0.00
        1       12      461     16.68   2767    2594    1314    40      2.69    0.16    80.47   0.00    46
        2       5       465     16.68   2791    2595    944     40      0.86    0.08    82.38   0.00    41
        3       13      458     16.50   2779    2595    1067    40      1.32    0.14    82.04   0.00    46
        4       6       463     16.63   2788    2596    1243    40      0.99    0.07    82.31   0.00    46
        5       14      452     16.31   2778    2596    1001    40      1.27    0.11    82.31   0.00    46
        6       7       462     16.58   2789    2596    1023    40      0.77    0.05    82.60   0.00    44
        7       15      452     16.29   2776    2597    1031    40      1.45    0.07    82.19   0.00    41

        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c3  CPU%c6  CPU%c7  CoreTmp PkgTmp  PkgWatt RAMWatt PKG_%RAM_%
        -       -       2428    86.63   2804    2599    85363   656     6.08    0.96    6.33    0.00    57      60      119.27  17.04   0.00 0.00
        0       0       2377    84.85   2802    2600    5756    41      9.47    1.09    4.59    0.00    55      60      55.56   6.59    0.00 0.00
        1       8       1835    65.48   2801    2602    5742    41      20.04   2.11    12.37   0.00    54
        2       1       2802    99.93   2803    2601    5037    41      0.07    0.00    0.00    0.00    57
        3       9       2802    99.93   2803    2601    5035    41      0.07    0.00    0.00    0.00    56
        4       2       2802    99.94   2803    2600    5044    41      0.06    0.00    0.00    0.00    57
        5       10      1992    71.12   2802    2598    5688    41      16.62   1.77    10.50   0.00    54
        6       3       2799    99.94   2803    2599    5049    41      0.06    0.00    0.00    0.00    57
        7       11      1914    68.39   2801    2598    5720    41      18.45   2.09    11.07   0.00    51
        0       4       2066    73.79   2800    2600    5335    41      9.85    2.19    14.17   0.00    46      53      63.72   10.45   0.00 0.00
        1       12      2803    99.86   2807    2600    5088    41      0.14    0.00    0.00    0.00    52
        2       5       656     23.46   2800    2597    3312    41      21.81   6.10    48.63   0.00    45
        3       13      2799    99.86   2807    2597    5610    41      0.14    0.00    0.00    0.00    53
        4       6       2799    99.86   2807    2597    7143    41      0.14    0.00    0.00    0.00    51
        5       14      2799    99.86   2807    2597    5044    41      0.14    0.00    0.00    0.00    50
        6       7       2799    99.86   2807    2597    5679    41      0.14    0.00    0.00    0.00    50
        7       15      2799    99.86   2807    2597    5081    41      0.14    0.00    0.00    0.00    48

        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c3  CPU%c6  CPU%c7  CoreTmp PkgTmp  PkgWatt RAMWatt PKG_%RAM_%
        -       -       2421    86.42   2807    2595    84373   656     6.28    1.07    6.23    0.00    59      62      120.52  17.00   0.00 0.00
        0       0       2798    99.83   2808    2595    5039    41      0.17    0.00    0.00    0.00    57      62      55.92   6.54    0.00 0.00
        1       8       1891    67.58   2803    2595    5151    41      16.92   2.72    12.78   0.00    55
        2       1       2798    99.83   2808    2595    5032    41      0.17    0.00    0.00    0.00    59
        3       9       2798    99.83   2808    2595    6068    41      0.17    0.00    0.00    0.00    58
        4       2       2798    99.83   2808    2595    5041    41      0.17    0.00    0.00    0.00    58
        5       10      1527    54.56   2804    2595    5540    41      24.02   3.73    17.70   0.00    56
        6       3       2793    99.83   2808    2590    5045    41      0.17    0.00    0.00    0.00    58
        7       11      1692    60.57   2804    2590    5556    41      20.66   3.24    15.53   0.00    54
        0       4       1425    50.99   2800    2595    5251    41      19.20   4.24    25.57   0.00    48      57      64.60   10.46   0.00 0.00
        1       12      2799    99.85   2809    2595    5053    41      0.15    0.00    0.00    0.00    54
        2       5       2799    99.84   2809    2595    5054    41      0.16    0.00    0.00    0.00    53
        3       13      1419    50.79   2800    2595    4642    41      17.88   3.22    28.11   0.00    49
        4       6       2799    99.85   2809    2595    5059    41      0.15    0.00    0.00    0.00    55
        5       14      2799    99.84   2809    2595    5047    41      0.16    0.00    0.00    0.00    53
        6       7       2799    99.84   2809    2595    6206    41      0.16    0.00    0.00    0.00    53
        7       15      2801    99.84   2809    2597    5589    41      0.16    0.00    0.00    0.00    50

Now towards the end as you see the Busy% increases and the CPU state under c-6 is reduced which means the CPU are currently in running state.

Case 2: Change tuned profile to latency-performance

# tuned-adm profile latency-performance

# tuned-adm active
Current active profile: latency-performance

Next monitor the CPU c-state when the system is idle

        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c3  CPU%c6  CPU%c7  CoreTmp PkgTmp  PkgWatt RAMWatt PKG_%RAM_%
        -       -       61      2.17    2800    2597    2923    656     97.83   0.00    0.00    0.00    68      74      78.78   6.14    0.00 0.00
        0       0       363     13.00   2800    2597    56      41      87.00   0.00    0.00    0.00    65      74      39.31   2.22    0.00 0.00
        1       8       4       0.14    2800    2597    9       41      99.86   0.00    0.00    0.00    68
        2       1       4       0.14    2800    2597    23      41      99.86   0.00    0.00    0.00    66
        3       9       61      2.17    2800    2597    211     41      97.83   0.00    0.00    0.00    66
        4       2       5       0.18    2800    2597    93      41      99.82   0.00    0.00    0.00    67
        5       10      4       0.14    2800    2597    20      41      99.86   0.00    0.00    0.00    66
        6       3       4       0.15    2800    2597    25      41      99.85   0.00    0.00    0.00    68
        7       11      8       0.28    2800    2597    337     41      99.72   0.00    0.00    0.00    64
        0       4       4       0.16    2800    2597    68      41      99.84   0.00    0.00    0.00    57      66      39.46   3.93    0.00 0.00
        1       12      4       0.14    2800    2597    34      41      99.86   0.00    0.00    0.00    58
        2       5       5       0.18    2800    2597    134     41      99.82   0.00    0.00    0.00    58
        3       13      38      1.36    2800    2597    928     41      98.64   0.00    0.00    0.00    59
        4       6       433     15.50   2800    2597    35      41      84.50   0.00    0.00    0.00    59
        5       14      7       0.24    2800    2597    375     41      99.76   0.00    0.00    0.00    59
        6       7       4       0.14    2800    2597    17      41      99.86   0.00    0.00    0.00    58
        7       15      21      0.74    2800    2597    558     41      99.26   0.00    0.00    0.00    55

As you see even when the CPU and cores are sitting idle still the CPU won't transition to deeper c-states since we are forcing it to stay at C1

What is POLL idle state ?

If cpuidle is active, X86 platforms have one special idle state. The POLL idle state is not a real idle state, it does not save any power. Instead, a busy-loop is executed doing nothing for a short period of time. This state is used if the kernel knows that work has to be processed very soon and entering any real hardware idle state may result in a slight performance penalty.

There exist two different cpuidle drivers on the X86 architecture platform:

"acpi_idle" cpuidle driver
The acpi_idle cpuidle driver retrieves available sleep states (C-states) from the ACPI BIOS tables (from the _CST ACPI function on recent platforms or from the FADT BIOS table on older ones). The C1 state is not retrieved from ACPI tables. If the C1 state is entered, the kernel will call the hlt instruction (or mwait on Intel).

"intel_idle" cpuidle driver
In kernel 2.6.36 the intel_idle driver was introduced. It only serves recent Intel CPUs (Nehalem, Westmere, Sandybridge, Atoms or newer). On older Intel CPUs the acpi_idle driver is still used (if the BIOS provides C-state ACPI tables). The intel_idle driver knows the sleep state capabilities of the processor and ignores ACPI BIOS exported processor sleep states tables.

Why the OS might ignore BIOS settings?

  • The OS might ignore BIOS settings based on the idle driver which is in use.
  • If one uses intel_idle (the default on intel machines) the OS can ignore ACPI and BIOS settings, i.e. the driver can re-enable the C-states.
  • In case one disables intel_idle and uses the older acpi_idle driver the OS should follow the BIOS settings.

One can disable the intel_idle driver by:

passing intel_idle.max_cstate=0 to kernel boot command line or
passing idle=* (where * can be e.g. poll, i.e. idle=poll)

IMPORTANT NOTE: Make sure your processor supports acpi driver or else you should not change the driver.

How to check currently loaded driver?

  • The intel_idle driver is a CPU idle driver that supports modern Intel processors.
  • The intel_idle driver presents the kernel with the duration of the target residency and exit latency for each supported Intel processor.
  • The CPU idle menu governor uses this data to predict how long the CPU will be idle
# cat /sys/devices/system/cpu/cpuidle/current_driver
intel_idle

Or you can also use below command

# dmesg |grep idle
[    1.766866] intel_idle: MWAIT substates: 0x2120
[    1.766868] intel_idle: v0.4.1 model 0x3F
[    1.767023] intel_idle: lapic_timer_reliable_states 0xffffffff
[    1.835938] cpuidle: using governor menu

I hope the article was useful.