
Queued up into tip/tip.git’s x86/cpu department forward of the Linux 6.8 merge window opening in a month is an optimization that ought to show useful in cloud/VM situations.
The change slated to be launched in Linux 6.8 is for not serializing model-specific register (MSR) accesses on AMD (and Zen 1 derived Hygon) processors. Intel CPUs have to serialize MSR accesses for the Time Stamp Counter (TSC) deadline (IA32_TSC_DEADLINE) and X2APIC MSRs and thus that is been the default habits for Linux x86_64 use. That habits was beforehand defined by an Intel Linux engineer as:
“The explanation the kernel makes use of a special semantic is that the SDM modified (roughly in late 2017). The SDM modified as a result of people at Intel had been auditing the entire really useful fences within the SDM and realized that the x2apic fences had been inadequate.
Why was the ache MFENCE judged inadequate?
WRMSR itself is generally a serializing instruction. No fences are wanted as a result of the instruction itself serializes every part.
However, there are specific exceptions for this serializing habits written into the WRMSR instruction documentation for 2 lessons of MSRs: IA32_TSC_DEADLINE and the X2APIC MSRs.
Again to x2apic: WRMSR is *not* serializing on this particular case. However why is MFENCE inadequate? MFENCE makes writes seen, however solely impacts load/retailer directions. WRMSR is sadly not a load/retailer instruction and is unaffected by MFENCE. Which means a non-serializing WRMSR might be reordered by the CPU to execute earlier than the writes made seen by the MFENCE have even occurred within the first place.
Which means an x2apic IPI might theoretically be triggered earlier than there may be any (seen) information to course of.
Does this have an effect on something in observe? I actually do not know. It appears fairly attainable that by the point an interrupt will get to eat the (not but) MFENCE’d information, it has turn into seen, largely accidentally.
To be protected, add the SDM-recommended fences for all x2apic WRMSRs.
This additionally leaves open the query of the _other_ weakly-ordered WRMSR: MSR_IA32_TSC_DEADLINE. Whereas it has the identical ordering structure because the x2APIC MSRs, it appears considerably much less more likely to be an issue in observe. Whereas writes to the in-memory Native Vector Desk (LVT) would possibly theoretically be reordered with respect to a weakly-ordered WRMSR like TSC_DEADLINE.”
So the Linux x86/x86_64 kernel has defaulted to an MFENCE and LFENCE however with none CPU-specific checks. It seems AMD CPUs do not want this and avoiding the serialized MSR entry for TSC_DEADLINE/X2APIC may also help with efficiency.
The patch slated for Linux 6.8 will now not serialize MSR accesses on AMD processors. The patch outlines the efficiency advantages from this variation:
“AMD doesn’t have the requirement for a synchronization barrier when acccessing a sure group of MSRs. Don’t incur that pointless penalty there.
…
On a AMD Zen4 system with 96 cores, a modified ipi-bench on a VM reveals x2AVIC IPI fee is 3% to 4% decrease than AVIC IPI fee. The ipi-bench is modified in order that the IPIs are despatched between two vCPUs in the identical CCX. This additionally requires to pin the vCPU to a bodily core to
forestall any latencies. This simulates the use case of pinning vCPUs to the thread of a single CCX to keep away from interrupt IPI latency.
…
With the above configuration:*) Efficiency measured utilizing ipi-bench for AVIC:
Common Latency: 1124.98ns [Time to send IPI from one vCPU to another vCPU]Cumulative throughput: 42.6759M/s [Total number of IPIs sent in a second from 48 vCPUs simultaneously]
*) Efficiency measured utilizing ipi-bench for x2AVIC:
Common Latency: 1172.42ns [Time to send IPI from one vCPU to another vCPU]Cumulative throughput: 40.9432M/s [Total number of IPIs sent in a second from 48 vCPUs simultaneously]
From above, x2AVIC latency is ~4% greater than AVIC. Nonetheless, the expectation is x2AVIC efficiency to be higher or equal to AVIC. Upon analyzing the perf captures, it’s noticed vital time is spent in weak_wrmsr_fence() invoked by x2apic_send_IPI().
With the repair to skip weak_wrmsr_fence()
*) Efficiency measured utilizing ipi-bench for x2AVIC:
Common Latency: 1117.44ns [Time to send IPI from one vCPU to another vCPU]Cumulative throughput: 42.9608M/s [Total number of IPIs sent in a second from 48 vCPUs simultaneously]
Evaluating the efficiency of x2AVIC with and with out the repair, it may be seen the efficiency improves by ~4%.
Efficiency captured utilizing an unmodified ipi-bench utilizing the ‘mesh-ipi’ possibility with and with out weak_wrmsr_fence() on a Zen4 system additionally confirmed vital efficiency enchancment with out weak_wrmsr_fence(). The ‘mesh-ipi’ possibility ignores CCX or CCD and simply picks random vCPU.
Common throughput (10 iterations) with weak_wrmsr_fence(),
Cumulative throughput: 4933374 IPI/sCommon throughput (10 iterations) with out weak_wrmsr_fence(),
Cumulative throughput: 6355156 IPI/s”
With this MSR entry habits having been the default habits of the Linux x86_64 kernel for a number of years now, it’s kind of shocking it was not noticed sooner by AMD or their companions for optimizing.
Barring any points from arising with the patch, now that it is a part of a TIP department it ought to in flip be a part of the Linux 6.8 kernel modifications for early 2024.