2

wrx80e_sage_nvme_disable_aer_severity_corrected.md

 2 years ago
source link: https://gist.github.com/zekome/35db528b33206e68f18439ad7fabfcd5
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Turn off AER logging for NVMe and event severity corrected

Motherboard: Asus Pro WS WRX80E-SAGE SE WIFI
Card: Asus HYPER M.2 X16 GEN 4 CARD
NVMe: 4x Samsung SSD 980 PRO 1TB
OS: Linux fedora 5.16.12-200.fc35.x86_64

AER, advanced error reporting logs excessively:

dmesg

nvme 0000:44:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
nvme 0000:44:00.0:    [ 0] RxErr                  (First)
nvme 0000:44:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
nvme 0000:44:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
nvme 0000:44:00.0:    [ 0] RxErr                  (First)
nvme 0000:44:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID

{2085}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
{2085}[Hardware Error]: It has been corrected by h/w and requires no further action
{2085}[Hardware Error]: event severity: corrected
{2085}[Hardware Error]:  Error 0, type: corrected
{2085}[Hardware Error]:   section_type: PCIe error
{2085}[Hardware Error]:   port_type: 0, PCIe end point
{2085}[Hardware Error]:   version: 0.2
{2085}[Hardware Error]:   command: 0x0406, status: 0x0010
{2085}[Hardware Error]:   device_id: 0000:44:00.0
{2085}[Hardware Error]:   slot: 0
{2085}[Hardware Error]:   secondary_bus: 0x00
{2085}[Hardware Error]:   vendor_id: 0x144d, device_id: 0xa80a
{2085}[Hardware Error]:   class_code: 010802
{2085}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0000

Note device id in logs. In this case it's 0000:44:00.0. Also there are similar logs for all four NVMe disks on the same card with respective device ids 0000:43:00.0, 0000:42:00.0, 0000:41:00.0. Then, for each device id (for example: 0000:44:00.0) turn off corrected-severity bit (clear the first bit) if set. Get the current value for CAP_EXP register and XOR it with 0x1 to toggle.

setpci -v -s 0000:44:00.0 CAP_EXP+0x8.w
0000:44:00.0 (cap 10 @70) @78 = 2937

So, the bit is set... toggle: 0x2937 XOR 0x1 = 0x2936

setpci -v -s 0000:44:00.0 CAP_EXP+0x8.w=0x2936
0000:44:00.0 (cap 10 @70) @78 2936

Device id and CAP_EXP values might differ in other cases.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK