This message indicates that the Solaris Fault Manager has received reports
of single bit correctable errors from a Memory Module at a rate exceeding
acceptable levels, and a Memory Module fault has been diagnosed. No
data has been lost, and pages of the affected Memory Module are being
retired as errors are encountered.
The recommended service action for this event is to schedule replacement
of the affected Memory Module at the earliest possible convenience. The
errors are correctable in nature so they do not present an immediate threat
to system availability.
Follow the steps below to complete the recommended repair action.
Step 1: Find the 36 character EVENT_ID string associated with the
fault.
It can be located using several methods. Either fmdump (1M) or fmadm (1M)
commands shown in Example 1.1 below, or extracted from the fault message
displayed in the console output at the time of the fault.
Note: Be sure to get the correct 36 character EVENT_ID string if more
than one is listed. You can identify the correct string by associating
the fault with the date and time stamp (highlighted in blue in Example 1.1
below) from the fmdump output. fmadm faulty shows the EVENT_ID (highlighted
in red).
Example 1.1 - finding the EVENT_ID (36 character string):
# fmdump
TIME UUID SUNW-MSG-ID
Mar 17 15:16:52.1773 731c7264-0619-e4fd-fdc4-a58b9c1b9ffb AMD-8000-2F
Mar 18 12:26:46.0362 ab688055-6049-60aa-817b-cf25406dac43 AMD-8000-67
Mar 20 10:50:54.7096 1eaa6704-1e0b-4ef9-9337-e14c108dfbec AMD-8000-AV
# fmadm faulty
STATE RESOURCE / UUID
-----------------------------------------------------------------------
degraded mem:///motherboard=0/chip=0/memory-controller=0/dimm=0
731c7264-0619-e4fd-fdc4-a58b9c1b9ffb
-----------------------------------------------------------------------
Step 2: Use the command; fmdump -v -u EVENT_ID, to locate the faulty
memory module, where EVENT_ID consists of the 36 character string
obtained in Step 1 above. FRU (highlighted in blue) See Example 2.1 below.
Example 2.1 - determining which FRU needs to be replaced:
# fmdump -v -u 731c7264-0619-e4fd-fdc4-a58b9c1b9ffb
TIME UUID SUNW-MSG-ID
Mar 17 15:16:52.1773 731c7264-0619-e4fd-fdc4-a58b9c1b9ffb AMD-8000-2F
100% fault.memory.dimm_sb
Problem in: hc:///motherboard=0/chip=0/memory-controller=0/dimm=0
Affects: mem:///motherboard=0/chip=0/memory-controller=0/dimm=0
FRU: hc:///motherboard=0/chip=0/memory-controller=0/dimm=0
Step 3: Identify the FRU that needs to be replaced.
In our example, the service action would be to replace memory module 0
associated with CPU0 (chip=0), as the memory module identified in the
string following the 'FRU:' label (highlighted in blue) in Example 2.1 above,
is hc///motherboard=0/chip=0/memory-controller=0/dimm=0.
The term 'chip' used in the output above can also be used to refer
to a processor or CPU chip, hence 'chip=x' equates to 'CPU chipx'.
Note that the cpuid refers to a logical CPU number within the CPU
chip.
For all Sun AMD based systems;
chip=x maps to the processor chip labeled CPUx within the system.
For example:
chip=0 maps to the physical location labeled; CPU0
chip=1 maps to the physical location labeled; CPU1
Once you have identified the correct CPU location, you then need to
identify it's corresponding memory module.
Please reference the following SRDB (Symptom Resolution Data Base)
link for the labeling of your Platforms Memory Module(s).
Memory Labeling
Step 4: Replace the faulty FRU (and reboot the system)
Refer to the service label or hardware maintenance manual for correct
replacement procedures.
Step 5: Update the Fault Manager's resource cache to indicate
that no problems are present in the resources that had been diagnosed
faulty and subsequently replaced in Step 4. Shown in Example 5.1
Once the Memory Modules have been physically replaced and the system rebooted,
you must invoke the 'fmadm repair' command using the UUID (Universally
Unique IDentifier) to identify the repaired FRU. The UUID is synonymous
with the EVENT_ID (highlighted in red) in Example 1.1 above.
Example 5.1 - Updating the Fault Manager's resource cache:
# fmadm repair 731c7264-0619-e4fd-fdc4-a58b9c1b9ffb
# fmadm: recorded repair to 731c7264-0619-e4fd-fdc4-a58b9c1b9ffb
Step 6: Reset the Faulted Page Counters.
Once the fmadm repair <UUID> has completed reset the Faulted Page counter,
you must invoke the 'fmadm reset' command shown in Example 6.1 below.
The required action in this step is temporary. When this becomes automated
this article will be updated to reflect these changes.
Example 6.1 - Reset Faulted Page Counters
# fmadm reset eft
Note: By resetting the Faulted Page Counters, this will reset ALL diagnosis
state for the diagnosis engine. All event history will be lost. All
Already diagnosed faults will continue to appear in fmadm faulty. Before
resetting capture the outputs from fmstat -m eft, fmstat -s -m eft.
Step 7: Verify the repaired resource is no longer faulty
Use the Solaris command 'fmadm faulty' to display all faulted resources
in the system. Confirm that the repaired resource is no longer listed
as faulted shown in Example 7.1.
Example 7.1 - verifying the repaired resource is no longer faulty:
# fmadm faulty
STATE RESOURCE / UUID
-------- ----------------------------------------------------------------------