This message indicates that the Solaris Fault Manager has received
reports from a CPU indicating that one or more Level 1 Data
Cache errors have occurred and a CPU fault has been diagnosed. If
an uncorrectable error was reported, it likely resulted in a system
reset followed by a reboot. If the errors were correctable, they have
occurred at a rate exceeding acceptable levels. There should have
been no operational impact on system or application activity from
correctable errors.
The recommended service action for this event is to schedule replacement
of the affected CPU at the earliest possible convenience. The faulted
CPU has been off-lined to prevent further disruption to system availability.
However it is not intended nor recommended that the faulted FRU remain
in the system for a prolonged period of time.
Follow the steps below to complete the recommended repair action.
Step 1: Find the 36 character EVENT_ID string associated with the
fault.
The 36 character EVENT_ID string can be located using the fmdump (1M) or
fmadm (1M) commands shown in Example 1.1 below, or it can be extracted
from the fault message displayed in the console output at the time of
the fault.
Note: Be sure to get the correct 36 character EVENT_ID string if more
than one is listed. You can identify the correct string (highlighted
in red) by associating
the fault with the date and time (highlighted in blue) from the fmdump
output in Example 1.1 below. The fmadm faulty output in the example
also shows the EVENT_ID (highlighted in red).
Example 1.1 - finding the EVENT_ID (36 character string):
# fmdump
TIME UUID SUNW-MSG-ID
Mar 20 10:50:54.7096 1eaa6704-1e0b-4ef9-9337-e14c108dfbec AMD-8000-AV
Mar 24 13:48:00.8161 ffd95aaf-396c-4a5f-d44d-a3cf5f61b80e AMD-8000-5M
Mar 27 15:16:52.1773 731c7264-0619-e4fd-fdc4-a58b9c1b9ffb AMD-8000-2F
# fmadm faulty
STATE RESOURCE / UUID
------------------------------------------------------------------------------
faulted cpu:///cpuid=2
1eaa6704-1e0b-4ef9-9337-e14c108dfbec
------------------------------------------------------------------------------
Step 2: Use the command; fmdump -v -u EVENT_ID, to locate the
faulty CPU, where EVENT_ID consists of the 36 character string
obtained in Step 1 above. The faulty FRU is identified by the string
following the label 'FRU:' (highlighted in blue) in Example 2.1 below.
Example 2.1 - determining which FRU needs to be replaced:
# fmdump -v -u 1eaa6704-1e0b-4ef9-9337-e14c108dfbec
TIME UUID SUNW-MSG-ID
Mar 14 13:48:00.8161 1eaa6704-1e0b-4ef9-9337-e14c108dfbec AMD-8000-AV
100% fault.cpu.amd.dcachedata
Problem in: hc:///motherboard=0/chip=1/cpu=0
Affects: cpu:///cpuid=2
FRU: hc:///motherboard=0/chip=1
Step 3: Identify the FRU that needs to be replaced.
In our example, the service action would be to replace CPU1, as the
chip identified in the string following the 'FRU:' label (highlighted
in blue) in Example 2.1 above, is hc:///motherboard=0/chip=1.
The term 'chip' used in the output above can also be used to refer
to a processor or CPU chip, hence 'chip=x' equates to 'CPU chipx'.
Note that the cpuid refers to a logical CPU number within the CPU
chip.
For all Sun AMD based systems;
chip=x maps to the processor chip labeled CPUx within the system.
For example:
chip=0 maps to the physical location labeled; CPU0
chip=1 maps to the physical location labeled; CPU1
Step 4: Replace the faulty FRU (and reboot the system)
Refer to the service label or hardware maintenance manual for correct
replacement procedures.
Step 5: Update the Fault Manager's resource cache to indicate
that no problems are present in the resources that had been diagnosed
faulty and subsequently replaced in Step 4. Shown in Example 5.1
Once the CPU has been physically replaced and the system rebooted,
you must invoke the 'fmadm repair' command using the UUID (Universally
Unique IDentifier) to identify the repaired FRU. The UUID is synonymous
with the EVENT_ID (highlighted in red) in Example 1.1 above.
Example 5.1 - Updating the Fault Manager's resource cache:
# fmadm repair 1eaa6704-1e0b-4ef9-9337-e14c108dfbec
# fmadm: recorded repair to 1eaa6704-1e0b-4ef9-9337-e14c108dfbec
Step 6: Verify the repaired resource is no longer faulty
Use the Solaris command 'fmadm faulty' to display all faulted resources
in the system. Confirm that the repaired resource is no longer listed
as faulted as shown in Example 6.1 below where no resources are
faulted.
Example 6.1 - verifying the repaired resource is no longer faulty:
# fmadm faulty
STATE RESOURCE / UUID
------------------------------------------------------------------------------
Step 7: Place the CPU back into the active configuration
Once you have completed the repair action, and told the fault manager
that the FRU has been repaired, it may still be necessary to tell Solaris
to use the affected system resource. If the Automated Response
described above in this article was; "An attempt will be made to remove
this CPU from service.", you will need to place the CPU back into the
active configuration and verify that the CPU has been returned to service.
To place the CPU back into the active configuration, you will need to
know the logical CPU number that was faulted. The logical CPU number
was identified in the fmdump output following the label 'Affects:'
(highlighted in green) in Example 2.1 above. You can also identify a
CPU that is still considered faulted by Solaris using the psrinfo (1M)
command shown in Example 7.1 below.
Example 7.1 - identifying the logical CPU number that was faulted:
# psrinfo
0 on-line since 03/20/2006 14:04:16
1 on-line since 03/20/2006 14:04:19
2 faulted since 03/20/2006 14:08:52
3 on-line since 03/20/2006 14:04:23
Place the faulted CPU back into the active configuration using the psradm
(1M) command; 'psradm -F -n x' where x is the logical cpuid. As logical CPU
2 was identified as the faulted CPU, the psradm (1M) command shown in Example
7.2 below will return it to the active configuration.
Example 7.2 - returning the repaired CPU to the active configuration:
# psradm -F -n 2
Step 8: Verify the CPU is in use by the system.
To verify that the CPU has been returned to service, use the Solaris
command psrinfo (1M). See Example 8.1 below.
Example 8.1 - verifying the CPU has returned to service:
# psrinfo
0 on-line since 03/20/2006 14:04:16
1 on-line since 03/20/2006 14:04:19
2 on-line since 03/20/2006 14:09:57
3 on-line since 03/20/2006 14:04:23