|
Level 1 Data Cache Fault
Type
- Fault
Severity
- Major
Description
- A level 1 Data Cache on this cpu is faulty.
Automated Response
- The system will attempt to offline this cpu to remove it from service.
Impact
- Performance of this system may be affected.
Suggested Action for System Administrator
- Schedule a repair procedure to replace the affected CPU. Use 'fmadm faulty' to identify the module.
Details
-
INTEL-8000-LE
Message ID: INTEL-8000-LE
indicates that the Solaris Fault Manager
has received an error report
indicating that a processor has experienced an error in the level 1
data cache.
If an
uncorrectable
or fatal processor error is reported by a machine check exception,
Solaris
will generate an ereport and then proceed to panic the
system, followed by a warm reset.
Upon
reboot, the Solaris operating system will replay the error telemetry
and diagnose a faulty cpu,
resulting in that CPU/processor being off-lined. Performance
of this system may be affected.
Your system may provide the capability to illuminate a service-required
LED on or near a faulty component
as an aid in physically locating the faulty component. The Sun products
listed below provide this feature.
Sun Fire X2270, Sun Fire X4170, Sun Fire
X4270, Sun Fire X4275, Sun
Blade X6270, Sun Blade X6275
The
service-required LED indicator next to the faulty processor will be
illuminated.
However, when
equipped with a
"fault
remind" button on the motherboard,
the
service required LED will illuminate only when this button is
depressed.
The
status of the processor's service required LED is displayed from the
ILOM CLI as:
/SYS/MB/P0/SERVICE = On
Refer to
the
service label located on top cover for the location of
processor and the "Fault Remind" button.
The chassis-wide service
required LED on the rackmount server or blade
server will also be illuminated.
The status of the
chassis-wide service required LED is displayed from the ILOM CLI
as: /SYS/SERVICE = On
The recommended service actions for this event are as
follows:
Section A - Identify Faulty
Component / FRU
Section B - Contact Authorized
Service Provider
Section C - Clearing Fault after
Replacement
Section D - Enable Off-lined Resource
Section A -
Identify Faulty Component / FRU
LEGEND
GREEN =
FRU
BLUE
= EVENT_ID
RED
=
SUNW-MSG-ID
Solaris provides commands that
can be used to obtain
information about faults present in the system.
The "fmadm faulty"
command is the preferred method by
which fault information can be obtained.
The "fmdump -av"
command is an alternative method for
obtaining simular fault information.
- Solaris nomenclature numbering assignment is
zero-based, therefore all numbering begins with the number 0.
- The
term "chip"
is used to describe the physical
CPU/Processor.
- The
term "cpuid"
is used to describe the logical/virtual
CPU/Processor.
- A logical CPU is a strand/thread of a core
contained within a physical CPU/Processor.
- The field replaceable unit is the physical CPU/Processor.
Note: The Event-ID
shown in Step
A1
and UUID shown in Step A2
are one in the same,
as is the MSG-ID shown in Step A1
and SUNW-MSG-ID shown in Step A2.
Step
A1. How to Identify the
faulty processor/chip using "fmadm
faulty".
Example: Use
fmadm (1M) faulty to list the faulty FRU's, unique
Event_ID, and Sun Message_ID.
# fmadm faulty
---------------
--------------------------------------------
------------------ ------------
TIME
EVENT-ID
MSG-ID
SEVERITY
---------------
--------------------------------------------
------------------ -------------
Sep 26 20:46:15 8a47c7ba-8e66-c19f-f874-8f22a2b70cac
INTEL-8000-LE
Major
Fault class : fault.cpu.intel.l1dcache
Affects : cpu:///cpuid=1
degraded but still in service
FRU :
hc://:product-id=ASSY,MOTHERBOARD,LYNX_SERVER:chassis-id=0000000000:server-id=wgs40-117/motherboard=0/chip=1
faulty
Description : A level 1 data cache on this cpu is
faulty. Refer to http://sun.com/msg/INTEL-8000-LE
for more information.
Response : The system will attempt to offline this
cpu to remove it
from service.
Impact : Performance of
this system may be affected.
Action : Schedule a
repair procedure to replace the affected CPU. Use fmadm faulty'
to identify the module.
If your product does not provide the "fmadm faulty"
output as shown above, proceed to Step A2.
The example shown in Step A1;
- Identifies the logical/virtual processor as "cpuid=1" in the
line
beginning with Affects.
- Identifies the physical processor
as "chip=1"
in the line
beginning with FRU.
- Identifies the location of the physical processor within the
server as "/motherboard=0/chip=1"
There is typically one motherboard in a server and it is commonly
referred to as
"motherboard=0".
The physical processor, "/chip=1",
refers to the second physical processor
on the motherboard.
Proceed to Step A3.
Step
A2. How to Identify the
faulty processor/chip using "fmdump
-av".
Example: Use
fmdump (1M) to list the faulty FRU's, unique
Event_ID, and Sun Message_ID.
# fmdump -av
TIME
UUID
SUNW-MSG-ID
Sep 26 20:46:15.5823 8a47c7ba-8e66-c19f-f874-8f22a2b70cac
INTEL-8000-LE
100% fault.cpu.intel.l1dcache
Problem in:
hc://:product-id=ASSY,MOTHERBOARD,LYNX_SERVER:chassis-id=0000000000:server-id=wgs40-117/motherboard=0/chip=1/core=0/strand=0
Affects: cpu:///cpuid=1
FRU:
hc://:product-id=ASSY,MOTHERBOARD,LYNX_SERVER:chassis-id=0000000000:server-id=wgs40-117/motherboard=0/chip=1
Location: -
The example shown in Step A2;
- Identifies the logical/virtual
processor as "cpuid=1" in the
line
beginning with Affects.
- Identifies the physical processor
as "chip=1"
in the line
beginning with FRU.
- Identifies the strand contained
within the physical processor as "strand=0"
in the line beginning with Problem.
- Identifies the location of the physical processor within the
server as "/motherboard=0/chip=1"
There is typically one motherboard in a server and can be commonly
referred to as
"motherboard=0".
The physical processor, "/chip=1",
refers to the second physical processor
on the motherboard.
Proceed to Step A3.
Step
A3. How to Identify the "physical"
location of the faulty processor/chip.
Sun
Platforms
A
service label on the top cover and silk screen on the motherboard
can assist with identifying the
physical location of the processor/chip. Typically, there is a
label in the
proximity of the processor/chip.
The label nomenclature for a CPU/Processor is Px, whereby
'x'
represents the processor number.
Sun platforms use zero-based numbering for their labeling
scheme, and as such,
the faulty
processor/chip
identified in our example above
(e.g. "/chip=1")
would be interpreted as "P1".
e.g.
chip=0
maps to the physical location labeled "P0".
chip=1
maps to the physical location labeled "P1".
chip=2
maps to the physical location labeled "P2".
chip=3
maps to the physical location labeled "P3".
Non-Sun
Platforms
A
service label on the top cover or silk screen on the motherboard may
exist
to assist with identifying the
physical location of the processor/chip. Typically, there is a
label in the
proximity of the processor/chip.
The label nomenclature for a CPU/Processor can be in the form of:
Px, CPUx, CHIPx,
SCKTx, etc;
whereby 'x'
represents the processor number.
If the product uses zero-based numbering for its labeling
scheme, as Sun's products do, the faulty
processor
identified in our example
(e.g. "/chip=1")
would be identified
as Px, CPUx, CHIPx,
SCKTx, etc;
e.g.
chip=0
maps to the physical location labeled; P0, CPU0, CHIP0,
SCKT0, etc;
chip=1
maps to the physical location labeled; P1, CPU1, CHIP1,
SCKT1, etc;
chip=2
maps to the physical location labeled; P2, CPU2, CHIP2,
SCKT2, etc;
chip=3
maps to the physical location labeled; P1, CPU3, CHIP3,
SCKT3, etc;.
The recommended service action for this
event is to replace the faulty processor.
The processor is not customer
serviceable and requires repair by
an authorized service provider.
Section B -
Contact Authorized Service Provider
Please contact
your service provider in accordance with the terms and conditions of
your service
agreement to open a service request and
to confirm and carry out the required repair actions specified
by current service policy.
Your service provider may ask for
information displayed using the procedures outlined in Section
A.
If the product is covered by a current
service agreement with Sun Microsystems, Inc.,
please refer to the following
instructions for reporting the problem and opening a service
request.
Auto Service Request (ASR)
Activated for the Product
If ASR has been activated for the product on which
this problem was diagnosed, you have,
or will receive a notification via e-mail confirming a service
request has been automatically
opened along with instructions for viewing the service request.
All of the fault event telemetry required to open a service
request has already been transmitted
to Sun Services.
Unless contacted and instructed otherwise by a Sun Service
representative,
no further actions is required to report this problem to Sun
Microsystems.
The e-mail notification will provide a pointer back to this same
article.
If you are reading this article in response to a fault
message or SNMP trap generated on the product,
rather than
in response to the ASR notification e-mail above, then you
can check on the status of the
associated service request by logging into the Members
Support Center at http://sun.com/support
For more information on Auto Service Request
(ASR) and the currently supported products,
please refer to http://sun.com/service/asr
Submitting
a Service Request via the Members Support Center Portal
In cases where ASR has not been activated:
2.
Create
a Service
Request
3.
Copy
and paste the
information displayed using the instructions provided in Section
A
to identify the faulty
FRU(s), into the Service Request notes.
Section C -
Clearing Fault after Replacement
LEGEND
GREEN =
FRU
BLUE
= EVENT_ID
RED
=
SUNW-MSG-ID
Solaris Command
to Clear the Fault
Once the processor/chip has been physically
replaced
and the system
is rebooted, a fault management command is
required to clear the processor/chip fault
from the
Solaris fault manager's resource cache to accurately reflect faults
that are no
longer
present.
Invoke the "fmadm repair"
command along with the UUID
(Universally
Unique IDentifier) associated with the faulted processor.
e.g.
Sep 26
20:46:15 8a47c7ba-8e66-c19f-f874-8f22a2b70cac
INTEL-8000-LE
Major
Note: The
Event-ID shown
in Step A1
and UUID shown in Step A2
are one in the same,
as is the MSG-ID shown in Step A1
and SUNW-MSG-ID shown in Step A2.
Example: Use
fmadm (1M) repair to clear the fault using the UUID (Universally
Unique Identifier).
# fmadm
repair 8a47c7ba-8e66-c19f-f874-8f22a2b70cac
fmadm: recorded
repair to 8a47c7ba-8e66-c19f-f874-8f22a2b70cac
ILOM Command to
Clear the Fault
On Sun platforms with ILOM 2.0
or later on the service processor, you may also need to clear the
processor/chip
fault from the
ILOM fault manager's resource cache to accurately reflect faults that
are no longer present and to extinguish
the service
required LED for the faulty CPU/processor and chassis wide service
required LED.
The nomenclature ILOM
uses to describe
a CPU/Processor is Px, where
'x'
represents the processor number.
Sun platforms use zero-based numbering for their labeling
scheme, and as such,
the faulty
processor/chip
identified in our example above
(e.g. "/chip=1")
would be interpreted as "P1".
e.g.
chip=0
maps to the physical location labeled "P0".
chip=1
maps to the physical location labeled "P1".
chip=2
maps to the physical location labeled "P2".
chip=3
maps to the physical location labeled "P3".
Login to the ILOM
command line interface as 'root' and use the following command to clear
the fault.
Example:
->
set /SYS/MB/P1 clear_fault_action=true
Are you sure you
want to clear /SYS/MB/P1
(y/n)? y
Set
'clear_fault_action' to 'true'
NOTE: The
example above specifically clears the fault associated with "P1".
You will need to provide the CPU/processor
number in the command above as determined by the process defined
in Section A
of this article.
Section D - Enable Off-lined
Resource
LEGEND
GREEN =
FRU
BLUE
= EVENT_ID
RED
=
SUNW-MSG-ID
The
term "chip"
is used to describe the physical
CPU/Processor.
The
term "cpuid"
is used to describe the logical/virtual
CPU/Processor ( e.g. Affects: cpu:///cpuid=1 )
A logical CPU is a strand/thread of a
core contained within a physical CPU/Processor.
Solairs assigns CPU numbers based on
threads/strands, not cores or chips.
A chip that consists of four(4) dual-threaded
cores would show up as eight(8) logical CPU's
and a chip that consists of two(2)
dual-threaded cores would show up as four(4) logical CPU's
Verify the status of the processor/chip in
the
active
configuration by identifying the logical/virtual
CPU/Processor that was faulted by
using
the "psrinfo"
command..
Example:
Use psrinfo (1M) to obtain status of logical/virtual
processors/chips.
# psrinfo
0 on-line since
09/26/2008 18:04:57
1 faulted since 09/26/2008
20:46:15
2 on-line since
09/26/2008 18:05:19
3 on-line since
09/26/2008 18:05:21
4 on-line since
09/26/2008 18:05:23
5 on-line since
09/26/2008 18:05:25
6 on-line since
09/26/2008 18:05:27
7 on-line since
09/26/2008 18:05:29
8 on-line since 09/26/2008
18:05:31
9 on-line since
09/26/2008 18:05:33
10 on-line since 09/26/2008
18:05:35
11 on-line since 09/26/2008
18:05:37
12 on-line since 09/26/2008
18:05:39
13 on-line since 09/26/2008
18:05:41
14 on-line since 09/26/2008
18:05:43
15 on-line since 09/26/2008
18:05:45
Enable the faulted processor/chip back
into the active configuration by using the "psradm"
command.
Example:
Use psradm (1M) to online a processor.
# psradm -F -n 1
, whereby '1' represents the "cpuid"
number in the line
beginning with Affects..
Verify the status of the processor/chip in the
active
configuration to confirm the logical/virtual
CPU number is enabled.
Example:
Use psrinfo (1M) to obtain status of processors/chips.
# psrinfo
0 on-line since
09/26/2008 18:04:57
1 on-line since
09/26/2008 20:57:15
2 on-line since
09/26/2008 18:05:19
3 on-line since
09/26/2008 18:05:21
4 on-line since
09/26/2008 18:05:23
5 on-line since
09/26/2008 18:05:25
6 on-line since
09/26/2008 18:05:27
7 on-line since
09/26/2008 18:05:29
8
on-line since 09/26/2008 18:05:31
9 on-line since
09/26/2008 18:05:33
10 on-line since 09/26/2008
18:05:35
11 on-line since 09/26/2008
18:05:37
12 on-line since 09/26/2008
18:05:39
13 on-line since 09/26/2008
18:05:41
14 on-line since 09/26/2008
18:05:43
15 on-line since 09/26/2008
18:05:45
Optionally, one can use the "psrinfo -vp" command to show the logical/virtual
CPU/processor numbering assigned to processor.
e.g. A chip that consists of four(4) dual-threaded cores would
show up as eight(8) logical CPU's
# psrinfo -vp
The physical processor has 4 cores and 8 virtual processors (0-3 8-11)
The core has 2 virtual processors (0 8)
The core has 2 virtual processors (1 9)
The core has 2 virtual processors (2 10)
The core has 2 virtual processors (3 11)
x86 (GenuineIntel 106A5 family 6 model
26 step 5 clock 2533 MHz)
Intel(r) Xeon(r)
CPU
E5540 @ 2.53GHz
The physical processor has 4 cores and 8 virtual processors (4-7 12-15)
The core has 2 virtual processors (4 12)
The core has 2 virtual processors (5 13)
The core has 2 virtual processors (6 14)
The core has 2 virtual processors (7 15)
x86 (GenuineIntel 106A5 family 6 model 26
step 5 clock 2533 MHz)
Intel(r) Xeon(r)
CPU
E5540 @ 2.53GHz
|