Tuesday, June 21, 2005

EMC event memory

==============================================================================
TOPIC: EMC event?

==============================================================================

== 1 of 3 ==
Date: Fri 17 Jun 2005 10:08
From: Scott Howard

Michael Tosch rote:
> Dragan Cvetkovic wrote:
>> Jun 16 06:51:17 sc2 SUNW,UltraSPARC-III+: [ID 574815 kern.info]
[AFT0] errID 0x0001d2fa.d7fd18c0 Corrected Mtag Error on J0406 is
Persistent
>> Jun 16 06:51:17 sc2 SUNW,UltraSPARC-III+: [ID 214705 kern.info]
[AFT0] errID 0x0001d2fa.d7fd18c0 MTAG Check Bit 3 was in error and
corrected
>>
>> but the phrase "EMC event" returns nothing on both google and sunsolve
>> (otherwise lot of hits related to EMC, but no such HW here).
>
> A correctable bit error occurred.

Correct. The system has ECC (Error Checking and Correcting) memory, so it
both detected and corrected the error without any impact to anything.

> Please turn to Sun Support, you will certainly get a hardware replacement.

No, you won't. Single, correctable memory errors do not indicate a
problem with the memory. If the errors continue then it may warrant a
replacement, but a single error certainly doesn't.

If you want to confirm download the "cediag" tool from
http://sunsolve.sun.com/pub-cgi/show.pl?target=cediag which will analyse
your system and determine if any memory/CPUs need to be replaced.

Scott

== 2 of 3 ==
Date: Fri 17 Jun 2005 12:35
From: Michael Laajanen

HI,

> Hi,
>
> just noticed the following in a log file of SF280R (2 x 900MHz CPUs,
> Solaris 9 Generic_112233-11):
>
> Jun 16 06:51:17 sc2 SUNW,UltraSPARC-III+: [ID 284628 kern.info] NOTICE: [AFT0] EMC Event detected by CPU0 at TL=0, errID 0x0001d2fa.d7fd18c0
> Jun 16 06:51:17 sc2 AFSR 0x00010000<EMC>.00080000 AFAR 0x00000000.06112130
> Jun 16 06:51:17 sc2 Fault_PC 0x11788e0 Msynd 0x0008 J0406
> Jun 16 06:51:17 sc2 SUNW,UltraSPARC-III+: [ID 574815 kern.info] [AFT0] errID 0x0001d2fa.d7fd18c0 Corrected Mtag Error on J0406 is Persistent
> Jun 16 06:51:17 sc2 SUNW,UltraSPARC-III+: [ID 214705 kern.info] [AFT0] errID 0x0001d2fa.d7fd18c0 MTAG Check Bit 3 was in error and corrected
> Jun 16 06:51:17 sc2 SUNW,UltraSPARC-III+: [ID 458748 kern.info] [AFT2] errID 0x0001d2fa.d7fd18c0 PA=0x00000000.06112100
> Jun 16 06:51:17 sc2 E$tag 0x00000000.18000002 E$state_4 Invalid
> Jun 16 06:51:17 sc2 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x00) 0x000d8633.000d8614 0x000d8600.000d85f6 ECC 0x1ab
> Jun 16 06:51:17 sc2 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x10) 0x000d85ea.000d85c7 0x000d8576.000d8571 ECC 0x013
> Jun 16 06:51:17 sc2 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x20) 0x000d8552.000d807a 0x000d805c.000d8051 ECC 0x092
> Jun 16 06:51:17 sc2 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x30) 0x000d8030.000d801c 0x000d8016.000d8007 ECC 0x080
> Jun 16 06:51:17 sc2 SUNW,UltraSPARC-III+: [ID 929717 kern.info] [AFT2] D$ data not available
> Jun 16 06:51:17 sc2 SUNW,UltraSPARC-III+: [ID 335345 kern.info] [AFT2] I$ data not available
>
> but the phrase "EMC event" returns nothing on both google and sunsolve
> (otherwise lot of hits related to EMC, but no such HW here).
>
> What does above mean?
>
> TIA,
>
> Dragan
>
Just curious, what CPU is UltraSPARC-III+ ?

/michael

== 3 of 3 ==
Date: Fri 17 Jun 2005 12:17
From: Gavin Maltby

Hi

Andreas Almroth wrote:
> Dragan Cvetkovic wrote:
>> Michael Tosch writes:
>>
>>
>>> Dragan Cvetkovic wrote:
>>>
>>>> Hi,
>>>> just noticed the following in a log file of SF280R (2 x 900MHz CPUs,
>>>> Solaris 9 Generic_112233-11):
>>>> Jun 16 06:51:17 sc2 SUNW,UltraSPARC-III+: [ID 284628 kern.info] NOTICE:
>>>> [AFT0] EMC Event detected by CPU0 at TL=0, errID 0x0001d2fa.d7fd18c0
>>>> Jun 16 06:51:17 sc2 AFSR 0x00010000<EMC>.00080000 AFAR
>>>> 0x00000000.06112130
>>>> Jun 16 06:51:17 sc2 Fault_PC 0x11788e0 Msynd 0x0008 J0406
>>>> Jun 16 06:51:17 sc2 SUNW,UltraSPARC-III+: [ID 574815 kern.info]
>>>> [AFT0] errID 0x0001d2fa.d7fd18c0 Corrected Mtag Error on J0406 is
>>>> Persistent
>>>> Jun 16 06:51:17 sc2 SUNW,UltraSPARC-III+: [ID 214705 kern.info]
>>>> [AFT0] errID 0x0001d2fa.d7fd18c0 MTAG Check Bit 3 was in error and
>>>> corrected

An EMC event is a hardware-corrected single-bit error experienced
in the "MTag" portion of the cacheline.

MTags are not actually used on a SF280R - they're only used on bigger
systems that use the SSM mode: SF15K, SF12K, SF25K. Nonetheless
they should always read as zero on systems that don't use them -
they're still part of the checkword (128 bits + some metadata = 144 bits)
stored in memory and the ECC code protecting MTags (separate to that
protecting the data) is still checked.

If this is an isolated event you have nothing to worry about - some bit flips
are expected from time to time in memory, and this one just happened to hit
an MTag. If this is part of a pattern of behaviour you may need to replace
the memory module. This event will have been counted and since you
report no message bleating about replacement consider the above
informational only. In Solaris 10 it would have gone to the
error log and not appeared in messages, and the diagnosis engine
would do the Right Thing.

>>
>> [snip]
>>
>>
>>> A correctable bit error occurred.
>>> Please turn to Sun Support, you will certainly get a hardware
>>> replacement.

No you should not - unless this is part of a pattern.

>>
>> Thanks. But whose error is that: CPU, memory, cache, ...?

Memory. But it is an "error" and not necessarily a "fault".

>> Bye, Dragan
>>
>
> Looks to me that memory module in position J0406 have a persistent
> problem, indicating the module is probably faulty/degraded.

No. The "Persistent" classification is a horrible term. It's actually
the good case - it means that after initial event we had another look
and the error was still there. We then rewrite the memory address and
check again, and if it is fixed we have the final classification of
"Persistent" (meaning fixable!); if we could not clear it then it would
have been labelled "Sticky".

In Solaris 10 the terminology is relegated to the error log rather than
messages, where hopefully it will confuse fewer people. In Solaris Express,
OpenSolaris etc the terminology has been replaced and the classification
algorithm enhanced - that will likely appear in Solaris 10 Update 1.

> If it appear only once, then it is maybe just a glitch, but if you see
> more of these messages, then the memory should be replaced.

That part is true.

Gavin

0 Comments:

Post a Comment

<< Home