Summary- How to find a faulty processor.....

From: Convey, Simon <simon.convey_at_csfb.com> Date: Fri Sep 21 2001 - 06:48:34 EDT · This archive was generated by hypermail 2.1.8 : Wed Mar 23 2016 - 16:32:31 EDT

Managers,
          Many thanks to those who replied. ( Jon, Jay, Johan, Hendrik,
Casper, Patrico,  Petri, and Roland ) .
Most suggestions involved using psradmin to turn off one processor at a time
to isolate the fault. This was my first thought too, but the program runs
fairly infrequently, and even then, only dumps core occasionally, I
calculated that statistically it would take about 6 weeks to find the
processor, 12 weeks at worst. Although this program dumps core fairly
infrequently, the impact to downstream applications is huge, resulting in
sysadmins being called out at horrible hours of the morning to replay
database logs and all that stuff we hate doing........
	Pbind was suggested, and this is a good approach, though
interestingly, it is not possible to bind to the 'current' processor, only
an explicitly named one. This is a side effect of the fact that even if
Solaris did have a get_cpuid() function, it is only valid at the time it was
called, since the very next clock tick might timeslice the process off the
current cpu, later to be restarted  on ( possibly )a different processor.
Statistically, it's likely to restart on the same processor due to affinity
rules, but this is not guaranteed, and would only serve to 'point me in the
right direction' rather give a concrete answer about which cpu was doing
this bit flip.
	 It then struck me that this did not matter,  we can start a
process, let the scheduler start it anywhere and then immediately pbind to
any of the 6 cpus at random. Any core dumps found could then be analysed for
a variable which matched the bound cpu. Here's the code. BTW, cpus in
Solaris are not necessarily numbered in a linear fashion.....

main()
{
	processorid_t cn=0;
	processorid_t cpu[64];

	ncpu=init_cpus(cpu);

	srand(getpid());
	cn=rand() % ncpu;
	if (processor_bind(P_PID,P_MYID,cpu[cn],NULL) == -1)
		perror("processor_bind");

	processor_bind(P_PID,P_MYID,PBIND_QUERY,&cn);
	printf("cpu %d of %d\n",cn,ncpu);
}

So, in a nutshell, the process is started by the scheduler on any available
cpu, it then binds for the rest of it's life to any random cpu, and if it
core dumps, the core will contain a symbol identifying where the binding
took place. I have ommited the guts of the test which uses code from the
same libraries as our crashing process.

Sadly, this effort is all to disprove Suns recommendation that it's a
hardware fault. We would have expected a kernel panic by now if it really
was a hardware fault. The fault is likely to reside during process linkage
in ld.so.1 ????

Thanks for all your suggestions, very appreciated.

Simon.

This message is for the named person's use only.  It may contain 
confidential, proprietary or legally privileged information.  No 
confidentiality or privilege is waived or lost by any mistransmission.
If you receive this message in error, please immediately delete it and all
copies of it from your system, destroy any hard copies of it and notify the
sender.  You must not, directly or indirectly, use, disclose, distribute, 
print, or copy any part of this message if you are not the intended 
recipient. CREDIT SUISSE GROUP and each of its subsidiaries each reserve
the right to monitor all e-mail communications through its networks.  Any
views expressed in this message are those of the individual sender, except
where the message states otherwise and the sender is authorised to state 
them to be the views of any such entity.
Unless otherwise stated, any pricing information given in this message is 
indicative only, is subject to change and does not constitute an offer to 
deal at any price quoted.
Any reference to the terms of executed transactions should be treated as 
preliminary only and subject to our formal written confirmation.