Thursday, August 16, 2012

Running Hardware diags on Solaris and Linux

Solaris -

# prtidiag -v

Default tool which comes with Solaris.
Displays system diagnostic information

------------------------------------------------------

#/opt/FJSVmadm/sbin/hrdconf -l

Fujitsu hardware native command for sparc server. Gives details config of hardware status.
------------------------------------------------------


Interactive

#/opt/FJSVmadm/sbin/madmin

The following options will show any HW failure/alerts.

- 2. Hardware Monitoring Information
-  1. Management of Hardware Error Event
- select d 

You will get something like this -

Display abnormalities

Units detected abnormality:

No.  Date                     Unit


Select unit by number.
You can confirm details and select appropriate action.

------------------------------------------------------

'/var/adm/messages'  is a very good place to look for hardware related messages. Sun Solaris has a very good feature which logs each and every kernell event in logs.

If you there is any doubt, it is always suggested to go through the logs.

Example: The following events indicate scsi controller failure which was not visible in prtdiag.


Jan 11 23:24:19 wpsh214 scsi: [ID 107833 kern.warning] WARNING: /pci@1f,4000/scsi@5 (glm2):
Jan 11 23:24:19 wpsh214         Unexpected DMA state: WAIT. dstat=a0<DMA-FIFO-empty,bus-fault>
Jan 11 23:24:19 wpsh214 genunix: [ID 408822 kern.info] NOTICE: glm2: fault detected in device; service still available
Jan 11 23:24:19 wpsh214 genunix: [ID 611667 kern.info] NOTICE: glm2: Unexpected DMA state: WAIT. dstat=a0<DMA-FIFO-empty,bus-fault>
Jan 11 23:24:19 wpsh214 scsi: [ID 107833 kern.warning] WARNING: /pci@1f,4000/scsi@5 (glm2):
Jan 11 23:24:19 wpsh214         got SCSI bus reset
Jan 11 23:24:19 wpsh214 genunix: [ID 408822 kern.info] NOTICE: glm2: fault detected in device; service still available
Jan 11 23:24:19 wpsh214 genunix: [ID 611667 kern.info] NOTICE: glm2: got SCSI bus reset
Jan 11 23:24:27 wpsh214 scsi: [ID 107833 kern.warning] WARNING: /pci@1f,4000/scsi@5/sd@c,0 (sd48):

------------------------------------------------------

Serial

#/opt/FJSVmadm/sbin/serialid -a

Model

#/opt/FJSVhwr/sbin/getmodelcode

------------------------------------------------------

#/opt/FJSVhwr/sbin/fjprtdiag -v

------------------------------------------------------

'fmadm faulty' 

This is a very useful command which lists faults in a OS running Sparc.

One drawback of this command is that the faults are not cleared automatically. We have to do it manually. If not we may end up troubleshooting some old messages.

http://docs.oracle.com/cd/E19166-01/E20792/z40031541454621.html

------------------------------------------------------

Apart for the above, some of Sun consoles give detailed hardware info.

Eg: E & M series have domain and platform consoles. ilom and alom consoles.

_________________________________________________________________________


Linux - Most of the Linux physical servers are hosted on HP or Dell.

HP has it's own CLI HW utility and so does Dell.

HP - 

1. hpasmcli - HP Management Command Line Interface, is a scriptable command line tool to manage and monitor the HP Hardware.
2. hpacucli - HP Array Configuration Utility.

Dell -




No comments: