[SGVLUG] Recommendation of an open source hardware diagnostic tool

Mon Mar 3 00:32:39 PST 2014

Hey Claude, thanks for your thoughts.

I have had bad (really bad) luck with OCZ so current I only do Intel SSDs.
I have not had one fail yet.

As I replaced a Seagate HD with the SSD and am getting the same errors it
makes me think this is a controller issue.  I had the boot drive running
off the RAID port on the motherboard (a Supermicro X9SAE-V), then moved it
to a non-RAID Mobo port with no change in symptoms.

This is the only drive on the system.  I have moved all compute into 2
redundant nodes and storage on a separate set of servers.

I am looking for a tool to first confirm that the server is freezing up,
then to determine what component is failing

There's nothing I have seen in the hardware logs or OS logs other than the
disk errors.

Matt

---------
*Matthew Campbell*
Storage Solution Consultant
Storage Design and Engineering

*Kaiser Permanente*
IMG-Systems Integration
99 S. Oakland
Pasadena, CA 91101

626-564-7228 (office)
8-338-7228 (tie-line)
818-314-9897 (mobile phone)
Green Center 3-North, 031W29
---------
*kp.org/thrive <http://kp.org/thrive>*

On Mon, Mar 3, 2014 at 12:17 AM, Claude Felizardo <cafelizardo at gmail.com>wrote:

> can you try the drives in another computer just to rule them out?  SATA on
> MB or card?
>
> you sure there isn't a bug  with the firmware on the drives?  I had a
> problem with drives in a RAID - bug would show up ever few weeks or maybe a
> month when it was trying to do calibration while in RAID config and it
> would knock the drive offline.  work around was to reboot before that
> period until i stumbled on  posts saying firmware was bad and an updated
> fixed it.
>
> Oh wait, all SSD?  What brand?
>
> Claude
>
>
>
> On Mar 2, 2014, at 7:31 PM, Matthew Campbell <dvdmatt at gmail.com> wrote:
>
> Yep.  Tried that with the RAM but the Mobo and CPU are the latest and I
> don't want to blow another grand on duplicates...
>
> Matt
>
> ---------
> *Matthew Campbell*
> Storage Solution Consultant
> Storage Design and Engineering
>
> *Kaiser Permanente*
> IMG-Systems Integration
> 99 S. Oakland
> Pasadena, CA 91101
>
> 626-564-7228 (office)
> 8-338-7228 (tie-line)
> 818-314-9897 (mobile phone)
> Green Center 3-North, 031W29
> ---------
> *kp.org/thrive <http://kp.org/thrive>*
>
>
> On Sun, Mar 2, 2014 at 5:01 PM, Dan Kegel <dank at kegel.com> wrote:
>
>> Swapping out part by part until the problem goes away might be your best
>> bet.
>>  Am 02.03.2014 15:24 schrieb "Matthew Campbell" <dvdmatt at gmail.com>:
>>
>> Does anyone have a hardware diagnostic tool they like, preferably open
>>> source?  I have been fighting a host for two weeks now and after finding
>>> and submitted 2 kernel bugs have begun to suspect that the problems I am
>>> running into are being exposed by a hardware failure.
>>>
>>> The system appears to be running fine, but every 10-15 seconds will zone
>>> out for a couple of seconds.  At first I thought it was a BTRFS bug, and
>>> the errors I was seeing turned out to be just that.
>>>
>>> Once they were fixed the freezing kept on.  Further poking uncovered a
>>> NFS bug in its interaction with the underlying filesystem, but having also
>>> patched the kernel for that the poor performance continues.
>>>
>>> Now I'm starting to see errors of this sort in my syslog:
>>>
>>> 2014-03-02T22:39:00.262Z cpu6:34527)WARNING: LinScsi:
>>> SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056
>>> Unknown status vmhba0:0:0:0 (driver name: ahci) - Message repeated 4 times
>>> 2014-03-02T22:39:00.262Z cpu2:32791)ScsiDeviceIO: 2324:
>>> Cmd(0x412e8088eac0) 0x4d, CmdSN 0x784 from world 0 to dev
>>> "t10.ATA_____INTEL_SSDSC2BW240A4_____________________CVDA341000752403GN__"
>>> failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
>>> 2014-03-02T22:39:00.275Z cpu2:32784)ScsiDeviceIO: 2324:
>>> Cmd(0x412e80842b00) 0x28, CmdSN 0x51c3 from world 32878 to dev
>>> "t10.ATA_____INTEL_SSDSC2BW240A4_____________________CVDA341000752403GN__"
>>> failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
>>>
>>> Could my SSD be failing?  But I just replaced the previous boot disk as
>>> it looked like it was failing...
>>>
>>> Device sense code D:0x8 equates to 08h  BUSY according to these docs:
>>>
>>> http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=289902
>>>
>>> It could be a MOBO issue with the SATA port or even the CPU or RAM.  Ugh.
>>>
>>> I tried memtest86 and all passed...
>>>
>>> Any suggestions on a full-system hardware test suite would be much
>>> appreciated.
>>>
>>> Matt
>>>
>>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://sgvlug.net/pipermail/sgvlug/attachments/20140303/3c6f72e5/attachment.html>