drbd-user March 2010 archive
Main Archive Page > Month Archives  > drbd-user archives
drbd-user: Re: [DRBD-user] Problems with oos Sectors after verif

Re: [DRBD-user] Problems with oos Sectors after verify

From: Lars Ellenberg <lars.ellenberg_at_nospam>
Date: Wed Mar 17 2010 - 10:35:22 GMT
To: drbd-user@lists.linbit.com

On Wed, Mar 17, 2010 at 07:08:20AM +0000, Henning Bitsch wrote:
> Hi,
>
> >
>
> I have a problem running drbd 8.3.7-1 on Debian Lenny (2.6.26-AMD64-Xen).
> I have six drbd devices with a total of 3 TB. Both nodes are Supermicro AMD
> Opteron boxes (one 12 core, one 4 core) with a dedicated 1 GBit connection for
> DRBD and Adaptec 5800 Raid controllers. One side is a NVIDIA forcedeth NIC,
> the other side an Intel e1000. Protocol is C. The dom0 has 2 GByte of RAM.
>
> Basically two symptoms can be observed but I am not sure if they are related:
>
> 1. Data Integrity errors
> I get occasional data integrity errors (checksummed with crc32c) on both nodes
> in the cluster.
>
> [ 8961.266879] block drbd3: Digest integrity check FAILED.
> [22846.253694] block drbd3: Digest integrity check FAILED.
> [23557.272471] block drbd3: Digest integrity check FAILED.
>
> Like recommended before I did the standard procedures (disable offloading,
> memtest, replacing cables, replacing one of the boxes) but without success.

Then your hardware is broken.
No more to say.

> The errors are only reported for devices wich the respective node is
> secondary for.
>
> 2. oos after verify
> I always get a few oos sectors after verifying any device which has been used
> previously. These are no false positives, the sectors are in fact different:
>
> 2,5c2,5
> < 0000010: 0000 0000 0800 0000 0000 00ff 0000 0000 ................
> < 0000020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> < 0000030: 0000 0000 ffff ffff ffff ffff 0000 0000 ................
> < 0000040: 0000 0400 0000 0000 0000 0000 0000 0000 ................
> ---
> > 0000010: 0000 0000 0800 0000 0000 19ff 0000 0000 ................
> > 0000020: 0000 002b 0000 0000 0000 0000 0000 0000 ...+............
> > 0000030: 0000 002b ffff ffff ffff ffff 0000 0000 ...+............
> > 0000040: 0000 0400 0000 0000 0002 8668 0000 0000 ...........h....
> 8c8
> < 0000070: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> ---
> > 0000070: 0000 0f03 0000 0000 0000 0001 0000 0000 ................
>
> After dis/reconnect/resyncing the device, they are identical again. This
> happens with random sectors and basically every verify.
>
> Here my relevant global config for drbd.
>
> startup {
> wfc-timeout 60;
> degr-wfc-timeout 300;
> }
>
> disk {
> on-io-error detach;
> }
>
> net {
> cram-hmac-alg sha1;
> after-sb-0pri disconnect;
> after-sb-1pri disconnect;
> after-sb-2pri disconnect;
> data-integrity-alg crc32c;
> max-buffers 3000;
> max-epoch-size 8000;
> }
>
> syncer {
> rate 25M;
> verify-alg crc32c;
> csums-alg crc32c;
> al-extents 257;
> }
>
> I tweaked the tcp settings using sysctl
>
> net.ipv4.tcp_rmem = 131072 131072 16777216
> net.ipv4.tcp_wmem = 131072 131072 16777216
> net.core.rmem_max = 10485760
> net.core.wmem_max = 10485760
> net.ipv4.tcp_mem = 96000 128000 256000
>
>
> I am not sure in which direction to search next and would be happy about any
> suggestions.
>
> Thanks.
>
> Regards,
> Henning
> COM+ IT Consulting
>
>
>
> _______________________________________________
> drbd-user mailing list
> drbd-user@lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>

-- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed _______________________________________________ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user