drbd-user March 2013 archive
Main Archive Page > Month Archives  > drbd-user archives
drbd-user: Re: [DRBD-user] Primary fully unavailable with "

Re: [DRBD-user] Primary fully unavailable with "time expired" errors

From: AZ 9901 <az9901_at_nospam>
Date: Sun Mar 10 2013 - 15:58:05 GMT
To: David Coulson <david@davidcoulson.net>

Thanks.
However, DRBD seems to have been stuck for 2 days with these "time expired" messages until I split nodes (it then started again flawlessly).
Seems that it would have stayed in this situation indefinitely, without working.

I already encountered this issue a few months ago, an online verification was also running.

Anything to do ?
Some tuning in parameters ?
A "retry patch" to code for DRBD to "stop and retry" when it encounters this issue ?
...

Thank you very much !

Best regards,

Ben

Le 10 mars 2013 à 16:33, David Coulson a écrit :

> Sorry - I picked out the wrong line(s).
>
> Feb 17 20:31:11 srv2-1 kernel: block drbd1: [drbd1_worker/3083] sock_sendmsg time expired, ko = 4294967295 Feb 17 20:31:17 srv2-1 kernel: block drbd1: [drbd1_worker/3083] sock_sendmsg time expired, ko = 4294967294
> That means your network is unreliable. Not much DRBD can do about it - I would investigate the cause of that problem.
>
> David
>
> On 3/10/13 11:21 AM, AZ 9901 wrote:
>> David,
>>
>> Thank you for your answer !
>>
>> This log entry arrived just after (and is certainly due to the fact that) I closed network communication between srv2-1 and srv2-2 :
>> I connected to secondary server and used iptables to stop communication between the two servers.
>> Just after that, primary server was reachable again !
>> But according to logs, issue started 2 days before.
>>
>> However, to answer your question, the network between the 2 servers is the private dedicated network OVH uses between its 2 data-centers RBX & SGB :
>> http://www.ovh.co.uk/dedicated_servers/data_centre_selection.xml
>> I have a 100Mbps connection between the 2 servers.
>>
>> Best regards,
>>
>> Ben
>>
>> Le 10 mars 2013 à 16:01, David Coulson a écrit :
>>
>>> What is your network between the two systems?
>>>
>>> Feb 19 19:20:56 srv2-2 kernel: block drbd1: PingAck did not arrive in time.
>>>
>>> That means DRBD couldn't communicate between the nodes.
>>>
>>> David
>>>
>>> On 3/10/13 10:59 AM, AZ 9901 wrote:
>>>> Le 5 mars 2013 à 07:21, AZ 9901 a écrit :
>>>>
>>>>> // I made some errors in my previous mail, here they are corrected
>>>>>
>>>>> Hello,
>>>>>
>>>>> I faced a big issue with DRBD.
>>>>>
>>>>> OS : Linux Debian 6
>>>>> Kernel : 2.6.32-46
>>>>> DRBD : 8.3.14
>>>>>
>>>>> My primary server (srv2-2) was totally unreachable, it only replied to ping.
>>>>> Apache, SSH etc... were not replying anymore.
>>>>>
>>>>> So I connected to my secondary server (srv2-1) and closed network communication between both.
>>>>> This made srv2-2 available again !
>>>>> I decided however to change srv2-1 from Secondary to Primary and to reboot srv2-2.
>>>>>
>>>>> Following are logs from srv2-2 and srv2-1, with some comments.
>>>>> srv2-2 : http://pastebin.com/raw.php?i=zkHV5Tr9
>>>>> srv2-1 : http://pastebin.com/raw.php?i=WX4vNR6d
>>>>>
>>>>> on srv2-2, sar tells me that some of my CPU cores were 100% used (100% iowait) during all the time frame in which I had "time expired" errors.
>>>>>
>>>>> Could you help me please ?
>>>>>
>>>>> Thank you very much,
>>>>>
>>>>> Ben
>>>>>
>>>>
>>>> Hello,
>>>>
>>>> Any help on this problem ?
>>>>
>>>> To help further, here is my configuration : http://pastebin.com/raw.php?i=UJ7npfBD
>>>>
>>>> Thank you very much,
>>>>
>>>> Best regards,
>>>>
>>>> Ben
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> drbd-user mailing list
>>>> drbd-user@lists.linbit.com
>>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>>
>>
>

_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user