GNU bug report logs - #30365
Offloading sometimes hangs

Previous Next

Package: guix;

Reported by: ludo <at> gnu.org (Ludovic Courtès)

Date: Tue, 6 Feb 2018 10:05:01 UTC

Severity: important

Done: Ricardo Wurmus <rekado <at> elephly.net>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 30365 in the body.
You can then email your comments to 30365 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guix <at> gnu.org:
bug#30365; Package guix. (Tue, 06 Feb 2018 10:05:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to ludo <at> gnu.org (Ludovic Courtès):
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Tue, 06 Feb 2018 10:05:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: bug-guix <at> gnu.org
Subject: Offloading sometimes hangs
Date: Tue, 06 Feb 2018 11:04:10 +0100
Hi,

On berlin.guixsd.org, offloading would sometimes hang in the middle of
an offloaded build: no more build log output showing up, nothing
happening (this is with guix-0.14.0-6.0dcf675).

On the build machine side, the guile process that forwards data between
the sshd and guix-daemon¹ is stuck on:

  read(0, …)

with this stack trace:

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  0x00007f09d6068aed in read () from /gnu/store/3h31zsqxjjg52da5gp3qmhkh4x8klhah-glibc-2.25/lib/libpthread.so.0
#1  0x00007f09d653fc47 in fport_read ()
   from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#2  0x00007f09d656cd77 in scm_i_read_bytes ()
   from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#3  0x00007f09d65705fe in scm_fill_input ()
   from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#4  0x00007f09d6577897 in scm_get_bytevector_some ()
   from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#5  0x00007f09d65abc4d in vm_regular_engine ()
   from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#6  0x00007f09d65af2aa in scm_call_n ()
   from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#7  0x00007f09d65338d7 in scm_primitive_eval ()
   from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
--8<---------------cut here---------------end--------------->8---

In theory this “cannot happen” because it reads from stdin iff ‘select’
said stdin is ready.

On the server side (on berlin itself), the corresponding ‘guix offload’
process is stuck here:

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  0x00007ff49b3590bd in poll () from target:/gnu/store/3h31zsqxjjg52da5gp3qmhkh4x8klhah-glibc-2.25/lib/libc.so.6
#1  0x00007ff48f4db377 in ssh_poll_ctx_dopoll ()
   from target:/gnu/store/3phbrya78gpk7rg6flqyqzf53y3x9zv9-libssh-0.7.5/lib/libssh.so.4
#2  0x00007ff48f4dc319 in ssh_handle_packets ()
   from target:/gnu/store/3phbrya78gpk7rg6flqyqzf53y3x9zv9-libssh-0.7.5/lib/libssh.so.4
#3  0x00007ff48f4dc3ed in ssh_handle_packets_termination ()
   from target:/gnu/store/3phbrya78gpk7rg6flqyqzf53y3x9zv9-libssh-0.7.5/lib/libssh.so.4
#4  0x00007ff48f4c8eff in ssh_channel_read_timeout ()
   from target:/gnu/store/3phbrya78gpk7rg6flqyqzf53y3x9zv9-libssh-0.7.5/lib/libssh.so.4
#5  0x00007ff48f930803 in read_from_channel_port ()
   from target:/gnu/store/xfaqdvk060yz7ddc9isk3wkybqmcfj3w-guile-ssh-0.11.2/lib/libguile-ssh.so.11
#6  0x00007ff49cea7d77 in scm_i_read_bytes ()
   from target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1
#7  0x00007ff49ceac3fc in scm_c_read_bytes ()
   from target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1
#8  0x00007ff49ceb2838 in scm_get_bytevector_n ()
   from target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1
#9  0x00007ff49cee6c4d in vm_regular_engine ()
   from target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1
#10 0x00007ff49ceea2aa in scm_call_n ()
   from target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1
#11 0x00007ff49ce6e8d7 in scm_primitive_eval ()
--8<---------------cut here---------------end--------------->8---

Presumably the ‘scm_get_bytevector_n’ call comes from (guix
serialization) or ‘process-stderr’.

IOW we have a deadlock where both sides are waiting for input data.

Ludo’.

¹ https://git.savannah.gnu.org/cgit/guix.git/tree/guix/ssh.scm?id=0362e5820ab6a1eb8eaf33bc47e592857c25f765#n102




Severity set to 'important' from 'normal' Request was from ludo <at> gnu.org (Ludovic Courtès) to control <at> debbugs.gnu.org. (Wed, 07 Feb 2018 13:35:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#30365; Package guix. (Wed, 07 Feb 2018 13:43:02 GMT) Full text and rfc822 format available.

Message #10 received at 30365 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: 30365 <at> debbugs.gnu.org
Subject: Re: bug#30365: Offloading sometimes hangs
Date: Wed, 07 Feb 2018 14:42:22 +0100
ludo <at> gnu.org (Ludovic Courtès) skribis:

> On the build machine side, the guile process that forwards data between
> the sshd and guix-daemon¹ is stuck on:
>
>   read(0, …)

> ¹ https://git.savannah.gnu.org/cgit/guix.git/tree/guix/ssh.scm?id=0362e5820ab6a1eb8eaf33bc47e592857c25f765#n102

I’ve been able to strace this Guile process.  When “things work
normally”, its loop does things like this:

--8<---------------cut here---------------start------------->8---
20456 10:52:17 read(9, "gmlo\0\0\0\0\201\0\0\0\0\0\0\0make[2]: Entering directory '/tmp/guix-build-make-4.2.1.drv-0/make-4.2.1/doc'\nmake[2]: Nothing to be done for 'install-exec-am'.\n\0\0\0\0\0\0\0", 65536) = 152
20456 10:52:17 write(1, "gmlo\0\0\0\0\201\0\0\0\0\0\0\0make[2]: Entering directory '/tmp/guix-build-make-4.2.1.drv-0/make-4.2.1/doc'\nmake[2]: Nothing to be done for 'install-exec-am'.\n\0\0\0\0\0\0\0", 152) = 152
20456 10:52:17 rt_sigprocmask(SIG_BLOCK, NULL,  <unfinished ...>
20456 10:52:17 <... rt_sigprocmask resumed> [], 8) = 0
20456 10:52:17 pselect6(10, [0 3 9], [], [], NULL, NULL <unfinished ...>
20456 10:52:17 <... pselect6 resumed> ) = 1 (in [9])
20456 10:52:17 read(9, "gmlo\0\0\0\0\223\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/mkdir -p '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0\0", 65536) = 168
20456 10:52:17 write(1, "gmlo\0\0\0\0\223\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/mkdir -p '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0\0", 168) = 168
20456 10:52:17 rt_sigprocmask(SIG_BLOCK, NULL,  <unfinished ...>
20456 10:52:17 <... rt_sigprocmask resumed> [], 8) = 0
20456 10:52:17 pselect6(10, [0 3 9], [], [], NULL, NULL <unfinished ...>
20456 10:52:17 <... pselect6 resumed> ) = 1 (in [9])
20456 10:52:17 read(9, "gmlo\0\0\0\0\304\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/install -c -m 644 ./make.info ./make.info-1 ./make.info-2 '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0", 65536) = 216
20456 10:52:17 write(1, "gmlo\0\0\0\0\304\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/install -c -m 644 ./make.info ./make.info-1 ./make.info-2 '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0", 216) = 216
--8<---------------cut here---------------end--------------->8---

Here FD 9 is the guix-daemon socket and 1 is stdout, which is connected
to the remote ‘guix offload’ process.

We then see a slightly different sequence leading to the read(0, …) call
that hangs:

--8<---------------cut here---------------start------------->8---
20456 10:52:18 pselect6(10, [0 3 9], [], [], NULL, NULL <unfinished ...>
20456 10:52:18 <... pselect6 resumed> ) = 1 (in [9])
20456 10:52:18 read(9, "gmlo\0\0\0\0005\0\0\0\0\0\0\0phase `validate-runpath' succeeded after 0.0 seconds\n\0\0\0", 65536) = 72
20456 10:52:18 write(1, "gmlo\0\0\0\0005\0\0\0\0\0\0\0phase `validate-runpath' succeeded after 0.0 seconds\n\0\0\0", 72) = 72
20456 10:52:18 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=261868144}) = 0
20456 10:52:18 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
20456 10:52:18 madvise(0xffffb0025000, 4096, MADV_DONTNEED <unfinished ...>
20456 10:52:18 <... madvise resumed> )  = 0
20456 10:52:18 write(6, "\0", 1 <unfinished ...>
20456 10:52:18 <... write resumed> )    = 1
20456 10:52:18 clock_gettime(CLOCK_PROCESS_CPUTIME_ID,  <unfinished ...>
20456 10:52:18 <... clock_gettime resumed> {tv_sec=0, tv_nsec=268384968}) = 0
20456 10:52:18 read(0,  <unfinished ...>
--8<---------------cut here---------------end--------------->8---

Notice that there’s no pselect showing up here!

The two clock_gettime correspond to ‘get_internal_run_time’ in libguile,
and ‘madvise’ corresponds to ‘return_unused_stack_to_os’; IOW, this part
corresponds to garbage collection.

So what we have here is that the Scheme procedure ‘select’ returned
stdin as “ready for reading”.  How did that happen?  I believe this is
due to <https://bugs.gnu.org/30368>: ‘scm_i_prepare_to_wait_on_fd’
returns 1, so ‘select’ returns EINTR but it does so without clearing the
FD sets.

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#30365; Package guix. (Wed, 07 Feb 2018 20:55:02 GMT) Full text and rfc822 format available.

Message #13 received at 30365 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: 30365 <at> debbugs.gnu.org
Subject: Re: bug#30365: Offloading sometimes hangs
Date: Wed, 07 Feb 2018 21:54:08 +0100
[Message part 1 (text/plain, inline)]
ludo <at> gnu.org (Ludovic Courtès) skribis:

> ludo <at> gnu.org (Ludovic Courtès) skribis:
>
>> On the build machine side, the guile process that forwards data between
>> the sshd and guix-daemon¹ is stuck on:
>>
>>   read(0, …)
>
>> ¹ https://git.savannah.gnu.org/cgit/guix.git/tree/guix/ssh.scm?id=0362e5820ab6a1eb8eaf33bc47e592857c25f765#n102
>
> I’ve been able to strace this Guile process.

For reference, one way to test this without going through offloading is by:

  1. storing the ‘redirect’ code in a file, say redirect.scm;
  2. talk to the daemon through a pipe running redirect.scm as with the
     attached patch.

Ludo’.

[Message part 2 (text/x-patch, inline)]
diff --git a/guix/store.scm b/guix/store.scm
index 6742611c6..b38e5bc23 100644
--- a/guix/store.scm
+++ b/guix/store.scm
@@ -473,6 +473,7 @@ name."
 
   (connect uri))
 
+(use-modules (ice-9 popen))
 (define* (open-connection #:optional (uri (%daemon-socket-uri))
                           #:key port (reserve-space? #t) cpu-affinity)
   "Connect to the daemon at URI (a string), or, if PORT is not #f, use it as
@@ -491,7 +492,14 @@ for this connection will be pinned.  Return a server object."
                                             (errno EPROTO))
                      (&message (message "build daemon handshake failed"))))))
     (let*-values (((port)
-                   (or port (connect-to-daemon uri)))
+                   (or port
+                       (with-fluids ((%default-port-encoding "ISO-8859-1"))
+                        (open-pipe* OPEN_BOTH;;  "strace" "-t" "-o"
+                                    ;; "/home/ludo/redirect.log"
+                                    ;; "-s" "1024"
+                                    "guile" "/home/ludo/src/guix/redirect.scm"))
+                       ;; (connect-to-daemon uri)
+                       ))
                   ((output flush)
                    (buffering-output-port port
                                           (make-bytevector 8192))))

Reply sent to ludo <at> gnu.org (Ludovic Courtès):
You have taken responsibility. (Fri, 09 Feb 2018 23:17:01 GMT) Full text and rfc822 format available.

Notification sent to ludo <at> gnu.org (Ludovic Courtès):
bug acknowledged by developer. (Fri, 09 Feb 2018 23:17:02 GMT) Full text and rfc822 format available.

Message #18 received at 30365-done <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: 30365-done <at> debbugs.gnu.org
Subject: Re: bug#30365: Offloading sometimes hangs
Date: Sat, 10 Feb 2018 00:16:46 +0100
Hello!

ludo <at> gnu.org (Ludovic Courtès) skribis:

> So what we have here is that the Scheme procedure ‘select’ returned
> stdin as “ready for reading”.  How did that happen?  I believe this is
> due to <https://bugs.gnu.org/30368>: ‘scm_i_prepare_to_wait_on_fd’
> returns 1, so ‘select’ returns EINTR but it does so without clearing the
> FD sets.

I’ve pushed a workaround here:

  https://git.savannah.gnu.org/cgit/guix.git/commit/?id=8446dc5a360e3a13fecea870f86efdbd893e3905

and guix-0.14.0-8.bc880f9 includes that fix.

It’s been running for several hours on berlin, building a bunch of
things notably on aarch64, and it seems to work well!

On the Guix side this can be considered fixed, I think.

On the Guile side, we’ll have to push the fix for
<https://bugs.gnu.org/30368>.

Ludo’.




Did not alter fixed versions and reopened. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 10 Feb 2018 11:06:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#30365; Package guix. (Sat, 10 Feb 2018 11:08:02 GMT) Full text and rfc822 format available.

Message #23 received at 30365 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Ricardo Wurmus <rekado <at> elephly.net>
Cc: 30365 <at> debbugs.gnu.org
Subject: Re: bug#30365: Offloading sometimes hangs
Date: Sat, 10 Feb 2018 12:07:13 +0100
Hello,

Ricardo Wurmus <rekado <at> elephly.net> skribis:

>> ludo <at> gnu.org (Ludovic Courtès) skribis:
>>
>>> So what we have here is that the Scheme procedure ‘select’ returned
>>> stdin as “ready for reading”.  How did that happen?  I believe this is
>>> due to <https://bugs.gnu.org/30368>: ‘scm_i_prepare_to_wait_on_fd’
>>> returns 1, so ‘select’ returns EINTR but it does so without clearing the
>>> FD sets.
>>
>> I’ve pushed a workaround here:
>>
>>   https://git.savannah.gnu.org/cgit/guix.git/commit/?id=8446dc5a360e3a13fecea870f86efdbd893e3905
>>
>> and guix-0.14.0-8.bc880f9 includes that fix.
>>
>> It’s been running for several hours on berlin, building a bunch of
>> things notably on aarch64, and it seems to work well!
>
> Congratulations on figuring this out!

Well it did show up again during the night.  :-/  So the problem appears
less frequently it seems, but it still appears.

A related issue is that ‘guix offload’ doesn’t time out in this case
but it probably should.

Ludo’.




Reply sent to Ricardo Wurmus <rekado <at> elephly.net>:
You have taken responsibility. (Mon, 12 Feb 2018 06:06:03 GMT) Full text and rfc822 format available.

Notification sent to ludo <at> gnu.org (Ludovic Courtès):
bug acknowledged by developer. (Mon, 12 Feb 2018 06:06:03 GMT) Full text and rfc822 format available.

Message #28 received at 30365-done <at> debbugs.gnu.org (full text, mbox):

From: Ricardo Wurmus <rekado <at> elephly.net>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 30365-done <at> debbugs.gnu.org
Subject: Re: bug#30365: Offloading sometimes hangs
Date: Sat, 10 Feb 2018 11:17:13 +0100
Hi Ludo,

> ludo <at> gnu.org (Ludovic Courtès) skribis:
>
>> So what we have here is that the Scheme procedure ‘select’ returned
>> stdin as “ready for reading”.  How did that happen?  I believe this is
>> due to <https://bugs.gnu.org/30368>: ‘scm_i_prepare_to_wait_on_fd’
>> returns 1, so ‘select’ returns EINTR but it does so without clearing the
>> FD sets.
>
> I’ve pushed a workaround here:
>
>   https://git.savannah.gnu.org/cgit/guix.git/commit/?id=8446dc5a360e3a13fecea870f86efdbd893e3905
>
> and guix-0.14.0-8.bc880f9 includes that fix.
>
> It’s been running for several hours on berlin, building a bunch of
> things notably on aarch64, and it seems to work well!

Congratulations on figuring this out!

-- 
Ricardo

GPG: BCA6 89B6 3655 3801 C3C6  2150 197A 5888 235F ACAC
https://elephly.net






Information forwarded to bug-guix <at> gnu.org:
bug#30365; Package guix. (Thu, 15 Feb 2018 00:07:01 GMT) Full text and rfc822 format available.

Message #31 received at 30365-done <at> debbugs.gnu.org (full text, mbox):

From: Danny Milosavljevic <dannym <at> scratchpost.org>
To: ludo <at> gnu.org (Ludovic Courtès)
Cc: 30365-done <at> debbugs.gnu.org
Subject: Re: bug#30365: Offloading sometimes hangs
Date: Thu, 15 Feb 2018 01:06:04 +0100
Hi Ludo,

> On the Guile side, we’ll have to push the fix for
> <https://bugs.gnu.org/30368>.

Can't think of a way to test select.  But can think of a way to test pselect:

Excuse the C:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/select.h>
#include <signal.h>

static void dummy(int signum) {
}

int main() {
        int fds[2];
        struct sigaction sa = {
                .sa_handler = dummy,
                .sa_flags = 0,
        };
        sigemptyset(&sa.sa_mask);
        pipe(fds);
	/* Make sure we don't miss it */
        sigsetmask(sigmask(SIGCHLD));
	/* Make sure we handle it once we DO get it (after unblocking it :P) */
        sigaction(SIGCHLD, &sa, NULL);
        if (fork() == 0) {
                _exit(0); /* Cause SIGCHLD (to be at least pending) in the parent. */
        } else {
                close(fds[0]);
                fd_set read_set;
                fd_set write_set;
                fd_set except_set;
                sigset_t sigmask;
                int rc;
                FD_ZERO(&read_set);
                FD_ZERO(&write_set);
                FD_SET(fds[1], &write_set);
                FD_ZERO(&except_set);
                sigemptyset(&sigmask); /* unblock SIGCHLD */
                rc = pselect(fds[0] + 1, &read_set, &write_set, &except_set, NULL, &sigmask);
                if (rc == -1) {
                        perror("select");
                } else if (rc == 0) {
                        printf("ZERO\n");
                } else {
                        printf("N\n");
                }
        }
        return 0;
}




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 15 Mar 2018 11:24:08 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 38 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.