GNU bug report logs - #17561
24.4.50; Emacs can forget processes

Previous Next

Package: emacs;

Reported by: Jorgen Schaefer <contact <at> jorgenschaefer.de>

Date: Fri, 23 May 2014 15:54:02 UTC

Severity: normal

Tags: moreinfo

Merged with 17337, 17628

Found in version 24.4.50

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 17561 in the body.
You can then email your comments to 17561 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Fri, 23 May 2014 15:54:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Jorgen Schaefer <contact <at> jorgenschaefer.de>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Fri, 23 May 2014 15:54:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Jorgen Schaefer <contact <at> jorgenschaefer.de>
To: bug-gnu-emacs <at> gnu.org
Subject: 24.4.50; Emacs can forget processes
Date: Fri, 23 May 2014 17:52:35 +0200

From time to time, my Emacs tends to become very slow. I have tracked
this down to a weird interaction with certain processes.

Sometimes, when a process dies (e.g. after its buffer is killed), Emacs
apparently does not clean up correctly. The sentinel is called, the
process itself receives a termination signal, but the process object
still shows up in M-x list-processes, and there is a zombie process in
ps.

When this happens, Emacs is in a tight loop trying to read from the dead
process(es):

21:24:00.948071 pselect6(12, [3 4 9 11], [], NULL, {0, 1376013},
                {NULL, 8}) = 2 (in [9 11], left {0, 1373467})
                <0.000011>
21:24:00.948529 read(9, "", 4096)       = 0 <0.000009>
21:24:00.948564 read(11, "", 4096)      = 0 <0.000008>

Killing the processes in M-x list-processes using
`process-menu-delete-process' returns Emacs to a usable state, albeit
the zombie processes stay around.

I have been able to reliably reproduce the behavior using this snippet:

(with-temp-buffer
  (let ((process-connection-type t))
    (set-process-query-on-exit-flag
     (start-process "cat" (current-buffer) "cat")
     nil)))

I.e. create a buffer, associate a process with it, remove the query on
exit flag, and kill the buffer. The `process-connection-type' nil is the
relevant part. With t, it works correctly.

The snippet does *not* work on a freshly-started Emacs. A random time
after the bug shows up, Emacs also cleans up the zombie processes and
the bug can not be triggered anymore. Until a whole while later, it
shows up again.

I have seen this happen in 24.3.50, 24.3.91 (pretest 2), and current
trunk.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Fri, 23 May 2014 16:29:01 GMT) Full text and rfc822 format available.

Message #8 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jorgen Schaefer <forcer <at> forcix.cx>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: Emacs can forget processes
Date: Fri, 23 May 2014 09:28:44 -0700

[Message part 1 (text/plain, inline)]

How was Emacs configured and built on your platform?  What's your platform?

Can you run the shell command 'strace -p' on an Emacs with the problem, 
and trace the system calls near the offending area?

Are you using anything involving SIGUSR1 or SIGUSR2?  I see an unlikely 
race condition there.

That's part of the problem with this code: it's so racy that it seems 
that every time I look at it I find another unlikely race condition. 
I'm attaching a patch to emacs-24 for the unlikely races I found in a 
quick look at the area, not that I think it'll fix your problem.  I 
haven't installed this.

[racy.patch (text/plain, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Fri, 23 May 2014 16:45:02 GMT) Full text and rfc822 format available.

Message #11 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Jorgen Schaefer <forcer <at> forcix.cx>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: Emacs can forget processes
Date: Fri, 23 May 2014 18:44:19 +0200

On Fri, 23 May 2014 09:28:44 -0700
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> How was Emacs configured and built on your platform? What's your
> platform?

I have seen this bug happen on a Debian 7.5 Wheezy x86_64 with an
Emacs configured with --without-x and an openSUSE 12.3 amd64 with
a GUI version (can't currently check which toolkit it uses by default
there).

> Can you run the shell command 'strace -p' on an Emacs with the
> problem, and trace the system calls near the offending area?

The bug report includes an strace output. Emacs does the pselect6 call
and then read calls for all "broken" sockets in a tight loop, and
pretty much nothing else.

> Are you using anything involving SIGUSR1 or SIGUSR2?  I see an
> unlikely race condition there.

Not to my knowledge, unless Emacs sends those signals in some situation
without it being obvious; I do not send any signals myself, all I do is
use `kill-buffer' and `delete-process'.

> That's part of the problem with this code: it's so racy that it seems 
> that every time I look at it I find another unlikely race condition. 
> I'm attaching a patch to emacs-24 for the unlikely races I found in a 
> quick look at the area, not that I think it'll fix your problem.  I 
> haven't installed this.

read(2) returning 0 indicates an EOF (which pselect6 indicates as
waiting input), I suspect there could at least be a sanity check here
to close the file descriptors instead of continuously polling.

Regards,
Jorgen

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Sat, 24 May 2014 23:02:02 GMT) Full text and rfc822 format available.

Message #14 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jorgen Schaefer <forcer <at> forcix.cx>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: Emacs can forget processes
Date: Sat, 24 May 2014 16:01:20 -0700

Jorgen Schaefer wrote:

>> Can you run the shell command 'strace -p' on an Emacs with the
>> problem, and trace the system calls near the offending area?
>
> The bug report includes an strace output. Emacs does the pselect6 call

Sorry, I should have been more specific.  It'd be nice to see the strace 
output for when you run start-process in a temp buffer, not for later 
when Emacs is in a tight loop.

> read(2) returning 0 indicates an EOF (which pselect6 indicates as
> waiting input), I suspect there could at least be a sanity check here
> to close the file descriptors instead of continuously polling.

Maybe, though I worry that might mask the bug and might cause other, 
more serious failures down the road.  It'd be nicer to fix the actual 
bug, if we can figure out what it is.

For what it's worth I tried reproducing the bug with emacs-24 'emacs -Q' 
on Fedora, using your recipe, but this didn't have a problem.  I ran it 
1000 times rapidly in sequence and it was OK.  I ran it 10000 times and 
ran out of file descriptors but that's to be expected.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Sun, 25 May 2014 07:58:02 GMT) Full text and rfc822 format available.

Message #17 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Jorgen Schaefer <forcer <at> forcix.cx>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: Emacs can forget processes
Date: Sun, 25 May 2014 09:57:35 +0200

On Sat, 24 May 2014 16:01:20 -0700
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Jorgen Schaefer wrote:
> 
> >> Can you run the shell command 'strace -p' on an Emacs with the
> >> problem, and trace the system calls near the offending area?
> >
> > The bug report includes an strace output. Emacs does the pselect6
> > call
> 
> Sorry, I should have been more specific.  It'd be nice to see the
> strace output for when you run start-process in a temp buffer, not
> for later when Emacs is in a tight loop.

Oh, my bad. Luckily, the bug just occurred again (it's really non-
deterministic):

strace -fttT output

The trace starts with a number of these triplets, which seem to be
"Emacs behaving normally". After terminating the process and returning
to "normal" later, Emacs does not issue these rt_sigprocmask calls
anymore, though.

16300 09:41:27.072717 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 176410474}, {NULL, 8}) = 0 (Timeout) <0.176635>
16300 09:41:27.249649 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000011>
16300 09:41:27.249881 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000010>

Then there's a large bunch of syscalls related to my command to restart
the process, with the "\r" now being me sending the M-x command:

16300 09:41:28.298391 read(3, "\r", 1)  = 1 <0.000012>
16300 09:41:28.298438 ioctl(3, FIONREAD, [0]) = 0 <0.000009>
16300 09:41:28.298480 ioctl(3, FIONREAD, [0]) = 0 <0.000009>
16300 09:41:28.312476 write(3, "\r", 1) = 1 <0.000021>
16300 09:41:28.317392 kill(4294953129, SIGHUP) = 0 <0.002235>
16300 09:41:28.321642 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000017>
16300 09:41:28.321841 write(3, "\33[K\33[H\n\n", 8) = 8 <0.000018>
16300 09:41:28.321909 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000012>
16300 09:41:28.321975 --- SIGIO (I/O possible) @ 0 (0) ---
16300 09:41:28.322004 rt_sigreturn(0x1d) = 0 <0.000013>
16300 09:41:28.322056 ioctl(3, FIONREAD, [0]) = 0 <0.000014>
16300 09:41:28.322183 ioctl(3, FIONREAD, [0]) = 0 <0.000012>
16300 09:41:28.322253 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 499801124}, {NULL, 8}) = 1 (in [15], left {0, 499788244}) <0.000024>
16300 09:41:28.322337 read(15, "", 4096) = 0 <0.000012>
16300 09:41:28.322475 ioctl(3, FIONREAD, [0]) = 0 <0.000014>
16300 09:41:28.322533 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 499519197}, {NULL, 8}) = 1 (in [15], left {0, 499513475}) <0.000017>
16300 09:41:28.322603 read(15, "", 4096) = 0 <0.000010>

Killing the process via the process list (hence the "d"):

16300 09:49:00.151571 read(3, "d", 1)   = 1 <0.000009>
16300 09:49:00.151604 ioctl(3, FIONREAD, [0]) = 0 <0.000007>
16300 09:49:00.151636 ioctl(3, FIONREAD, [0]) = 0 <0.000006>
16300 09:49:00.151790 rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 <0.000020>
16300 09:49:00.151849 kill(4294953129, SIGKILL) = 0 <0.000013>
16300 09:49:00.151896 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0 <0.000007>
16300 09:49:00.151964 close(12)         = 0 <0.000015>
16300 09:49:00.152003 close(15)         = 0 <0.000010>
16300 09:49:00.153389 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000010>
16300 09:49:00.153531 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000008>

> > read(2) returning 0 indicates an EOF (which pselect6 indicates as
> > waiting input), I suspect there could at least be a sanity check
> > here to close the file descriptors instead of continuously polling.
> 
> Maybe, though I worry that might mask the bug and might cause other, 
> more serious failures down the road.  It'd be nicer to fix the actual 
> bug, if we can figure out what it is.
> 
> For what it's worth I tried reproducing the bug with emacs-24 'emacs
> -Q' on Fedora, using your recipe, but this didn't have a problem.  I
> ran it 1000 times rapidly in sequence and it was OK.  I ran it 10000
> times and ran out of file descriptors but that's to be expected.

Yes, the bug is really random. I have not been able to intentionally
reproduce it. It just happens once in a while. Until this bug report,
I wasn't even sure what was causing my Emacs to stop responding in the
first place. It just happened, once every few days, that Emacs suddenly
becomes very sluggish.

My Emacs sessions usually have a number of processes open. When the bug
showed up just now, it was five processes and three network connections,
for example. I'm not sure if that's related.

Regards,
Jorgen

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Mon, 26 May 2014 17:10:02 GMT) Full text and rfc822 format available.

Message #20 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jorgen Schaefer <forcer <at> forcix.cx>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: Emacs can forget processes
Date: Mon, 26 May 2014 10:08:38 -0700

Jorgen Schaefer wrote:
> The trace starts with a number of these triplets, which seem to be
> "Emacs behaving normally".
>
> 16300 09:41:27.072717 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 176410474}, {NULL, 8}) = 0 (Timeout) <0.176635>
> 16300 09:41:27.249649 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000011>
> 16300 09:41:27.249881 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000010>

I don't observe this behavior when running Emacs on Fedora 20 (this is 
the latest emacs-24 version, running your little test).  I ran:

strace -o /tmp/tr -fttT src/bootstrap-emacs -nw -Q

I did see this:

29589 09:44:10.979952 ioctl(4, FIONREAD, [0]) = 0 <0.000023>
29589 09:44:10.980031 ioctl(4, FIONREAD, [0]) = 0 <0.000024>
29589 09:44:10.980143 pselect6(6, [4 5], [], NULL, {0, 499810466}, 
{NULL, 8}) = 0 (Timeout) <0.500499>
29589 09:44:11.480745 poll([{fd=5, events=POLLIN}], 1, 0) = 0 (Timeout) 
<0.000030>
29589 09:44:11.480861 read(5, 0x7fff777b3820, 16) = -1 EAGAIN (Resource 
temporarily unavailable) <0.000029>
29589 09:44:11.481214 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 
<0.000030>
29589 09:44:11.481505 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 
0 <0.000027>
29589 09:44:11.481594 ioctl(4, FIONREAD, [0]) = 0 <0.000030>

but that's not really the same.  Could you please try running emacs -nw 
-Q with your test case, and see whether it behaves like the pattern on 
my platform?  If so, we might try to investigate why Emacs changes from 
this pattern to the pattern that you observe.

> Then there's a large bunch of syscalls related to my command to restart
> the process, with the "\r" now being me sending the M-x command:

Which M-x command was that?  M-x list-processes?

> 16300 09:41:28.298391 read(3, "\r", 1)  = 1 <0.000012>
> 16300 09:41:28.298438 ioctl(3, FIONREAD, [0]) = 0 <0.000009>
> 16300 09:41:28.298480 ioctl(3, FIONREAD, [0]) = 0 <0.000009>
> 16300 09:41:28.312476 write(3, "\r", 1) = 1 <0.000021>
> 16300 09:41:28.317392 kill(4294953129, SIGHUP) = 0 <0.002235>
> 16300 09:41:28.321642 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000017>
> 16300 09:41:28.321841 write(3, "\33[K\33[H\n\n", 8) = 8 <0.000018>
> 16300 09:41:28.321909 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000012>

When I try your test, the child process is running in parallel with the 
parent, the 'kill' terminates the child, and the parent is signaled.  In 
contrast your trace shows no child, leading me to guess that the child 
has already exited (so the parent is killing a zombie), which means it's 
not the test case you sent but some other process (since the test case 
you sent waits on a pty so the child shouldn't exit).  Here's the trace 
I see around the kill:

29592 09:44:24.494089 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No 
such file or directory) <0.000028>
29589 09:44:24.494167 kill(4294937704, SIGHUP <unfinished ...>
29592 09:44:24.494199 
open("/usr/lib64/alliance/lib/tls/x86_64/libc.so.6", O_RDONLY|O_CLOEXEC 
<unfinished ...>
29589 09:44:24.494218 <... kill resumed> ) = 0 <0.000040>
29592 09:44:24.494236 <... open resumed> ) = -1 ENOENT (No such file or 
directory) <0.000025>
29592 09:44:24.494267 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, 
si_pid=29589, si_uid=1000} ---
29592 09:44:24.494375 +++ killed by SIGHUP +++
29589 09:44:24.494388 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, 
si_pid=29592, si_status=SIGHUP, si_utime=0, si_stime=0} ---
29589 09:44:24.494435 wait4(29592, [{WIFSIGNALED(s) && WTERMSIG(s) == 
SIGHUP}], WNOHANG|WSTOPPED|WCONTINUED, NULL) = 29592 <0.000037>
29589 09:44:24.494507 rt_sigreturn()    = 12159536 <0.000021>
29589 09:44:24.494603 rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0 
<0.000019>
29589 09:44:24.494653 rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0 
<0.000017>
29589 09:44:24.495590 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 
<0.000025>
29589 09:44:24.496152 write(4, 
"\r\n\33[?25lnil\33[48;34H\33[7m11\33[0m\33[3"..., 70) = 70 <0.000036>
29589 09:44:24.496236 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 
0 <0.000015>
29589 09:44:24.496282 --- SIGIO {si_signo=SIGIO, si_code=SI_KERNEL} ---
29589 09:44:24.496306 rt_sigreturn()    = 0 <0.000015>
29589 09:44:24.496350 ioctl(4, FIONREAD, [0]) = 0 <0.000016>

Perhaps you could run your test on a fresh emacs -Q -nw and see whether 
it matches the behavior I'm seeing.

> Killing the process via the process list (hence the "d"):

You typed "d" in the window generated by M-x list-processes?  When I try 
that, it says "d is undefined".

> My Emacs sessions usually have a number of processes open. When the bug
> showed up just now, it was five processes and three network connections,
> for example. I'm not sure if that's related.

I expect that it is, unfortunately.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Mon, 26 May 2014 18:50:03 GMT) Full text and rfc822 format available.

Message #23 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Jorgen Schaefer <forcer <at> forcix.cx>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: Emacs can forget processes
Date: Mon, 26 May 2014 20:49:52 +0200

[Message part 1 (text/plain, inline)]

On Mon, 26 May 2014 10:08:38 -0700
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Jorgen Schaefer wrote:
> > The trace starts with a number of these triplets, which seem to be
> > "Emacs behaving normally".
> >
> > 16300 09:41:27.072717 pselect6(20, [3 4 5 6 8 10 14 15 19], [],
> > NULL, {0, 176410474}, {NULL, 8}) = 0 (Timeout) <0.176635> 16300
> > 09:41:27.249649 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0
> > <0.000011> 16300 09:41:27.249881 rt_sigprocmask(SIG_UNBLOCK, [WINCH
> > IO], NULL, 8) = 0 <0.000010>
> 
> I don't observe this behavior when running Emacs on Fedora 20 (this
> is the latest emacs-24 version, running your little test).  I ran:
> 
> strace -o /tmp/tr -fttT src/bootstrap-emacs -nw -Q
> 
> I did see this:
> 
> 29589 09:44:10.979952 ioctl(4, FIONREAD, [0]) = 0 <0.000023>
> 29589 09:44:10.980031 ioctl(4, FIONREAD, [0]) = 0 <0.000024>
> 29589 09:44:10.980143 pselect6(6, [4 5], [], NULL, {0, 499810466}, 
> {NULL, 8}) = 0 (Timeout) <0.500499>
> 29589 09:44:11.480745 poll([{fd=5, events=POLLIN}], 1, 0) = 0
> (Timeout) <0.000030>
> 29589 09:44:11.480861 read(5, 0x7fff777b3820, 16) = -1 EAGAIN
> (Resource temporarily unavailable) <0.000029>
> 29589 09:44:11.481214 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8)
> = 0 <0.000030>
> 29589 09:44:11.481505 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL,
> 8) = 0 <0.000027>
> 29589 09:44:11.481594 ioctl(4, FIONREAD, [0]) = 0 <0.000030>
> 
> but that's not really the same. Could you please try running emacs
> -nw -Q with your test case, and see whether it behaves like the
> pattern on my platform?  If so, we might try to investigate why Emacs
> changes from this pattern to the pattern that you observe.

In "normal mode" (with no processes running), Emacs simply hangs in a
pselect6 with a "pretty long timeout". I have attached the full
traceback of the emacs -Q session.

> > Then there's a large bunch of syscalls related to my command to
> > restart the process, with the "\r" now being me sending the M-x
> > command:
> 
> Which M-x command was that?  M-x list-processes?

M-x elpy-rpc-restart

https://github.com/jorgenschaefer/elpy/blob/master/elpy.el#L1654-1662

The call to `kill-process' was added only recently as part of
debugging this problem, the bug showed up with only the `kill-buffer'
call, too.

The Elpy RPC processes are where I notice this bug the most, as there
are often quite a few of them, and I kill them regularly during development,
but I have seen this with e.g. openssl processes, too.

> > 16300 09:41:28.298391 read(3, "\r", 1)  = 1 <0.000012>
> > 16300 09:41:28.298438 ioctl(3, FIONREAD, [0]) = 0 <0.000009>
> > 16300 09:41:28.298480 ioctl(3, FIONREAD, [0]) = 0 <0.000009>
> > 16300 09:41:28.312476 write(3, "\r", 1) = 1 <0.000021>
> > 16300 09:41:28.317392 kill(4294953129, SIGHUP) = 0 <0.002235>
> > 16300 09:41:28.321642 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL,
> > 8) = 0 <0.000017> 16300 09:41:28.321841 write(3, "\33[K\33[H\n\n",
> > 8) = 8 <0.000018> 16300 09:41:28.321909 rt_sigprocmask(SIG_UNBLOCK,
> > [WINCH IO], NULL, 8) = 0 <0.000012>
> 
> When I try your test, the child process is running in parallel with
> the parent, the 'kill' terminates the child, and the parent is
> signaled.  In contrast your trace shows no child, leading me to guess
> that the child has already exited (so the parent is killing a
> zombie), which means it's not the test case you sent but some other
> process (since the test case you sent waits on a pty so the child
> shouldn't exit).

That was for the already-existing process that exhibited the problem. I
wasn't able to reproduce it with the exact same code I sent this time.
Sorry, I should have been more clear about this.

> > Killing the process via the process list (hence the "d"):
> 
> You typed "d" in the window generated by M-x list-processes?  When I
> try that, it says "d is undefined".

"d runs the command process-menu-delete-process"

This is on a recent trunk version. It seems this was added after the
24.4 branch. (As I can't reliably reproduce this, I have to use
whichever Emacs I happen to be running at that point.)

I have also attached a file the-bug.txt.gz, which shows the strace of a
running Emacs session which starts to exhibit the bug, so we now have a
trace that includes the time when the bug actually happens. It shows me
starting Python subprocesses (elpy RPC processes) and killing them, in
an attempt to reproduce the problem. Eventually, this succeeds, and
Emacs goes into the loop reading from fd 15. I snipped a few megabytes
of logs there (marked "[...]"). Interestingly, I did not kill anything
there, I ran my "reproduction case", and Emacs cleaned up the broken
process by itself.

If I read this correctly, the "broken" process in question is 27012
(16300 is emacs). It gets killed at some point, right after a RET is
read (the elpy-rpc-restart), and goes straight into the loop:

16300 20:01:25.347174 kill(4294940284, SIGKILL <unfinished ...>
27012 20:01:25.349483 +++ killed by SIGKILL +++
16300 20:01:25.349515 <... kill resumed> ) = 0 <0.002304>
16300 20:01:25.349649 kill(4294940284, SIGHUP) = 0 <0.000019>
16300 20:01:25.354299 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) =
0 <0.0000 16300 20:01:25.354585 write(3, "\33[K\33[16;1H", 10) = 10
<0.000025> 16300 20:01:25.354687 rt_sigprocmask(SIG_UNBLOCK, [WINCH
IO], NULL, 8) = 0 <0.00 16300 20:01:25.354753 --- SIGIO (I/O possible)
@ 0 (0) --- 16300 20:01:25.354789 rt_sigreturn(0x1d) = 0 <0.000016>
16300 20:01:25.354862 ioctl(3, FIONREAD, [0]) = 0 <0.000017>
16300 20:01:25.355015 ioctl(3, FIONREAD, [0]) = 0 <0.000017>
16300 20:01:25.355098 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 670925
16300 20:01:25.355222 read(15, "", 4096) = 0 <0.000014>
16300 20:01:25.355455 ioctl(3, FIONREAD, [0]) = 0 <0.000017>
16300 20:01:25.355541 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 626503 
16300 20:01:25.355642 read(15, "", 4096) = 0 <0.000015>
16300 20:01:25.355710 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 609628
16300 20:01:25.355804 read(15, "", 4096) = 0 <0.000015>
16300 20:01:25.355869 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 593577
16300 20:01:25.355962 read(15, "", 4096) = 0 <0.000014>
16300 20:01:25.356027 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 577782
16300 20:01:25.356156 read(15, "", 4096) = 0 <0.000015>

Comparing this with the "normal" code, I do not see a SIGCHLD. Emacs does a
SIGKILL (delete-process) followed by a SIGHUP (from killing the buffer?), but
there's no SIGCHLD. The last rt_sigprocmask with SIG_SETMASK before
that sets QUIT ALRM PROF, so that does not seem to be the culprit.

Strange stuff.

(And I see you noticed the discussion emacs-devel about epa :-))

Regards,
Jorgen

[emacs-Q-trace.txt (text/plain, attachment)]

[the-bug.txt.gz (application/gzip, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Tue, 27 May 2014 00:00:07 GMT) Full text and rfc822 format available.

Message #26 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jorgen Schaefer <forcer <at> forcix.cx>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: Emacs can forget processes
Date: Mon, 26 May 2014 16:58:45 -0700

Jorgen Schaefer wrote:
> the-bug.txt.gz ... shows the strace of a
> running Emacs session which starts to exhibit the bug, so we now have a
> trace that includes the time when the bug actually happens. It shows me
> starting Python subprocesses (elpy RPC processes) and killing them, in
> an attempt to reproduce the problem. Eventually, this succeeds, and
> Emacs goes into the loop reading from fd 15.

Thanks for that trace; it's helpful.  A couple of things.  First, what 
platform are you running on, exactly?  I'm running on an AMD Phenom II 
X4 910e (according to /proc/cpuinfo), and using Fedora 20 x86-64 (Linux 
kernel 3.14.4-200.fc20.x86_64, glibc 2.18), and compiling with GCC 4.9.0.

Second, looking at the-bug.txt I'm suspicious about the interaction 
between vfork, pthread_sigmask (aka rt_sigprocmask) and rt_sigreturn. 
Could you please try building Emacs with fork rather than vfork and see 
whether that fixes the problem?  If it fixes things, great; if not I'd 
like to see another strace of the bug.  To use fork instead of vfork, 
you can run "./configure ac_cv_func_fork_works=no", or (faster but 
flimsier) edit src/config.h to add "#define fork vfork".

Here are some details as to why I think vfork is related to the problem.

You see a bug on line 7717 of the trace in the-bug.txt, where read(15, 
...) returns 0 and Emacs just keeps trying.  I see bugs before that, 
including:

* On trace line 1043, file descriptor 15 is created, as part of the 
preparation for vforking and execing subprocess 27012.  So far so good.

* On trace line 1048 vfork succeeds.  Still good.

* On trace line 1049 the parent process continues, and executes another 
system call!  The child process has neither execed nor terminated.  This 
is incorrect; if vfork is called, the Linux man page says "the calling 
thread is suspended until the child terminates (either normally, by 
calling _exit(2), or abnormally, after delivery of a fatal signal), or 
it makes a call to execve(2)".  I'm reasonably sure that Emacs relies on 
this and will fail badly if the parent and child can stomp on the same 
memory in parallel.  (Another possibility, alas, is that strace is buggy 
and its output unreliable....)

* On trace line 1049, the signal mask is cleared except for QUIT, ALRM, 
PROF.

* On trace line 5598, an INT is delivered just as the INT bit is removed 
from the signal mask.  This strongly suggests that the INT bit was set 
in the signal mask -- but that bit was cleared on line 1049, so what 
gives?  This suggests that the signal mask is messed up somehow.  One 
possibility is that the buggy vfork is doing it somehow -- perhaps the 
signal mask is part of the shared memory?

* On trace line 7706, the subprocess exits but no CHLD signal is 
delivered to Emacs.  Again, this suggests that the signal mask is messed 
up.  This last problem most likely leads to the symptoms you're 
observing: Emacs isn't told that a child died, so it doesn't check for it.

FYI when I try your test, I get something like the following strace 
output near the vfork.  This is the correct vfork behavior, i.e., the 
parent is suspended until the child execs:

6690  15:51:23.093780 rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0 
<0.000025>
6690  15:51:23.093873 vfork( <unfinished ...>
6692  15:51:23.094078 setsid()          = 6692 <0.000070>
6692  15:51:23.094202 open("/dev/tty", O_RDWR|O_CLOEXEC) = -1 ENXIO (No 
such device or address) <0.000048>
6692  15:51:23.094310 open("/dev/pts/9", O_RDWR|O_CLOEXEC) = 9 <0.000041>
6692  15:51:23.094409 rt_sigaction(SIGINT, {SIG_DFL, [INT], 
SA_RESTORER|SA_RESTART, 0x7f9de6e2fcb0}, {0x4ea440, [INT QUIT ALRM CHLD 
PROF WINCH IO], SA_RESTORER, 0x7f9de73d7750}, 8) = 0 <0.000026>
6692  15:51:23.094507 rt_sigaction(SIGQUIT, {SIG_DFL, [QUIT], 
SA_RESTORER|SA_RESTART, 0x7f9de6e2fcb0}, {0x4ea440, [INT QUIT ALRM CHLD 
PROF WINCH IO], SA_RESTORER, 0x7f9de73d7750}, 8) = 0 <0.000026>
6692  15:51:23.094602 rt_sigaction(SIGPROF, {SIG_DFL, [PROF], 
SA_RESTORER|SA_RESTART, 0x7f9de6e2fcb0}, {SIG_IGN, [PROF], 
SA_RESTORER|SA_RESTART, 0x7f9de6e2fcb0}, 8) = 0 <0.000025>
6692  15:51:23.094694 rt_sigaction(SIGPIPE, {SIG_DFL, [PIPE], 
SA_RESTORER|SA_RESTART, 0x7f9de6e2fcb0}, {SIG_IGN, [PIPE], 
SA_RESTORER|SA_RESTART, 0x7f9de6e2fcb0}, 8) = 0 <0.000025>
6692  15:51:23.094793 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 
<0.000026>
6692  15:51:23.094877 ioctl(9, SNDCTL_TMR_TIMEBASE or 
SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon echo 
...}) = 0 <0.000029>
6692  15:51:23.094963 ioctl(9, SNDCTL_TMR_STOP or 
SNDRV_TIMER_IOCTL_GINFO or TCSETSW, {B38400 opost isig icanon -echo 
...}) = 0 <0.000030>
6692  15:51:23.095080 ioctl(9, SNDCTL_TMR_TIMEBASE or 
SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon -echo 
...}) = 0 <0.000029>
6692  15:51:23.095170 getpid()          = 6692 <0.000025>
6692  15:51:23.095258 chdir("/home/eggert/src/gnu/emacs/emacs-24-sc") = 
0 <0.000038>
6692  15:51:23.095466 dup2(9, 0)        = 0 <0.000028>
6692  15:51:23.095545 dup2(9, 1)        = 1 <0.000027>
6692  15:51:23.095621 dup2(9, 2)        = 2 <0.000025>
6692  15:51:23.095696 setpgid(0, 0)     = -1 EPERM (Operation not 
permitted) <0.000026>
6692  15:51:23.095776 ioctl(0, SNDRV_TIMER_IOCTL_SELECT or TIOCSPGRP, 
[6692]) = 0 <0.000027>
6692  15:51:23.095872 
execve("/home/eggert/opt/Linux-x86_64/merged/bin/cat", 
["/home/eggert/opt/Linux-x86_64/me"...], [/* 100 vars */] <unfinished ...>
6690  15:51:23.096361 <... vfork resumed> ) = 6692 <0.002462>
6690  15:51:23.096440 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 
<0.000028>

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Tue, 27 May 2014 04:07:01 GMT) Full text and rfc822 format available.

Message #29 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jorgen Schaefer <forcer <at> forcix.cx>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: Emacs can forget processes
Date: Mon, 26 May 2014 21:05:56 -0700

I see another bug in the-bug.txt line 399:

16300 20:00:21.012145 kill(4294935048, SIGHUP) = -1 ESRCH (No such 
process) <0.000009>

Here, the subprocess has already been reaped, but Emacs is trying to 
kill it with SIGHUP.  This is a no-no, as Emacs might kill some other 
random victim process.  There's a similar bug in line 408.

I doubt whether this bug is causing your problem, but we should fix it 
anyway.  I doubt whether this bug is a regression, so I've installed a 
patch for this as trunk bzr 117161.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Tue, 27 May 2014 18:29:02 GMT) Full text and rfc822 format available.

Message #32 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Jorgen Schaefer <forcer <at> forcix.cx>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: Emacs can forget processes
Date: Tue, 27 May 2014 20:27:56 +0200

On Mon, 26 May 2014 16:58:45 -0700
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Jorgen Schaefer wrote:
> > the-bug.txt.gz ... shows the strace of a
> > running Emacs session which starts to exhibit the bug, so we now
> > have a trace that includes the time when the bug actually happens.
> > It shows me starting Python subprocesses (elpy RPC processes) and
> > killing them, in an attempt to reproduce the problem. Eventually,
> > this succeeds, and Emacs goes into the loop reading from fd 15.
> 
> Thanks for that trace; it's helpful.  A couple of things.  First,
> what platform are you running on, exactly?  I'm running on an AMD
> Phenom II X4 910e (according to /proc/cpuinfo), and using Fedora 20
> x86-64 (Linux kernel 3.14.4-200.fc20.x86_64, glibc 2.18), and
> compiling with GCC 4.9.0.

Thanks for making me check that. There was a bit of a wtf there for
me :-D The strace is from an OpenVZ virtual host:

Debian GNU/Linux 7.5 (wheezy)
Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
Linux kernel 3.2.41-042stab085.20
eglibc 2.13

Note the embedded glibc. 

I have seen a similar behavior (Emacs suddenly responding badly,
improving once processes are killed) on this type of system:

openSUSE 12.2 (x86_64)
Intel(R) Core(TM) i3-2120 CPU @ 3.30GHz
Linux kernel 3.4.63-2.44-desktop
glibc-2.15
gcc 4.7.1

But I can't know for sure if it's the same bug.

> Second, looking at the-bug.txt I'm suspicious about the interaction 
> between vfork, pthread_sigmask (aka rt_sigprocmask) and rt_sigreturn. 
> Could you please try building Emacs with fork rather than vfork and
> see whether that fixes the problem?  If it fixes things, great; if
> not I'd like to see another strace of the bug.  To use fork instead
> of vfork, you can run "./configure ac_cv_func_fork_works=no", or
> (faster but flimsier) edit src/config.h to add "#define fork vfork".

Should that be ac_cv_func_vfork_works=no? I have no recompiled Emacs
with the latter change, and it's using clone(2) to execute processes,
which sounds right. I haven't been able to reproduce the bug so far,
sadly that doesn't say much, I have gone for over a week before without
noticing the bug until it reappeared.

Regards,
Jorgen

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Tue, 27 May 2014 21:43:02 GMT) Full text and rfc822 format available.

Message #35 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jorgen Schaefer <forcer <at> forcix.cx>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: Emacs can forget processes
Date: Tue, 27 May 2014 14:42:03 -0700

[Message part 1 (text/plain, inline)]

On 05/27/2014 11:27 AM, Jorgen Schaefer wrote:
> Should that be ac_cv_func_vfork_works=no?

Yes, thanks for fixing the typo in my earlier message.

> The strace is from an OpenVZ virtual host

Perhaps OpenVZ is contributing to the problem?  There are or were 
relevant bugs in its implementation of vfork; e.g., see 
<http://lkml.org/lkml/2012/5/31/364>.

> I have no[w] recompiled Emacs
> with the latter change, and it's using clone(2) to execute processes,
> which sounds right. I haven't been able to reproduce the bug so far

So far, so good anyway.  That's the good news.  However, some bad news: 
on some platforms strace is reportedly buggy in this area, and can cause 
vfork to misbehave even if vfork works correctly when it's not being 
straced.  Please see <http://www.openwall.com/lists/musl/2013/02/03/3>.

If we have found the problem, I'd like to modify Emacs to avoid the 
kernel bug.  Let's start by trying to build a dynamic test for it. Can 
you please build and run the attached program, and see whether you can 
get it to output the message "vfork bug detected" on your platform?  
Please compile it with the same compiler and flags that you use to 
compile Emacs.  You may need to invoke the test program with arguments, 
e.g., "./a.out 100 10000" to try the test 100 times with a delay of 
10,000 nanoseconds.  The goal is to detect the vfork bug as quickly as 
possible, so if you can detect it it'd be nice to see how small we can 
make the delay.  You might also try to run the test program under strace 
to see whether that changes things. Thanks.

[vfork-test.c (text/plain, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Tue, 27 May 2014 22:17:01 GMT) Full text and rfc822 format available.

Message #38 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Jorgen Schaefer <forcer <at> forcix.cx>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: Emacs can forget processes
Date: Wed, 28 May 2014 00:16:41 +0200

On Tue, 27 May 2014 14:42:03 -0700
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Let's start by trying to build a dynamic test for it.
> Can you please build and run the attached program, and see whether
> you can get it to output the message "vfork bug detected" on your
> platform? Please compile it with the same compiler and flags that you
> use to compile Emacs.  You may need to invoke the test program with
> arguments, e.g., "./a.out 100 10000" to try the test 100 times with a
> delay of 10,000 nanoseconds.  The goal is to detect the vfork bug as
> quickly as possible, so if you can detect it it'd be nice to see how
> small we can make the delay.  You might also try to run the test
> program under strace to see whether that changes things. Thanks.

Compiled using:

gcc -std=gnu99 -g3 -O2 -MMD -Wl,-znocombreloc -o vfork-test vfork-test.c

Ran as:

./vfork-test 1000 1000000
strace -ttf -o /dev/null ./vfork-test 10000 1000000

No output.

I also changed the delay to 1s just to test that, but no dice, either.

:-(

Jorgen

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Wed, 28 May 2014 00:49:02 GMT) Full text and rfc822 format available.

Message #41 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jorgen Schaefer <forcer <at> forcix.cx>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: Emacs can forget processes
Date: Tue, 27 May 2014 17:47:41 -0700

Jorgen Schaefer wrote:
> No output.

Ouch.  It sounds like we may be barking up the wrong tree with the 
assumption that there is a kernel bug with vfork, then.  It could be 
that it's just an strace bug.

I'm still suspicious of the interaction between pthread_sigmask and 
vfork, though.  What happens if you do the following shell session, 
where "$" is the shell prompt and "(gdb)" is the GDB prompt?

$ grep -i pthread src/config.h
$ ldd src/bootstrap-emacs | grep -i pthread
$ gdb src/bootstrap-emacs
(gdb) disas rpl_pthread_sigmask
(gdb) disas pthread_sigmask
(gdb) disas rpl_vfork
(gdb) disas vfork
(gdb) disas rpl_fork
(gdb) disas fork

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Wed, 28 May 2014 20:54:01 GMT) Full text and rfc822 format available.

Message #44 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Jorgen Schaefer <forcer <at> forcix.cx>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: Emacs can forget processes
Date: Wed, 28 May 2014 22:53:46 +0200

On Tue, 27 May 2014 17:47:41 -0700
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> I'm still suspicious of the interaction between pthread_sigmask and 
> vfork, though.  What happens if you do the following shell session, 
> where "$" is the shell prompt and "(gdb)" is the GDB prompt?

Assuming you meant in an Emacs with vfork enabled (i.e. the default):

> $ grep -i pthread src/config.h

/* Define to 1 if you have pthread (-lpthread). */
#define HAVE_PTHREAD 1
/* Define to 1 if you have the <pthread.h> header file. */
#define HAVE_PTHREAD_H 1
/* Define to 1 if the pthread_sigmask function can be used (despite bugs). */
/* #undef HAVE_PTHREAD_SIGMASK */
/* Define to 1 if pthread_sigmask(), when it fails, returns -1 and sets errno.
/* #undef PTHREAD_SIGMASK_FAILS_WITH_ERRNO */
/* Define to 1 if pthread_sigmask() may returns 0 and have no effect. */
/* #undef PTHREAD_SIGMASK_INEFFECTIVE */
/* Define to 1 if pthread_sigmask() unblocks signals incorrectly. */
/* #undef PTHREAD_SIGMASK_UNBLOCK_BUG */
#define PTY_TTY_NAME_SPRINTF { char *ptyname = 0; sigset_t blocked; sigemptyset (&blocked); sigaddset (&blocked, SIGCHLD); pthread_sigmask (SIG_BLOCK, &blocked, 0); if (grantpt (fd) != -1 && unlockpt (fd) != -1) ptyname = ptsname(fd); pthread_sigmask (SIG_UNBLOCK, &blocked, 0); if (!ptyname) { emacs_close (fd); return -1; } snprintf (pty_name, PTY_NAME_SIZE, "%s", ptyname); }
#ifndef _POSIX_PTHREAD_SEMANTICS
# define _POSIX_PTHREAD_SEMANTICS 1

> $ ldd src/bootstrap-emacs | grep -i pthread

        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0
        (0x00007f0c1cf79000)

> $ gdb src/bootstrap-emacs
> (gdb) disas rpl_pthread_sigmask

No symbol "rpl_pthread_sigmask" in current context.

> (gdb) disas pthread_sigmask

Dump of assembler code for function pthread_sigmask:
   0x00000000005642c0 <+0>:     sub    $0x8,%rsp
   0x00000000005642c4 <+4>:     callq  0x405790 <sigprocmask <at> plt>
   0x00000000005642c9 <+9>:     xor    %edx,%edx
   0x00000000005642cb <+11>:    test   %eax,%eax
   0x00000000005642cd <+13>:    jns    0x5642d6 <pthread_sigmask+22>
   0x00000000005642cf <+15>:    callq  0x405a10 <__errno_location <at> plt>
   0x00000000005642d4 <+20>:    mov    (%rax),%edx
   0x00000000005642d6 <+22>:    mov    %edx,%eax
   0x00000000005642d8 <+24>:    add    $0x8,%rsp
   0x00000000005642dc <+28>:    retq   
End of assembler dump.

> (gdb) disas rpl_vfork

No symbol "rpl_vfork" in current context.

> (gdb) disas vfork

Dump of assembler code for function vfork <at> plt:
   0x0000000000405ed0 <+0>:     jmpq   *0x39e662(%rip)        # 0x7a4538 <vfork <at> got.plt>
   0x0000000000405ed6 <+6>:     pushq  $0xfc
   0x0000000000405edb <+11>:    jmpq   0x404f00
End of assembler dump.

> (gdb) disas rpl_fork

No symbol "rpl_fork" in current context.

> (gdb) disas fork

Dump of assembler code for function fork <at> plt:
   0x0000000000405cb0 <+0>:     jmpq   *0x39e772(%rip)        # 0x7a4428 <fork <at> got.plt>
   0x0000000000405cb6 <+6>:     pushq  $0xda
   0x0000000000405cbb <+11>:    jmpq   0x404f00
End of assembler dump.

Regards,
Jorgen

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Wed, 28 May 2014 23:01:02 GMT) Full text and rfc822 format available.

Message #47 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jorgen Schaefer <forcer <at> forcix.cx>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: Emacs can forget processes
Date: Wed, 28 May 2014 16:00:39 -0700

On 05/28/2014 01:53 PM, Jorgen Schaefer wrote:
> Dump of assembler code for function pthread_sigmask:
>     0x00000000005642c0 <+0>:     sub    $0x8,%rsp
>     0x00000000005642c4 <+4>:     callq  0x405790 <sigprocmask <at> plt>
>     0x00000000005642c9 <+9>:     xor    %edx,%edx
>     0x00000000005642cb <+11>:    test   %eax,%eax
>     0x00000000005642cd <+13>:    jns    0x5642d6 <pthread_sigmask+22>
>     0x00000000005642cf <+15>:    callq  0x405a10 <__errno_location <at> plt>
>     0x00000000005642d4 <+20>:    mov    (%rax),%edx
>     0x00000000005642d6 <+22>:    mov    %edx,%eax
>     0x00000000005642d8 <+24>:    add    $0x8,%rsp
>     0x00000000005642dc <+28>:    retq

Bingo!  We're on the right track.  sigprocmask is *definitely* the wrong 
thing to call here, and would explain your symptoms.  Can you please 
send the preprocessed output of pthread_sigmask.c?  That is, if 'make' 
uses the following command to build pthread_sigmask.o:

gcc -std=gnu99 -DHAVE_CONFIG_H  -I../src -g3 -O2 -c pthread_sigmask.c

then please send the output of:

gcc -std=gnu99 -DHAVE_CONFIG_H  -I../src -g3 -O2 -E pthread_sigmask.c

It may be large; OK to send it as a compressed attachment. Thanks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Wed, 28 May 2014 23:36:01 GMT) Full text and rfc822 format available.

Message #50 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Jorgen Schaefer <forcer <at> forcix.cx>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: Emacs can forget processes
Date: Thu, 29 May 2014 01:35:05 +0200

[Message part 1 (text/plain, inline)]

On Wed, 28 May 2014 16:00:39 -0700
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> On 05/28/2014 01:53 PM, Jorgen Schaefer wrote:
> > Dump of assembler code for function pthread_sigmask:
> >     0x00000000005642c0 <+0>:     sub    $0x8,%rsp
> >     0x00000000005642c4 <+4>:     callq  0x405790 <sigprocmask <at> plt>
> >     0x00000000005642c9 <+9>:     xor    %edx,%edx
> >     0x00000000005642cb <+11>:    test   %eax,%eax
> >     0x00000000005642cd <+13>:    jns    0x5642d6
> > <pthread_sigmask+22> 0x00000000005642cf <+15>:    callq  0x405a10
> > <__errno_location <at> plt> 0x00000000005642d4 <+20>:    mov
> > (%rax),%edx 0x00000000005642d6 <+22>:    mov    %edx,%eax
> >     0x00000000005642d8 <+24>:    add    $0x8,%rsp
> >     0x00000000005642dc <+28>:    retq
> 
> Bingo!  We're on the right track.  sigprocmask is *definitely* the
> wrong thing to call here, and would explain your symptoms.  Can you
> please send the preprocessed output of pthread_sigmask.c?  That is,
> if 'make' uses the following command to build pthread_sigmask.o:
> 
> gcc -std=gnu99 -DHAVE_CONFIG_H  -I../src -g3 -O2 -c pthread_sigmask.c
> 
> then please send the output of:
> 
> gcc -std=gnu99 -DHAVE_CONFIG_H  -I../src -g3 -O2 -E pthread_sigmask.c
> 
> It may be large; OK to send it as a compressed attachment. Thanks.

I bow before your debugging skills!

gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I../lib -I../src -I../src       -g3
-O2 -MT pthread_sigmask.o -MD -MP -MF .deps/pthread_sigmask.Tpo -E -o
pthread_sigmask.E pthread_sigmask.c

Output attached.

Regards,
Jorgen

[pthread_sigmask.E.gz (application/gzip, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Thu, 29 May 2014 01:24:01 GMT) Full text and rfc822 format available.

Message #53 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jorgen Schaefer <forcer <at> forcix.cx>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: Emacs can forget processes
Date: Wed, 28 May 2014 18:22:48 -0700

Jorgen Schaefer wrote:
> Output attached.

OK, thanks, I think we need to fix things by arranging for Emacs to call 
pthread_sigmask rather than sigprocmask.  So, I'm curious as to why 
HAVE_PTHREAD_SIGMASK is not defined in config.h.  Can you please send 
the contents of config.log?  gzipped and attached is fine.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Thu, 29 May 2014 04:18:02 GMT) Full text and rfc822 format available.

Message #56 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jorgen Schaefer <forcer <at> forcix.cx>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: Emacs can forget processes
Date: Wed, 28 May 2014 21:17:40 -0700

[Message part 1 (text/plain, inline)]

Can you please apply the attached patch to the emacs-24 branch, run 
./autogen.sh && configure && make clean && make, and then try it out on 
your platform?  Thanks.

[sigmask.patch (text/plain, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Thu, 29 May 2014 10:09:02 GMT) Full text and rfc822 format available.

Message #59 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Jorgen Schaefer <forcer <at> forcix.cx>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: Emacs can forget processes
Date: Thu, 29 May 2014 12:08:23 +0200

[Message part 1 (text/plain, inline)]

On Wed, 28 May 2014 18:22:48 -0700
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Jorgen Schaefer wrote:
> > Output attached.
> 
> OK, thanks, I think we need to fix things by arranging for Emacs to
> call pthread_sigmask rather than sigprocmask.  So, I'm curious as to
> why HAVE_PTHREAD_SIGMASK is not defined in config.h.  Can you please
> send the contents of config.log?  gzipped and attached is fine.

See attachment

[config.log.gz (application/gzip, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Thu, 29 May 2014 11:40:02 GMT) Full text and rfc822 format available.

Message #62 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Jorgen Schaefer <forcer <at> forcix.cx>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: Emacs can forget processes
Date: Thu, 29 May 2014 13:39:27 +0200

On Wed, 28 May 2014 21:17:40 -0700
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Can you please apply the attached patch to the emacs-24 branch, run 
> ./autogen.sh && configure && make clean && make, and then try it out
> on your platform?  Thanks.

I can not currently build the emacs-24 branch (and haven't found a
commit in it that I can on a quick try), see bug#17629.

Using trunk (git 40f5ec0 * alloc.c (Fgarbage_collect): Fix compilation
with GC_MARK_STACK == GC_USE_GCPROS_AS_BEFORE.) now, built with your
patch and this command:

$ ./autogen.sh && ./configure --without-x && make clean && make

When restarting Emacs, I had the bug happen again right away. Sadly
without a running strace.

When my Emacs starts, I connect to a few IRC servers, and my IRC client
then starts flyspell in the channel buffers. I got this error right
away:

Error in post-command-hook (flyspell-post-command-hook): (error
"process ispellno longer connected to pipe; closed it")


After this, killing Python RPC processes reproduced it again as well.

strace:

13:28:46 --- SIGIO (I/O possible) @ 0 (0) ---
13:28:46 rt_sigreturn(0x1d)             = 1 <0.000020>
13:28:46 ioctl(3, FIONREAD, [1])        = 0 <0.000019>
13:28:46 read(3, "\r", 1)               = 1 <0.000020>
13:28:46 ioctl(3, FIONREAD, [0])        = 0 <0.000016>
13:28:46 ioctl(3, FIONREAD, [0])        = 0 <0.000016>
13:28:46 write(3, "\r", 1)              = 1 <0.000032>
13:28:46 rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 <0.000020>
13:28:46 kill(4294962889, SIGKILL)      = 0 <0.002883>
13:28:46 --- SIGIO (I/O possible) @ 0 (0) ---
13:28:46 rt_sigreturn(0x1d)             = 0 <0.000021>
13:28:46 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0 <0.000016>
13:28:46 ioctl(3, FIONREAD, [0])        = 0 <0.000018>
13:28:46 rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 <0.000017>
13:28:46 kill(4294962889, SIGHUP)       = 0 <0.000017>
13:28:46 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0 <0.000016>
13:28:46 rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 <0.000017>
13:28:46 kill(4294962821, SIGKILL)      = 0 <0.004785>
13:28:46 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0 <0.000017>
13:28:46 rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 <0.000017>
13:28:46 kill(4294962821, SIGHUP)       = 0 <0.000019>
13:28:46 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0 <0.000015>
13:28:46 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000022>
13:28:46 write(3, "\33[K\33[46;1H", 10) = 10 <0.000028>
13:28:46 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000016>
13:28:46 ioctl(3, FIONREAD, [0])        = 0 <0.000021>
13:28:46 pselect6(18, [3 4 5 7 8 11 13 17], [], NULL, {0, 262002641}, {NULL, 8}) = 2 (in [13 17], left {0, 261989573}) <0.000029>
13:28:46 read(13, "", 4096)             = 0 <0.000017>
13:28:46 read(17, "", 4096)             = 0 <0.000016>
13:28:46 ioctl(3, FIONREAD, [0])        = 0 <0.000019>
13:28:46 pselect6(18, [3 4 5 7 8 11 13 17], [], NULL, {0, 261495320}, {NULL, 8}) = 2 (in [13 17], left {0, 261485907}) <0.000030>
13:28:46 read(13, "", 4096)             = 0 <0.000017> 
13:28:46 read(17, "", 4096)             = 0 <0.000015>

[...]

13:289:02 --- SIGIO (I/O possible) @ 0 (0) --- 
13:289:02 rt_sigreturn(0x1d)             = 3 <0.000010> 
13:289:02 ioctl(3, FIONREAD, [1])        = 0 <0.000008> 
13:289:02 read(3, "d", 1)                = 1 <0.000009> 
13:289:02 ioctl(3, FIONREAD, [0])        = 0 <0.000007> 
13:289:02 ioctl(3, FIONREAD, [0])        = 0 <0.000007> 
13:289:02 rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 <0.000014> 
13:289:02 kill(4294962821, SIGKILL)      = 0 <0.000010> 
13:289:02 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0 <0.000008> 
13:289:02 close(12)                      = 0 <0.000015> 
13:289:02 close(13)                      = 0 <0.000012>
13:289:02 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000012>
13:289:02 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000010>
13:289:02 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.012141>
13:289:02 write(3, "\33[1;8r\33[3;1H\33[1M\33[1;50r", 23) = 23 <0.000018>
13:289:02 write(3, "\33[8;1H\33[K\33[H\n", 13) = 13 <0.000015>
13:289:02 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000010>
13:289:02 --- SIGIO (I/O possible) @ 0 (0) ---
13:289:02 rt_sigreturn(0x1d)             = 0 <0.000011>
13:289:02 ioctl(3, FIONREAD, [0])        = 0 <0.000011>
13:289:02 ioctl(3, FIONREAD, [0])        = 0 <0.000010>
13:289:02 pselect6(18, [3 4 5 7 8 11 17], [], NULL, {0, 206805058}, {NULL, 8}) = 1 (in [17], left {0, 206797159}) <0.000017> 
13:289:02 read(17, "", 4096)             = 0 <0.000009> 
13:289:02 pselect6(18, [3 4 5 7 8 11 17], [], NULL, {0, 206700816}, {NULL, 8}) = 1 (in [17], left {0, 206697015}) <0.000013> 
13:289:02 read(17, "", 4096)             = 0 <0.000008>

[...]

13:289:03 --- SIGIO (I/O possible) @ 0 (0) ---
13:289:03 rt_sigreturn(0x1d)             = 2 <0.000009>
13:289:03 ioctl(3, FIONREAD, [1])        = 0 <0.000008>
13:289:03 read(3, "d", 1)                = 1 <0.000008>
13:289:03 ioctl(3, FIONREAD, [0])        = 0 <0.000006>
13:289:03 ioctl(3, FIONREAD, [0])        = 0 <0.000006>
13:289:03 rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 <0.000007> 
13:289:03 kill(4294962889, SIGKILL)      = 0 <0.000011>
13:289:03 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0 <0.000007> 
13:289:03 close(16)                      = 0 <0.000013>
13:289:03 close(17)                      = 0 <0.000012>
13:289:03 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000010>
13:289:03 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000009>
13:289:03 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000010>
13:289:03 write(3, "\33[1;7r\33[2;1H\33[1M\33[1;50r", 23) = 23 <0.000012>
13:289:03 write(3, "\33[7;1H\33[K\33[H\n", 13) = 13 <0.000012>
13:289:03 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000007>

Back to normal.

Regards,
Jorgen

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Thu, 29 May 2014 15:10:01 GMT) Full text and rfc822 format available.

Message #65 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jorgen Schaefer <forcer <at> forcix.cx>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: Emacs can forget processes
Date: Thu, 29 May 2014 08:09:19 -0700

Jorgen Schaefer wrote:

> I can not currently build the emacs-24 branch (and haven't found a
> commit in it that I can on a quick try), see bug#17629.

That's an indication of a reasonably-serious problem with Emacs and your 
platform, which is worrisome and should get fixed.  I've responded at 
Bug#17629.

> Using trunk (git 40f5ec0 * alloc.c (Fgarbage_collect): Fix compilation
> with GC_MARK_STACK == GC_USE_GCPROS_AS_BEFORE.) now, built with your
> patch and this command:
>
> $ ./autogen.sh && ./configure --without-x && make clean && make
>
> When restarting Emacs, I had the bug happen again right away. Sadly
> without a running strace.

If the bug is easily reproducible, that's progress.  The patch does fix 
a bug (even if it's not your bug), so I've pushed it into the trunk. 
Can you get an strace -f of when Emacs stops getting SIGCHLD?  That's 
the crucial part.  That is, early on in all the straces I've seen, child 
processes die, Emacs gets a SIGCHLD, and then it calls waitpid (wait4) 
to reap them.  At some point this stops working, and Emacs no longer 
gets a SIGCHLD when a child terminates.  We need to find out what this 
point is, and what's causing it.

> 13:28:46 kill(4294962889, SIGKILL)      = 0 <0.002883>
> 13:28:46 --- SIGIO (I/O possible) @ 0 (0) ---
> 13:28:46 rt_sigreturn(0x1d)             = 0 <0.000021>
> 13:28:46 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0 <0.000016>
> 13:28:46 ioctl(3, FIONREAD, [0])        = 0 <0.000018>
> 13:28:46 rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 <0.000017>
> 13:28:46 kill(4294962889, SIGHUP)       = 0 <0.000017>

I'm surprised by the above trace.  The first kill(-4407,SIGKILL) 
succeeds, indicating that Emacs still has a subprocess.  But Emacs 
doesn't get a SIGCHLD, even after it unblocks SIGCHLD.  It would be 
helpful to see an 'strace -f' to see what's going on in the children here.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Thu, 29 May 2014 15:24:02 GMT) Full text and rfc822 format available.

Message #68 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Schwab <schwab <at> linux-m68k.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 17561 <at> debbugs.gnu.org, Jorgen Schaefer <forcer <at> forcix.cx>
Subject: Re: bug#17561: Emacs can forget processes
Date: Thu, 29 May 2014 17:22:49 +0200

Paul Eggert <eggert <at> cs.ucla.edu> writes:

>> 13:28:46 kill(4294962889, SIGKILL)      = 0 <0.002883>
>> 13:28:46 --- SIGIO (I/O possible) @ 0 (0) ---
>> 13:28:46 rt_sigreturn(0x1d)             = 0 <0.000021>
>> 13:28:46 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0 <0.000016>
>> 13:28:46 ioctl(3, FIONREAD, [0])        = 0 <0.000018>
>> 13:28:46 rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 <0.000017>
>> 13:28:46 kill(4294962889, SIGHUP)       = 0 <0.000017>
>
> I'm surprised by the above trace.  The first kill(-4407,SIGKILL) succeeds,
> indicating that Emacs still has a subprocess.  But Emacs doesn't get a
> SIGCHLD, even after it unblocks SIGCHLD.

Nothing in that sequence unblocks SIGCHLD.

Andreas.

-- 
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Thu, 29 May 2014 15:27:01 GMT) Full text and rfc822 format available.

Message #71 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Andreas Schwab <schwab <at> linux-m68k.org>
Cc: 17561 <at> debbugs.gnu.org, Jorgen Schaefer <forcer <at> forcix.cx>
Subject: Re: bug#17561: Emacs can forget processes
Date: Thu, 29 May 2014 08:26:31 -0700

Andreas Schwab wrote:
> Nothing in that sequence unblocks SIGCHLD.

Ouch; thanks, you're right, I misread that sequence.

So I'd like to see an strace of a session to see where Emacs first 
blocked SIGCHLD, so that we can figure out why Emacs didn't soon unblock it.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Thu, 29 May 2014 17:04:01 GMT) Full text and rfc822 format available.

Message #74 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Jorgen Schaefer <forcer <at> forcix.cx>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: bug#17561: Emacs can forget processes
Date: Thu, 29 May 2014 19:03:26 +0200

[Message part 1 (text/plain, inline)]

On Thu, 29 May 2014 08:26:31 -0700
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Andreas Schwab wrote:
> > Nothing in that sequence unblocks SIGCHLD.
> 
> Ouch; thanks, you're right, I misread that sequence.
> 
> So I'd like to see an strace of a session to see where Emacs first 
> blocked SIGCHLD, so that we can figure out why Emacs didn't soon
> unblock it.

Ok, let's see.

--- SIGCHLD (Child exited) @ 0 (0) ---

It works. A bit later:

rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM PROF], 8) = 0
rt_sigprocmask(SIG_SETMASK, [QUIT ALRM PROF], NULL, 8) = 0

That's fine, CHLD is not masked. A while later, then, this:

--- SIGIO (I/O possible) @ 0 (0) ---
rt_sigreturn(0x1d)                      = 11684722
ioctl(3, FIONREAD, [0])                 = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0
kill(4294962810, SIGKILL)               = 0
rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0
kill(4294962810, SIGHUP)                = 0
rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0
kill(4294942593, SIGKILL)               = 0
rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0
kill(4294942593, SIGHUP)                = 0
rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0
write(3, "\33[K\33[13;12H", 11)         = 11
rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0
--- SIGIO (I/O possible) @ 0 (0) ---
rt_sigreturn(0x1d)                      = 101
ioctl(3, FIONREAD, [0])                 = 0
rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0
write(3, "\33[9;1H\33[?25l\33[46m(\33[39;49m\33[13;1"..., 48) = 48
write(3, "\33[34h\33[?25h\33[34l", 16)  = 16
rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0
--- SIGIO (I/O possible) @ 0 (0) ---

If I read this right, this leaves CHLD blocked (SETMASK for CHLD, then
BLOCK WINCH IO, UNBLOCK WINCH IO, etc.)

Full trace attached. Uncompressed, that's 8.3M, mostly because of the
tight select/read loop.

What I did there was to start a process using my "test code" (temp
buffer cat), then start a few Python RPC processes (which cause some
noise), and then kill them all using the elpy-rpc-restart function I
posted before.

Regards,
Jorgen

[emacs-bug.txt.xz (application/x-xz, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Thu, 29 May 2014 17:56:01 GMT) Full text and rfc822 format available.

Message #77 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Schwab <schwab <at> linux-m68k.org>
To: Jorgen Schaefer <forcer <at> forcix.cx>
Cc: 17561 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#17561: Emacs can forget processes
Date: Thu, 29 May 2014 19:55:20 +0200

Jorgen Schaefer <forcer <at> forcix.cx> writes:

> It works. A bit later:
>
> rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM PROF], 8) = 0
> rt_sigprocmask(SIG_SETMASK, [QUIT ALRM PROF], NULL, 8) = 0
>
> That's fine, CHLD is not masked. A while later, then, this:
>
> --- SIGIO (I/O possible) @ 0 (0) ---
> rt_sigreturn(0x1d)                      = 11684722
> ioctl(3, FIONREAD, [0])                 = 0
> rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0

So the damage happend somewhere between these blocks.  Please try this
patch to better see when CHLD became blocked:

diff --git a/src/sysdep.c b/src/sysdep.c
index e5b2920..35c2856 100644
--- a/src/sysdep.c
+++ b/src/sysdep.c
@@ -618,6 +618,7 @@ request_sigio (void)
 {
 #ifdef USABLE_SIGIO
   sigset_t unblocked;
+  sigset_t oldmask;
 
   if (noninteractive)
     return;
@@ -627,7 +628,7 @@ request_sigio (void)
   sigaddset (&unblocked, SIGWINCH);
 # endif
   sigaddset (&unblocked, SIGIO);
-  pthread_sigmask (SIG_UNBLOCK, &unblocked, 0);
+  pthread_sigmask (SIG_UNBLOCK, &unblocked, &oldmask);
 
   interrupts_deferred = 0;
 #endif
@@ -638,6 +639,7 @@ unrequest_sigio (void)
 {
 #ifdef USABLE_SIGIO
   sigset_t blocked;
+  sigset_t oldmask;
 
   if (noninteractive)
     return;
@@ -647,7 +649,7 @@ unrequest_sigio (void)
   sigaddset (&blocked, SIGWINCH);
 # endif
   sigaddset (&blocked, SIGIO);
-  pthread_sigmask (SIG_BLOCK, &blocked, 0);
+  pthread_sigmask (SIG_BLOCK, &blocked, &oldmask);
   interrupts_deferred = 1;
 #endif
 }


Andreas.

-- 
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Thu, 29 May 2014 18:24:02 GMT) Full text and rfc822 format available.

Message #80 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Jorgen Schaefer <forcer <at> forcix.cx>
To: Andreas Schwab <schwab <at> linux-m68k.org>
Cc: 17561 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#17561: Emacs can forget processes
Date: Thu, 29 May 2014 20:23:00 +0200

[Message part 1 (text/plain, inline)]

On Thu, 29 May 2014 19:55:20 +0200
Andreas Schwab <schwab <at> linux-m68k.org> wrote:

> Jorgen Schaefer <forcer <at> forcix.cx> writes:
> 
> > It works. A bit later:
> >
> > rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM PROF], 8) = 0
> > rt_sigprocmask(SIG_SETMASK, [QUIT ALRM PROF], NULL, 8) = 0
> >
> > That's fine, CHLD is not masked. A while later, then, this:
> >
> > --- SIGIO (I/O possible) @ 0 (0) ---
> > rt_sigreturn(0x1d)                      = 11684722
> > ioctl(3, FIONREAD, [0])                 = 0
> > rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0
> 
> So the damage happend somewhere between these blocks.  Please try this
> patch to better see when CHLD became blocked:

See the attached trace. I spared you the few megs of read/select loop
this time.

I don't understand it, though.

Line 4685 has a SIGCHLD. The next rt_sigprocmask that mentions CHLD is
in line 4726, where SIGCHLD is in the oldset. It wasn't at 4714/4715.
Huh? How did it get there?

[with-andreas-patch.txt.xz (application/x-xz, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Thu, 29 May 2014 19:08:02 GMT) Full text and rfc822 format available.

Message #83 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Jorgen Schaefer <forcer <at> forcix.cx>
To: 17561 <at> debbugs.gnu.org
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, Andreas Schwab <schwab <at> linux-m68k.org>
Subject: Re: bug#17561: Emacs can forget processes
Date: Thu, 29 May 2014 21:06:59 +0200

On Thu, 29 May 2014 20:23:00 +0200
Jorgen Schaefer <forcer <at> forcix.cx> wrote:

> See the attached trace. I spared you the few megs of read/select loop
> this time.
> 
> I don't understand it, though.
> 
> Line 4685 has a SIGCHLD. The next rt_sigprocmask that mentions CHLD is
> in line 4726, where SIGCHLD is in the oldset. It wasn't at 4714/4715.
> Huh? How did it get there?

I do not see the rt_sigreturn corresponding to the SIGINT in line 4724.
Which would mean that the signals blocked in a signal handler (which I
guess includes SIGCHLD, according to the SIG_BLOCK in line 4726) won't
get restored.

I see the same sequence a few times in the full (21M) strace output:

SIGINT, then SIG_UNBLOCK for [INT], SIG_BLOCK for [WINCH IO] with
oldmask [QUIT ALRM CHILD PROF WINCH IO], then two calls to write(),
and then SIG_UNBLOCK with [WINCH IO], then a SIGIO, but no rt_sigreturn
for the SIGINT.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Thu, 29 May 2014 19:16:01 GMT) Full text and rfc822 format available.

Message #86 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Schwab <schwab <at> linux-m68k.org>
To: Jorgen Schaefer <forcer <at> forcix.cx>
Cc: 17561 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#17561: Emacs can forget processes
Date: Thu, 29 May 2014 21:15:35 +0200

Jorgen Schaefer <forcer <at> forcix.cx> writes:

> Line 4685 has a SIGCHLD. The next rt_sigprocmask that mentions CHLD is
> in line 4726, where SIGCHLD is in the oldset. It wasn't at 4714/4715.
> Huh? How did it get there?

31528 20:04:49.041423 --- SIGINT (Interrupt) @ 0 (0) ---
31528 20:04:49.041578 rt_sigprocmask(SIG_UNBLOCK, [INT], NULL, 8) = 0 <0.000016>
31528 20:04:49.044035 rt_sigprocmask(SIG_BLOCK, [WINCH IO], [QUIT ALRM CHLD PROF WINCH IO], 8) = 0 <0.000019>

It's the first time a SIGINT is received, and the first time the default
mask became non-empty.  That's because quit_throw_to_read_char doesn't
restore the signal mask.

Andreas.

-- 
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Thu, 29 May 2014 20:28:01 GMT) Full text and rfc822 format available.

Message #89 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Schwab <schwab <at> linux-m68k.org>
To: Jorgen Schaefer <forcer <at> forcix.cx>
Cc: 17561 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#17561: Emacs can forget processes
Date: Thu, 29 May 2014 22:27:54 +0200

Jorgen Schaefer <forcer <at> forcix.cx> writes:

> I do not see the rt_sigreturn corresponding to the SIGINT in line 4724.

There is none, it longjmps out of the signal handler.

Andreas.

-- 
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Thu, 29 May 2014 23:16:01 GMT) Full text and rfc822 format available.

Message #92 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jorgen Schaefer <forcer <at> forcix.cx>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: Emacs can forget processes
Date: Thu, 29 May 2014 16:15:16 -0700

Jorgen Schaefer wrote:
> See attachment

Thanks, there's clearly a bug there, as 'configure' decides that 
pthread_sigmask doesn't exist even though it does, in -lpthread.  I've 
installed a patch in trunk bzr 117200.  I don't think this fixes your 
main bug, nor does it fix a bug in emacs-24, but one step at a time. 
I'll look at the main bug soon.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Fri, 30 May 2014 06:08:02 GMT) Full text and rfc822 format available.

Message #95 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Andreas Schwab <schwab <at> linux-m68k.org>, Jorgen Schaefer <forcer <at> forcix.cx>
Cc: 17561 <at> debbugs.gnu.org
Subject: Re: bug#17561: Emacs can forget processes
Date: Thu, 29 May 2014 23:07:06 -0700

Andreas Schwab wrote:
> quit_throw_to_read_char doesn't restore the signal mask.

Thanks for the diagnosis.  I installed trunk bzr 117201 to try to fix 
this problem.  I don't see how this would explain the bug in emacs-24, 
though, as in emacs-24 the signal handler invokes pthread_sigmask 
(SIG_SETMASK, &empty_mask, 0) before calling quit_throw_to_read_char, 
and the pthread_sigmask call should restore the signal mask.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17561; Package emacs. (Fri, 30 May 2014 20:43:02 GMT) Full text and rfc822 format available.

Message #98 received at 17561 <at> debbugs.gnu.org (full text, mbox):

From: Jorgen Schaefer <forcer <at> forcix.cx>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 17561 <at> debbugs.gnu.org, Andreas Schwab <schwab <at> linux-m68k.org>
Subject: Re: bug#17561: Emacs can forget processes
Date: Fri, 30 May 2014 22:41:56 +0200

On Thu, 29 May 2014 23:07:06 -0700
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Andreas Schwab wrote:
> > quit_throw_to_read_char doesn't restore the signal mask.
> 
> Thanks for the diagnosis.  I installed trunk bzr 117201 to try to fix 
> this problem.  I don't see how this would explain the bug in
> emacs-24, though, as in emacs-24 the signal handler invokes
> pthread_sigmask (SIG_SETMASK, &empty_mask, 0) before calling
> quit_throw_to_read_char, and the pthread_sigmask call should restore
> the signal mask.

I haven't been able to reproduce the bug since I am running the
patched version. Does not say it's gone, but it's a good indication.

It is entirely possible that the hangs I have witnessed in the past
were caused by some other problem. As I said initially, it's only been
the last few days that I actually had the luck to track down the
behavior in the first place (being at work is a bad time to try and
find weird behavior of your editor, sadly :-)).

So I would assume that this bug is fixed, and if I come across weird
behavior again, I'll create a new report or re-open this one. Thank you
for the excellent debugging!

Possibly related, while trying to reproduce the bug, I have noticed
some other strange behavior. I managed to reduce it to the following
reproduction code:

(with-temp-buffer
  (let ((proc (start-process "test" (current-buffer) "bash" "-c"
                             "echo foo ; sleep 5")))
    (set-process-query-on-exit-flag proc nil)
    (accept-process-output nil 2)))

This hangs my normal Emacs for 2 seconds. A newly-started Emacs returns
immediately. I *suspect* that this is a timing issue, and strace
*seems* to concur - Emacs reads the "foo" from the subprocess before
running accept-process-output, and then just hangs for the two second
timeout, instead of noticing that there already was some output. Should
I create a new bug report for this?

Regards,
Jorgen

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Fri, 30 May 2014 21:30:04 GMT) Full text and rfc822 format available.

Notification sent to Jorgen Schaefer <contact <at> jorgenschaefer.de>:
bug acknowledged by developer. (Fri, 30 May 2014 21:30:06 GMT) Full text and rfc822 format available.

Message #103 received at 17561-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jorgen Schaefer <forcer <at> forcix.cx>
Cc: 17561-done <at> debbugs.gnu.org
Subject: Re: bug#17561: Emacs can forget processes
Date: Fri, 30 May 2014 14:29:36 -0700

On 05/30/2014 01:41 PM, Jorgen Schaefer wrote:
> So I would assume that this bug is fixed, and if I come across weird
> behavior again, I'll create a new report or re-open this one.

Thanks, closing the bug.

> Possibly related, while trying to reproduce the bug, I have noticed
> some other strange behavior. I managed to reduce it to the following
> reproduction code:
>
> (with-temp-buffer
>    (let ((proc (start-process "test" (current-buffer) "bash" "-c"
>                               "echo foo ; sleep 5")))
>      (set-process-query-on-exit-flag proc nil)
>      (accept-process-output nil 2)))
>
> This hangs my normal Emacs for 2 seconds. A newly-started Emacs returns
> immediately. I*suspect*  that this is a timing issue, and strace
> *seems*  to concur - Emacs reads the "foo" from the subprocess before
> running accept-process-output, and then just hangs for the two second
> timeout, instead of noticing that there already was some output. Should
> I create a new bug report for this?

Yes, please.  It'd be nice to get an strace of Emacs, particularly where 
it goes wrong (which may be well before the symptoms start being visible).

Forcibly Merged 17337 17561 17628. Request was from Glenn Morris <rgm <at> gnu.org> to control <at> debbugs.gnu.org. (Tue, 03 Jun 2014 15:41:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 02 Jul 2014 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 9 years and 300 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #17561 24.4.50; Emacs can forget processes

GNU bug report logs - #17561
24.4.50; Emacs can forget processes