GNU bug report logs - #12980
24.3.50; Zombie process left after call-process

Previous Next

Package: emacs;

Reported by: Harald Hanche-Olsen <hanche <at> math.ntnu.no>

Date: Sat, 24 Nov 2012 09:37:02 UTC

Severity: important

Tags: patch

Merged with 14515

Found in versions 24.3, 24.3.50

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 12980 in the body.
You can then email your comments to 12980 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#12980; Package emacs. (Sat, 24 Nov 2012 09:37:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Harald Hanche-Olsen <hanche <at> math.ntnu.no>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Sat, 24 Nov 2012 09:37:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Harald Hanche-Olsen <hanche <at> math.ntnu.no>
To: bug-gnu-emacs <at> gnu.org
Subject: 24.3.50; Zombie process left after call-process
Date: Sat, 24 Nov 2012 10:34:37 +0100 (CET)
If I run

(call-process "/bin/sleep" nil 0 nil "5")

this returns immediately (which is expected), but after 5 seconds, the
subprocess becomes a zombie and hangs around until emacs is quit.

In GNU Emacs 24.3.50.1 (x86_64-apple-darwin12.2.0, NS apple-appkit-1187.34)
 of 2012-11-09 on airy
Bzr revision: 110853 jan.h.d <at> swipnet.se-20121109063651-ewuwszrpta821oha
Windowing system distributor `Apple', version 10.3.1187
Configured using:
 `configure '--with-ns''

I asked about this on the emacs-devel list (see the thread "Zombie
subprocesses" starting 2012-11-23), and others report being able to
reproduce the bug on Cygwin (emacs-24 branch) and GNU/Linux
(x86_64-unknown-linux-gnu, latest emacs-24 and trunk) as well.

From the discussion there, this appears related to bug#8855 somehow.

- Harald




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12980; Package emacs. (Mon, 26 Nov 2012 22:41:02 GMT) Full text and rfc822 format available.

Message #8 received at 12980 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Harald Hanche-Olsen <hanche <at> math.ntnu.no>
Cc: 9488 <at> debbugs.gnu.org, Eli Zaretskii <eliz <at> gnu.org>, 12980 <at> debbugs.gnu.org
Subject: Re: 24.3.50; Zombie process left after call-process
Date: Mon, 26 Nov 2012 14:38:41 -0800
[Message part 1 (text/plain, inline)]
Harald, thanks for the bug report.  This part of Emacs is a bit
messy, but I hacked away at it and came up with
the attached proposed patch.

This patch touches some code for Microsoft platforms
so I'm CC'ing this to Eli.  I'm also CC'ing Bug#9488
as I think it fixes the rest of that bug too.
[g_spawn_sync.txt (text/plain, attachment)]

Added tag(s) patch. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Mon, 26 Nov 2012 23:44:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12980; Package emacs. (Tue, 27 Nov 2012 16:03:02 GMT) Full text and rfc822 format available.

Message #13 received at 12980 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 9488 <at> debbugs.gnu.org, hanche <at> math.ntnu.no, 12980 <at> debbugs.gnu.org
Subject: Re: 24.3.50; Zombie process left after call-process
Date: Tue, 27 Nov 2012 18:00:56 +0200
> Date: Mon, 26 Nov 2012 14:38:41 -0800
> From: Paul Eggert <eggert <at> cs.ucla.edu>
> CC: 12980 <at> debbugs.gnu.org, Eli Zaretskii <eliz <at> gnu.org>, 
>  9488 <at> debbugs.gnu.org
> 
> Harald, thanks for the bug report.  This part of Emacs is a bit
> messy, but I hacked away at it and came up with
> the attached proposed patch.
> 
> This patch touches some code for Microsoft platforms
> so I'm CC'ing this to Eli.

Thanks.

> +/* Return true if CHILD is reapable.  CHILD must be the process ID of
> +   a child process.  As a side effect this function may reap CHILD,
> +   discarding its status, and therefore report that CHILD is not
> +   reapable.
> +
> +   This function is intended for use after the caller was interrupted
> +   while trying to reap CHILD, and where the caller later tests
> +   whether CHILD was in fact reaped.  The caller should not create
> +   another child process (via vfork, say) between the earlier attempt
> +   to reap CHILD and now, as that would cause a race if the newly
> +   created child process happened to have CHILD as its process-ID.  */

I'm confused by this commentary, perhaps because it doesn't explain
what is the definition of "reapable".  The original naive
interpretation (i.e. the child exited and its status is ready to be
accessed) seems to contradict the latter part of the commentary, and
also the fact that the function will return false both if waitpid
returns a positive and a negative value.  I would suggest to clarify
this.

>  #ifdef MSDOS /* MW, July 1993 */
>      /* Note that on MSDOS `child_setup' actually returns the child process
> -       exit status, not its PID, so we assign it to `synch_process_retcode'
> -       below.  */
> +       exit status, not its PID, so assign it to status below.  */
>      pid = child_setup (filefd, outfilefd, fd_error, (char **) new_argv,
>  		       0, current_dir);
> -
> -    /* Record that the synchronous process exited and note its
> -       termination status.  */
> -    synch_process_alive = 0;
> -    synch_process_retcode = pid;
> -    if (synch_process_retcode < 0)  /* means it couldn't be exec'ed */
> -      {
> -	synchronize_system_messages_locale ();
> -	synch_process_death = strerror (errno);
> -      }
> +    status = pid;

The feature, whereby this function (call-process) could return a
description of errno, is hereby lost.  That's quite nasty in the MSDOS
case, where errno is the only way to indicate to the user what
happened, because the value of status in case of failure is always -1,
which is not quite descriptive.  Can we please not drop this feature
on the floor?

> +    if (0 < pid)
> +      {
> +	if (INTEGERP (buffer))
> +	  record_deleted_pid (pid);
> +	else
> +	  {
> +#ifdef MSDOS
> +	    /* MSDOS needs different cleanup information.  */
> +	    cleanup_info_tail = build_string (tempfile ? tempfile : "");
> +#else
> +	    cleanup_info_tail = INTEGER_TO_CONS (pid);
> +#endif
> +	    record_unwind_protect (call_process_cleanup,
> +				   Fcons (Fcurrent_buffer (),
> +					  Fcons (INTEGER_TO_CONS (fd[0]),
> +						 cleanup_info_tail)));
> +	  }
> +      }
> +
> +    unblock_child_signal ();
>      unblock_input ();
>  
>  #endif /* not WINDOWSNT */
> @@ -658,17 +705,6 @@
>    /* Enable sending signal if user quits below.  */
>    call_process_exited = 0;
>  
> -#if defined (MSDOS)
> -  /* MSDOS needs different cleanup information.  */
> -  cleanup_info_tail = build_string (tempfile ? tempfile : "");
> -#else
> -  cleanup_info_tail = INTEGER_TO_CONS (pid);
> -#endif /* not MSDOS */
> -  record_unwind_protect (call_process_cleanup,
> -			 Fcons (Fcurrent_buffer (),
> -				Fcons (INTEGER_TO_CONS (fd[0]),
> -				       cleanup_info_tail)));
> -

This is wrong for MSDOS, because the temporary file is created before
attempting to launch the process, and so it needs to be cleaned up
even if we failed to launch the child process.

> -  if (synch_process_termsig)
> +  if (WIFSIGNALED (status))
>      {
>        const char *signame;
>  
>        synchronize_system_messages_locale ();
> -      signame = strsignal (synch_process_termsig);
> +      signame = strsignal (WTERMSIG (status));
>  
>        if (signame == 0)
>  	signame = "unknown";
>  
> -      synch_process_death = signame;
> +      return code_convert_string_norecord (build_string (signame),
> +					   Vlocale_coding_system, 0);
>      }
>  
> -  if (synch_process_death)
> -    return code_convert_string_norecord (build_string (synch_process_death),
> -					 Vlocale_coding_system, 0);
> -  return make_number (synch_process_retcode);
> +  eassert (WIFEXITED (status));
> +  return make_number (WEXITSTATUS (status));

This seem to assume that the only trouble in call-process could be
from some signal.  Why is that true?

>  static bool
> -process_status_retrieved (pid_t desired, pid_t have, int *status)
> +process_status_retrieved (pid_t desired, int *status, int options)
>  {
> -  if (have < 0)
> +  /* Invoke waitpid only with a known process ID; do not invoke
> +     waitpid with a nonpositive argument.  Otherwise, Emacs might
> +     reap an unwanted process by mistake.  For example, invoking
> +     waitpid (-1, ...) can mess up glib by reaping glib's subprocesses,
> +     so that another thread running glib won't find them.  */
> +  eassert (0 < desired);
> +
> +  while (1)
>      {
> -      /* Invoke waitpid only with a known process ID; do not invoke
> -	 waitpid with a nonpositive argument.  Otherwise, Emacs might
> -	 reap an unwanted process by mistake.  For example, invoking
> -	 waitpid (-1, ...) can mess up glib by reaping glib's subprocesses,
> -	 so that another thread running glib won't find them.  */
> -      do
> -	have = waitpid (desired, status, WNOHANG | WUNTRACED);
> -      while (have < 0 && errno == EINTR);
> +      pid_t pid = waitpid (desired, status, WNOHANG | options);
> +      if (0 <= pid)
> +	return 0 < pid;
> +      if (errno == ECHILD)
> +	return 0;
> +      eassert (errno == EINTR);

Why is this eassert a good idea?  We don't need to enforce the Posix
spec at a price of crashing Emacs on users, do we?  What bad things
can happen if you see some other errno here?

> --- src/w32proc.c	2012-11-24 01:57:09 +0000
> +++ src/w32proc.c	2012-11-26 22:06:24 +0000
> @@ -1273,34 +1273,7 @@
>    DebPrint (("Wait signaled with process pid %d\n", cp->pid));
>  #endif
>  
> -  if (status)
> -    {
> -      *status = retval;
> -    }
> -  else if (synch_process_alive)
> -    {
> -      synch_process_alive = 0;
> -
> -      /* Report the status of the synchronous process.  */
> -      if (WIFEXITED (retval))
> -	synch_process_retcode = WEXITSTATUS (retval);
> -      else if (WIFSIGNALED (retval))
> -	{
> -	  int code = WTERMSIG (retval);
> -	  const char *signame;
> -
> -	  synchronize_system_messages_locale ();
> -	  signame = strsignal (code);
> -
> -	  if (signame == 0)
> -	    signame = "unknown";
> -
> -	  synch_process_death = signame;
> -	}
> -
> -      reap_subprocess (cp);
> -    }
> -
> +  *status = retval;
>    reap_subprocess (cp);

The condition testing that status is non-NULL must stay, because that
argument can be NULL, even if Emacs currently never uses that option.

I didn't see any other obvious problems.  However, given the
complexity of handling this, and the extent of changes in this
changeset, it would be nice to have somewhere a commentary with an
overview of how termination of child processes is handled.  Reverse
engineering that from the code is not really trivial.  Would you
please consider writing such an overview?

TIA




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12980; Package emacs. (Thu, 29 Nov 2012 02:33:01 GMT) Full text and rfc822 format available.

Message #16 received at 12980 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 9488 <at> debbugs.gnu.org, hanche <at> math.ntnu.no, 12980 <at> debbugs.gnu.org
Subject: Re: 24.3.50; Zombie process left after call-process
Date: Wed, 28 Nov 2012 18:30:08 -0800
[Message part 1 (text/plain, inline)]
Eli, thanks for your review.  I'm attaching a
heavily-revised patch, which tries to take your comments
into account, along with several other issues I noticed in
my own analysis and testing.  In response to your comments:

On 11/27/2012 08:00 AM, Eli Zaretskii wrote:

> I'm confused by this commentary

The new patch removes that function and its associated
commentary.

> The feature, whereby this function (call-process) could return a
> description of errno, is hereby lost.  That's quite nasty in the MSDOS
> case

That feature is supported only by MS-DOS, right?  On all
other platforms an error is thrown.  Anyway, the new patch
attempts to preserve the MS-DOS behavior.

> This is wrong for MSDOS, because the temporary file is created before
> attempting to launch the process, and so it needs to be cleaned up
> even if we failed to launch the child process.

The new patch attempts to undo that change for MSDOS.

> This seem to assume that the only trouble in call-process could be
> from some signal.  Why is that true?

By the time that part of the code is reached, any other
failures (e.g., fork failure) have already been diagnosed
and an error would have been thrown or call-process would
have returned before getting to that location.

> Why is this eassert a good idea?  We don't need to enforce the Posix
> spec at a price of crashing Emacs on users, do we?  What bad things
> can happen if you see some other errno here?

It's not a question of enforcing Posix semantics.  It's a
question of catching potentially serious internal
programming errors in Emacs.  That part of the code has been
completely rewritten, so the details of the comment is moot.
However, the new code has an "eassert (errno == EINTR)" in
the new get_child_status function.  I preceded that with a
comment to try to help explain why the eassert is there.

> The condition testing that status is non-NULL must stay, because that
> argument can be NULL, even if Emacs currently never uses that option.

OK, thanks, I did that.

> it would be nice to have somewhere a commentary with an
> overview of how termination of child processes is handled.

I added some, before handle_child_signal.

[g_spawn_sync.txt (text/plain, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12980; Package emacs. (Thu, 29 Nov 2012 18:07:01 GMT) Full text and rfc822 format available.

Message #19 received at 12980 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 9488 <at> debbugs.gnu.org, hanche <at> math.ntnu.no, 12980 <at> debbugs.gnu.org
Subject: Re: 24.3.50; Zombie process left after call-process
Date: Thu, 29 Nov 2012 20:04:55 +0200
> Date: Wed, 28 Nov 2012 18:30:08 -0800
> From: Paul Eggert <eggert <at> cs.ucla.edu>
> CC: hanche <at> math.ntnu.no, 12980 <at> debbugs.gnu.org, 9488 <at> debbugs.gnu.org
> 
> > The feature, whereby this function (call-process) could return a
> > description of errno, is hereby lost.  That's quite nasty in the MSDOS
> > case
> 
> That feature is supported only by MS-DOS, right?

Yes, I think so.

> On all other platforms an error is thrown.  Anyway, the new patch
> attempts to preserve the MS-DOS behavior.

Thanks.

> > Why is this eassert a good idea?  We don't need to enforce the Posix
> > spec at a price of crashing Emacs on users, do we?  What bad things
> > can happen if you see some other errno here?
> 
> It's not a question of enforcing Posix semantics.  It's a
> question of catching potentially serious internal
> programming errors in Emacs.  That part of the code has been
> completely rewritten, so the details of the comment is moot.
> However, the new code has an "eassert (errno == EINTR)" in
> the new get_child_status function.  I preceded that with a
> comment to try to help explain why the eassert is there.

Here's the comment:

> +      /* CHILD must be a child process that has not been reaped, and
> +	 STATUS and OPTIONS must be valid.  */
> +      eassert (errno == EINTR);

Are we sure that either CHILD will have exited at this point, or else
OPTIONS won't include WNOHANG?  Can this function be ever called if
neither of these conditions is true?

> > it would be nice to have somewhere a commentary with an
> > overview of how termination of child processes is handled.
> 
> I added some, before handle_child_signal.

Thanks.

> -  if (synch_process_termsig)
> +  if (WIFSIGNALED (status))
>      {
>        const char *signame;
>  
>        synchronize_system_messages_locale ();
> -      signame = strsignal (synch_process_termsig);
> +      signame = strsignal (WTERMSIG (status));
>  
>        if (signame == 0)
>  	signame = "unknown";
>  
> -      synch_process_death = signame;
> +      return code_convert_string_norecord (build_string (signame),
> +					   Vlocale_coding_system, 0);
>      }
>  
> -  if (synch_process_death)
> -    return code_convert_string_norecord (build_string (synch_process_death),
> -					 Vlocale_coding_system, 0);
> -  return make_number (synch_process_retcode);
> +  eassert (WIFEXITED (status));
> +  return make_number (WEXITSTATUS (status));

The use of WIF* macros here, in particular WIFSIGNALED and WTERMSIG,
will probably require more accurate definitions of these for Windows
(currently, they are quite naive and therefore wrong).  But that can
wait.

Otherwise, I see no problems related to Windows.  Thanks!




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12980; Package emacs. (Thu, 29 Nov 2012 20:31:02 GMT) Full text and rfc822 format available.

Message #22 received at 12980 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 9488 <at> debbugs.gnu.org, hanche <at> math.ntnu.no, 12980 <at> debbugs.gnu.org
Subject: Re: 24.3.50; Zombie process left after call-process
Date: Thu, 29 Nov 2012 12:28:24 -0800
On 11/29/12 10:04, Eli Zaretskii wrote:

>> +      /* CHILD must be a child process that has not been reaped, and
>> +         STATUS and OPTIONS must be valid.  */
>> +      eassert (errno == EINTR);
> 
> Are we sure that either CHILD will have exited at this point, or else
> OPTIONS won't include WNOHANG?

It's unlikely that CHILD will have exited at this point.  That can
happen only if CHILD exited after the immediately-preceding waitpid
call and before this eassert.  In such a case, CHILD has exited but
has not been reaped.

It's common for OPTIONS to not include WNOHANG (wait_for_termination
does this), and when that happens it's almost always the case that
CHILD has not exited at this point.

> Can this function be ever called if
> neither of these conditions is true?

By "this function" I assume you mean get_child_status.  Yes, it's
quite common.  For example, Fcall_process invokes wait_for_termination,
and it's common for wait_for_termination to invoke get_child_status
before CHILD has exited, and without WNOHANG in OPTIONS.




Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Mon, 03 Dec 2012 21:51:02 GMT) Full text and rfc822 format available.

Notification sent to Harald Hanche-Olsen <hanche <at> math.ntnu.no>:
bug acknowledged by developer. (Mon, 03 Dec 2012 21:51:02 GMT) Full text and rfc822 format available.

Message #27 received at 12980-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 12980-done <at> debbugs.gnu.org, 9488-done <at> debbugs.gnu.org
Subject: Re: 24.3.50; Zombie process left after call-process
Date: Mon, 03 Dec 2012 13:47:59 -0800
No further comment, so I installed the patch as Emacs trunk bzr 111081
and am marking bugs 9488 and 12980 as done.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 01 Jan 2013 12:24:03 GMT) Full text and rfc822 format available.

bug unarchived. Request was from Glenn Morris <rgm <at> gnu.org> to control <at> debbugs.gnu.org. (Sat, 15 Jun 2013 18:26:02 GMT) Full text and rfc822 format available.

Forcibly Merged 12980 14515. Request was from Glenn Morris <rgm <at> gnu.org> to control <at> debbugs.gnu.org. (Sat, 15 Jun 2013 18:26:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 14 Jul 2013 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 10 years and 282 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.