GNU bug report logs - #32786
(system* ...) call somehow interferes with atomic-box on armv7l

Previous Next

Package: guile;

Reported by: Михаил Бахтерев <mob <at> k.imm.uran.ru>

Date: Thu, 20 Sep 2018 18:08:01 UTC

Severity: normal

Done: Mark H Weaver <mhw <at> netris.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 32786 in the body.
You can then email your comments to 32786 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guile <at> gnu.org:
bug#32786; Package guile. (Thu, 20 Sep 2018 18:08:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Михаил Бахтерев <mob <at> k.imm.uran.ru>:
New bug report received and forwarded. Copy sent to bug-guile <at> gnu.org. (Thu, 20 Sep 2018 18:08:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Михаил Бахтерев <mob <at> k.imm.uran.ru>
To: bug-guile <at> gnu.org
Subject: (system* ...) call somehow interferes with atomic-box on armv7l
Date: Thu, 20 Sep 2018 22:13:06 +0500
[Message part 1 (text/plain, inline)]
Greetings.

I attach the code sample, which expected behaviour is:

  1. to periodically restart thread, which executes «sleep 1s» with
  system* call;

  2. and then to check if it should repeat the execution.
  
The flag, which controls that re-execution is managed by another thread
through atomic-box primitive.

The code does not run as expected: the sleep executing sleep starts
once, and change seems to remain invisible to main thread.

The code works as expected, when:

  1. (sleep 1) is used instead (system* "sleep" "1s");
  2. running the code on x86_64 cpus.

The use of affinity mask

  taskset 0x01 guile test.scm

does not help.

- MB. Respectfully

P.S. Additional information:

$ guile --version  
guile (GNU Guile) 2.2.4
Copyright (C) 2018 Free Software Foundation, Inc.

License LGPLv3+: GNU LGPL 3 or later <http://gnu.org/licenses/lgpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

$ sh config.guess
armv7l-unknown-linux-gnueabihf

$ pacman -Qi guile 
Name            : guile
Version         : 2.2.4-1
Description     : Portable, embeddable Scheme implementation written in C
Architecture    : armv7h
URL             : https://www.gnu.org/software/guile/
Licenses        : GPL
Groups          : None
Provides        : None
Depends On      : gmp  libltdl  ncurses  texinfo  libunistring  gc  libffi
Optional Deps   : None
Required By     : make
Optional For    : gnutls
Conflicts With  : None
Replaces        : None
Installed Size  : 40.83 MiB
Packager        : Arch Linux ARM Build System <builder+xu9 <at> archlinuxarm.org>
Build Date      : Tue 03 Jul 2018 04:47:53 PM +05
Install Date    : Mon 09 Jul 2018 01:02:48 PM +05
Install Reason  : Installed as a dependency for another package
Install Script  : No
Validated By    : Signature
[test.scm (application/vnd.lotus-screencam, attachment)]
[cpuinfo (text/plain, attachment)]

Information forwarded to bug-guile <at> gnu.org:
bug#32786; Package guile. (Fri, 21 Sep 2018 21:27:01 GMT) Full text and rfc822 format available.

Message #8 received at 32786 <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: Михаил Бахтерев <mob <at> k.imm.uran.ru>
Cc: 32786 <at> debbugs.gnu.org
Subject: Re: bug#32786: (system* ...) call somehow interferes with atomic-box
 on armv7l
Date: Fri, 21 Sep 2018 17:26:27 -0400
Hi,

Михаил Бахтерев <mob <at> k.imm.uran.ru> writes:

> I attach the code sample, which expected behaviour is:
>
>   1. to periodically restart thread, which executes «sleep 1s» with
>   system* call;
>
>   2. and then to check if it should repeat the execution.
>   
> The flag, which controls that re-execution is managed by another thread
> through atomic-box primitive.
>
> The code does not run as expected: the sleep executing sleep starts
> once, and change seems to remain invisible to main thread.

Can you describe in more detail the behavior you are seeing?  What
messages do you see on stderr from your 'dump' calls when it gets stuck?
What state is the atomic box left in?

I see a couple of problems with the code below:

* Two threads write concurrently to the same port, namely
  (current-error-port).  You should use thread synchronization to ensure
  that operations to a given port are serialized.

* There's a race condition when the atomic box is in the #:accepted
  state.  If the main thread finds the box in that state, it changes the
  state to #:need-to-sleep, which will apparently cause the other thread
  to sleep again.

I'm not sure if these problems could explain what you're seeing.

     Regards,
       Mark


> The code works as expected, when:
>
>   1. (sleep 1) is used instead (system* "sleep" "1s");
>   2. running the code on x86_64 cpus.
>
> The use of affinity mask
>
>   taskset 0x01 guile test.scm
>
> does not help.
>
> - MB. Respectfully
>
> P.S. Additional information:
>
> $ guile --version  
> guile (GNU Guile) 2.2.4
> Copyright (C) 2018 Free Software Foundation, Inc.
>
> License LGPLv3+: GNU LGPL 3 or later <http://gnu.org/licenses/lgpl.html>.
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
>
> $ sh config.guess
> armv7l-unknown-linux-gnueabihf
>
> $ pacman -Qi guile 
> Name            : guile
> Version         : 2.2.4-1
> Description     : Portable, embeddable Scheme implementation written in C
> Architecture    : armv7h
> URL             : https://www.gnu.org/software/guile/
> Licenses        : GPL
> Groups          : None
> Provides        : None
> Depends On      : gmp  libltdl  ncurses  texinfo  libunistring  gc  libffi
> Optional Deps   : None
> Required By     : make
> Optional For    : gnutls
> Conflicts With  : None
> Replaces        : None
> Installed Size  : 40.83 MiB
> Packager        : Arch Linux ARM Build System <builder+xu9 <at> archlinuxarm.org>
> Build Date      : Tue 03 Jul 2018 04:47:53 PM +05
> Install Date    : Mon 09 Jul 2018 01:02:48 PM +05
> Install Reason  : Installed as a dependency for another package
> Install Script  : No
> Validated By    : Signature
>
> (use-modules (ice-9 atomic)
>              (ice-9 threads))
>
> (define dump
>   (let ((p (current-error-port)))
>     (lambda (fmt . args) (apply format p fmt args))))
>
> (define state (make-atomic-box #:nothing-to-do))
>
> (define (expect e v)
>   (when (not (eq? e v))
>     (error "Protocol violation; (expecting . got):" (cons e v)))) 
>
> (define (sleep-loop)
>   (expect #:need-to-sleep (atomic-box-ref state))
>
>   (dump "R: Going to (sleep 1)~%")
>   (atomic-box-set! state #:accepted)
>   
>   ; SOMETHING WRONG IS HERE
>   (system* "sleep" "1s")
>   ; (sleep 1)
>
>   (let ((v (atomic-box-compare-and-swap! state #:accepted #:nothing-to-do)))
>     (when (not (eq? #:accepted v))
>       (dump "S: Repeating sleep~%")
>       (sleep-loop)))
>
>   (dump "R: sleep-loop finished~%"))
>
> (define (main-loop)
>   (define (run-thread)
>     (dump "M: new sleep thread~%")
>     (atomic-box-set! state #:need-to-sleep)
>     (call-with-new-thread sleep-loop)) 
>
>   (dump "M: (sleep 3)~%")
>   (sleep 3)
>
>   (dump "M: protocol exchange~%")
>   (let ((st (atomic-box-ref state)))
>     (case st 
>       ((#:nothing-to-do) (run-thread))
>       ((#:accepted) (let ((v (atomic-box-compare-and-swap! state #:accepted #:need-to-sleep)))
>                       (when (not (eq? #:accepted v))
>                         (expect #:accepted v)
>                         (run-thread))))
>       (else (expect #:need-to-sleep st))))
>
>   (main-loop))
>
> (main-loop)




Information forwarded to bug-guile <at> gnu.org:
bug#32786; Package guile. (Sat, 22 Sep 2018 05:06:01 GMT) Full text and rfc822 format available.

Message #11 received at 32786 <at> debbugs.gnu.org (full text, mbox):

From: Михаил Бахтерев <mob <at> k.imm.uran.ru>
To: Mark H Weaver <mhw <at> netris.org>
Cc: 32786 <at> debbugs.gnu.org
Subject: Re: bug#32786: (system* ...) call somehow interferes with atomic-box
 on armv7l
Date: Sat, 22 Sep 2018 09:39:38 +0500
On Fri, Sep 21, 2018 at 05:26:27PM -0400, Mark H Weaver wrote:
> Hi,
> 
> Can you describe in more detail the behavior you are seeing?  What
> messages do you see on stderr from your 'dump' calls when it gets stuck?
> What state is the atomic box left in?
> 
> I see a couple of problems with the code below:
> 
> * Two threads write concurrently to the same port, namely
>   (current-error-port).  You should use thread synchronization to ensure
>   that operations to a given port are serialized.
> 
> * There's a race condition when the atomic box is in the #:accepted
>   state.  If the main thread finds the box in that state, it changes the
>   state to #:need-to-sleep, which will apparently cause the other thread
>   to sleep again.
> 
> I'm not sure if these problems could explain what you're seeing.
> 
>      Regards,
>        Mark
> 

The code is supposed to model the following behaviour.

The main thread tracks some events (in the example: the end of (sleep
3)). If one or multiple events occurred some long-lasting external
command should be executed. Long-lasting means that multiple events may
occur during execution.

To perform that the main thread should:

  1. either to run new work thread (sleep-loop procedure), which will
  call (system* ...);

  2. or to communicate the fact of events occurrences to the work
  thread, which is supposed to detect this and just repeat the command.

If multiple events have occurred during command execution the command
should be re-executed only once.

The event occurrence is communicated through simple protocol with three
states: #:nothing-to-do < #:accepted < #:need-to-sleep. The main thread
pulls state up, and work thread pulls it down. When the main thread
detects event, and the state is #:accepted, it means, that work thread
accepted the previous job, and executing it now, so the main thread
pulls the state up to need-to-sleep, and the work thread re-executes
command (sleeps again in the test case), when the previous external
command run has finished. This is intended behaviour.

I expect the program to print some garbage, showing that work thread is
restarted periodically. Something like that (it is result of using
(sleep 1) instead of (system* "sleep" "1s")):

$ guile test.scm
M: (sleep 3)
M: protocol exchange
M: new sleep thread
M: (sleep 3)

R: Going to (sleep 1)
R: sleep-loop finished
M: protocol exchange
M: new sleep thread
M: (sleep 3)
: (sleep 3)R: Going to (sleep 1)

R: sleep-loop finished
M: protocol exchange
M: new sleep thread
M: (sleep 3)

R: Going to (sleep 1)
R: sleep-loop finished
M: protocol exchange
M: new sleep thread
M: (sleep 3)

: (sleep 3)R: Going to (sleep 1)
R: sleep-loop finished

But what i get with (system* "sleep" "1s") is this:

$ guile test.scm
M: (sleep 3)
M: protocol exchange
M: new sleep thread
M: (sleep 3)
R: Going to (sleep 1)
R: sleep-loop finished
M: protocol exchange
M: (sleep 3)
M: protocol exchange
M: (sleep 3)
M: protocol exchange
M: (sleep 3)

So, the state of the atomic-box is stuck in #:need-to-sleep somehow.

- MB. Respectfully




Information forwarded to bug-guile <at> gnu.org:
bug#32786; Package guile. (Sat, 22 Sep 2018 05:06:02 GMT) Full text and rfc822 format available.

Message #14 received at 32786 <at> debbugs.gnu.org (full text, mbox):

From: Михаил Бахтерев <mob <at> k.imm.uran.ru>
To: Mark H Weaver <mhw <at> netris.org>
Cc: 32786 <at> debbugs.gnu.org
Subject: Re: bug#32786: (system* ...) call somehow interferes with atomic-box
 on armv7l
Date: Sat, 22 Sep 2018 09:50:09 +0500
By the way, if i change (system* "sleep" "1s") to the (system* "true"),
nothing changes. The log of execution is the same:

$ guile test.scm
M: (sleep 3)
M: protocol exchange
M: new sleep thread
M: (sleep 3)
R: Going to (sleep 1)
R: sleep-loop finished
M: protocol exchange
M: (sleep 3)
M: protocol exchange
M: (sleep 3)

But 3 seconds are more than enough to finish the work thread and pull
the state down to the #:nothing-to-do, which should cause the main
thread re-run work thread.

- MB. Respectfully

On Sat, Sep 22, 2018 at 09:39:38AM +0500, Михаил Бахтерев wrote:
> On Fri, Sep 21, 2018 at 05:26:27PM -0400, Mark H Weaver wrote:
> > Hi,
> > 
> > Can you describe in more detail the behavior you are seeing?  What
> > messages do you see on stderr from your 'dump' calls when it gets stuck?
> > What state is the atomic box left in?
> > 
> > I see a couple of problems with the code below:
> > 
> > * Two threads write concurrently to the same port, namely
> >   (current-error-port).  You should use thread synchronization to ensure
> >   that operations to a given port are serialized.
> > 
> > * There's a race condition when the atomic box is in the #:accepted
> >   state.  If the main thread finds the box in that state, it changes the
> >   state to #:need-to-sleep, which will apparently cause the other thread
> >   to sleep again.
> > 
> > I'm not sure if these problems could explain what you're seeing.
> > 
> >      Regards,
> >        Mark
> > 
> 
> The code is supposed to model the following behaviour.
> 
> The main thread tracks some events (in the example: the end of (sleep
> 3)). If one or multiple events occurred some long-lasting external
> command should be executed. Long-lasting means that multiple events may
> occur during execution.
> 
> To perform that the main thread should:
> 
>   1. either to run new work thread (sleep-loop procedure), which will
>   call (system* ...);
> 
>   2. or to communicate the fact of events occurrences to the work
>   thread, which is supposed to detect this and just repeat the command.
> 
> If multiple events have occurred during command execution the command
> should be re-executed only once.
> 
> The event occurrence is communicated through simple protocol with three
> states: #:nothing-to-do < #:accepted < #:need-to-sleep. The main thread
> pulls state up, and work thread pulls it down. When the main thread
> detects event, and the state is #:accepted, it means, that work thread
> accepted the previous job, and executing it now, so the main thread
> pulls the state up to need-to-sleep, and the work thread re-executes
> command (sleeps again in the test case), when the previous external
> command run has finished. This is intended behaviour.
> 
> I expect the program to print some garbage, showing that work thread is
> restarted periodically. Something like that (it is result of using
> (sleep 1) instead of (system* "sleep" "1s")):
> 
> $ guile test.scm
> M: (sleep 3)
> M: protocol exchange
> M: new sleep thread
> M: (sleep 3)
> 
> R: Going to (sleep 1)
> R: sleep-loop finished
> M: protocol exchange
> M: new sleep thread
> M: (sleep 3)
> : (sleep 3)R: Going to (sleep 1)
> 
> R: sleep-loop finished
> M: protocol exchange
> M: new sleep thread
> M: (sleep 3)
> 
> R: Going to (sleep 1)
> R: sleep-loop finished
> M: protocol exchange
> M: new sleep thread
> M: (sleep 3)
> 
> : (sleep 3)R: Going to (sleep 1)
> R: sleep-loop finished
> 
> But what i get with (system* "sleep" "1s") is this:
> 
> $ guile test.scm
> M: (sleep 3)
> M: protocol exchange
> M: new sleep thread
> M: (sleep 3)
> R: Going to (sleep 1)
> R: sleep-loop finished
> M: protocol exchange
> M: (sleep 3)
> M: protocol exchange
> M: (sleep 3)
> M: protocol exchange
> M: (sleep 3)
> 
> So, the state of the atomic-box is stuck in #:need-to-sleep somehow.
> 
> - MB. Respectfully




Information forwarded to bug-guile <at> gnu.org:
bug#32786; Package guile. (Thu, 27 Sep 2018 05:50:02 GMT) Full text and rfc822 format available.

Message #17 received at 32786 <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: Михаил Бахтерев <mob <at> k.imm.uran.ru>
Cc: 32786 <at> debbugs.gnu.org
Subject: Re: bug#32786: (system* ...) call somehow interferes with atomic-box
 on armv7l
Date: Thu, 27 Sep 2018 01:49:23 -0400
[Message part 1 (text/plain, inline)]
Hi,

Thanks for the additional details.  I was able to reproduce the bug, and
I believe I now see the problem.

'atomic-box-compare-and-swap!' is implemented using
'atomic_compare_exchange_weak' (if available), but neglects to handle
the case where 'atomic_compare_exchange_weak' may spuriously fail.  In
that case, the box is left unchanged, although the observed value was
equal to the expected value.

What's happening here is that the 'atomic-box-compare-and-swap!' in
'sleep-loop' fails spuriously, leaving the box in state #:accepted
although it returns #:accepted as its result.  When the main loop
discovers this, it changes the state to #:need-to-sleep, although the
thread has already ended.

To confirm this hypothesis, I added a print statement to the main loop
showing the state of the box that it observed during the protocol
exchange, and indeed it sees #:accepted the first time it checks, and
#:need-to-sleep in all later iterations.

I've attached a proposed patch that I hope will fix this problem.  If
you'd be willing to test it, I'd be grateful, but otherwise I'll try to
test it in the next week or so.  My access to armv7l boxes is somewhat
limited.

Thanks for this report.

      Mark


[0001-UNTESTED-Fix-atomic-box-compare-and-swap.patch (text/x-patch, inline)]
From 958d37686a9ac65f48cb2ca32a341cf182c24b5a Mon Sep 17 00:00:00 2001
From: Mark H Weaver <mhw <at> netris.org>
Date: Thu, 27 Sep 2018 01:00:11 -0400
Subject: [PATCH] UNTESTED: Fix 'atomic-box-compare-and-swap!'.

Fixes <https://bugs.gnu.org/32786>.

'scm_atomic_compare_and_swap_scm' uses 'atomic_compare_exchange_weak' if
it's available on the host platform.  'atomic_compare_exchange_weak' may
fail spuriously, leaving the atomic object unchanged even when it
contained the expected value.  'atomic-box-compare-and-swap!' uses
'scm_atomic_compare_and_swap_scm' without checking its return value, and
therefore may return the expected value while leaving the box unchanged.
This contradicts the documentation, which explicitly states that "you
can know if the swap worked by checking if the return value is 'eq?' to
EXPECTED.".

* libguile/atomic.c (scm_atomic_box_compare_and_swap_x): If
'scm_atomic_compare_and_swap_scm' returns false and the observed value
is equal to 'expected', then try again.
* libguile/vm-engine.c (atomic-box-compare-and-swap!): Ditto.
---
 libguile/atomic.c    | 13 +++++++++----
 libguile/vm-engine.c | 13 ++++++++-----
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/libguile/atomic.c b/libguile/atomic.c
index 950874030..504643912 100644
--- a/libguile/atomic.c
+++ b/libguile/atomic.c
@@ -1,4 +1,4 @@
-/* Copyright (C) 2016 Free Software Foundation, Inc.
+/* Copyright (C) 2016, 2018 Free Software Foundation, Inc.
  *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public License
@@ -95,10 +95,15 @@ SCM_DEFINE (scm_atomic_box_compare_and_swap_x,
             "if the return value is @code{eq?} to @var{expected}.")
 #define FUNC_NAME s_scm_atomic_box_compare_and_swap_x
 {
+  SCM result = expected;
+
   SCM_VALIDATE_ATOMIC_BOX (1, box);
-  scm_atomic_compare_and_swap_scm (scm_atomic_box_loc (box),
-                                   &expected, desired);
-  return expected;
+  while (!scm_atomic_compare_and_swap_scm (scm_atomic_box_loc (box),
+                                           &result, desired)
+         && scm_is_eq (result, expected))
+    {
+    }
+  return result;
 }
 #undef FUNC_NAME
 
diff --git a/libguile/vm-engine.c b/libguile/vm-engine.c
index 19ff3e498..9650e33eb 100644
--- a/libguile/vm-engine.c
+++ b/libguile/vm-engine.c
@@ -3868,16 +3868,19 @@ VM_NAME (scm_i_thread *thread, struct scm_vm *vp,
     {
       scm_t_uint16 dst, box;
       scm_t_uint32 expected, desired;
-      SCM scm_box, scm_expected;
+      SCM scm_box, scm_expected, scm_result;
       UNPACK_12_12 (op, dst, box);
       UNPACK_24 (ip[1], expected);
       UNPACK_24 (ip[2], desired);
       scm_box = SP_REF (box);
       VM_VALIDATE_ATOMIC_BOX (scm_box, "atomic-box-compare-and-swap!");
-      scm_expected = SP_REF (expected);
-      scm_atomic_compare_and_swap_scm (scm_atomic_box_loc (scm_box),
-                                       &scm_expected, SP_REF (desired));
-      SP_SET (dst, scm_expected);
+      scm_result = scm_expected = SP_REF (expected);
+      while (!scm_atomic_compare_and_swap_scm (scm_atomic_box_loc (scm_box),
+                                               &scm_result, SP_REF (desired))
+             && scm_is_eq (scm_result, scm_expected))
+        {
+        }
+      SP_SET (dst, scm_result);
       NEXT (3);
     }
 
-- 
2.19.0


Information forwarded to bug-guile <at> gnu.org:
bug#32786; Package guile. (Thu, 27 Sep 2018 06:19:02 GMT) Full text and rfc822 format available.

Message #20 received at 32786 <at> debbugs.gnu.org (full text, mbox):

From: Михаил Бахтерев <mob <at> k.imm.uran.ru>
To: Mark H Weaver <mhw <at> netris.org>
Cc: 32786 <at> debbugs.gnu.org
Subject: Re: bug#32786: (system* ...) call somehow interferes with atomic-box
 on armv7l
Date: Thu, 27 Sep 2018 11:18:16 +0500
Thanks for clarification! I'll be able to test the patch in a couple of
days.

- MB. Respectfully

On Thu, Sep 27, 2018 at 01:49:23AM -0400, Mark H Weaver wrote:
> Hi,
> 
> Thanks for the additional details.  I was able to reproduce the bug, and
> I believe I now see the problem.
> 
> 'atomic-box-compare-and-swap!' is implemented using
> 'atomic_compare_exchange_weak' (if available), but neglects to handle
> the case where 'atomic_compare_exchange_weak' may spuriously fail.  In
> that case, the box is left unchanged, although the observed value was
> equal to the expected value.
> 
> What's happening here is that the 'atomic-box-compare-and-swap!' in
> 'sleep-loop' fails spuriously, leaving the box in state #:accepted
> although it returns #:accepted as its result.  When the main loop
> discovers this, it changes the state to #:need-to-sleep, although the
> thread has already ended.
> 
> To confirm this hypothesis, I added a print statement to the main loop
> showing the state of the box that it observed during the protocol
> exchange, and indeed it sees #:accepted the first time it checks, and
> #:need-to-sleep in all later iterations.
> 
> I've attached a proposed patch that I hope will fix this problem.  If
> you'd be willing to test it, I'd be grateful, but otherwise I'll try to
> test it in the next week or so.  My access to armv7l boxes is somewhat
> limited.
> 
> Thanks for this report.
> 
>       Mark
> 
> 





Information forwarded to bug-guile <at> gnu.org:
bug#32786; Package guile. (Fri, 05 Oct 2018 19:45:02 GMT) Full text and rfc822 format available.

Message #23 received at 32786 <at> debbugs.gnu.org (full text, mbox):

From: Михаил Бахтерев <mob <at> k.imm.uran.ru>
To: Mark H Weaver <mhw <at> netris.org>
Cc: 32786 <at> debbugs.gnu.org
Subject: Re: bug#32786: (system* ...) call somehow interferes with atomic-box
 on armv7l
Date: Sat, 6 Oct 2018 00:43:57 +0500
Hi. I am sorry for delay.

I've tested the patch (i've applied it to the v2.2.4 source code). The
test sample works now as expected with both test «payloads»: (sleep
N-seconds) and (system* "any" "command" "here") calls.

Thanks for fixing it.

- MB. Respectfully

On Thu, Sep 27, 2018 at 01:49:23AM -0400, Mark H Weaver wrote:
> Hi,
> 
> Thanks for the additional details.  I was able to reproduce the bug, and
> I believe I now see the problem.
> 
> 'atomic-box-compare-and-swap!' is implemented using
> 'atomic_compare_exchange_weak' (if available), but neglects to handle
> the case where 'atomic_compare_exchange_weak' may spuriously fail.  In
> that case, the box is left unchanged, although the observed value was
> equal to the expected value.
> 
> What's happening here is that the 'atomic-box-compare-and-swap!' in
> 'sleep-loop' fails spuriously, leaving the box in state #:accepted
> although it returns #:accepted as its result.  When the main loop
> discovers this, it changes the state to #:need-to-sleep, although the
> thread has already ended.
> 
> To confirm this hypothesis, I added a print statement to the main loop
> showing the state of the box that it observed during the protocol
> exchange, and indeed it sees #:accepted the first time it checks, and
> #:need-to-sleep in all later iterations.
> 
> I've attached a proposed patch that I hope will fix this problem.  If
> you'd be willing to test it, I'd be grateful, but otherwise I'll try to
> test it in the next week or so.  My access to armv7l boxes is somewhat
> limited.
> 
> Thanks for this report.
> 
>       Mark
> 
> 

> From 958d37686a9ac65f48cb2ca32a341cf182c24b5a Mon Sep 17 00:00:00 2001
> From: Mark H Weaver <mhw <at> netris.org>
> Date: Thu, 27 Sep 2018 01:00:11 -0400
> Subject: [PATCH] UNTESTED: Fix 'atomic-box-compare-and-swap!'.
> 
> Fixes <https://bugs.gnu.org/32786>.
> 
> 'scm_atomic_compare_and_swap_scm' uses 'atomic_compare_exchange_weak' if
> it's available on the host platform.  'atomic_compare_exchange_weak' may
> fail spuriously, leaving the atomic object unchanged even when it
> contained the expected value.  'atomic-box-compare-and-swap!' uses
> 'scm_atomic_compare_and_swap_scm' without checking its return value, and
> therefore may return the expected value while leaving the box unchanged.
> This contradicts the documentation, which explicitly states that "you
> can know if the swap worked by checking if the return value is 'eq?' to
> EXPECTED.".
> 
> * libguile/atomic.c (scm_atomic_box_compare_and_swap_x): If
> 'scm_atomic_compare_and_swap_scm' returns false and the observed value
> is equal to 'expected', then try again.
> * libguile/vm-engine.c (atomic-box-compare-and-swap!): Ditto.
> ---
>  libguile/atomic.c    | 13 +++++++++----
>  libguile/vm-engine.c | 13 ++++++++-----
>  2 files changed, 17 insertions(+), 9 deletions(-)
> 
> diff --git a/libguile/atomic.c b/libguile/atomic.c
> index 950874030..504643912 100644
> --- a/libguile/atomic.c
> +++ b/libguile/atomic.c
> @@ -1,4 +1,4 @@
> -/* Copyright (C) 2016 Free Software Foundation, Inc.
> +/* Copyright (C) 2016, 2018 Free Software Foundation, Inc.
>   *
>   * This library is free software; you can redistribute it and/or
>   * modify it under the terms of the GNU Lesser General Public License
> @@ -95,10 +95,15 @@ SCM_DEFINE (scm_atomic_box_compare_and_swap_x,
>              "if the return value is @code{eq?} to @var{expected}.")
>  #define FUNC_NAME s_scm_atomic_box_compare_and_swap_x
>  {
> +  SCM result = expected;
> +
>    SCM_VALIDATE_ATOMIC_BOX (1, box);
> -  scm_atomic_compare_and_swap_scm (scm_atomic_box_loc (box),
> -                                   &expected, desired);
> -  return expected;
> +  while (!scm_atomic_compare_and_swap_scm (scm_atomic_box_loc (box),
> +                                           &result, desired)
> +         && scm_is_eq (result, expected))
> +    {
> +    }
> +  return result;
>  }
>  #undef FUNC_NAME
>  
> diff --git a/libguile/vm-engine.c b/libguile/vm-engine.c
> index 19ff3e498..9650e33eb 100644
> --- a/libguile/vm-engine.c
> +++ b/libguile/vm-engine.c
> @@ -3868,16 +3868,19 @@ VM_NAME (scm_i_thread *thread, struct scm_vm *vp,
>      {
>        scm_t_uint16 dst, box;
>        scm_t_uint32 expected, desired;
> -      SCM scm_box, scm_expected;
> +      SCM scm_box, scm_expected, scm_result;
>        UNPACK_12_12 (op, dst, box);
>        UNPACK_24 (ip[1], expected);
>        UNPACK_24 (ip[2], desired);
>        scm_box = SP_REF (box);
>        VM_VALIDATE_ATOMIC_BOX (scm_box, "atomic-box-compare-and-swap!");
> -      scm_expected = SP_REF (expected);
> -      scm_atomic_compare_and_swap_scm (scm_atomic_box_loc (scm_box),
> -                                       &scm_expected, SP_REF (desired));
> -      SP_SET (dst, scm_expected);
> +      scm_result = scm_expected = SP_REF (expected);
> +      while (!scm_atomic_compare_and_swap_scm (scm_atomic_box_loc (scm_box),
> +                                               &scm_result, SP_REF (desired))
> +             && scm_is_eq (scm_result, scm_expected))
> +        {
> +        }
> +      SP_SET (dst, scm_result);
>        NEXT (3);
>      }
>  
> -- 
> 2.19.0
> 





Reply sent to Mark H Weaver <mhw <at> netris.org>:
You have taken responsibility. (Fri, 05 Oct 2018 23:23:01 GMT) Full text and rfc822 format available.

Notification sent to Михаил Бахтерев <mob <at> k.imm.uran.ru>:
bug acknowledged by developer. (Fri, 05 Oct 2018 23:23:01 GMT) Full text and rfc822 format available.

Message #28 received at 32786-done <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: Михаил Бахтерев <mob <at> k.imm.uran.ru>
Cc: 32786-done <at> debbugs.gnu.org
Subject: Re: bug#32786: (system* ...) call somehow interferes with atomic-box
 on armv7l
Date: Fri, 05 Oct 2018 19:22:13 -0400
Hi,

Михаил Бахтерев <mob <at> k.imm.uran.ru> writes:

> I've tested the patch (i've applied it to the v2.2.4 source code). The
> test sample works now as expected with both test «payloads»: (sleep
> N-seconds) and (system* "any" "command" "here") calls.

Thanks for letting us know the results of your testing, this is very
helpful.  I just pushed a similar fix (same code, improved comments) to
the stable-2.2 branch, commit 2d09e0513fc11e2305077ba3653f6e4c2f266ddb,
which will be in the upcoming guile-2.2.5 release.

    Thanks,
      Mark


>
> Thanks for fixing it.
>
> - MB. Respectfully
>
> On Thu, Sep 27, 2018 at 01:49:23AM -0400, Mark H Weaver wrote:
>> Hi,
>> 
>> Thanks for the additional details.  I was able to reproduce the bug, and
>> I believe I now see the problem.
>> 
>> 'atomic-box-compare-and-swap!' is implemented using
>> 'atomic_compare_exchange_weak' (if available), but neglects to handle
>> the case where 'atomic_compare_exchange_weak' may spuriously fail.  In
>> that case, the box is left unchanged, although the observed value was
>> equal to the expected value.
>> 
>> What's happening here is that the 'atomic-box-compare-and-swap!' in
>> 'sleep-loop' fails spuriously, leaving the box in state #:accepted
>> although it returns #:accepted as its result.  When the main loop
>> discovers this, it changes the state to #:need-to-sleep, although the
>> thread has already ended.
>> 
>> To confirm this hypothesis, I added a print statement to the main loop
>> showing the state of the box that it observed during the protocol
>> exchange, and indeed it sees #:accepted the first time it checks, and
>> #:need-to-sleep in all later iterations.
>> 
>> I've attached a proposed patch that I hope will fix this problem.  If
>> you'd be willing to test it, I'd be grateful, but otherwise I'll try to
>> test it in the next week or so.  My access to armv7l boxes is somewhat
>> limited.
>> 
>> Thanks for this report.
>> 
>>       Mark
>> 
>> 
>
>> From 958d37686a9ac65f48cb2ca32a341cf182c24b5a Mon Sep 17 00:00:00 2001
>> From: Mark H Weaver <mhw <at> netris.org>
>> Date: Thu, 27 Sep 2018 01:00:11 -0400
>> Subject: [PATCH] UNTESTED: Fix 'atomic-box-compare-and-swap!'.
>> 
>> Fixes <https://bugs.gnu.org/32786>.
>> 
>> 'scm_atomic_compare_and_swap_scm' uses 'atomic_compare_exchange_weak' if
>> it's available on the host platform.  'atomic_compare_exchange_weak' may
>> fail spuriously, leaving the atomic object unchanged even when it
>> contained the expected value.  'atomic-box-compare-and-swap!' uses
>> 'scm_atomic_compare_and_swap_scm' without checking its return value, and
>> therefore may return the expected value while leaving the box unchanged.
>> This contradicts the documentation, which explicitly states that "you
>> can know if the swap worked by checking if the return value is 'eq?' to
>> EXPECTED.".
>> 
>> * libguile/atomic.c (scm_atomic_box_compare_and_swap_x): If
>> 'scm_atomic_compare_and_swap_scm' returns false and the observed value
>> is equal to 'expected', then try again.
>> * libguile/vm-engine.c (atomic-box-compare-and-swap!): Ditto.
>> ---
>>  libguile/atomic.c    | 13 +++++++++----
>>  libguile/vm-engine.c | 13 ++++++++-----
>>  2 files changed, 17 insertions(+), 9 deletions(-)
>> 
>> diff --git a/libguile/atomic.c b/libguile/atomic.c
>> index 950874030..504643912 100644
>> --- a/libguile/atomic.c
>> +++ b/libguile/atomic.c
>> @@ -1,4 +1,4 @@
>> -/* Copyright (C) 2016 Free Software Foundation, Inc.
>> +/* Copyright (C) 2016, 2018 Free Software Foundation, Inc.
>>   *
>>   * This library is free software; you can redistribute it and/or
>>   * modify it under the terms of the GNU Lesser General Public License
>> @@ -95,10 +95,15 @@ SCM_DEFINE (scm_atomic_box_compare_and_swap_x,
>>              "if the return value is @code{eq?} to @var{expected}.")
>>  #define FUNC_NAME s_scm_atomic_box_compare_and_swap_x
>>  {
>> +  SCM result = expected;
>> +
>>    SCM_VALIDATE_ATOMIC_BOX (1, box);
>> -  scm_atomic_compare_and_swap_scm (scm_atomic_box_loc (box),
>> -                                   &expected, desired);
>> -  return expected;
>> +  while (!scm_atomic_compare_and_swap_scm (scm_atomic_box_loc (box),
>> +                                           &result, desired)
>> +         && scm_is_eq (result, expected))
>> +    {
>> +    }
>> +  return result;
>>  }
>>  #undef FUNC_NAME
>>  
>> diff --git a/libguile/vm-engine.c b/libguile/vm-engine.c
>> index 19ff3e498..9650e33eb 100644
>> --- a/libguile/vm-engine.c
>> +++ b/libguile/vm-engine.c
>> @@ -3868,16 +3868,19 @@ VM_NAME (scm_i_thread *thread, struct scm_vm *vp,
>>      {
>>        scm_t_uint16 dst, box;
>>        scm_t_uint32 expected, desired;
>> -      SCM scm_box, scm_expected;
>> +      SCM scm_box, scm_expected, scm_result;
>>        UNPACK_12_12 (op, dst, box);
>>        UNPACK_24 (ip[1], expected);
>>        UNPACK_24 (ip[2], desired);
>>        scm_box = SP_REF (box);
>>        VM_VALIDATE_ATOMIC_BOX (scm_box, "atomic-box-compare-and-swap!");
>> -      scm_expected = SP_REF (expected);
>> -      scm_atomic_compare_and_swap_scm (scm_atomic_box_loc (scm_box),
>> -                                       &scm_expected, SP_REF (desired));
>> -      SP_SET (dst, scm_expected);
>> +      scm_result = scm_expected = SP_REF (expected);
>> +      while (!scm_atomic_compare_and_swap_scm (scm_atomic_box_loc (scm_box),
>> +                                               &scm_result, SP_REF (desired))
>> +             && scm_is_eq (scm_result, scm_expected))
>> +        {
>> +        }
>> +      SP_SET (dst, scm_result);
>>        NEXT (3);
>>      }
>>  
>> -- 
>> 2.19.0
>> 




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 03 Nov 2018 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 169 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.