GNU bug report logs - #22152
fat_mutex owner corruption (fmoc) inside fat_mutex_unlock (guile-v2.0.11)

Previous Next

Package: guile;

Reported by: Iwan Aucamp <aucampia <at> gmail.com>

Date: Sat, 12 Dec 2015 19:04:03 UTC

Severity: normal

Done: Mark H Weaver <mhw <at> netris.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 22152 in the body.
You can then email your comments to 22152 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guile <at> gnu.org:
bug#22152; Package guile. (Sat, 12 Dec 2015 19:04:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Iwan Aucamp <aucampia <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-guile <at> gnu.org. (Sat, 12 Dec 2015 19:04:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Iwan Aucamp <aucampia <at> gmail.com>
To: bug-guile <at> gnu.org
Cc: Iwan Armand Aucamp <Iwan.Aucamp <at> concurrent.co.za>,
 Donovan Hutcheon <donovan.hutcheon <at> concurrent.co.za>,
 Martin Cooper <Martin.Cooper <at> concurrent.co.za>,
 Waqar Ali <waqar.ali <at> concurrent.co.za>
Subject: fat_mutex owner corruption (fmoc) inside fat_mutex_unlock
 (guile-v2.0.11)
Date: Sat, 12 Dec 2015 19:28:14 +0200
[Message part 1 (text/plain, inline)]
Hi

We sporadically get "mutex not locked" and "mutex not locked by current thread"
exceptions on Solaris 10u10 with guile-2.0.11.

This problem can be reproduced with following scheme scripts:



  guile-fmoc-test-mnl.scm (for "mutex not locked")
      Two threads, one of them (reader-000) waits on a condition variable that
      nothing will trigger and the other thread (writer-000) locks and unlocks
      the mutex used with the condition variable. This code causes "mutex not
      locked" exception with some consistency.

      Output for this is in guile-fmoc-test-mnl-problem_output.txt (referenced
      as mnl-problem_output.txt)


      * owner id for reader-000 is 12593872

      * owner id for writer-000 is 12595040



  guile-fmoc-test-mnlbct.scm (for "mutex not locked by current thread")
      Same as guile-fmoc-test-mnl.scm except here writer-000 signals condition
      variable. This code causes "mutex not locked" and "mutex not locked by
      current thread" errors.

      Output for this, showing "mutex not locked by current thread", is in
      guile-fmoc-test-mnlbct-problem_output.txt (referenced as mnlbct-
      problem_output.txt)


      * owner id for reader-000 is 14535648

      * owner id for writer-000 is 14536400



To track down this issue we have added some debug printfs (see guile-2.0.11-
with-debug.patch). Given that this changes the line numbers I have referenced
original line numbers as o:file:line and line numbers with patch as d[:file]:
line. Also, these printfs have resulted in some irrelevant output (for internal
and verbose logging mutexes) which has been filtered out.

There is various scenarios leading to these errors that we have found but all
caused by same problem. The a detailed analysis for "mutex not locked by
current thread" scenario that can be seen in mnlbct-problem_output.txt is
included below and detailed analysis for other scenarios will be shared if
required.

Scenario from mnlbct-problem_output.txt:


  1. [writer-000:14536400] unlocks fat_mutex[14512880] and queues reader-000:
     14535648

     at o:threads.c:1664 - d:1681

     before mnlbct-problem_output.txt:4079

  2. [reader-000:14535648] locks fat_mutex[14512880] with fat_mutex_lock

     at o:threads.c:1394 - d:1401

     before mnlbct-problem_output.txt:4080

  3. [reader-000:14535648] enters wait-condition-variable and changes fat_mutex
     [14512880].owner to writer-000:14536400

     at o:threads.c:1616 - d:1631

     before mnlbct-problem_output.txt:4083

  4. [reader-000:14535648] goes into block_self and starts waiting on ptheead
     condition variable inside fat_mutex_unlock &gt; block_self. This unlocks
     fat_mutex[14512880].mutex, allowing some other thread to lock fat_mutex
     [14512880]

     at o:threads.c:452 - d:456

     before mnlbct-problem_output.txt:4083

  5. [writer-000:14536400] locks fat_mutex[14512880] with fat_mutex_lock

     at o:threads.c:1394 - d:1401

     before mnlbct-problem_output.txt:4082

  6. [reader-000:14535648] spurious wake-up occurs for condition variable which
     causes block_self to return EINTR to fat_mutex_unlock

     at o:threads.c:1621 - d:1636

     before mnlbct-problem_output.txt:4084

  7. [reader-000:14535648] loops and sets fat_mutex[14512880].owner=4 (i.e. not
     locked) while writer-000:14536400 should still be owner of fat_mutex
     [14512880]. Since it was spurious wake-up reader-000:14535648 continues to
     wait for condition to be notified again.

     at o:threads.c:1616 - d:1631

     before mnlbct-problem_output.txt:4086

  8. [writer-000:14536400] completes signal-condition-variable

     before mnlbct-problem_output.txt:4088

  9. [reader-000:14535648] now gets actual notification and block_self returns
     0. This causes fat_mutex:14512880 to be locked again - which works cos
     fat_mutex:14512880.owner is 4. This changes fat_mutex:14512880.owner from
     4 to 14535648.

     at o:threads.c:1643 - d:1660

     before mnlbct-problem_output.txt:4086

 10. [writer-000:14536400] tries to unlock the mutex, this fails though as
     reader-000:14535648 now owns the mutex - resulting in "mutex not locked by
     current thread" exception.

     at o:threads.c:1599 - d:1614

     before mnlbct-problem_output.txt:4089


Briefly, for mnl-problem_output.txt:


  1. reader-000 locks fat_mutex and unlocks it again as it starts waiting for
     condition to be notified.

  2. writer-000 locks mutex

  3. Spurious wake-up occurs for reader-000 which causes reader to change
     fat_mutex.owner from writer-000 id to 4 and then resumes waiting on
     condition variable

  4. Writer tries to unlock fat_mutex but now owner is 4 and this results in
     "mutex not locked" exception


The cause of these problems seems to be related to fat_mutex_unlock changing
fat_mutex.owner inside the while loop that is intended for checking condition
variable predicate which is problematic if spurious wake-ups from
pthread_cond_wait occur. Spurious wake-ups from pthread_cond_wait seems less
common on Linux, which is why we have only been observing the issue on Solaris.
It does however look like this problem will occur on any platform when a
spurious wake-up does occur.

As far as we can tell there is no reason for the fat_mutex.owner assignment to
happen inside the loop. It seems more appropriate that this happens only once
before the loop and not again. To this extent we moved owner reassignment out
of the loop and this seems to have resolved our issues. The patch for this is
in guile-2.0.11-with-fmoc_fix.patch.

We have ran the test suite with this on Linux and everything passes. There is
however other issues with test suite on Solaris that prevents it from
completing (both with and without the patch) which needs further investigation.


* All files related to this can be found at
https://gitlab.com/concurrent-systems/osp-issues-1512/tree/master/guile-fmoc

* Source with guile-2.0.11-with-debug.patch can be found at
https://gitlab.com/concurrent-systems/guile/tree/v2.0.11-with-debug

* Source with guile-2.0.11-with-fmoc_fix.patch can be found at
https://gitlab.com/concurrent-systems/guile/tree/v2.0.11-with-fmoc_fix


Regards
-- 
Iwan Aucamp
[guile-2.0.11-with-debug.patch (text/x-patch, attachment)]
[guile-2.0.11-with-fmoc_fix.patch (text/x-patch, attachment)]
[guile-fmoc-test-mnl.scm (text/x-scheme, attachment)]
[guile-fmoc-test-mnlbct.scm (text/x-scheme, attachment)]
[guile-fmoc-test-mnlbct-problem_output.txt.bz2 (application/x-bzip2, attachment)]
[guile-fmoc-test-mnl-problem_output.txt.bz2 (application/x-bzip2, attachment)]

Information forwarded to bug-guile <at> gnu.org:
bug#22152; Package guile. (Thu, 07 Jan 2016 00:15:02 GMT) Full text and rfc822 format available.

Message #8 received at 22152 <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: Iwan Aucamp <aucampia <at> gmail.com>
Cc: Iwan Armand Aucamp <Iwan.Aucamp <at> concurrent.co.za>, 22152 <at> debbugs.gnu.org,
 Donovan Hutcheon <donovan.hutcheon <at> concurrent.co.za>,
 Martin Cooper <Martin.Cooper <at> concurrent.co.za>,
 Waqar Ali <waqar.ali <at> concurrent.co.za>
Subject: Re: bug#22152: fat_mutex owner corruption (fmoc) inside
 fat_mutex_unlock (guile-v2.0.11)
Date: Wed, 06 Jan 2016 19:13:57 -0500
[Message part 1 (text/plain, inline)]
Iwan Aucamp <aucampia <at> gmail.com> writes:
> We sporadically get "mutex not locked" and "mutex not locked by current thread"
> exceptions on Solaris 10u10 with guile-2.0.11.

Thanks very much for your detailed analysis and proposed fix.

I've attached a patch that hopefully fixes this bug and also refactors
the code to hopefully be somewhat more clear.  Can you please test it on
Solaris and verify that it works for your use cases?

    Thanks,
      Mark

[PRELIM-REFACTOR-fat_mutex_unlock.patch (text/x-patch, inline)]
From 1d945f638033e7649ea61fa1c4050b30da891d6b Mon Sep 17 00:00:00 2001
From: Mark H Weaver <mhw <at> netris.org>
Date: Fri, 18 Dec 2015 02:29:48 -0500
Subject: [PATCH] PRELIMINARY: Fix bug in fat_mutex_unlock.

---
 libguile/threads.c | 87 ++++++++++++++++++++++--------------------------------
 1 file changed, 36 insertions(+), 51 deletions(-)

diff --git a/libguile/threads.c b/libguile/threads.c
index 6ae6818..dcbd24d 100644
--- a/libguile/threads.c
+++ b/libguile/threads.c
@@ -1607,70 +1607,55 @@ fat_mutex_unlock (SCM mutex, SCM cond,
 	}
     }
 
+  if (m->level > 0)
+    m->level--;
+  if (m->level == 0)
+    {
+      /* Change the owner of MUTEX.  */
+      t->mutexes = scm_delq_x (mutex, t->mutexes);
+      m->owner = unblock_from_queue (m->waiting);
+    }
+
   if (! (SCM_UNBNDP (cond)))
     {
       c = SCM_CONDVAR_DATA (cond);
-      while (1)
-	{
-	  int brk = 0;
-
-	  if (m->level > 0)
-	    m->level--;
-	  if (m->level == 0)
-	    {
-	      /* Change the owner of MUTEX.  */
-	      t->mutexes = scm_delq_x (mutex, t->mutexes);
-	      m->owner = unblock_from_queue (m->waiting);
-	    }
-
-	  t->block_asyncs++;
+      t->block_asyncs++;
 
+      do
+	{
 	  err = block_self (c->waiting, cond, &m->lock, waittime);
 	  scm_i_pthread_mutex_unlock (&m->lock);
 
-	  if (err == 0)
-	    {
-	      ret = 1;
-	      brk = 1;
-	    }
-	  else if (err == ETIMEDOUT)
-	    {
-	      ret = 0;
-	      brk = 1;
-	    }
-	  else if (err != EINTR)
-	    {
-	      errno = err;
-	      scm_syserror (NULL);
-	    }
-
-	  if (brk)
-	    {
-	      if (relock)
-		scm_lock_mutex_timed (mutex, SCM_UNDEFINED, owner);
-	      t->block_asyncs--;
-	      break;
-	    }
-
-	  t->block_asyncs--;
-	  scm_async_click ();
+          if (err == EINTR)
+            {
+              t->block_asyncs--;
+              scm_async_click ();
 
-	  scm_remember_upto_here_2 (cond, mutex);
+              scm_remember_upto_here_2 (cond, mutex);
 
-	  scm_i_scm_pthread_mutex_lock (&m->lock);
+              scm_i_scm_pthread_mutex_lock (&m->lock);
+              t->block_asyncs++;
+            }
 	}
+      while (err == EINTR);
+
+      if (err == 0)
+        ret = 1;
+      else if (err == ETIMEDOUT)
+        ret = 0;
+      else
+        {
+          t->block_asyncs--;
+          errno = err;
+          scm_syserror (NULL);
+        }
+
+      if (relock)
+        scm_lock_mutex_timed (mutex, SCM_UNDEFINED, owner);
+      t->block_asyncs--;
     }
   else
     {
-      if (m->level > 0)
-	m->level--;
-      if (m->level == 0)
-	{
-	  /* Change the owner of MUTEX.  */
-	  t->mutexes = scm_delq_x (mutex, t->mutexes);
-	  m->owner = unblock_from_queue (m->waiting);
-	}
-
       scm_i_pthread_mutex_unlock (&m->lock);
       ret = 1;
     }
-- 
2.6.3


Reply sent to Mark H Weaver <mhw <at> netris.org>:
You have taken responsibility. (Mon, 20 Jun 2016 20:06:02 GMT) Full text and rfc822 format available.

Notification sent to Iwan Aucamp <aucampia <at> gmail.com>:
bug acknowledged by developer. (Mon, 20 Jun 2016 20:06:02 GMT) Full text and rfc822 format available.

Message #13 received at 22152-done <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: Iwan Aucamp <aucampia <at> gmail.com>
Cc: Iwan Armand Aucamp <Iwan.Aucamp <at> concurrent.co.za>,
 Donovan Hutcheon <donovan.hutcheon <at> concurrent.co.za>,
 22152-done <at> debbugs.gnu.org, Waqar Ali <waqar.ali <at> concurrent.co.za>,
 Martin Cooper <Martin.Cooper <at> concurrent.co.za>
Subject: Re: bug#22152: fat_mutex owner corruption (fmoc) inside
 fat_mutex_unlock (guile-v2.0.11)
Date: Mon, 20 Jun 2016 16:05:01 -0400
Mark H Weaver <mhw <at> netris.org> writes:

> Iwan Aucamp <aucampia <at> gmail.com> writes:
>> We sporadically get "mutex not locked" and "mutex not locked by current thread"
>> exceptions on Solaris 10u10 with guile-2.0.11.
>
> Thanks very much for your detailed analysis and proposed fix.
>
> I've attached a patch that hopefully fixes this bug and also refactors
> the code to hopefully be somewhat more clear.  Can you please test it on
> Solaris and verify that it works for your use cases?

I went ahead and pushed commit 1e86dc32a42af549fc9e4721ad48cdd7d296c042
to stable-2.0, which will soon become guile-2.0.12.  I hope it fixes the
issue, although unfortunately my patch was never tested on Solaris.  I'm
going to close this bug, but feel free to reopen it if there are still
issues.

     Thanks,
       Mark




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 19 Jul 2016 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 7 years and 272 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.