GNU bug report logs - #19883
Smob's mark_smob has become unreliable in Guile 2.x

Previous Next

Package: guile;

Reported by: David Kastrup <dak <at> gnu.org>

Date: Mon, 16 Feb 2015 17:16:02 UTC

Severity: normal

Done: Andy Wingo <wingo <at> pobox.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 19883 in the body.
You can then email your comments to 19883 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guile <at> gnu.org:
bug#19883; Package guile. (Mon, 16 Feb 2015 17:16:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to David Kastrup <dak <at> gnu.org>:
New bug report received and forwarded. Copy sent to bug-guile <at> gnu.org. (Mon, 16 Feb 2015 17:16:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: David Kastrup <dak <at> gnu.org>
To: bug-guile <at> gnu.org
Subject: Smob's mark_smob has become unreliable in Guile 2.x
Date: Mon, 16 Feb 2015 17:41:36 +0100
[Message part 1 (text/plain, inline)]
This precludes porting LilyPond to Guile 2 since LilyPond relies
extensively on the mark_smob functionality.

The symptom likely occurs when an object is collected and consequently
its free_smob function gets called.  This free_smob function essentially
is responsible for freeing all resources claimed by the smob, rendering
the underlying data moot.

If mark_smob continues to get called, its interpretation of the now dead
data is not likely to lead to happy results.

Debugging is tricky since the gc phases tend to occur asynchronously.
If I start the accompanying program with
./test 2 20000 40
I tend to get a core dump perhaps every fourth call.  Running repeatedly
in a debugger does not produce core dumps (perhaps one needs to reload
the debugger for each try?), so one needs to do

ulimit -c unlimited
./test 2 20000 40

and, after the problem triggers

gdb ./test core

for post-mortem debugging.  A traceback looks like
(gdb) bt
#0  0x00000000 in ?? ()
#1  0x0804aee0 in ?? ()
#2  0xb761b2da in ?? () from /usr/lib/libguile-2.0.so.22
#3  0xb72751f8 in GC_mark_from () from /usr/lib/i386-linux-gnu/libgc.so.1
#4  0xb72766ca in GC_mark_some () from /usr/lib/i386-linux-gnu/libgc.so.1
#5  0xb726cd16 in GC_stopped_mark () from /usr/lib/i386-linux-gnu/libgc.so.1
#6  0xb726d730 in GC_try_to_collect_inner ()
   from /usr/lib/i386-linux-gnu/libgc.so.1
#7  0xb726e12a in GC_collect_or_expand ()
   from /usr/lib/i386-linux-gnu/libgc.so.1
#8  0xb726e2b1 in GC_allocobj () from /usr/lib/i386-linux-gnu/libgc.so.1
#9  0xb72731a4 in GC_generic_malloc_inner ()
   from /usr/lib/i386-linux-gnu/libgc.so.1
#10 0xb72732be in GC_generic_malloc () from /usr/lib/i386-linux-gnu/libgc.so.1
#11 0xb761b81e in scm_i_new_smob () from /usr/lib/libguile-2.0.so.22
#12 0x0804985a in std::vector<void (*)(), std::allocator<void (*)()> >::_M_insert_aux(__gnu_cxx::__normal_iterator<void (**)(), std::vector<void (*)(), std::allocator<void (*)()> > >, void (* const&)()) ()
#13 0x0804a6fc in __gnu_cxx::new_allocator<void (*)()>::allocate(unsigned int, void const*) ()
#14 0x0804a03d in void std::_Destroy<void (**)(), void (*)()>(void (**)(), void (**)(), std::allocator<void (*)()>&) ()
#15 0x08049a73 in void std::_Destroy<Family**, Family*>(Family**, Family**, std::allocator<Family*>&) ()
#16 0x0804934d in scm_puts ()
#17 0x0804938f in Scm_init::Scm_init(void (*)()) ()
#18 0x0804938f in Scm_init::Scm_init(void (*)()) ()
#19 0x0804938f in Scm_init::Scm_init(void (*)()) ()
#20 0x0804938f in Scm_init::Scm_init(void (*)()) ()
#21 0x0804938f in Scm_init::Scm_init(void (*)()) ()
#22 0x0804938f in Scm_init::Scm_init(void (*)()) ()
#23 0x0804938f in Scm_init::Scm_init(void (*)()) ()
#24 0x0804938f in Scm_init::Scm_init(void (*)()) ()
#25 0x0804938f in Scm_init::Scm_init(void (*)()) ()
#26 0x0804938f in Scm_init::Scm_init(void (*)()) ()
#27 0x0804938f in Scm_init::Scm_init(void (*)()) ()
#28 0x0804938f in Scm_init::Scm_init(void (*)()) ()
#29 0x0804938f in Scm_init::Scm_init(void (*)()) ()
#30 0x080495e7 in std::vector<Family*, std::allocator<Family*> >::operator[](unsigned int) ()
#31 0xb75b7dfd in ?? () from /usr/lib/libguile-2.0.so.22
#32 0xb76418e7 in ?? () from /usr/lib/libguile-2.0.so.22
#33 0xb761afb9 in ?? () from /usr/lib/libguile-2.0.so.22
 #34 0xb7659f20 in ?? () from /usr/lib/libguile-2.0.so.22
#35 0xb765a539 in ?? () from /usr/lib/libguile-2.0.so.22
#36 0xb75c24f3 in scm_call_4 () from /usr/lib/libguile-2.0.so.22
#37 0xb7641acf in scm_catch_with_pre_unwind_handler ()
   from /usr/lib/libguile-2.0.so.22
#38 0xb7641bd4 in scm_c_catch () from /usr/lib/libguile-2.0.so.22
#39 0xb75b85d1 in ?? () from /usr/lib/libguile-2.0.so.22
#40 0xb75b86d3 in scm_c_with_continuation_barrier ()
   from /usr/lib/libguile-2.0.so.22
#41 0xb763ef7e in ?? () from /usr/lib/libguile-2.0.so.22
#42 0xb72782c1 in GC_call_with_stack_base ()
   from /usr/lib/i386-linux-gnu/libgc.so.1
#43 0xb763f3e6 in scm_with_guile () from /usr/lib/libguile-2.0.so.22
#44 0x080496c5 in void __gnu_cxx::__alloc_traits<std::allocator<void (*)()> >::construct<void (*)()>(std::allocator<void (*)()>&, void (**)(), void (* const&)()) ()
#45 0xb72cfa83 in __libc_start_main (main=0x4, argc=134517216, argv=0x0, 
    init=0x8049201 <main+16>, 
    fini=0x804967e <std::_Vector_base<void (*)(), std::allocator<void (*)()> >::~_Vector_base()+40>, rtld_fini=0x4, stack_end=0xbfca7aa4) at libc-start.c:287
#46 0xb7750000 in ?? () from /lib/ld-linux.so.2
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

The frames listed as

#17 0x0804938f in Scm_init::Scm_init(void (*)()) ()
#18 ...

appear to be a fluke in address/symbol correlation and do not seem to
have an actual relation to the listed function when looking at the code.

I append all the requisite source files producing the test binary when
running "make".  I can send my binary (when desired) which was compiled
using 2.0.11 on a i586 platform.  However, I would expect the problem to
reproduce in some kind of manner on other systems.

The compiler was gcc version 4.9.1 (Ubuntu 4.9.1-16ubuntu6), Target:
i686-linux-gnu.

The headers are somewhat cooked-down variants (to remove dependencies on
other parts of LilyPond) of the Smob organizing classes used in
LilyPond.  If one wants to compile for Guile v1 for comparison, one
probably needs

typedef void * scm_t_func;

or so somewhere.

[guile-bug.tgz (application/x-gtar-compressed, attachment)]
[Message part 3 (text/plain, inline)]
-- 
David Kastrup

Information forwarded to bug-guile <at> gnu.org:
bug#19883; Package guile. (Mon, 16 Feb 2015 18:14:02 GMT) Full text and rfc822 format available.

Message #8 received at 19883 <at> debbugs.gnu.org (full text, mbox):

From: David Kastrup <dak <at> gnu.org>
To: 19883 <at> debbugs.gnu.org
Subject: Correction for backtrace
Date: Mon, 16 Feb 2015 19:13:12 +0100
This is embarrassing: I used the wrong executable in connection with the
core dump.  With the matching executable, the coredump makes a lot more
sense:

#0  0x00000000 in ?? ()
#1  0x0804aee0 in Smob_base<Family>::mark_trampoline (arg=0x9fbb000)
    at smobs.tcc:34
#2  0xb761b2da in ?? () from /usr/lib/libguile-2.0.so.22
#3  0xb72751f8 in GC_mark_from () from /usr/lib/i386-linux-gnu/libgc.so.1
#4  0xb72766ca in GC_mark_some () from /usr/lib/i386-linux-gnu/libgc.so.1
#5  0xb726cd16 in GC_stopped_mark () from /usr/lib/i386-linux-gnu/libgc.so.1
#6  0xb726d730 in GC_try_to_collect_inner ()
   from /usr/lib/i386-linux-gnu/libgc.so.1
#7  0xb726e12a in GC_collect_or_expand ()
   from /usr/lib/i386-linux-gnu/libgc.so.1
#8  0xb726e2b1 in GC_allocobj () from /usr/lib/i386-linux-gnu/libgc.so.1
#9  0xb72731a4 in GC_generic_malloc_inner ()
   from /usr/lib/i386-linux-gnu/libgc.so.1
#10 0xb72732be in GC_generic_malloc () from /usr/lib/i386-linux-gnu/libgc.so.1
#11 0xb761b81e in scm_i_new_smob () from /usr/lib/libguile-2.0.so.22
#12 0x0804985a in scm_new_smob (tc=7039, data=158987224)
    at /usr/include/guile/2.0/libguile/smob.h:91
#13 0x0804a6fc in Smob_base<Family>::register_ptr (p=0x979f3d8)
    at smobs.tcc:53
#14 0x0804a03d in Smob<Family>::unprotected_smobify_self (this=0x979f3dc)
    at smobs.hh:273
#15 0x08049a73 in Smob<Family>::smobify_self (this=0x979f3dc) at smobs.hh:286
#16 0x0804934d in Family::Family (this=0x979f3d8, totals=0, branch=2)
    at test.cc:30
#17 0x0804938f in Family::Family (this=0x98b7c98, totals=3, branch=2)
    at test.cc:35
#18 0x0804938f in Family::Family (this=0x979e598, totals=8, branch=2)
    at test.cc:35
#19 0x0804938f in Family::Family (this=0x98b65c8, totals=17, branch=2)
    at test.cc:35
#20 0x0804938f in Family::Family (this=0x9934d78, totals=37, branch=2)
    at test.cc:35
#21 0x0804938f in Family::Family (this=0x99377c8, totals=76, branch=2)
    at test.cc:35
#22 0x0804938f in Family::Family (this=0x96921e0, totals=154, branch=2)
    at test.cc:35
#23 0x0804938f in Family::Family (this=0xa0c3998, totals=311, branch=2)
    at test.cc:35
#24 0x0804938f in Family::Family (this=0x9d0cc88, totals=623, branch=2)
    at test.cc:35
#25 0x0804938f in Family::Family (this=0x9b8b0a8, totals=1248, branch=2)
    at test.cc:35
#26 0x0804938f in Family::Family (this=0x9e84808, totals=2498, branch=2)
    at test.cc:35
#27 0x0804938f in Family::Family (this=0x9bf3d48, totals=4998, branch=2)
    at test.cc:35
#28 0x0804938f in Family::Family (this=0x9d05de8, totals=9998, branch=2)
    at test.cc:35
#29 0x0804938f in Family::Family (this=0x9b8dcb8, totals=19999, branch=2)
    at test.cc:35
#30 0x080495e7 in workload (avv=0xbfca7aa4) at test.cc:71
#31 0xb75b7dfd in ?? () from /usr/lib/libguile-2.0.so.22
#32 0xb76418e7 in ?? () from /usr/lib/libguile-2.0.so.22
#33 0xb761afb9 in ?? () from /usr/lib/libguile-2.0.so.22
#34 0xb7659f20 in ?? () from /usr/lib/libguile-2.0.so.22
#35 0xb765a539 in ?? () from /usr/lib/libguile-2.0.so.22
#36 0xb75c24f3 in scm_call_4 () from /usr/lib/libguile-2.0.so.22
#37 0xb7641acf in scm_catch_with_pre_unwind_handler ()
   from /usr/lib/libguile-2.0.so.22
#38 0xb7641bd4 in scm_c_catch () from /usr/lib/libguile-2.0.so.22
#39 0xb75b85d1 in ?? () from /usr/lib/libguile-2.0.so.22
#40 0xb75b86d3 in scm_c_with_continuation_barrier ()
   from /usr/lib/libguile-2.0.so.22
#41 0xb763ef7e in ?? () from /usr/lib/libguile-2.0.so.22
#42 0xb72782c1 in GC_call_with_stack_base ()
   from /usr/lib/i386-linux-gnu/libgc.so.1
#43 0xb763f3e6 in scm_with_guile () from /usr/lib/libguile-2.0.so.22
#44 0x080496c5 in main (ac=4, av=0xbfca7aa4) at test.cc:85
(gdb) 

So the core indeed occurs when trying to call scm_gc_mark on a smob no
longer (or not yet?) associated with a valid structure in memory, in a
garbage collection apparently triggered during normal allocation of smob
data.

Sorry for the nonsensical core dump previously.

-- 
David Kastrup




Information forwarded to bug-guile <at> gnu.org:
bug#19883; Package guile. (Wed, 25 Feb 2015 22:17:02 GMT) Full text and rfc822 format available.

Message #11 received at 19883 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: David Kastrup <dak <at> gnu.org>
Cc: 19883 <at> debbugs.gnu.org
Subject: Re: bug#19883: Correction for backtrace
Date: Wed, 25 Feb 2015 23:16:15 +0100
David Kastrup <dak <at> gnu.org> skribis:

> This is embarrassing: I used the wrong executable in connection with the
> core dump.  With the matching executable, the coredump makes a lot more
> sense:
>
> #0  0x00000000 in ?? ()
> #1  0x0804aee0 in Smob_base<Family>::mark_trampoline (arg=0x9fbb000)
>     at smobs.tcc:34
> #2  0xb761b2da in ?? () from /usr/lib/libguile-2.0.so.22
> #3  0xb72751f8 in GC_mark_from () from /usr/lib/i386-linux-gnu/libgc.so.1

Could you try commenting out all the SMOB mark functions in LilyPond?

This doesn’t fix the bug, of course, but it’s probably a good
workaround: user-provided mark functions are not needed in Guile 2.0
since libgc scans the whole heap for live pointers.

Thanks,
Ludo’.




Information forwarded to bug-guile <at> gnu.org:
bug#19883; Package guile. (Wed, 25 Feb 2015 23:18:02 GMT) Full text and rfc822 format available.

Message #14 received at 19883 <at> debbugs.gnu.org (full text, mbox):

From: David Kastrup <dak <at> gnu.org>
To: ludo <at> gnu.org (Ludovic Courtès)
Cc: 19883 <at> debbugs.gnu.org
Subject: Re: bug#19883: Correction for backtrace
Date: Thu, 26 Feb 2015 00:17:27 +0100
ludo <at> gnu.org (Ludovic Courtès) writes:

> David Kastrup <dak <at> gnu.org> skribis:
>
>> This is embarrassing: I used the wrong executable in connection with the
>> core dump.  With the matching executable, the coredump makes a lot more
>> sense:
>>
>> #0  0x00000000 in ?? ()
>> #1  0x0804aee0 in Smob_base<Family>::mark_trampoline (arg=0x9fbb000)
>>     at smobs.tcc:34
>> #2  0xb761b2da in ?? () from /usr/lib/libguile-2.0.so.22
>> #3  0xb72751f8 in GC_mark_from () from /usr/lib/i386-linux-gnu/libgc.so.1
>
> Could you try commenting out all the SMOB mark functions in LilyPond?
>
> This doesn’t fix the bug, of course, but it’s probably a good
> workaround: user-provided mark functions are not needed in Guile 2.0
> since libgc scans the whole heap for live pointers.

Even the test program crashes at the end (when `count' is called in
order to traverse the created hierarchy) when you disable the setting of
the mark function in the init method in smobs.tcc.

A pointer to a C++ structure does not appear to protect the
corresponding SMOB data and free_smob calls the delete operator which
calls destructors and clobbers the memory area.

Program received signal SIGSEGV, Segmentation fault.
0x08049b0a in std::vector<Family*, std::allocator<Family*> >::size (
    this=0x1b8b) at /usr/include/c++/4.9/bits/stl_vector.h:655
655	      { return size_type(this->_M_impl._M_finish - this->_M_impl._M_start); }
(gdb) bt
#0  0x08049b0a in std::vector<Family*, std::allocator<Family*> >::size (
    this=0x1b8b) at /usr/include/c++/4.9/bits/stl_vector.h:655
#1  0x08049498 in Family::count (this=0x1b7f) at test.cc:53
#2  0x0804947c in Family::count (this=0x834f350) at test.cc:54
#3  0x0804947c in Family::count (this=0x8297d40) at test.cc:54
#4  0x0804947c in Family::count (this=0x828a9f8) at test.cc:54
#5  0x0804947c in Family::count (this=0x817d768) at test.cc:54
#6  0x0804947c in Family::count (this=0x828d588) at test.cc:54
#7  0x0804947c in Family::count (this=0x83298b8) at test.cc:54
#8  0x0804947c in Family::count (this=0x817fe58) at test.cc:54
#9  0x080495df in workload (avv=0xbffff074) at test.cc:73
#10 0xb7e66dfd in ?? () from /usr/lib/libguile-2.0.so.22
#11 0xb7ef08e7 in ?? () from /usr/lib/libguile-2.0.so.22
#12 0xb7ec9fb9 in ?? () from /usr/lib/libguile-2.0.so.22
#13 0xb7f08f20 in ?? () from /usr/lib/libguile-2.0.so.22
#14 0xb7f09539 in ?? () from /usr/lib/libguile-2.0.so.22
#15 0xb7e714f3 in scm_call_4 () from /usr/lib/libguile-2.0.so.22
#16 0xb7ef0acf in scm_catch_with_pre_unwind_handler ()
   from /usr/lib/libguile-2.0.so.22
#17 0xb7ef0bd4 in scm_c_catch () from /usr/lib/libguile-2.0.so.22
#18 0xb7e675d1 in ?? () from /usr/lib/libguile-2.0.so.22
#19 0xb7e676d3 in scm_c_with_continuation_barrier ()
   from /usr/lib/libguile-2.0.so.22
#20 0xb7eedf7e in ?? () from /usr/lib/libguile-2.0.so.22
#21 0xb7b272c1 in GC_call_with_stack_base ()
   from /usr/lib/i386-linux-gnu/libgc.so.1
#22 0xb7eee3e6 in scm_with_guile () from /usr/lib/libguile-2.0.so.22
#23 0x08049685 in main (ac=4, av=0xbffff074) at test.cc:85


-- 
David Kastrup




Information forwarded to bug-guile <at> gnu.org:
bug#19883; Package guile. (Thu, 26 Feb 2015 11:36:01 GMT) Full text and rfc822 format available.

Message #17 received at 19883 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: David Kastrup <dak <at> gnu.org>
Cc: 19883 <at> debbugs.gnu.org
Subject: Re: bug#19883: Correction for backtrace
Date: Thu, 26 Feb 2015 12:35:32 +0100
David Kastrup <dak <at> gnu.org> skribis:

> ludo <at> gnu.org (Ludovic Courtès) writes:
>
>> David Kastrup <dak <at> gnu.org> skribis:
>>
>>> This is embarrassing: I used the wrong executable in connection with the
>>> core dump.  With the matching executable, the coredump makes a lot more
>>> sense:
>>>
>>> #0  0x00000000 in ?? ()
>>> #1  0x0804aee0 in Smob_base<Family>::mark_trampoline (arg=0x9fbb000)
>>>     at smobs.tcc:34
>>> #2  0xb761b2da in ?? () from /usr/lib/libguile-2.0.so.22
>>> #3  0xb72751f8 in GC_mark_from () from /usr/lib/i386-linux-gnu/libgc.so.1
>>
>> Could you try commenting out all the SMOB mark functions in LilyPond?
>>
>> This doesn’t fix the bug, of course, but it’s probably a good
>> workaround: user-provided mark functions are not needed in Guile 2.0
>> since libgc scans the whole heap for live pointers.
>
> Even the test program crashes at the end (when `count' is called in
> order to traverse the created hierarchy) when you disable the setting of
> the mark function in the init method in smobs.tcc.

Could you add debugging symbols for libguile?  I don’t understand how
‘count’ gets called.

Do you know if this is a use-after-free error?  Perhaps setting
MALLOC_CHECK_=1 would give a hint.

If this is the case, Andy had the idea of turning on topological
finalization in the GC.  This may help for this particular case, but I
vaguely recall that this breaks other finalizer-related things.

(I would check by myself, but ISTR that building LilyPond “on one’s own”
is not recommended.  What would you suggest?  A Guix recipe would be
sweet.)

> A pointer to a C++ structure does not appear to protect the
> corresponding SMOB data and free_smob calls the delete operator which
> calls destructors and clobbers the memory area.

Oh, I was mistaken in my previous message.  GC scans the stack and the
GC-managed heap (stuff allocated with GC_MALLOC/scm_gc_malloc et al.),
but it does *not* scan the malloc/new heap.

So indeed, C++ objects that hold references to ‘SCM’ objects, such as
instances of ‘Smob<X>’, must either have a mark function, or they must
be allocated with scm_gc_malloc.

Would it be possible to add a ‘new’ operator to ‘Smob’ that uses
‘scm_gc_malloc’, and a ‘delete’ operator that uses ‘scm_gc_free’?

Thanks,
Ludo’.




Information forwarded to bug-guile <at> gnu.org:
bug#19883; Package guile. (Thu, 26 Feb 2015 12:33:01 GMT) Full text and rfc822 format available.

Message #20 received at 19883 <at> debbugs.gnu.org (full text, mbox):

From: David Kastrup <dak <at> gnu.org>
To: ludo <at> gnu.org (Ludovic Courtès)
Cc: 19883 <at> debbugs.gnu.org
Subject: Re: bug#19883: Correction for backtrace
Date: Thu, 26 Feb 2015 13:32:00 +0100
ludo <at> gnu.org (Ludovic Courtès) writes:

> David Kastrup <dak <at> gnu.org> skribis:
>
>> ludo <at> gnu.org (Ludovic Courtès) writes:
>>
>>> David Kastrup <dak <at> gnu.org> skribis:
>>>
>>>> This is embarrassing: I used the wrong executable in connection with the
>>>> core dump.  With the matching executable, the coredump makes a lot more
>>>> sense:
>>>>
>>>> #0  0x00000000 in ?? ()
>>>> #1  0x0804aee0 in Smob_base<Family>::mark_trampoline (arg=0x9fbb000)
>>>>     at smobs.tcc:34
>>>> #2  0xb761b2da in ?? () from /usr/lib/libguile-2.0.so.22
>>>> #3  0xb72751f8 in GC_mark_from () from /usr/lib/i386-linux-gnu/libgc.so.1
>>>
>>> Could you try commenting out all the SMOB mark functions in LilyPond?
>>>
>>> This doesn’t fix the bug, of course, but it’s probably a good
>>> workaround: user-provided mark functions are not needed in Guile 2.0
>>> since libgc scans the whole heap for live pointers.
>>
>> Even the test program crashes at the end (when `count' is called in
>> order to traverse the created hierarchy) when you disable the setting of
>> the mark function in the init method in smobs.tcc.
>
> Could you add debugging symbols for libguile?  I don’t understand how
> ‘count’ gets called.

Figure me surprised.  Here is the recursive walk:

int
Family::count ()
{
  int sum = 1;
  for (int i = 0; i < kids.size (); i++)
    sum += kids[i]->count ();
  return sum;
}

and here is the starting call in workload():

  cout << "last has " << Family::unsmob (k)->count () << endl;

> Do you know if this is a use-after-free error?

Sure.  Nothing else would clobber the kids[] array to contain bad
pointers.

> If this is the case, Andy had the idea of turning on topological
> finalization in the GC.  This may help for this particular case, but I
> vaguely recall that this breaks other finalizer-related things.

I don't see why.  Topological finalization might help with
mark-after-free.  But why would it help if there is not even any mark
call involved?  This is clearly use-after-free.

> (I would check by myself, but ISTR that building LilyPond “on one’s
> own” is not recommended.  What would you suggest?  A Guix recipe would
> be sweet.)

Is there a reason you are not using the test program provided with this
bug report?  There is no real point in experimenting with LilyPond's
complexity when a simple test program using its memory management
classes already crashes.

LilyPond's GUILEv2 branch is currently out of order again since 2.0.11
changed encoding mechanisms _again_ in an incompatible manner (what
GUILE calls "stable" is anything but).  It is becoming harder and harder
to work around GUILE's attempts of wresting encoding control from the
application, while GUILE has no byte-transparent decoding of UTF-8, does
not support strings encoded in UTF-8, and (as of 2.0.11 or 2.0.10)
supports _only_ string ports redecoded to UTF-8.

So dealing with memory-mapped UTF-8 encoded files which are multiplexed
between reading by GUILE and reading by an UTF-8 decoding parser has
again been thwarted.  While I try figuring out how to repair the damage
this time, testing with LilyPond itself is hard to interpret since a
number of problems are not related to the memory management.

As long as this simple test program can show the memory management
related crashes, I don't see the point in throwing people at LilyPond:
that has not delivered any results the last several times I tried it.

>> A pointer to a C++ structure does not appear to protect the
>> corresponding SMOB data and free_smob calls the delete operator which
>> calls destructors and clobbers the memory area.
>
> Oh, I was mistaken in my previous message.  GC scans the stack and the
> GC-managed heap (stuff allocated with GC_MALLOC/scm_gc_malloc et al.),
> but it does *not* scan the malloc/new heap.
>
> So indeed, C++ objects that hold references to ‘SCM’ objects, such as
> instances of ‘Smob<X>’, must either have a mark function, or they must
> be allocated with scm_gc_malloc.
>
> Would it be possible to add a ‘new’ operator to ‘Smob’ that uses
> ‘scm_gc_malloc’, and a ‘delete’ operator that uses ‘scm_gc_free’?

It would not help since many of the references are stored in STL
containers (like std::vector <Grob *>) which have their data
allocated/deallocated separately from the memory area of the structure
itself.

Frankly, I don't get the current strategy of GUILE: basically any use of
scm_set_smob_mark will result in a function that can be called with
garbage from a smob that has already been deallocated via the function
registered with scm_set_smob_free.

GUILEv2 developers have resisted fixing this bug for years by trying to
stop people from using scm_set_smob_mark and instead telling people to
have their entire heap scanned by a conservative garbage collector.

For an application like LilyPond which can easily have the heap cover
more than half of the available address space and run for half an hour
(when generating docs) processing independent files with large
individual memory requirements, this strategy will have both
considerable performance impacts as well as bleed enough randomly
retained memory to run the application into the ground eventually.

In my current work on fixing the encoding stuff again I have patched my
code to deal with the mark-after-free errors in the free and mark
trampolines myself.  I need to find a solution for the encoding mess
before I can actually indulge in more testing of this workaround.

However, due to the intransparency of GUILE's implementation and the
multithreaded collector, I have no guarantees that my work on the
respective trampolines will reliably prevent all mark-after-free errors.

This is something that needs to get fixed in GUILE.  It does not make
sense to provide a mark callback mechanism that can be called with
garbage in GUILE's free store.  When GUILE releases/collects memory, it
does not make sense to leave the SMOB cells in a state indistinguishable
from from valid data.  Apart from causing crashes in mark functions,
this makes work much harder for the conservative garbage collector.

-- 
David Kastrup




Information forwarded to bug-guile <at> gnu.org:
bug#19883; Package guile. (Thu, 26 Feb 2015 14:04:01 GMT) Full text and rfc822 format available.

Message #23 received at 19883 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: David Kastrup <dak <at> gnu.org>
Cc: 19883 <at> debbugs.gnu.org
Subject: Re: bug#19883: Correction for backtrace
Date: Thu, 26 Feb 2015 15:03:22 +0100
[Message part 1 (text/plain, inline)]
David Kastrup <dak <at> gnu.org> skribis:

> Is there a reason you are not using the test program provided with this
> bug report?

Sorry, I had overlooked it.  Please assume good faith.

With this patch:

[Message part 2 (text/x-patch, inline)]
diff --git a/smobs.tcc b/smobs.tcc
index c8b9dfe..b1d61b7 100644
--- a/smobs.tcc
+++ b/smobs.tcc
@@ -132,9 +132,6 @@ void Smob_base<Super>::init ()
   // types in order to be able to compare them.  The other comparisons
   // are for static member functions and thus are ordinary function
   // pointers which work without those contortions.
-  if (static_cast<SCM (Super::*)()>(&Super::mark_smob) !=
-      static_cast<SCM (Super::*)()>(&Smob_base<Super>::mark_smob))
-    scm_set_smob_mark (smob_tag_, Super::mark_trampoline);
   scm_set_smob_print (smob_tag_, Super::print_trampoline);
   if (&Super::equal_p != &Smob_base<Super>::equal_p)
     scm_set_smob_equalp (smob_tag_, Super::equal_p);
[Message part 3 (text/plain, inline)]
“./test 1000 1000 10000” works fine.

However, it’s not representative since the C++ objects in this test do
not hold ‘SCM’ references, AFAICS.

> LilyPond's GUILEv2 branch is currently out of order again since 2.0.11
> changed encoding mechanisms _again_

Please submit a different bug report.

>>> A pointer to a C++ structure does not appear to protect the
>>> corresponding SMOB data and free_smob calls the delete operator which
>>> calls destructors and clobbers the memory area.
>>
>> Oh, I was mistaken in my previous message.  GC scans the stack and the
>> GC-managed heap (stuff allocated with GC_MALLOC/scm_gc_malloc et al.),
>> but it does *not* scan the malloc/new heap.
>>
>> So indeed, C++ objects that hold references to ‘SCM’ objects, such as
>> instances of ‘Smob<X>’, must either have a mark function, or they must
>> be allocated with scm_gc_malloc.
>>
>> Would it be possible to add a ‘new’ operator to ‘Smob’ that uses
>> ‘scm_gc_malloc’, and a ‘delete’ operator that uses ‘scm_gc_free’?
>
> It would not help since many of the references are stored in STL
> containers (like std::vector <Grob *>) which have their data
> allocated/deallocated separately from the memory area of the structure
> itself.

Oh, OK.  Still, I don’t think this is a problem because each C++ object
has a corresponding SMOB, and said SMOB is GC-protected; thus the C++
object itself is also GC-protected until the SMOB is unprotected.

Here’s the patch I’ve ended up with:

[Message part 4 (text/x-patch, inline)]
diff --git a/smobs.hh b/smobs.hh
index 3701280..a41a645 100644
--- a/smobs.hh
+++ b/smobs.hh
@@ -263,6 +263,20 @@ private:
 protected:
   Smob () : self_scm_ (SCM_UNDEFINED), protection_cons_ (SCM_EOL) { };
 public:
+  static void *operator new (std::size_t size)
+  {
+    /* This C++ object is referred to by the corresponding SMOB, which is
+       itself GC-protected.  Thus, no need to protect the C++ object.  */
+    return scm_gc_malloc (size, "lily-smob");
+  }
+
+  static void operator delete (void *thing)
+  {
+    /* Nothing to do: the GC will reclaim memory for THING when it deems
+       appropriate.  */
+    // printf ("delete %p\n", thing);
+  }
+
   static size_t free_smob (SCM obj)
   {
     delete Smob_base<Super>::unregister_ptr (obj);
diff --git a/smobs.tcc b/smobs.tcc
index c8b9dfe..0eb0c30 100644
--- a/smobs.tcc
+++ b/smobs.tcc
@@ -31,7 +31,7 @@ template <class Super>
 SCM
 Smob_base<Super>::mark_trampoline (SCM arg)
 {
-  return (Super::unsmob (arg))->mark_smob ();
+  abort ();
 }
 
 template <class Super>
@@ -60,14 +60,14 @@ template <class Super>
 SCM
 Smob_base<Super>::mark_smob ()
 {
-  return SCM_UNSPECIFIED;
+  abort ();
 }
 
 template <class Super>
 size_t
 Smob_base<Super>::free_smob (SCM)
 {
-  return 0;
+  abort ();
 }
 
 template <class Super>
@@ -125,16 +125,11 @@ void Smob_base<Super>::init ()
   // While that's not a consideration for type_p_name_, it's easier
   // doing it like the rest.
 
-  if (&Super::free_smob != &Smob_base<Super>::free_smob)
-    scm_set_smob_free (smob_tag_, Super::free_smob);
   // Old GCC versions get their type lattice for pointers-to-members
   // tangled up to a degree where we need to typecast _both_ covariant
   // types in order to be able to compare them.  The other comparisons
   // are for static member functions and thus are ordinary function
   // pointers which work without those contortions.
-  if (static_cast<SCM (Super::*)()>(&Super::mark_smob) !=
-      static_cast<SCM (Super::*)()>(&Smob_base<Super>::mark_smob))
-    scm_set_smob_mark (smob_tag_, Super::mark_trampoline);
   scm_set_smob_print (smob_tag_, Super::print_trampoline);
   if (&Super::equal_p != &Smob_base<Super>::equal_p)
     scm_set_smob_equalp (smob_tag_, Super::equal_p);
diff --git a/test.cc b/test.cc
index b5ab476..6bbd435 100644
--- a/test.cc
+++ b/test.cc
@@ -41,9 +41,7 @@ Family::Family (int totals, int branch)
 SCM
 Family::mark_smob ()
 {
-  for (int i = 0; i < kids.size (); i++)
-    scm_gc_mark (kids[i]->self_scm ());
-  return SCM_EOL;
+  abort ();
 }
 
 int
@@ -51,7 +49,11 @@ Family::count ()
 {
   int sum = 1;
   for (int i = 0; i < kids.size (); i++)
-    sum += kids[i]->count ();
+    {
+      sum += kids[i]->count ();
+      if ((i % 100) == 0)
+	scm_gc ();
+    }
   return sum;
 }
 
[Message part 5 (text/plain, inline)]
It works on this test case, and, assuming I correctly understand
LilyPond’s use case, it may be a viable solution for LilyPond.  Could
you give it a try and report back?

> Frankly, I don't get the current strategy of GUILE: basically any use of
> scm_set_smob_mark will result in a function that can be called with
> garbage from a smob that has already been deallocated via the function
> registered with scm_set_smob_free.

That’s not true.  For example, the GnuTLS bindings use the exact same
code for 1.8 and 2.0 without any problems; it’s surely a simpler example
than LilyPond, of course, but still.

~~~~~~~

In the interest of the Guile and LilyPond projects, I would like to
improve our communication.

Let me explain what my personal expectations are.  First, like many
others, I’m a volunteer with limited bandwidth.  Thus, I really prefer
bug reports that are as concise as possible and to-the-point; I can’t
afford to read so much.

Second, I don’t want to mix issues: this bug report is about a specific
GC issue, not about an encoding issue.  It’s better for me (and everyone
I guess) when a bug report remains focused.

Third, we must fix these LilyPond vs. Guile 2 issues.  You and others
have spent a lot of time on this already.  I think it would help to get
everyone involved on both sides.  Thus, could you Cc: this bug report to
the LilyPond developer list, or the corresponding LilyPond bug report?
This is really important to me.

Please don’t take it personally or anything.  I’m really trying hard to
see how we can improve the collaboration between the two projects.

Thanks,
Ludo’.

Information forwarded to bug-guile <at> gnu.org:
bug#19883; Package guile. (Thu, 26 Feb 2015 15:31:02 GMT) Full text and rfc822 format available.

Message #26 received at 19883 <at> debbugs.gnu.org (full text, mbox):

From: David Kastrup <dak <at> gnu.org>
To: ludo <at> gnu.org (Ludovic Courtès)
Cc: 19883 <at> debbugs.gnu.org
Subject: Re: bug#19883: Correction for backtrace
Date: Thu, 26 Feb 2015 16:30:28 +0100
ludo <at> gnu.org (Ludovic Courtès) writes:

> David Kastrup <dak <at> gnu.org> skribis:
>
>> Is there a reason you are not using the test program provided with this
>> bug report?
>
> Sorry, I had overlooked it.  Please assume good faith.
>
> With this patch:
>
> diff --git a/smobs.tcc b/smobs.tcc
> index c8b9dfe..b1d61b7 100644
> --- a/smobs.tcc
> +++ b/smobs.tcc
> @@ -132,9 +132,6 @@ void Smob_base<Super>::init ()
>    // types in order to be able to compare them.  The other comparisons
>    // are for static member functions and thus are ordinary function
>    // pointers which work without those contortions.
> -  if (static_cast<SCM (Super::*)()>(&Super::mark_smob) !=
> -      static_cast<SCM (Super::*)()>(&Smob_base<Super>::mark_smob))
> -    scm_set_smob_mark (smob_tag_, Super::mark_trampoline);
>    scm_set_smob_print (smob_tag_, Super::print_trampoline);
>    if (&Super::equal_p != &Smob_base<Super>::equal_p)
>      scm_set_smob_equalp (smob_tag_, Super::equal_p);
>
>
> “./test 1000 1000 10000” works fine.

That's not really surprising: this specifies a "fork factor" of 1000 and
1000 points of data which means that the data structures are nested only
one level deep.

Try ./test 2 2000 200

which will cause a nesting of about 10 levels, making it much more
likely for collection/allocation problems to strike.

> However, it’s not representative since the C++ objects in this test do
> not hold ‘SCM’ references, AFAICS.

Why isn't it representative?  That's _very_ much the situation in
LilyPond's code base.

>> LilyPond's GUILEv2 branch is currently out of order again since
>> 2.0.11 changed encoding mechanisms _again_
>
> Please submit a different bug report.

There were separate bug reports.  This is considered a feature.

>> It would not help since many of the references are stored in STL
>> containers (like std::vector <Grob *>) which have their data
>> allocated/deallocated separately from the memory area of the
>> structure itself.
>
> Oh, OK.  Still, I don’t think this is a problem because each C++
> object has a corresponding SMOB, and said SMOB is GC-protected; thus
> the C++ object itself is also GC-protected until the SMOB is
> unprotected.

The code given in test.cc is representative for LilyPond: most of the
C++ objects refer to other C++ objects via pointers, and the protection
of SMOB and C++ objects is managed through the mark callbacks.  Complex
C++ objects contain their own SCM value as part of the Smob base class.
Simple C++ objects (derived from Simple_smob) don't and are only
optionally managed by GUILE.

The test program currently does not demonstrate use of Simple_smob: at
the current point of time getting the crashes of Smob under control is
the first target.

> Here’s the patch I’ve ended up with:
>
> diff --git a/smobs.hh b/smobs.hh
> index 3701280..a41a645 100644
> --- a/smobs.hh
> +++ b/smobs.hh
> @@ -263,6 +263,20 @@ private:
>  protected:
>    Smob () : self_scm_ (SCM_UNDEFINED), protection_cons_ (SCM_EOL) { };
>  public:
> +  static void *operator new (std::size_t size)
> +  {
> +    /* This C++ object is referred to by the corresponding SMOB, which is
> +       itself GC-protected.  Thus, no need to protect the C++ object.  */
> +    return scm_gc_malloc (size, "lily-smob");
> +  }
> +
> +  static void operator delete (void *thing)
> +  {
> +    /* Nothing to do: the GC will reclaim memory for THING when it deems
> +       appropriate.  */
> +    // printf ("delete %p\n", thing);
> +  }
> +

As I stated: this will not help with STL containers which are
extensively used in pretty much every data structure of LilyPond.

>> Frankly, I don't get the current strategy of GUILE: basically any use
>> of scm_set_smob_mark will result in a function that can be called
>> with garbage from a smob that has already been deallocated via the
>> function registered with scm_set_smob_free.
>
> That’s not true.

Shrug.  I posted the backtrace and the example program.

> Let me explain what my personal expectations are.  First, like many
> others, I’m a volunteer with limited bandwidth.  Thus, I really prefer
> bug reports that are as concise as possible and to-the-point; I can’t
> afford to read so much.

That's what I consider the primary reason that every offer to help with
LilyPond so far has gone dead the moment I provided a branch and recipe
to work on.  LilyPond is a large and complex application.  That's why
I went to the pain of providing standalone code demonstrating the memory
allocation and management problems so that this could be worked on
independently from other problems.

> Second, I don’t want to mix issues: this bug report is about a
> specific GC issue, not about an encoding issue.  It’s better for me
> (and everyone I guess) when a bug report remains focused.

Which is why I provided a separate test program that does not break
whenever the "stable" Guile version changes its encoding APIs again.
Several weeks ago I promised RMS to update the GUILEv2 branch to the
current version of LilyPond as he got a renewed promise of cooperation.
It turned out that in the mean time Ubuntu had updated Guilev2 from
2.0.9 (or something) to 2.0.11, and the encoding issues I got under
control for 2.0.9 were back with a vengeance where a speedy solution was
not in sight.

So I spent the time to create test code _only_ involving the memory
management issues.  Now the complaint appears to be that the code is not
representative for LilyPond.  It most certainly is.

> Third, we must fix these LilyPond vs. Guile 2 issues.  You and others
> have spent a lot of time on this already.

In the last two years, nobody except myself spent any time on
GUILEv2/LilyPond.  Previous to that, Ian Hulin spent time on
reorganizing our code base and startup code in order not to get the byte
compilation, module system and other interpreter/compiler differences of
GUILEv2 mess up the basic operation, mostly by disabling a lot of stuff.
There will be a lot of followup work to get code management and
performance into reasonable shape once it works at all.

When it is not possible to run the test suites without random crashes,
there is no reasonably effective way for people to work on those aspects
however.

> I think it would help to get everyone involved on both sides.  Thus,
> could you Cc: this bug report to the LilyPond developer list, or the
> corresponding LilyPond bug report?  This is really important to me.

Shrug.  I'll put a link to this bug report to a suitable LilyPond issue.
It will not likely trigger a response or other feedback, however.
Nobody but myself is working on GUILEv2 migration.  The current C++
template class implementation/abstraction of the GUILE memory management
has been written by me.  No other currently active developer has had
either significant involvement with the current code, or the
functionally similar previous code that used C preprocessor macros
instead of language features to do essentially the same.

> Please don’t take it personally or anything.  I’m really trying hard
> to see how we can improve the collaboration between the two projects.

For better or worse, LilyPond has no other developer to offer than
myself for that task.  There is nobody else available who even has had
exposure to GUILEv2 or even GUILE1 internals.

-- 
David Kastrup




Information forwarded to bug-guile <at> gnu.org:
bug#19883; Package guile. (Thu, 26 Feb 2015 18:16:02 GMT) Full text and rfc822 format available.

Message #29 received at 19883 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: David Kastrup <dak <at> gnu.org>
Cc: 19883 <at> debbugs.gnu.org
Subject: Re: bug#19883: Correction for backtrace
Date: Thu, 26 Feb 2015 19:15:30 +0100
David Kastrup <dak <at> gnu.org> skribis:

> ludo <at> gnu.org (Ludovic Courtès) writes:
>
>> David Kastrup <dak <at> gnu.org> skribis:

[...]

>>> It would not help since many of the references are stored in STL
>>> containers (like std::vector <Grob *>) which have their data
>>> allocated/deallocated separately from the memory area of the
>>> structure itself.
>>
>> Oh, OK.  Still, I don’t think this is a problem because each C++
>> object has a corresponding SMOB, and said SMOB is GC-protected; thus
>> the C++ object itself is also GC-protected until the SMOB is
>> unprotected.
>
> The code given in test.cc is representative for LilyPond: most of the
> C++ objects refer to other C++ objects via pointers, and the protection
> of SMOB and C++ objects is managed through the mark callbacks.  Complex
> C++ objects contain their own SCM value as part of the Smob base class.
> Simple C++ objects (derived from Simple_smob) don't and are only
> optionally managed by GUILE.

I don’t think this contradicts what I wrote above, does it?

>> Here’s the patch I’ve ended up with:
>>
>> diff --git a/smobs.hh b/smobs.hh
>> index 3701280..a41a645 100644
>> --- a/smobs.hh
>> +++ b/smobs.hh
>> @@ -263,6 +263,20 @@ private:
>>  protected:
>>    Smob () : self_scm_ (SCM_UNDEFINED), protection_cons_ (SCM_EOL) { };
>>  public:
>> +  static void *operator new (std::size_t size)
>> +  {
>> +    /* This C++ object is referred to by the corresponding SMOB, which is
>> +       itself GC-protected.  Thus, no need to protect the C++ object.  */
>> +    return scm_gc_malloc (size, "lily-smob");
>> +  }
>> +
>> +  static void operator delete (void *thing)
>> +  {
>> +    /* Nothing to do: the GC will reclaim memory for THING when it deems
>> +       appropriate.  */
>> +    // printf ("delete %p\n", thing);
>> +  }
>> +
>
> As I stated: this will not help with STL containers which are
> extensively used in pretty much every data structure of LilyPond.

Sorry, I don’t understand how it doesn’t help.

It would be a problem is ‘Smob’ objects could be copied, thus ending up
in non-GC-scanned storage, but IIUC they cannot be copied because their
copy constructor is private.

What am I missing?

At any rate, I don’t see any failure with the test program.

>> I think it would help to get everyone involved on both sides.  Thus,
>> could you Cc: this bug report to the LilyPond developer list, or the
>> corresponding LilyPond bug report?  This is really important to me.
>
> Shrug.  I'll put a link to this bug report to a suitable LilyPond issue.

Thank you.  Though I want other LilyPond developers to get involved, and
I’m afraid it would be easy for them to just ignore a side bug report.

It’s a vital task for LilyPond, it cannot be a one-person side-project
on the LilyPond side.

Thanks in advance,
Ludo’.




Information forwarded to bug-guile <at> gnu.org:
bug#19883; Package guile. (Thu, 26 Feb 2015 19:26:01 GMT) Full text and rfc822 format available.

Message #32 received at 19883 <at> debbugs.gnu.org (full text, mbox):

From: David Kastrup <dak <at> gnu.org>
To: ludo <at> gnu.org (Ludovic Courtès)
Cc: 19883 <at> debbugs.gnu.org
Subject: Re: bug#19883: Correction for backtrace
Date: Thu, 26 Feb 2015 20:25:22 +0100
ludo <at> gnu.org (Ludovic Courtès) writes:

> David Kastrup <dak <at> gnu.org> skribis:
>
>> Shrug.  I'll put a link to this bug report to a suitable LilyPond issue.
>
> Thank you.  Though I want other LilyPond developers to get involved, and
> I’m afraid it would be easy for them to just ignore a side bug report.
>
> It’s a vital task for LilyPond, it cannot be a one-person side-project
> on the LilyPond side.

Three years ago in August there was a meeting of developers at my house,
a lot of information was passed around and in a concerted effort
LilyPond-2.16.0 was released.

Since then all developments would properly be labelled a one-person
project or side-project.

Now if you take a look at GUILE 2.1 development, in particular all
commits that are _not_ merges from the 2.0 branch, you'll find that
quite more than 90% have come from the same person.

LilyPond development is quite more diverse, but the various changes in
the last years have invariably been one-person projects.

Of the three large GUILE-based applications Gnucash, TeXmacs, and
LilyPond, the only successful migration to GUILEv2 so far has been
Gnucash, and its integration with GUILE is much smaller than that of
LilyPond.

The one-person projects concerning finding and/or fixing language and
compiler problems or restructuring the related code areas have been
exclusively mine.  That is not necessarily a good match since my
productivity drops to near zero when I lose interest with a problem, and
without anybody else keeping the ball rolling, it can easily stay there.

But I have no resources I could call upon to change that.

-- 
David Kastrup




Information forwarded to bug-guile <at> gnu.org:
bug#19883; Package guile. (Fri, 27 Feb 2015 09:56:02 GMT) Full text and rfc822 format available.

Message #35 received at 19883 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: David Kastrup <dak <at> gnu.org>
Cc: 19883 <at> debbugs.gnu.org
Subject: Re: bug#19883: Correction for backtrace
Date: Fri, 27 Feb 2015 10:55:04 +0100
David Kastrup <dak <at> gnu.org> skribis:

> Three years ago in August there was a meeting of developers at my house,
> a lot of information was passed around and in a concerted effort
> LilyPond-2.16.0 was released.
>
> Since then all developments would properly be labelled a one-person
> project or side-project.

OK.  I viewed LilyPond as a project with more people involved.

> Now if you take a look at GUILE 2.1 development, in particular all
> commits that are _not_ merges from the 2.0 branch, you'll find that
> quite more than 90% have come from the same person.
>
> LilyPond development is quite more diverse, but the various changes in
> the last years have invariably been one-person projects.
>
> Of the three large GUILE-based applications Gnucash, TeXmacs, and
> LilyPond, the only successful migration to GUILEv2 so far has been
> Gnucash, and its integration with GUILE is much smaller than that of
> LilyPond.
>
> The one-person projects concerning finding and/or fixing language and
> compiler problems or restructuring the related code areas have been
> exclusively mine.  That is not necessarily a good match since my
> productivity drops to near zero when I lose interest with a problem, and
> without anybody else keeping the ball rolling, it can easily stay there.

Sure.  There’s quite a number of names showing up in the lilypond-devel
archive though.  We need them to feel concerned about this.

Ludo’.




Information forwarded to bug-guile <at> gnu.org:
bug#19883; Package guile. (Fri, 27 Feb 2015 10:19:02 GMT) Full text and rfc822 format available.

Message #38 received at 19883 <at> debbugs.gnu.org (full text, mbox):

From: David Kastrup <dak <at> gnu.org>
To: ludo <at> gnu.org (Ludovic Courtès)
Cc: 19883 <at> debbugs.gnu.org
Subject: Re: bug#19883: Correction for backtrace
Date: Fri, 27 Feb 2015 11:18:15 +0100
ludo <at> gnu.org (Ludovic Courtès) writes:

> David Kastrup <dak <at> gnu.org> skribis:
>
>> Three years ago in August there was a meeting of developers at my house,
>> a lot of information was passed around and in a concerted effort
>> LilyPond-2.16.0 was released.
>>
>> Since then all developments would properly be labelled a one-person
>> project or side-project.
>
> OK.  I viewed LilyPond as a project with more people involved.

There are a number of them involved, but with regard to coding,
everybody is working on his own projects.  The largest amount of
"teamwork" is probably done in the translation team (which translates
documentation and web pages to several different languages) and with
documentation writers.

There are also people compiling the releases and running the test
suites.  And there is feedback on the bug trackers.

> Sure.  There’s quite a number of names showing up in the
> lilypond-devel archive though.  We need them to feel concerned about
> this.

The only concern you'll get is people piping up and saying they cannot
believe that not more progress has been made.  Which is not all that
dissimilar to what I see with GUILE and other projects even when sending
ready-made patches.

In this particular case however, the respective people do not have the
low-level skills to actually contribute significantly.  Which is the
reason I try to condense the ugly stuff into well-encapsulated and
controlled places instead of keeping them distributed across the code
base.

The memory management stuff is of that "nobody else treads here" kind.

Now the encoding stuff is wildly enfuriating in its pointless erection
and shifting around of road blocks for people needing to actually work
with strings and buffers in their original encoding, but would possibly
be suitable as material to fight with to more C programmers.

-- 
David Kastrup




Information forwarded to bug-guile <at> gnu.org:
bug#19883; Package guile. (Sun, 01 Mar 2015 06:52:02 GMT) Full text and rfc822 format available.

Message #41 received at 19883 <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: David Kastrup <dak <at> gnu.org>
Cc: 19883 <at> debbugs.gnu.org
Subject: Re: bug#19883: Smob's mark_smob has become unreliable in Guile 2.x
Date: Sun, 01 Mar 2015 01:51:28 -0500
Hi David,

Thank you for the minimal test case.  This is enormously helpful.

David Kastrup <dak <at> gnu.org> writes:

> The symptom likely occurs when an object is collected and consequently
> its free_smob function gets called.

After some investigation, I believe I've confirmed that this is exactly
what's happening.

I added a field to the Smob class that is set to 12345678 when the
object is created (and 21345678 at a later point), and then 87654321
when free_smob is called.  I then observed that the segfault happens
when the smob_mark function is called on a smob whose tag field is
valid, but whose corresponding C++ object has 87654321 in the special
field, indicating that its smob_free method was already called.  This
happened with GC_MARKERS=1.

I believe this is a bug in BDWGC.  gc_mark.h warns:

--8<---------------cut here---------------start------------->8---
/* WARNING: Such a mark procedure may be invoked on an unused object    */
/* residing on a free list.  Such objects are cleared, except for a     */
/* free list link field in the first word.  Thus mark procedures may    */
/* not count on the presence of a type descriptor, and must handle this */
/* case correctly somehow.                                              */
--8<---------------cut here---------------end--------------->8---

and protect Guile users from this issue by checking that the tag on the
smob cell (first word) is correct before calling the user-defined
smob_mark function.  However, in this case, the mark procedure was
called on a smob cell whose first word was the correct tag and whose
second word was non-zero and clearly a pointer to an already deleted
instance of the C++ Smob class.

I think the next step is to remove libguile from this test program, and
turn it into a minimal demonstration of this apparent BDWGC bug, for
submission to the BDWGC developers.

FYI, here are the changes I made to the test program and the results I
observed.

First, some changes that I think might be needed for GC safety:

--8<---------------cut here---------------start------------->8---
--- smobs.hh.ORIG	2015-02-15 13:35:03.000000000 -0500
+++ smobs.hh	2015-02-28 23:42:06.346803668 -0500
@@ -270,8 +271,9 @@
   }
   SCM unprotected_smobify_self ()
   {
-    self_scm_ = Smob_base<Super>::register_ptr (static_cast<Super *> (this));
-    return self_scm_;
+    SCM s = Smob_base<Super>::register_ptr (static_cast<Super *> (this));
+    self_scm_ = s;
+    return s;
   }
   void protect ()
   {
@@ -279,12 +281,12 @@
   }
   SCM unprotect ()
   {
-    unprotect_smob (self_scm_, &protection_cons_);
-    return self_scm_;
+    SCM s = self_scm_;
+    unprotect_smob (s, &protection_cons_);
+    return s;
   }
   void smobify_self () {
-    self_scm_ = unprotected_smobify_self ();
-    protect ();
+    protect_smob (unprotected_smobify_self (), &protection_cons_);
   }
   SCM self_scm () const { return self_scm_; }
 };
--8<---------------cut here---------------end--------------->8---

The changes to 'unprotected_smobify_self' and 'smobify_self" are
probably only needed for a multithreaded program.  The idea is to try to
ensure that the smob is protected from GC by being on the stack (and not
merely in the malloc/free heap) until it is protected.

The change to 'unprotect' is along the same lines, and might be needed
even for single-threaded programs.  I don't know that
'scm_gc_unprotect_object' (called from 'unprotect_smob') will guarantee
not to free the object passed to it if there are no other visible
references to it.

However, none of these changes fixed the problem here.

BTW, one other thing to keep in mind: it will never be enough if the
only GC-visible reference to an object is the C++ Smob object.  There
must always be a GC-visible reference to the SCM value.

--8<---------------cut here---------------start------------->8---
--- smobs.hh.ORIG	2015-02-15 13:35:03.000000000 -0500
+++ smobs.hh	2015-02-28 23:42:06.346803668 -0500
@@ -263,6 +263,7 @@
 protected:
   Smob () : self_scm_ (SCM_UNDEFINED), protection_cons_ (SCM_EOL) { };
 public:
+  volatile int mhw_valid;
   static size_t free_smob (SCM obj)
   {
     delete Smob_base<Super>::unregister_ptr (obj);
--- smobs.tcc.ORIG	2015-02-16 08:29:02.000000000 -0500
+++ smobs.tcc	2015-02-28 23:30:19.563298917 -0500
@@ -50,6 +50,7 @@
   // announce the memory as being GC-controlled before even
   // allocating the controlling smob.
   SCM s = SCM_UNDEFINED;
+  p->mhw_valid = 12345678;
   SCM_NEWSMOB (s, smob_tag (), p);
   scm_gc_register_collectable_memory (p, sizeof (*p), smob_name_.c_str ());
   return s;
@@ -94,6 +95,9 @@
 Smob_base<Super>::unregister_ptr (SCM obj)
 {
   Super *p = Super::unchecked_unsmob (obj);
+  if (p->mhw_valid != 12345678 && p->mhw_valid != 21345678)
+    abort ();
+  p->mhw_valid = 87654321;
   scm_gc_unregister_collectable_memory (p, sizeof (*p), smob_name_.c_str ());
   return p;
 }
--- test.cc.ORIG	2015-02-16 10:33:24.000000000 -0500
+++ test.cc	2015-02-28 23:28:34.690778883 -0500
@@ -28,6 +28,7 @@
 Family::Family (int totals, int branch)
 {
   smobify_self ();
+  mhw_valid = 21345678;
   for (int l=0; l<branch; l++)
     {
       if (l >= totals)
--8<---------------cut here---------------end--------------->8---

Here's what I observed:

--8<---------------cut here---------------start------------->8---
mhw <at> jojen:~/guile-bug$ GC_MARKERS=1 ./test 2 20000 40
Iteration 0
Iteration 1
Iteration 2
Iteration 3
Segmentation fault (core dumped)
mhw <at> jojen:~/guile-bug$ gdb ./test core
GNU gdb (GDB) 7.8.2
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./test...done.
[New LWP 30663]

warning: Could not load shared library symbols for linux-gate.so.1.
Do you need "set solib-search-path" or "set sysroot"?
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/gnu/store/3g20rdmnavpblsmgppyl8jhg67nidhjk-glibc-2.20/lib/libthread_db.so.1".
Core was generated by `./test 2 20000 40'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0a2abde3 in ?? ()
warning: File "/gnu/store/7a5b7fqnxzbwhx7k2bhbc4i1nliq53rl-gcc-4.9.2-lib/lib/libstdc++.so.6.0.20-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/gnu/store/3lhr8q28q6f59774di9av7ncy09jd55d-guile-2.0.11/lib/libguile-2.0.so.22.7.2-gdb.scm".
To enable execution of this file add
	add-auto-load-safe-path /gnu/store/7a5b7fqnxzbwhx7k2bhbc4i1nliq53rl-gcc-4.9.2-lib/lib/libstdc++.so.6.0.20-gdb.py
line to your configuration file "/home/mhw/.gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/home/mhw/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
(gdb) bt
#0  0x0a2abde3 in ?? ()
#1  0xb766dc4a in smob_mark (addr=0xa15f6e0, mark_stack_ptr=0x9e5a320, mark_stack_limit=0x9e62000, env=0) at smob.c:335
#2  0xb758664a in GC_mark_from (mark_stack_top=0x9e5a320, mark_stack=0x9e5a000, mark_stack_limit=0x9e62000) at mark.c:737
#3  0xb7587c07 in GC_mark_some (cold_gc_frame=0xbf874ed8 "pVZ\267M\332W\267\330N\207\277") at mark.c:386
#4  0xb757da4d in GC_stopped_mark (stop_func=stop_func <at> entry=0xb757d5b0 <GC_never_stop_func>) at alloc.c:637
#5  0xb757e4a9 in GC_try_to_collect_inner (stop_func=0xb757d5b0 <GC_never_stop_func>) at alloc.c:456
#6  0xb757ee00 in GC_collect_or_expand (needed_blocks=needed_blocks <at> entry=1, ignore_off_page=ignore_off_page <at> entry=0, 
    retry=retry <at> entry=0) at alloc.c:1266
#7  0xb757effc in GC_allocobj (gran=gran <at> entry=3, kind=kind <at> entry=1) at alloc.c:1355
#8  0xb758440b in GC_generic_malloc_inner (lb=lb <at> entry=24, k=k <at> entry=1) at malloc.c:133
#9  0xb7581e8a in GC_register_finalizer_inner (obj=obj <at> entry=0xa005720, fn=fn <at> entry=0xb766d970 <finalize_smob>, 
    cd=cd <at> entry=0x0, ofn=ofn <at> entry=0xbf875098, ocd=ocd <at> entry=0xbf87509c, 
    mp=mp <at> entry=0xb75811c0 <GC_null_finalize_mark_proc>) at finalize.c:524
#10 0xb75821e5 in GC_register_finalizer_no_order (obj=obj <at> entry=0xa005720, fn=fn <at> entry=0xb766d970 <finalize_smob>, 
    cd=cd <at> entry=0x0, ofn=ofn <at> entry=0xbf875098, ocd=ocd <at> entry=0xbf87509c) at finalize.c:575
#11 0xb761bfbb in scm_i_set_finalizer (obj=obj <at> entry=0xa005720, proc=proc <at> entry=0xb766d970 <finalize_smob>, 
    data=data <at> entry=0x0) at finalizers.c:44
#12 0xb766e1f8 in scm_i_new_smob (tc=7039, data=171214120) at smob.c:416
#13 0x08049b12 in scm_new_smob (tc=7039, data=171214120)
    at /gnu/store/3lhr8q28q6f59774di9av7ncy09jd55d-guile-2.0.11/include/guile/2.0/libguile/smob.h:91
#14 0x0804a8dc in Smob_base<Family>::register_ptr (p=0xa348528) at smobs.tcc:54
#15 0x0804a281 in Smob<Family>::unprotected_smobify_self (this=0xa34852c) at smobs.hh:274
#16 0x08049d0a in Smob<Family>::smobify_self (this=0xa34852c) at smobs.hh:289
#17 0x0804960c in Family::Family (this=0xa348528, totals=8, branch=2) at test.cc:30
#18 0x08049658 in Family::Family (this=0xa2a1638, totals=17, branch=2) at test.cc:36
#19 0x08049658 in Family::Family (this=0xa2a1758, totals=37, branch=2) at test.cc:36
#20 0x08049658 in Family::Family (this=0xa349478, totals=76, branch=2) at test.cc:36
#21 0x08049658 in Family::Family (this=0xa349008, totals=154, branch=2) at test.cc:36
#22 0x08049658 in Family::Family (this=0xa347318, totals=310, branch=2) at test.cc:36
#23 0x08049658 in Family::Family (this=0xa29d938, totals=623, branch=2) at test.cc:36
#24 0x08049658 in Family::Family (this=0xa2998b8, totals=1248, branch=2) at test.cc:36
#25 0x08049658 in Family::Family (this=0xa27f818, totals=2498, branch=2) at test.cc:36
#26 0x08049658 in Family::Family (this=0xa343608, totals=4998, branch=2) at test.cc:36
#27 0x08049658 in Family::Family (this=0xa1e2a88, totals=9998, branch=2) at test.cc:36
#28 0x08049658 in Family::Family (this=0x9f9a3b8, totals=19999, branch=2) at test.cc:36
#29 0x0804989f in workload (avv=0xbf8759f4) at test.cc:72
#30 0xb7608480 in c_body (d=0xbf875884) at continuations.c:517
#31 0xb76955de in apply_catch_closure (clo=#<smob catch-closure 9ffab20>, args=()) at throw.c:140
#32 0xb766d8cf in apply_1 (smob=#<smob catch-closure 9ffab20>, a=()) at smob.c:142
#33 0xb76ac662 in vm_regular_engine (vm=#<vm 9ee2060>, program=#<program 9ee04c0>, argv=<optimized out>, 
    nargs=<optimized out>) at vm-i-system.c:858
#34 0xb76ae621 in scm_c_vm_run (vm=#<vm 9ee2060>, program=program <at> entry=#<program 9f5ddc8>, argv=argv <at> entry=0xbf875790, 
    nargs=nargs <at> entry=4) at vm.c:768
#35 0xb76135b4 in scm_call_4 (proc=#<program 9f5ddc8>, arg1=arg1 <at> entry=#t, arg2=arg2 <at> entry=#<smob catch-closure 9ffab20>, 
    arg3=arg3 <at> entry=<error reading variable: ERROR: Cannot access memory at address 0x0>, 
    arg4=arg4 <at> entry=<error reading variable: ERROR: Cannot access memory at address 0x0>) at eval.c:507
#36 0xb76957ee in scm_catch_with_pre_unwind_handler (key=key <at> entry=#t, thunk=thunk <at> entry=#<smob catch-closure 9ffab20>, 
    handler=handler <at> entry=<error reading variable: ERROR: Cannot access memory at address 0x0>, 
    pre_unwind_handler=<error reading variable: ERROR: Cannot access memory at address 0x0>) at throw.c:73
#37 0xb7695920 in scm_c_catch (tag=tag <at> entry=#t, body=body <at> entry=0xb7608470 <c_body>, body_data=body_data <at> entry=0xbf875884, 
    handler=handler <at> entry=0xb7608930 <c_handler>, handler_data=handler_data <at> entry=0xbf875884, 
    pre_unwind_handler=pre_unwind_handler <at> entry=0xb7608630 <pre_unwind_handler>, 
    pre_unwind_handler_data=pre_unwind_handler_data <at> entry=0x9ee2020) at throw.c:207
#38 0xb7608cd7 in scm_i_with_continuation_barrier (body=body <at> entry=0xb7608470 <c_body>, 
    body_data=body_data <at> entry=0xbf875884, handler=handler <at> entry=0xb7608930 <c_handler>, 
    handler_data=handler_data <at> entry=0xbf875884, 
    pre_unwind_handler=pre_unwind_handler <at> entry=0xb7608630 <pre_unwind_handler>, pre_unwind_handler_data=0x9ee2020)
    at continuations.c:455
#39 0xb7608da6 in scm_c_with_continuation_barrier (func=0x80497a4 <workload(void*)>, data=0xbf8759f4) at continuations.c:551
#40 0xb7692b32 in with_guile_and_parent (base=0xbf8758ec, data=0xbf875914) at threads.c:906
#41 0xb7589ac8 in GC_call_with_stack_base (fn=fn <at> entry=0xb7692ae0 <with_guile_and_parent>, arg=arg <at> entry=0xbf875914)
    at misc.c:1840
#42 0xb7692fcf in scm_i_with_guile_and_parent (parent=<optimized out>, data=0xbf8759f4, func=0x80497a4 <workload(void*)>)
    at threads.c:949
#43 scm_with_guile (func=0x80497a4 <workload(void*)>, data=0xbf8759f4) at threads.c:955
#44 0x0804997d in main (ac=4, av=0xbf8759f4) at test.cc:86
(gdb) dir /home/mhw/gc-7.4.2
Source directories searched: /home/mhw/gc-7.4.2:$cdir:$cwd
(gdb) dir /home/mhw/guile-2.0.11/libguile
Source directories searched: /home/mhw/guile-2.0.11/libguile:/home/mhw/gc-7.4.2:$cdir:$cwd
(gdb) up
#1  0xb766dc4a in smob_mark (addr=0xa15f6e0, mark_stack_ptr=0x9e5a320, mark_stack_limit=0x9e62000, env=0) at smob.c:335
warning: Source file is more recent than executable.
335	      obj = scm_smobs[smobnum].mark (cell);
(gdb) list
330	      scm_i_pthread_setspecific (current_mark_stack_pointer, mark_stack_ptr);
331	      scm_i_pthread_setspecific (current_mark_stack_limit, mark_stack_limit);
332	
333	      /* Invoke the SMOB's mark procedure, which will in turn invoke
334		 `scm_gc_mark', which may modify `current_mark_stack_pointer'.  */
335	      obj = scm_smobs[smobnum].mark (cell);
336	
337	      mark_stack_ptr = scm_i_pthread_getspecific (current_mark_stack_pointer);
338	
339	      if (SCM_NIMP (obj))
(gdb) p *(Family *)0x9f9a3b8
$1 = {<Smob<Family>> = {<Smob_base<Family>> = {static smob_tag_ = 7039, static scm_init_ = {<No data fields>}, 
      static smob_name_ = {static npos = <optimized out>, 
        _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
          _M_p = 0x9ea1fd4 "Family"}}, static type_p_name_ = <optimized out>, static smob_proc = <optimized out>, 
      static smob_proc_signature_ = <optimized out>}, self_scm_ = #<smob Family a3b28a0>, protection_cons_ = (), 
    mhw_valid = 21345678}, _vptr.Family = 0x804b668 <vtable for Family+8>, 
  kids = {<std::_Vector_base<Family*, std::allocator<Family*> >> = {
      _M_impl = {<std::allocator<Family*>> = {<__gnu_cxx::new_allocator<Family*>> = {<No data fields>}, <No data fields>}, 
        _M_start = 0x9f9a788, _M_finish = 0x9f9a78c, _M_end_of_storage = 0x9f9a78c}}, <No data fields>}}
(gdb) p ((Family *)0x9f9a3b8)->mhw_valid
$2 = 21345678
(gdb) p ((Family *)0xa348528)->mhw_valid
$3 = 12345678
(gdb) p ((Family *)0xa2a1638)->mhw_valid
$4 = 21345678
(gdb) x/2xw addr
0xa15f6e0:	0x00001b7f	0x0a352c58
(gdb) p *(Family *)0x0a352c58
$5 = {<Smob<Family>> = {<Smob_base<Family>> = {static smob_tag_ = 7039, static scm_init_ = {<No data fields>}, 
      static smob_name_ = {static npos = <optimized out>, 
        _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
          _M_p = 0x9ea1fd4 "Family"}}, static type_p_name_ = <optimized out>, static smob_proc = <optimized out>, 
      static smob_proc_signature_ = <optimized out>}, self_scm_ = #<smob Family a15f6e0>, protection_cons_ = (), 
    mhw_valid = 87654321}, _vptr.Family = 0xa202ca8, kids = {<std::_Vector_base<Family*, std::allocator<Family*> >> = {
      _M_impl = {<std::allocator<Family*>> = {<__gnu_cxx::new_allocator<Family*>> = {<No data fields>}, <No data fields>}, 
        _M_start = 0xa352cf8, _M_finish = 0xa352d00, _M_end_of_storage = 0xa352d00}}, <No data fields>}}
(gdb) p/x 7039
$6 = 0x1b7f
(gdb)
--8<---------------cut here---------------end--------------->8---

     Regards,
       Mark




Information forwarded to bug-guile <at> gnu.org:
bug#19883; Package guile. (Sun, 01 Mar 2015 10:11:02 GMT) Full text and rfc822 format available.

Message #44 received at 19883 <at> debbugs.gnu.org (full text, mbox):

From: David Kastrup <dak <at> gnu.org>
To: Mark H Weaver <mhw <at> netris.org>
Cc: 19883 <at> debbugs.gnu.org
Subject: Re: bug#19883: Smob's mark_smob has become unreliable in Guile 2.x
Date: Sun, 01 Mar 2015 11:09:46 +0100
Mark H Weaver <mhw <at> netris.org> writes:

> Hi David,
>
> Thank you for the minimal test case.  This is enormously helpful.
>
> David Kastrup <dak <at> gnu.org> writes:
>
>> The symptom likely occurs when an object is collected and consequently
>> its free_smob function gets called.
>
> After some investigation, I believe I've confirmed that this is exactly
> what's happening.
>
> I added a field to the Smob class that is set to 12345678 when the
> object is created (and 21345678 at a later point), and then 87654321
> when free_smob is called.  I then observed that the segfault happens
> when the smob_mark function is called on a smob whose tag field is
> valid, but whose corresponding C++ object has 87654321 in the special
> field, indicating that its smob_free method was already called.  This
> happened with GC_MARKERS=1.

There are a few remedies for the test case that do not necessarily apply
to the full-blown code.

I was able to avoid the crashes here by declaring the mark_smob function
non-virtual.  Declaring it virtual was actually more an oversight than
anything else.  Not routing the scm_mark call through a pointer fetched
from the no longer existent virtual function table will stop the crashes
in the example code.

It's not an option for LilyPond proper, however, since the type lattice
is complex enough that routing some of the marking process through a
virtual member function is not really avoidable.

Another more likely effective patch (which already made it into
LilyPond's code base) _clears_ the SCM data (where the pointer to the
C++ structure is stored) in unregister_ptr.  Since all calls to the
member functions mark_smob (and print_smob accordingly) are routed
through the same mark_trampoline, letting this trampoline check for a
non-null pointer avoids the mark-after-free situation in the general
case.

However, this is still a probabilistic fix.  For one thing, there might
be race conditions inside of GUILE.  For another, the basic premise is
that calling the mark function with garbage must not result in crashes.

Since LilyPond relies on C++ data structures and containers for its
operation and those are also involved in the mark walks, "can be called
on arbitrary garbage" (of which already freed smobs are just one special
case, so catering for them is not sufficient) is not an option that can
be implemented with reasonable amount of effort in the mark_smob
routines of LilyPond.

However, from the GUILE side the mark_smob calls originate from objects
detected to be SCM cells, and GUILE should be able to manage areas
filled with SCM cells such that it can _reliably_ distinguish allocated
from unallocated cells and there is no such thing as random garbage.

When that is the case, it may be sufficient to clear out the tag field
in the course of calling the free_smob hook, avoiding any subsequent
mark calls to occur on a cell that has lost its type consistency anyway.
Again, this might involve race conditions: no idea.  Because of
potential race conditions, a reliable fix needs to be in GUILE or BDWGC
and cannot be done at the application end.  The best we will be able to
achieve there is "crashes almost never randomly".

Which is what we should get in LilyPond with
<URL:https://codereview.appspot.com/197690043> (already committed to the
code base as
<URL:http://git.savannah.gnu.org/cgit/lilypond.git/commit/?id=60eeb4a4ff217e3abbfcb6b4de94f0a6ef6d437d>.

> I believe this is a bug in BDWGC.  gc_mark.h warns:
>
> /* WARNING: Such a mark procedure may be invoked on an unused object    */
> /* residing on a free list.  Such objects are cleared, except for a     */
> /* free list link field in the first word.  Thus mark procedures may    */
> /* not count on the presence of a type descriptor, and must handle this */
> /* case correctly somehow.                                              */
>
> and protect Guile users from this issue by checking that the tag on the
> smob cell (first word) is correct before calling the user-defined
> smob_mark function.  However, in this case, the mark procedure was
> called on a smob cell whose first word was the correct tag and whose
> second word was non-zero and clearly a pointer to an already deleted
> instance of the C++ Smob class.
>
> I think the next step is to remove libguile from this test program, and
> turn it into a minimal demonstration of this apparent BDWGC bug, for
> submission to the BDWGC developers.

I don't know about the interplay of GUILE and BDWGC.  It may be that the
general C freestore operation that BDWGC provides cannot prevent the
"can be called on garbage" effect and functions need to be robust
against that.

However, the specific "garbage" here resulted directly from marking an
already freed smob, and with a frequency that is far above random.  When
that happens, it is extremely unlikely that memory is being either
successfully retained or freed.

And that sounds like a bad idea in BDWGC itself.

> FYI, here are the changes I made to the test program and the results I
> observed.
>
> First, some changes that I think might be needed for GC safety:
>
> --- smobs.hh.ORIG	2015-02-15 13:35:03.000000000 -0500
> +++ smobs.hh	2015-02-28 23:42:06.346803668 -0500
> @@ -270,8 +271,9 @@
>    }
>    SCM unprotected_smobify_self ()
>    {
> -    self_scm_ = Smob_base<Super>::register_ptr (static_cast<Super *> (this));
> -    return self_scm_;
> +    SCM s = Smob_base<Super>::register_ptr (static_cast<Super *> (this));
> +    self_scm_ = s;
> +    return s;

Don't see the intended difference here.  It's conceivable that with
several "volatile" declarations the compiler may generate different code
that would protect against self_scm_ being modified before the function
returns.  Is that supposed to help against non-atomic changes of
self_scm_?

> @@ -279,12 +281,12 @@
>    }
>    SCM unprotect ()
>    {
> -    unprotect_smob (self_scm_, &protection_cons_);
> -    return self_scm_;
> +    SCM s = self_scm_;
> +    unprotect_smob (s, &protection_cons_);
> +    return s;
>    }

I see how this one would be advisable in multi-threaded applications.

>    void smobify_self () {
> -    self_scm_ = unprotected_smobify_self ();
> -    protect ();
> +    protect_smob (unprotected_smobify_self (), &protection_cons_);
>    }
>    SCM self_scm () const { return self_scm_; }
>  };
>
>
> The changes to 'unprotected_smobify_self' and 'smobify_self" are
> probably only needed for a multithreaded program.  The idea is to try to
> ensure that the smob is protected from GC by being on the stack (and not
> merely in the malloc/free heap) until it is protected.
>
> The change to 'unprotect' is along the same lines, and might be needed
> even for single-threaded programs.  I don't know that
> 'scm_gc_unprotect_object' (called from 'unprotect_smob') will guarantee
> not to free the object passed to it if there are no other visible
> references to it.
>
> However, none of these changes fixed the problem here.

I'll put them in as there is no sane reason not to.  However, they
appear to mostly concern low-probability races whereas we still have an
elephant in the room.

> BTW, one other thing to keep in mind: it will never be enough if the
> only GC-visible reference to an object is the C++ Smob object.  There
> must always be a GC-visible reference to the SCM value.

That's why we need the mark functions in LilyPond.  It would be a really
large PITA to have to route everything through SCM since there is a
really large corpus of code and non-SCM data structures involved.

> (gdb) bt
> #0  0x0a2abde3 in ?? ()
> #1  0xb766dc4a in smob_mark (addr=0xa15f6e0, mark_stack_ptr=0x9e5a320, mark_stack_limit=0x9e62000, env=0) at smob.c:335
> #2  0xb758664a in GC_mark_from (mark_stack_top=0x9e5a320, mark_stack=0x9e5a000, mark_stack_limit=0x9e62000) at mark.c:737
> #3  0xb7587c07 in GC_mark_some (cold_gc_frame=0xbf874ed8 "pVZ\267M\332W\267\330N\207\277") at mark.c:386
> #4  0xb757da4d in GC_stopped_mark (stop_func=stop_func <at> entry=0xb757d5b0 <GC_never_stop_func>) at alloc.c:637
> #5  0xb757e4a9 in GC_try_to_collect_inner (stop_func=0xb757d5b0 <GC_never_stop_func>) at alloc.c:456
> #6  0xb757ee00 in GC_collect_or_expand (needed_blocks=needed_blocks <at> entry=1, ignore_off_page=ignore_off_page <at> entry=0, 
>     retry=retry <at> entry=0) at alloc.c:1266
> #7  0xb757effc in GC_allocobj (gran=gran <at> entry=3, kind=kind <at> entry=1) at alloc.c:1355
> #8  0xb758440b in GC_generic_malloc_inner (lb=lb <at> entry=24, k=k <at> entry=1) at malloc.c:133
> #9  0xb7581e8a in GC_register_finalizer_inner (obj=obj <at> entry=0xa005720, fn=fn <at> entry=0xb766d970 <finalize_smob>, 
>     cd=cd <at> entry=0x0, ofn=ofn <at> entry=0xbf875098, ocd=ocd <at> entry=0xbf87509c, 
>     mp=mp <at> entry=0xb75811c0 <GC_null_finalize_mark_proc>) at finalize.c:524
> #10 0xb75821e5 in GC_register_finalizer_no_order (obj=obj <at> entry=0xa005720, fn=fn <at> entry=0xb766d970 <finalize_smob>, 
>     cd=cd <at> entry=0x0, ofn=ofn <at> entry=0xbf875098, ocd=ocd <at> entry=0xbf87509c) at finalize.c:575
> #11 0xb761bfbb in scm_i_set_finalizer (obj=obj <at> entry=0xa005720, proc=proc <at> entry=0xb766d970 <finalize_smob>, 
>     data=data <at> entry=0x0) at finalizers.c:44
> #12 0xb766e1f8 in scm_i_new_smob (tc=7039, data=171214120) at smob.c:416
> #13 0x08049b12 in scm_new_smob (tc=7039, data=171214120)
>     at /gnu/store/3lhr8q28q6f59774di9av7ncy09jd55d-guile-2.0.11/include/guile/2.0/libguile/smob.h:91
> #14 0x0804a8dc in Smob_base<Family>::register_ptr (p=0xa348528) at smobs.tcc:54
> #15 0x0804a281 in Smob<Family>::unprotected_smobify_self (this=0xa34852c) at smobs.hh:274
> #16 0x08049d0a in Smob<Family>::smobify_self (this=0xa34852c) at smobs.hh:289

That's when smobify_self tries to create a smob pointing to the
C++-allocated data structure.  This calls scm_new_smob, and in the
course of that, garbage is apparently collected.  And smob_mark is
called on an already freed C++ structure with a bad virtual table, so
when mark_trampoline tries routing the virtual table call to mark_smob,
it crashes.

With the already checked-in changes and your suggested changes, we'll
likely get the mark-after-free crashes under control.  They still seem
to warrant fixing at the BDWGC level as they, crashes aside, appear to
cause memory to be in some half-released state good for neither
retaining nor reallocation.

But the mark-on-general-garbage problem seems sort of system-immanent to
conservative garbage collection while making the management of C++ data
structures (often relying on some internal consistency outside of user
control) extremely awkward.  Fixing that at the SCM level seems both
desirable and doable and will likely require GUILE to not let garbage
(rather than recognizably unallocated cells) into its managed cell
store.

-- 
David Kastrup




Information forwarded to bug-guile <at> gnu.org:
bug#19883; Package guile. (Sun, 01 Mar 2015 15:01:01 GMT) Full text and rfc822 format available.

Message #47 received at 19883 <at> debbugs.gnu.org (full text, mbox):

From: David Kastrup <dak <at> gnu.org>
To: Mark H Weaver <mhw <at> netris.org>
Cc: 19883 <at> debbugs.gnu.org
Subject: Re: bug#19883: Smob's mark_smob has become unreliable in Guile 2.x
Date: Sun, 01 Mar 2015 16:00:24 +0100
Mark H Weaver <mhw <at> netris.org> writes:

[...]

The race-condition related patch suggestions have been submitted to
LilyPond's tracker.

Tracker issue: 4304 (http://code.google.com/p/lilypond/issues/detail?id=4304)
Rietveld issue: 210810043 (https://codereview.appspot.com/210810043)
Issue description:
  Smob code changes mostly relevant for multithreading  These changes
  have been suggested by Mark H Weaver in
  <URL:http://debbugs.gnu.org/cgi/bugreport.cgi?bug=19883#41>

-- 
David Kastrup




Information forwarded to bug-guile <at> gnu.org:
bug#19883; Package guile. (Sun, 01 Mar 2015 17:03:02 GMT) Full text and rfc822 format available.

Message #50 received at 19883 <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: David Kastrup <dak <at> gnu.org>
Cc: 19883 <at> debbugs.gnu.org
Subject: Re: bug#19883: Smob's mark_smob has become unreliable in Guile 2.x
Date: Sun, 01 Mar 2015 12:02:41 -0500
David Kastrup <dak <at> gnu.org> writes:

> Since LilyPond relies on C++ data structures and containers for its
> operation and those are also involved in the mark walks, "can be called
> on arbitrary garbage"

Who are you quoting here?

> But the mark-on-general-garbage problem seems sort of system-immanent to
> conservative garbage collection [...]

Marking arbitrary garbage is certainly not inherent (is that the word
you meant to use?) to conservative garbage collection.  There may be
bugs, but BDWGC is designed to prevent this.  See the "Allocation" and
"Mark phase" sections of <http://www.hboehm.info/gc/gcdescr.html>.  In
particular, from the "Allocation" section:

  The collector uses a two level allocator. A large block is defined to
  be one larger than half of HBLKSIZE, which is a power of 2, typically
  on the order of the page size.

  Large block sizes are rounded up to the next multiple of HBLKSIZE and
  then allocated by GC_allochblk. Recent versions of the collector use
  an approximate best fit algorithm by keeping free lists for several
  large block sizes. The actual implementation of GC_allochblk is
  significantly complicated by black-listing issues (see below).

  Small blocks are allocated in chunks of size HBLKSIZE. Each chunk is
  dedicated to only one object size and kind.

From the "Mark phase" section:

  We determine whether a candidate pointer is actually the address of a
  heap block. This is done in the following steps: The candidate pointer
  is checked against rough heap bounds. These heap bounds are maintained
  such that all actual heap objects fall between them. In order to
  facilitate black-listing (see below) we also include address regions
  that the heap is likely to expand into. Most non-pointers fail this
  initial test.

  The candidate pointer is divided into two pieces; the most significant
  bits identify a HBLKSIZE-sized page in the address space, and the
  least significant bits specify an offset within that page. (A hardware
  page may actually consist of multiple such pages. HBLKSIZE is usually
  the page size divided by a small power of two.)
  
  The page address part of the candidate pointer is looked up in a
  table. Each table entry contains either 0, indicating that the page is
  not part of the garbage collected heap, a small integer n, indicating
  that the page is part of large object, starting at least n pages back,
  or a pointer to a descriptor for the page. In the first case, the
  candidate pointer i not a true pointer and can be safely ignored. In
  the last two cases, we can obtain a descriptor for the page containing
  the beginning of the object.
  
  The starting address of the referenced object is computed. The page
  descriptor contains the size of the object(s) in that page, the object
  kind, and the necessary mark bits for those objects. The size
  information can be used to map the candidate pointer to the object
  starting address. To accelerate this process, the page header also
  contains a pointer to a precomputed map of page offsets to
  displacements from the beginning of an object. The use of this map
  avoids a potentially slow integer remainder operation in computing the
  object start address.
  
  The mark bit for the target object is checked and set. If the object
  was previously unmarked, the object is pushed on the mark stack. The
  descriptor is read from the page descriptor. (This is computed from
  information GC_obj_kinds when the page is first allocated.)

In Guile, there is a dedicated "kind" for SMOBs that have user-defined
mark functions.  This means that they are segregrated into their own
HBLKSIZE-sized blocks, and so BDWGC is able to reliably determine
whether our general SMOB marker ('smob_mark' in smob.c) should be
called, without relying on some unreliable tag inside the block itself.

So, while it is true that a conservative GC cannot avoid possibly
retaining some garbage that should be freed, it is possible to reliably
avoid calling mark functions on arbitrary garbage, and BDWGC is designed
to do this.

BTW, a few more points to mention: I vaguely recall that you may have
wondered in the past if Boehm GC "parallel marking" means that it is
marking while the mutator (user code) is running.  That is not the case.

BDWGC first stops all user threads (on POSIX I believe this is done by
sending signals to all threads and keeping them in the signal handlers
during GC) and then runs multiple of its own marker threads in parallel.
I guess this means that user-specified mark functions may be called in
multiple threads.

However, this can be avoided by setting the GC_MARKERS environment
variable to "1", and I set this while running your test code to
eliminate this possible complication for now.

Another document worth looking at is:

  http://www.hboehm.info/gc/finalization.html

> However, from the GUILE side the mark_smob calls originate from objects
> detected to be SCM cells, and GUILE should be able to manage areas
> filled with SCM cells such that it can _reliably_ distinguish allocated
> from unallocated cells and there is no such thing as random garbage.

As described above, BDWGC already handles this, and so far I haven't
seen evidence of 'smob_mark' being called on random garbage.  However,
it seems that 'smob_mark' is being called on objects that have already
been finalized.

> When that is the case, it may be sufficient to clear out the tag field
> in the course of calling the free_smob hook, avoiding any subsequent
> mark calls to occur on a cell that has lost its type consistency anyway.
> Again, this might involve race conditions: no idea.

Yes, this sounds like a work-around worth considering.  I'll think about
whether it can be done reliably in Guile and/or LilyPond.

It occurs to me that during times of heavy allocation, it might be
possible for a GC to be triggered while a finalizer is being run.  Since
the object being finalized would be on the stack, it would be considered
live by the GC, even if most or all of the finalizer had already been
run.  It warrants investigation.

Also, while reading the GC design documentation, a particular warning
raised some alarms.  From "Finalization" in gcdescr.html:

  Any objects remaining unmarked at the end of this process are added to
  a queue of objects whose finalizers can be run. Depending on collector
  configuration, finalizers are dequeued and run either implicitly
  during allocation calls, or explicitly in response to a user
  request. (Note that the former is unfortunately both the default and
  not generally safe. If finalizers perform synchronization, it may
  result in deadlocks. Nontrivial finalizers generally need to perform
  synchronization, and thus require a different collector
  configuration.)

and from finalization.html:

  Finalization code may be run anyplace an allocation or other call to
  the collector takes place. In multithreaded programs, finalizers have
  to obey the normal locking conventions to ensure safety. Code run
  directly from finalizers should not acquire locks that may be held
  during allocation. This restriction can be easily circumvented by
  registering a finalizer which enqueues the real action for execution
  in a separate thread.

  In single-threaded code, it is also often easiest to have finalizers
  queue actions, which are then explicitly run during an explicit call
  by the user's program.

So it may be that we need to be careful of this.

Here's more recommended reading on the difficulties of implementing
finalizers properly:

  http://www.hpl.hp.com/techreports/2002/HPL-2002-335.html

Finalizers are a huge can of worms.  I suspect this is why Ludovic was
encouraging you to avoid them entirely, but I agree that we need to find
a way to make them work for when they are needed.

> Because of potential race conditions, a reliable fix needs to be in
> GUILE or BDWGC and cannot be done at the application end.  The best we
> will be able to achieve there is "crashes almost never randomly".

I'm not sure this is true.  It will require more thought.

>> First, some changes that I think might be needed for GC safety:
>>
>> --- smobs.hh.ORIG	2015-02-15 13:35:03.000000000 -0500
>> +++ smobs.hh	2015-02-28 23:42:06.346803668 -0500
>> @@ -270,8 +271,9 @@
>>    }
>>    SCM unprotected_smobify_self ()
>>    {
>> -    self_scm_ = Smob_base<Super>::register_ptr (static_cast<Super *> (this));
>> -    return self_scm_;
>> +    SCM s = Smob_base<Super>::register_ptr (static_cast<Super *> (this));
>> +    self_scm_ = s;
>> +    return s;
>
> Don't see the intended difference here.  It's conceivable that with
> several "volatile" declarations the compiler may generate different code
> that would protect against self_scm_ being modified before the function
> returns.  Is that supposed to help against non-atomic changes of
> self_scm_?

No, the idea is to ensure the existence of a GC-visible copy of the SCM
value until 'scm_gc_protect_object' is called.  For objects allocated in
the malloc/free heap, self_scm_ is not visible to the GC.

>> BTW, one other thing to keep in mind: it will never be enough if the
>> only GC-visible reference to an object is the C++ Smob object.  There
>> must always be a GC-visible reference to the SCM value.
>
> That's why we need the mark functions in LilyPond.

The mark functions can ensure that in your Family class, the children
will be marked even though the parent->child pointers point directly to
the C++ object.  However, whenever the root of the tree is not protected
by 'scm_gc_protect_object', the SCM value of that root must be visible
to the GC, in order to cause the user-specified 'mark' function to be
called in the first place.

It should be noted that for any particular C++ class, another
alternative is to arrange for instances of that class to be allocated
using 'scm_gc_malloc', as Ludovic suggested.  If you do that, then it
would suffice to have a GC-visible pointer to instances of that class,
which might make your life easier.

    Regards,
      Mark




Information forwarded to bug-guile <at> gnu.org:
bug#19883; Package guile. (Sun, 01 Mar 2015 18:39:02 GMT) Full text and rfc822 format available.

Message #53 received at 19883 <at> debbugs.gnu.org (full text, mbox):

From: David Kastrup <dak <at> gnu.org>
To: Mark H Weaver <mhw <at> netris.org>
Cc: 19883 <at> debbugs.gnu.org
Subject: Re: bug#19883: Smob's mark_smob has become unreliable in Guile 2.x
Date: Sun, 01 Mar 2015 19:38:16 +0100
Mark H Weaver <mhw <at> netris.org> writes:

> Finalizers are a huge can of worms.  I suspect this is why Ludovic was
> encouraging you to avoid them entirely, but I agree that we need to
> find a way to make them work for when they are needed.

Again, LilyPond does not really have an option to avoid finalizers
(free_smob action): the C++ class instances associated with smobs
contain STL containers, and the storage for those is allocated
separately on the heap.  Running the destructor of the C++ class
instance (which is the basic finalization action done when calling the
delete operator on the C++ class instance) will then also destruct the
STL container, freeing all memory associated with _that_.

We don't have control over the storage layout of the STL templates.
While one could allocate them using a separate allocation template
parameter, that may be a C++11 feature.  And it would really complicate
the code.

Moving all of the smobification stuff into a few template classes has
actually locked up our release process for almost a year: 2.19.15 was
released in September, 2.19.16 is currently being uploaded.  The
release-building crosscompiling environment had to be updated to a newer
GCC version to accept the templated code, and that GCC reason had new
dependencies and needed to be bootstrapped with a C++ rather than a C
compiler.  And the person who has run the release process the last few
years was not actually a programmer.

> The mark functions can ensure that in your Family class, the children
> will be marked even though the parent->child pointers point directly
> to the C++ object.  However, whenever the root of the tree is not
> protected by 'scm_gc_protect_object', the SCM value of that root must
> be visible to the GC, in order to cause the user-specified 'mark'
> function to be called in the first place.
>
> It should be noted that for any particular C++ class, another
> alternative is to arrange for instances of that class to be allocated
> using 'scm_gc_malloc', as Ludovic suggested.  If you do that, then it
> would suffice to have a GC-visible pointer to instances of that class,
> which might make your life easier.

Again: a lot of smobified C++ structures contain STL containers that
have their separate allocation that can easily have pointers to C++
structures associated with smobs again that need to be marked.

At any rate: since LilyPond can eat a significant ratio of the entire
memory space on 32bit machines, I consider it prudent not to have
basically all of that scanned by the conservative garbage collector
since that would significantly increase the number of false positives
when marking.

And we are talking about an advertised GUILE feature after all.  With
regard to "making our life easier": it's a sunk cost since all the work
has obviously been done for GUILEv1 already.

-- 
David Kastrup




Information forwarded to bug-guile <at> gnu.org:
bug#19883; Package guile. (Thu, 23 Jun 2016 09:51:01 GMT) Full text and rfc822 format available.

Message #56 received at 19883 <at> debbugs.gnu.org (full text, mbox):

From: Andy Wingo <wingo <at> pobox.com>
To: David Kastrup <dak <at> gnu.org>
Cc: Ludovic Courtès <ludo <at> gnu.org>, 19883 <at> debbugs.gnu.org
Subject: Re: bug#19883: Correction for backtrace
Date: Thu, 23 Jun 2016 11:50:48 +0200
On Thu 26 Feb 2015 16:30, David Kastrup <dak <at> gnu.org> writes:

> Try ./test 2 2000 200

I can reproduce the crash with your test case, thanks :) The patch below
fixes the bug for me.  WDYT Ludovic?

Andy

commit db30120fc3a1727d8f221cbb014314f2babf841e
Author: Andy Wingo <wingo <at> pobox.com>
Date:   Thu Jun 23 11:47:42 2016 +0200

    Fix race between SMOB marking and finalization
    
    * libguile/smob.c (clear_smobnum): New helper.
      (finalize_smob): Re-set the smobnum to the "finalized smob" type
      before finalizing.  Fixes #19883.
      (scm_smob_prehistory): Pre-register a "finalized smob" type, which has
      no mark procedure.

diff --git a/libguile/smob.c b/libguile/smob.c
index 6a97caa..43ea613 100644
--- a/libguile/smob.c
+++ b/libguile/smob.c
@@ -372,20 +372,43 @@ scm_gc_mark (SCM o)
 }
 
 
+static void*
+clear_smobnum (void *ptr)
+{
+  SCM smob;
+  scm_t_bits smobnum;
+
+  smob = SCM_PACK_POINTER (ptr);
+
+  smobnum = SCM_SMOBNUM (smob);
+  /* Frob the object's type in place, re-setting it to be the "finalized
+     smob" type.  This will prevent other routines from accessing its
+     internals in a way that assumes that the smob data is valid.  This
+     is notably the case for SMOB's own "mark" procedure, if any; as the
+     finalizer runs without the alloc lock, it's possible for a GC to
+     occur while it's running, in which case the object is alive and yet
+     its data is invalid.  */
+  SCM_SET_SMOB_DATA_0 (smob, SCM_SMOB_DATA_0 (smob) & ~(scm_t_bits) 0xff00);
+
+  return (void *) smobnum;
+}
+
 /* Finalize SMOB by calling its SMOB type's free function, if any.  */
 static void
 finalize_smob (void *ptr, void *data)
 {
   SCM smob;
+  scm_t_bits smobnum;
   size_t (* free_smob) (SCM);
 
   smob = SCM_PACK_POINTER (ptr);
+  smobnum = (scm_t_bits) GC_call_with_alloc_lock (clear_smobnum, ptr);
+
 #if 0
-  printf ("finalizing SMOB %p (smobnum: %u)\n",
-	  ptr, SCM_SMOBNUM (smob));
+  printf ("finalizing SMOB %p (smobnum: %u)\n", ptr, smobnum);
 #endif
 
-  free_smob = scm_smobs[SCM_SMOBNUM (smob)].free;
+  free_smob = scm_smobs[smobnum].free;
   if (free_smob)
     free_smob (smob);
 }
@@ -460,6 +483,7 @@ void
 scm_smob_prehistory ()
 {
   long i;
+  scm_t_bits finalized_smob_tc16;
 
   scm_i_pthread_key_create (&current_mark_stack_pointer, NULL);
   scm_i_pthread_key_create (&current_mark_stack_limit, NULL);
@@ -483,6 +507,9 @@ scm_smob_prehistory ()
       scm_smobs[i].apply      = 0;
       scm_smobs[i].apply_trampoline = SCM_BOOL_F;
     }
+
+  finalized_smob_tc16 = scm_make_smob_type ("finalized smob", 0);
+  if (SCM_TC2SMOBNUM (finalized_smob_tc16) != 0) abort ();
 }
 
 /*




Reply sent to Andy Wingo <wingo <at> pobox.com>:
You have taken responsibility. (Thu, 23 Jun 2016 10:37:01 GMT) Full text and rfc822 format available.

Notification sent to David Kastrup <dak <at> gnu.org>:
bug acknowledged by developer. (Thu, 23 Jun 2016 10:37:01 GMT) Full text and rfc822 format available.

Message #61 received at 19883-done <at> debbugs.gnu.org (full text, mbox):

From: Andy Wingo <wingo <at> pobox.com>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 19883-done <at> debbugs.gnu.org, David Kastrup <dak <at> gnu.org>
Subject: Re: bug#19883: Correction for backtrace
Date: Thu, 23 Jun 2016 12:36:34 +0200
[Message part 1 (text/plain, inline)]
On Thu 23 Jun 2016 11:50, Andy Wingo <wingo <at> pobox.com> writes:

> On Thu 26 Feb 2015 16:30, David Kastrup <dak <at> gnu.org> writes:
>
>> Try ./test 2 2000 200
>
> I can reproduce the crash with your test case, thanks :) The patch below
> fixes the bug for me.  WDYT Ludovic?

Here's a patch with a test case.  I'm going to apply as it seems to be
obviously the right thing and the test case does reproduce what I think
is the bug (racing mark and finalize procedures, even if it's only
happening in one thread, finalizers and mark procedures do introduce
concurrency).  We trigger the concurrency in a simple way, via
allocation in the finalizer.  The patch does fix the original test.  GC
could happen due to another thread of course.  I'm actually not sure
where the concurrency is coming from in David's test though :/

I'm very interested in any feedback you might have!

Andy

[0001-Fix-race-between-SMOB-marking-and-finalization.patch (text/plain, inline)]
From 8dff3af087c6eaa83ae0d72aa8b22aef5c65d65d Mon Sep 17 00:00:00 2001
From: Andy Wingo <wingo <at> pobox.com>
Date: Thu, 23 Jun 2016 11:47:42 +0200
Subject: [PATCH] Fix race between SMOB marking and finalization

* libguile/smob.c (clear_smobnum): New helper.
  (finalize_smob): Re-set the smobnum to the "finalized smob" type
  before finalizing.  Fixes #19883.
  (scm_smob_prehistory): Pre-register a "finalized smob" type, which has
  no mark procedure.
* test-suite/standalone/test-smob-mark-race.c: New file.
* test-suite/standalone/Makefile.am: Arrange to build and run the new
  test.
---
 libguile/smob.c                             | 33 +++++++++++++--
 test-suite/standalone/Makefile.am           |  6 +++
 test-suite/standalone/test-smob-mark-race.c | 65 +++++++++++++++++++++++++++++
 3 files changed, 101 insertions(+), 3 deletions(-)
 create mode 100644 test-suite/standalone/test-smob-mark-race.c

diff --git a/libguile/smob.c b/libguile/smob.c
index 90849a8..ed9d91a 100644
--- a/libguile/smob.c
+++ b/libguile/smob.c
@@ -374,20 +374,43 @@ scm_gc_mark (SCM o)
 }
 
 
+static void*
+clear_smobnum (void *ptr)
+{
+  SCM smob;
+  scm_t_bits smobnum;
+
+  smob = PTR2SCM (ptr);
+
+  smobnum = SCM_SMOBNUM (smob);
+  /* Frob the object's type in place, re-setting it to be the "finalized
+     smob" type.  This will prevent other routines from accessing its
+     internals in a way that assumes that the smob data is valid.  This
+     is notably the case for SMOB's own "mark" procedure, if any; as the
+     finalizer runs without the alloc lock, it's possible for a GC to
+     occur while it's running, in which case the object is alive and yet
+     its data is invalid.  */
+  SCM_SET_SMOB_DATA_0 (smob, SCM_SMOB_DATA_0 (smob) & ~(scm_t_bits) 0xff00);
+
+  return (void *) smobnum;
+}
+
 /* Finalize SMOB by calling its SMOB type's free function, if any.  */
 static void
 finalize_smob (void *ptr, void *data)
 {
   SCM smob;
+  scm_t_bits smobnum;
   size_t (* free_smob) (SCM);
 
   smob = PTR2SCM (ptr);
+  smobnum = (scm_t_bits) GC_call_with_alloc_lock (clear_smobnum, ptr);
+
 #if 0
-  printf ("finalizing SMOB %p (smobnum: %u)\n",
-	  ptr, SCM_SMOBNUM (smob));
+  printf ("finalizing SMOB %p (smobnum: %u)\n", ptr, smobnum);
 #endif
 
-  free_smob = scm_smobs[SCM_SMOBNUM (smob)].free;
+  free_smob = scm_smobs[smobnum].free;
   if (free_smob)
     free_smob (smob);
 }
@@ -470,6 +493,7 @@ void
 scm_smob_prehistory ()
 {
   long i;
+  scm_t_bits finalized_smob_tc16;
 
   scm_i_pthread_key_create (&current_mark_stack_pointer, NULL);
   scm_i_pthread_key_create (&current_mark_stack_limit, NULL);
@@ -493,6 +517,9 @@ scm_smob_prehistory ()
       scm_smobs[i].apply      = 0;
       scm_smobs[i].apply_trampoline_objcode = SCM_BOOL_F;
     }
+
+  finalized_smob_tc16 = scm_make_smob_type ("finalized smob", 0);
+  if (SCM_TC2SMOBNUM (finalized_smob_tc16) != 0) abort ();
 }
 
 /*
diff --git a/test-suite/standalone/Makefile.am b/test-suite/standalone/Makefile.am
index 712418a..aec7418 100644
--- a/test-suite/standalone/Makefile.am
+++ b/test-suite/standalone/Makefile.am
@@ -275,4 +275,10 @@ test_smob_mark_LDADD = $(LIBGUILE_LDADD)
 check_PROGRAMS += test-smob-mark
 TESTS += test-smob-mark
 
+test_smob_mark_race_SOURCES = test-smob-mark-race.c
+test_smob_mark_race_CFLAGS = ${test_cflags}
+test_smob_mark_race_LDADD = $(LIBGUILE_LDADD)
+check_PROGRAMS += test-smob-mark-race
+TESTS += test-smob-mark-race
+
 EXTRA_DIST += ${check_SCRIPTS}
diff --git a/test-suite/standalone/test-smob-mark-race.c b/test-suite/standalone/test-smob-mark-race.c
new file mode 100644
index 0000000..eca0325
--- /dev/null
+++ b/test-suite/standalone/test-smob-mark-race.c
@@ -0,0 +1,65 @@
+/* Copyright (C) 2016 Free Software Foundation, Inc.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public License
+ * as published by the Free Software Foundation; either version 3 of
+ * the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301 USA
+ */
+
+#if HAVE_CONFIG_H
+#include <config.h>
+#endif
+
+#undef NDEBUG
+
+#include <libguile.h>
+#include <assert.h>
+
+static SCM
+mark_smob (SCM smob)
+{
+  assert (SCM_SMOB_DATA (smob) == 1);
+  return SCM_BOOL_F;
+}
+
+static size_t
+finalize_smob (SCM smob)
+{
+  assert (SCM_SMOB_DATA (smob) == 1);
+  SCM_SET_SMOB_DATA (smob, 0);
+  /* Allocate a bit in the hopes of triggering a new GC, making the
+     marker race with the finalizer.  */
+  scm_cons (SCM_BOOL_F, SCM_BOOL_F);
+  return 0;
+}
+
+static void
+tests (void *data, int argc, char **argv)
+{
+  scm_t_bits tc16;
+  int i;
+
+  tc16 = scm_make_smob_type ("smob with finalizer", 0);
+  scm_set_smob_mark (tc16, mark_smob);
+  scm_set_smob_free (tc16, finalize_smob);
+
+  for (i = 0; i < 1000 * 1000; i++)
+    scm_new_smob (tc16, 1);
+}
+
+int
+main (int argc, char *argv[])
+{
+  scm_boot_guile (argc, argv, tests, NULL);
+  return 0;
+}
-- 
2.8.3


Information forwarded to bug-guile <at> gnu.org:
bug#19883; Package guile. (Thu, 23 Jun 2016 13:13:02 GMT) Full text and rfc822 format available.

Message #64 received at 19883 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Andy Wingo <wingo <at> pobox.com>
Cc: 19883 <at> debbugs.gnu.org, David Kastrup <dak <at> gnu.org>
Subject: Re: bug#19883: Correction for backtrace
Date: Thu, 23 Jun 2016 15:12:34 +0200
Andy Wingo <wingo <at> pobox.com> skribis:

> On Thu 26 Feb 2015 16:30, David Kastrup <dak <at> gnu.org> writes:
>
>> Try ./test 2 2000 200
>
> I can reproduce the crash with your test case, thanks :) The patch below
> fixes the bug for me.  WDYT Ludovic?
>
> Andy
>
> commit db30120fc3a1727d8f221cbb014314f2babf841e
> Author: Andy Wingo <wingo <at> pobox.com>
> Date:   Thu Jun 23 11:47:42 2016 +0200
>
>     Fix race between SMOB marking and finalization
>     
>     * libguile/smob.c (clear_smobnum): New helper.
>       (finalize_smob): Re-set the smobnum to the "finalized smob" type
>       before finalizing.  Fixes #19883.
>       (scm_smob_prehistory): Pre-register a "finalized smob" type, which has
>       no mark procedure.

This LGTM, nice hack!

Do you think the test case could be added to the test suite somehow?

Thank you,
Ludo’.




Information forwarded to bug-guile <at> gnu.org:
bug#19883; Package guile. (Thu, 23 Jun 2016 13:14:01 GMT) Full text and rfc822 format available.

Message #67 received at 19883-done <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Andy Wingo <wingo <at> pobox.com>
Cc: 19883-done <at> debbugs.gnu.org, David Kastrup <dak <at> gnu.org>
Subject: Re: bug#19883: Correction for backtrace
Date: Thu, 23 Jun 2016 15:13:07 +0200
Andy Wingo <wingo <at> pobox.com> skribis:

> On Thu 23 Jun 2016 11:50, Andy Wingo <wingo <at> pobox.com> writes:
>
>> On Thu 26 Feb 2015 16:30, David Kastrup <dak <at> gnu.org> writes:
>>
>>> Try ./test 2 2000 200
>>
>> I can reproduce the crash with your test case, thanks :) The patch below
>> fixes the bug for me.  WDYT Ludovic?
>
> Here's a patch with a test case.

Perfect, you can ignore my previous message.  :-)

Ludo’.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 22 Jul 2016 11:24:05 GMT) Full text and rfc822 format available.

bug unarchived. Request was from ludovic.courtes <at> inria.fr (Ludovic Courtès) to control <at> debbugs.gnu.org. (Sun, 09 Oct 2016 13:42:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-guile <at> gnu.org:
bug#19883; Package guile. (Sun, 09 Oct 2016 18:09:03 GMT) Full text and rfc822 format available.

Message #74 received at 19883 <at> debbugs.gnu.org (full text, mbox):

From: Artyom Poptsov <poptsov.artyom <at> gmail.com>
To: ludo <at> gnu.org (Ludovic Courtès)
Cc: 19883 <at> debbugs.gnu.org, guile-devel <guile-devel <at> gnu.org>
Subject: Re: The “finalized” SMOB type
Date: Sun, 09 Oct 2016 10:51:25 +0300
[Message part 1 (text/plain, inline)]
Hello Ludovic,

thanks for pointing that out.  IIRC, I saw this kind of errors during
testing but I thought I fixed them.  What Guile-SSH version do you use?

Here's 'free_session' procedure from Guile-SSH version 0.10.0:

--8<---------------cut here---------------start------------->8---
size_t
free_session (SCM session_smob)
{
  if (! SCM_SMOB_PREDICATE (session_tag, session_smob))
    {
      _ssh_log (SSH_LOG_FUNCTIONS, "free_session", "%s", "already freed");
      return 0;
    }
  struct session_data *data = _scm_to_session_data (session_smob);

  [...]

  return 0;
}
--8<---------------cut here---------------end--------------->8---

As you can see, there's additional smob check before accessing the
smob's data.

Please let me know if the problem manifest itself in the latest
Guile-SSH version.

Thanks!

- Artyom

-- 
Artyom V. Poptsov <poptsov.artyom <at> gmail.com>;  GPG Key: 0898A02F
Home page: http://poptsov-artyom.narod.ru/
[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guile <at> gnu.org:
bug#19883; Package guile. (Sun, 09 Oct 2016 22:30:02 GMT) Full text and rfc822 format available.

Message #77 received at 19883 <at> debbugs.gnu.org (full text, mbox):

From: Artyom Poptsov <poptsov.artyom <at> gmail.com>
To: ludo <at> gnu.org (Ludovic Courtès)
Cc: 19883 <at> debbugs.gnu.org, David Kastrup <dak <at> gnu.org>,
 guile-devel <guile-devel <at> gnu.org>
Subject: Re: The “finalized” SMOB type
Date: Sun, 09 Oct 2016 22:00:36 +0300
[Message part 1 (text/plain, inline)]
Hello Ludovic and David,

thanks for all the comments on the smob freeing callbacks.  Apparently I
misread the documentation so the bug was introduced.  I've changed the
callback procedures as was suggested; the patches included in Guile-SSH
version 0.10.1 released a few moments ago.

- Artyom

-- 
Artyom V. Poptsov <poptsov.artyom <at> gmail.com>;  GPG Key: 0898A02F
Home page: http://poptsov-artyom.narod.ru/
[signature.asc (application/pgp-signature, inline)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 07 Nov 2016 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 7 years and 164 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.