GNU bug report logs - #37237
mcron randomly stops running jobs

Previous Next

Package: guix;

Reported by: Robert Vollmert <rob <at> vllmrt.net>

Date: Fri, 30 Aug 2019 17:33:02 UTC

Severity: important

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 37237 in the body.
You can then email your comments to 37237 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guix <at> gnu.org:
bug#37237; Package guix. (Fri, 30 Aug 2019 17:33:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Robert Vollmert <rob <at> vllmrt.net>:
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Fri, 30 Aug 2019 17:33:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Robert Vollmert <rob <at> vllmrt.net>
To: bug-guix <at> gnu.org
Subject: mcron randomly stops running jobs
Date: Fri, 30 Aug 2019 19:31:44 +0200
This is the third time I’ve seen this, and this time I’m
sure that nothing else happened.

I have numerous mcron jobs configured, many of which run every 15 minutes,
and one which runs once a minute. Just now, at 19:22, I noticed the minutely
cron job didn’t seem to run, and saw that one every-15-minute job which always
logs to mcron.log had last run at 18:00. I restarted mcron, and jobs started
running again. There is no timezone confusion either: That every-15-minute job
logged its first run at 19:30.





Severity set to 'important' from 'normal' Request was from Ludovic Courtès <ludo <at> gnu.org> to control <at> debbugs.gnu.org. (Fri, 21 Feb 2020 20:20:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#37237; Package guix. (Sun, 23 Feb 2020 18:31:01 GMT) Full text and rfc822 format available.

Message #10 received at 37237 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Robert Vollmert <rob <at> vllmrt.net>
Cc: 37237 <at> debbugs.gnu.org
Subject: Re: bug#37237: mcron randomly stops running jobs
Date: Sun, 23 Feb 2020 19:30:26 +0100
Hi,

Robert Vollmert <rob <at> vllmrt.net> skribis:

> I have numerous mcron jobs configured, many of which run every 15 minutes,
> and one which runs once a minute. Just now, at 19:22, I noticed the minutely
> cron job didn’t seem to run, and saw that one every-15-minute job which always
> logs to mcron.log had last run at 18:00. I restarted mcron, and jobs started
> running again. There is no timezone confusion either: That every-15-minute job
> logged its first run at 19:30.

I’ve been bitten by this problem several times already.  I suspect the
patch at
<https://lists.gnu.org/archive/html/bug-mcron/2020-02/msg00004.html>
helps with zombies that mcron tends to leave behind, but I’m not sure it
helps with this “stops running jobs” problem (I’d need to grab a C stack
trace of mcron when that happens.)

Anyway, let’s get this fixed!

Ludo’.




Reply sent to Ludovic Courtès <ludo <at> gnu.org>:
You have taken responsibility. (Wed, 18 Mar 2020 11:11:02 GMT) Full text and rfc822 format available.

Notification sent to Robert Vollmert <rob <at> vllmrt.net>:
bug acknowledged by developer. (Wed, 18 Mar 2020 11:11:02 GMT) Full text and rfc822 format available.

Message #15 received at 37237-done <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Robert Vollmert <rob <at> vllmrt.net>
Cc: 37237-done <at> debbugs.gnu.org
Subject: Re: bug#37237: mcron randomly stops running jobs
Date: Wed, 18 Mar 2020 12:10:27 +0100
Hi,

Robert Vollmert <rob <at> vllmrt.net> skribis:

> I have numerous mcron jobs configured, many of which run every 15 minutes,
> and one which runs once a minute. Just now, at 19:22, I noticed the minutely
> cron job didn’t seem to run, and saw that one every-15-minute job which always
> logs to mcron.log had last run at 18:00. I restarted mcron, and jobs started
> running again. There is no timezone confusion either: That every-15-minute job
> logged its first run at 19:30.

When mcron is stuck, it’s apparently stuck in ‘pthread_join’, waiting
for the finalization thread to terminate:

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  0x00007f7f730d8478 in __GI___pthread_timedjoin_ex (threadid=140185229293312, thread_return=thread_return <at> entry=0x0, abstime=abstime <at> entry=0x0, 
    block=block <at> entry=true) at pthread_join_common.c:84
#1  0x00007f7f730d82dc in __pthread_join (threadid=<optimized out>, thread_return=thread_return <at> entry=0x0) at pthread_join.c:24
#2  0x00007f7f731d7508 in stop_finalization_thread () at finalizers.c:265
#3  0x00007f7f731d7729 in scm_i_finalizer_pre_fork () at finalizers.c:290
#4  0x00007f7f73250426 in scm_fork () at posix.c:1220
#5  0x00007f7f7324979f in vm_regular_engine (thread=0x7f7f6acb79d0, vp=0x7f7f6b299f30, registers=0xafaf, resume=1930265720) at vm-engine.c:786
#6  0x00007f7f7324bfd7 in scm_call_n (proc=#<program 7f7f6aeaf5d0>, argv=argv <at> entry=0x7ffd9b10bed8, nargs=nargs <at> entry=1) at vm.c:1260
#7  0x00007f7f731cdf58 in scm_call_1 (proc=<optimized out>, arg1=<optimized out>) at eval.c:485
#8  0x000000000040130c in inner_main ()
#9  0x00007f7f731e5cfd in invoke_main_func (body_data=0x7ffd9b10c3b0) at init.c:341
#10 0x00007f7f731c835a in c_body (d=0x7ffd9b10c2f0) at continuations.c:422
#11 0x00007f7f7324979f in vm_regular_engine (thread=0x7f7f6acb79d0, vp=0x7f7f6b299f30, registers=0xafaf, resume=1930265720) at vm-engine.c:786
#12 0x00007f7f7324bfd7 in scm_call_n (proc=proc <at> entry=#<smob catch-closure 7f7f6b2a1460>, argv=argv <at> entry=0x0, nargs=nargs <at> entry=0) at vm.c:1260
#13 0x00007f7f731cdf39 in scm_call_0 (proc=proc <at> entry=#<smob catch-closure 7f7f6b2a1460>) at eval.c:479
#14 0x00007f7f7323ac12 in catch (tag=tag <at> entry=#t, thunk=#<smob catch-closure 7f7f6b2a1460>, 
    handler=<error reading variable: ERROR: Cannot access memory at address 0x73776b3469667269>0x7f7f6b2a1420, 
    pre_unwind_handler=<error reading variable: ERROR: Cannot access memory at address 0x73776b3469667269>0x7f7f6b2a1400) at throw.c:137
#15 0x00007f7f7323aef5 in scm_catch_with_pre_unwind_handler (key=key <at> entry=#t, thunk=<optimized out>, handler=<optimized out>, pre_unwind_handler=<optimized out>)
    at throw.c:254
#16 0x00007f7f7323b0bf in scm_c_catch (tag=tag <at> entry=#t, body=body <at> entry=0x7f7f731c8350 <c_body>, body_data=body_data <at> entry=0x7ffd9b10c2f0, 
    handler=handler <at> entry=0x7f7f731c85e0 <c_handler>, handler_data=handler_data <at> entry=0x7ffd9b10c2f0, 
    pre_unwind_handler=pre_unwind_handler <at> entry=0x7f7f731c8440 <pre_unwind_handler>, pre_unwind_handler_data=0x7f7f6b2939a0) at throw.c:377
#17 0x00007f7f731c8940 in scm_i_with_continuation_barrier (body=body <at> entry=0x7f7f731c8350 <c_body>, body_data=body_data <at> entry=0x7ffd9b10c2f0, 
    handler=handler <at> entry=0x7f7f731c85e0 <c_handler>, handler_data=handler_data <at> entry=0x7ffd9b10c2f0, 
    pre_unwind_handler=pre_unwind_handler <at> entry=0x7f7f731c8440 <pre_unwind_handler>, pre_unwind_handler_data=0x7f7f6b2939a0) at continuations.c:360
#18 0x00007f7f731c89d5 in scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:456
#19 0x00007f7f7323984c in with_guile (base=base <at> entry=0x7ffd9b10c358, data=data <at> entry=0x7ffd9b10c380) at threads.c:661
#20 0x00007f7f73127a68 in GC_call_with_stack_base (fn=fn <at> entry=0x7f7f73239800 <with_guile>, arg=arg <at> entry=0x7ffd9b10c380) at misc.c:1941
#21 0x00007f7f73239b68 in scm_i_with_guile (dynamic_state=<optimized out>, data=data <at> entry=0x7ffd9b10c380, func=func <at> entry=0x7f7f731e5ce0 <invoke_main_func>)
    at threads.c:704
#22 scm_with_guile (func=func <at> entry=0x7f7f731e5ce0 <invoke_main_func>, data=data <at> entry=0x7ffd9b10c3b0) at threads.c:710
#23 0x00007f7f731e5e92 in scm_boot_guile (argc=12, argv=0x7ffd9b10c4f8, main_func=0x4012a0 <inner_main>, closure=0x0) at init.c:324
#24 0x000000000040118b in main ()
--8<---------------cut here---------------end--------------->8---

The finalization thread lives its life as if it hadn’t received the
“stop” message in its pipe:

--8<---------------cut here---------------start------------->8---
(gdb) thread 17
[Switching to thread 17 (Thread 0x7f7f6acb7700 (LWP 44975))]
#0  0x00007f7f730e0344 in __libc_read (fd=5, buf=buf <at> entry=0x7f7f6acb6a40, nbytes=nbytes <at> entry=1) at ../sysdeps/unix/sysv/linux/read.c:26
26      ../sysdeps/unix/sysv/linux/read.c: No such file or directory.
(gdb) bt
#0  0x00007f7f730e0344 in __libc_read (fd=5, buf=buf <at> entry=0x7f7f6acb6a40, nbytes=nbytes <at> entry=1) at ../sysdeps/unix/sysv/linux/read.c:26
#1  0x00007f7f731d7497 in read_finalization_pipe_data (data=0x7f7f6acb6a40) at finalizers.c:199
#2  0x00007f7f7312d503 in GC_do_blocking_inner (data=0x7f7f6acb6a00 "\200t\035s\177\177", context=<optimized out>) at pthread_support.c:1362
#3  0x00007f7f73121d62 in GC_with_callee_saves_pushed (fn=0x7f7f7312d4c0 <GC_do_blocking_inner>, arg=arg <at> entry=0x7f7f6acb6a00 "\200t\035s\177\177") at mach_dep.c:328
#4  0x00007f7f73127a9c in GC_do_blocking (fn=fn <at> entry=0x7f7f731d7480 <read_finalization_pipe_data>, client_data=client_data <at> entry=0x7f7f6acb6a40) at misc.c:2053
#5  0x00007f7f73239bba in scm_without_guile (func=0x7f7f731d7480 <read_finalization_pipe_data>, data=0x7f7f6acb6a40) at threads.c:722
#6  0x00007f7f731d784b in finalization_thread_proc (unused=<optimized out>) at finalizers.c:212
#7  0x00007f7f731c835a in c_body (d=0x7f7f6acb6e50) at continuations.c:422
#8  0x00007f7f7324979f in vm_regular_engine (thread=0x5, vp=0x7f7f6b299ea0, registers=0x1, resume=1930298180) at vm-engine.c:786
#9  0x00007f7f7324bfd7 in scm_call_n (proc=proc <at> entry=#<smob catch-closure 7f7f6b22fb80>, argv=argv <at> entry=0x0, nargs=nargs <at> entry=0) at vm.c:1260
#10 0x00007f7f731cdf39 in scm_call_0 (proc=proc <at> entry=#<smob catch-closure 7f7f6b22fb80>) at eval.c:479
#11 0x00007f7f7323ac12 in catch (tag=tag <at> entry=#t, thunk=#<smob catch-closure 7f7f6b22fb80>, 
    handler=<error reading variable: ERROR: Cannot access memory at address 0x73776b3469667269>0x7f7f6b22f740, 
    pre_unwind_handler=<error reading variable: ERROR: Cannot access memory at address 0x73776b3469667269>0x7f7f6b22f720) at throw.c:137
#12 0x00007f7f7323aef5 in scm_catch_with_pre_unwind_handler (key=key <at> entry=#t, thunk=<optimized out>, handler=<optimized out>, pre_unwind_handler=<optimized out>)
    at throw.c:254
#13 0x00007f7f7323b0bf in scm_c_catch (tag=tag <at> entry=#t, body=body <at> entry=0x7f7f731c8350 <c_body>, body_data=body_data <at> entry=0x7f7f6acb6e50, 
    handler=handler <at> entry=0x7f7f731c85e0 <c_handler>, handler_data=handler_data <at> entry=0x7f7f6acb6e50, 
    pre_unwind_handler=pre_unwind_handler <at> entry=0x7f7f731c8440 <pre_unwind_handler>, pre_unwind_handler_data=0x7f7f6b2939a0) at throw.c:377
#14 0x00007f7f731c8940 in scm_i_with_continuation_barrier (body=body <at> entry=0x7f7f731c8350 <c_body>, body_data=body_data <at> entry=0x7f7f6acb6e50, 
    handler=handler <at> entry=0x7f7f731c85e0 <c_handler>, handler_data=handler_data <at> entry=0x7f7f6acb6e50, 
    pre_unwind_handler=pre_unwind_handler <at> entry=0x7f7f731c8440 <pre_unwind_handler>, pre_unwind_handler_data=0x7f7f6b2939a0) at continuations.c:360
#15 0x00007f7f731c89d5 in scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:456
#16 0x00007f7f7323984c in with_guile (base=base <at> entry=0x7f7f6acb6eb8, data=data <at> entry=0x7f7f6acb6ee0) at threads.c:661
#17 0x00007f7f73127a68 in GC_call_with_stack_base (fn=fn <at> entry=0x7f7f73239800 <with_guile>, arg=arg <at> entry=0x7f7f6acb6ee0) at misc.c:1941
#18 0x00007f7f73239b68 in scm_i_with_guile (dynamic_state=<optimized out>, data=<optimized out>, func=<optimized out>) at threads.c:704
#19 scm_with_guile (func=<optimized out>, data=<optimized out>) at threads.c:710
#20 0x00007f7f730d7015 in start_thread (arg=0x7f7f6acb7700) at pthread_create.c:486
#21 0x00007f7f7300891f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
--8<---------------cut here---------------end--------------->8---

This is on 2.2.6 and very likely the same issue as
<https://bugs.gnu.org/37757>: ‘finalization_thread_proc’ gets EINTR, by
chance ‘data.byte’ is zero, and so it keeps going instead of exiting.

Commit e002332b8a136638c2a4fc1a2f2a26541ce0211d moves mcron to 2.2.7,
which should fix this problem.  \o/

Ludo’.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 15 Apr 2020 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 348 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.