sched: improve infinite loop detection
The "infinite loop in scheduling" fatal error was observed on a system running out of memory. Presumably, the execution of the process slowed down due to memory thrashing so much that the dispatching loop wasn't able to break with a single server polled at a 16-second interval. To allow recovery in such a case, require for the error more than 20 handled timeouts and a rate higher than 100 per second. Reported-by: Jamie Gruener <jamie.gruener@biospatial.io>
This commit is contained in:
parent
fb7475bf59
commit
59e8b79034
1 changed files with 16 additions and 9 deletions
25
sched.c
25
sched.c
|
@ -498,10 +498,13 @@ SCH_RemoveTimeout(SCH_TimeoutID id)
|
||||||
|
|
||||||
static void
|
static void
|
||||||
dispatch_timeouts(struct timespec *now) {
|
dispatch_timeouts(struct timespec *now) {
|
||||||
|
unsigned long n_done, n_entries_on_start;
|
||||||
TimerQueueEntry *ptr;
|
TimerQueueEntry *ptr;
|
||||||
SCH_TimeoutHandler handler;
|
SCH_TimeoutHandler handler;
|
||||||
SCH_ArbitraryArgument arg;
|
SCH_ArbitraryArgument arg;
|
||||||
int n_done = 0, n_entries_on_start = n_timer_queue_entries;
|
|
||||||
|
n_entries_on_start = n_timer_queue_entries;
|
||||||
|
n_done = 0;
|
||||||
|
|
||||||
while (1) {
|
while (1) {
|
||||||
LCL_ReadRawTime(now);
|
LCL_ReadRawTime(now);
|
||||||
|
@ -526,16 +529,20 @@ dispatch_timeouts(struct timespec *now) {
|
||||||
/* Increment count of timeouts handled */
|
/* Increment count of timeouts handled */
|
||||||
++n_done;
|
++n_done;
|
||||||
|
|
||||||
/* If more timeouts were handled than there were in the timer queue on
|
/* If the number of dispatched timeouts is significantly larger than the
|
||||||
start and there are now, assume some code is scheduling timeouts with
|
length of the queue on start and now, assume there is a bug causing
|
||||||
negative delays and abort. Make the actual limit higher in case the
|
an infinite loop by constantly adding a timeout with a zero or negative
|
||||||
machine is temporarily overloaded and dispatching the handlers takes
|
delay. Check the actual rate of timeouts to avoid false positives in
|
||||||
more time than was delay of a scheduled timeout. */
|
case the execution slowed down so much (e.g. due to memory thrashing)
|
||||||
if (n_done > n_timer_queue_entries * 4 &&
|
that it repeatedly takes more time to handle the timeout than is its
|
||||||
n_done > n_entries_on_start * 4) {
|
delay. This is a safety mechanism intended to stop a full-speed flood
|
||||||
|
of NTP requests due to a bug in the NTP polling. */
|
||||||
|
|
||||||
|
if (n_done > 20 &&
|
||||||
|
n_done > 4 * MAX(n_timer_queue_entries, n_entries_on_start) &&
|
||||||
|
fabs(UTI_DiffTimespecsToDouble(now, &last_select_ts_raw)) / n_done < 0.01)
|
||||||
LOG_FATAL("Possible infinite loop in scheduling");
|
LOG_FATAL("Possible infinite loop in scheduling");
|
||||||
}
|
}
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
/* ================================================== */
|
/* ================================================== */
|
||||||
|
|
Loading…
Reference in a new issue