Don't leave dangling pointers to timer queue entries when they are
freed in the scheduler finalization in case some code tried to remove
a timer later.
Fixes: 6ea1082a72 ("sched: free timer blocks on exit")
The "infinite loop in scheduling" fatal error was observed on a system
running out of memory. Presumably, the execution of the process slowed
down due to memory thrashing so much that the dispatching loop wasn't
able to break with a single server polled at a 16-second interval.
To allow recovery in such a case, require for the error more than
20 handled timeouts and a rate higher than 100 per second.
Reported-by: Jamie Gruener <jamie.gruener@biospatial.io>
Update the monotonic time before the timestamps are corrected for
unexpected jumps, e.g. due to the computer being suspended and resumed,
and switch to the raw timestamps. This should allow the NTS refresh
interval to better follow real time, but it will not be corrected for
a frequency offset if the clock is not synchronized (e.g. with -x).
Measure the interval since the start in order to provide a monotonic
time for periodical tasks not using timers like driftfile updates, key
refresh, etc. Return the interval in the double format, but keep an
integer remainder limiting the precision to 0.01 second to avoid issues
with very small increments in a long-running process.
Before dispatching a handler, check if it is still valid. This allows a
handler to remove itself when a descriptor has two different events at
the same time.
It was never used for anything and messages in debug output already
include filenames, which can be easily grepped if there is a need
to see log messages only from a particular file.
Extend the random value which is included in the calculation of the
delay from 16 to 32 bits. This makes scheduling of NTP transmissions
random to one microsecond for polling intervals up to 17.
Replace struct timeval with struct timespec as the main data type for
timestamps. This will allow the NTP code to work with timestamps in
nanosecond resolution.
Instead of copying a prepared fd_set to the fd_set used by select(),
fill it from scratch according to the array of file handlers before each
select() call. This should make the code simpler and save some memory
when other events are supported.
Replace SCH_*InputFileHandler() functions with more general
SCH_*FileHandler(), where events are specified as a new parameter and
which will later support other file events, e.g. file ready for ouput
and exception.
The file handlers have two new parameters: file descriptor and event.
Don't require the scheduler to be initialized in SCH_QuitProgram().
This fixes a crash when a signal is received between scheduler
finalization and chronyd exit.
Use UTI_GetRandomBytes() instead of random() to calculate the random
part of the timeout. This was the only remaining use of random() in the
code and the srandom() call can be removed.
To avoid problems in the very unlikely case where a timeout is so long
and new IDs are allocated so frequently that they would have a chance
to overflow and catch up with it, make sure before returning new ID that
it's currently not in use.
Timeout ID of zero can be now safely used to indicate that the timer is
not running. Remove the extra timer_running variables that were
necessary to track that.
Abort when the system time gets so close to the end of 32-bit time_t
that timeouts added by delay start to overflow. This is an addition to
the loop detector in dispatch_timeouts().
To detect forward time jumps, use a timestamp made before calling
select() instead of the first timeout in the queue. Also, if the timeout
value is modified by select() (e.g. on Linux) use it to get a more
accurate estimate of the elapsed time.
With cmdport 0 and port 0, it's now possible that there is no descriptor
watched or timer running, i.e. chronyd doing nothing and only waiting to
be terminated. Replace the assertion with LOG_FATAL to exit properly.
With special reference update modes, the timeout handlers may add or
remove file descriptors from the read fd set, so it needs to be copied
for select() call after they are dispatched. Also, they can now request
quit, so the exit flag needs to be checked before select() to avoid
hanging.
It could be triggered by delayed name resolving as it adds multiple new
timeouts which can be called in the same dispatching if the DNS responses
are slower than initial delay and sampling separation.
Compare number of dispatched events also with current number of
timeouts.