Note that the only DNS plugin that actually emits the FAILED signal was
NMDnsDnsmasq. Let's not handle restart, retry and rate-limiting by
NMDnsManager but by NMDnsDnsmasq itself.
There are three goals here:
(1) we want that when dnsmasq (infrequently) crashes, that we always keep
retrying. A random crash should be automatically resolved and
eventually dnsmasq should be working again.
Note that we anyway cannot fully detect whether something is wrong.
OK, we detect crashes, but if dnsmasq just gets catatonic, it's just
as broken. Point being: our ability to detect non-working dnsmasq is limited.
(2) when dnsmasq keeps crashing all the time, then rate limit the retry.
Of course, at this point there is already something seriously wrong,
but we shouldn't kill the system by respawning the process without rate
limiting.
(3) previously, when NMDnsManager noticed that the pluging was broken
(and rate-limiting kicked in), it would temporarily disable the plugin.
Basically, that meant to write the real name servers to /etc/resolv.conf
directly, instead of setting localhost. This partly conflicts with
(1), because we want to retry and recover automatically. So what good
is it to notice a problem, resort to plain /etc/resolv.conf for a
short time, and then run into the issues again? If something is really
broken, there is no way but to involve the user to investigate and
fix the issue. Hence, we don't need to concern NMDnsManager with this either.
The only thing that the manager notices is when the dnsmasq binary is not
available. In that case, update() fails right away, and the manager falls back
to configure the name servers in /etc/resolv.conf directly.
Also, change the backoff time from 5 minutes to 1 minute (twice the
burst interval). There is not particularly strong reason for either
choice, I think that if the ratelimit kicks in, then something is
already so wrong that it doesn't matter either way. Anyway, also 60
seconds is long enough to not kill the machine otherwise.
Several points.
- We spawn the dnsmasq process directly. That has several downsides:
- The lifetime of the process is tied to NetworkManager's. When
stopping NetworkManager, we usually also stop dnsmasq. Or we keep
the process running, but later the process is no longer a child process
of NetworkManager and we can only kill it using the pidfile.
- We don't do special sandboxing of the dnsmasq process.
- Note that we want to ensure that only one dnsmasq process is running
at any time. We should track that in a singletone. Note that NMDnsDnsmasq
is not a singleton. While there is only one instance active at any time,
the DNS plugin can be swapped (e.g. during SIGHUP). Hence, don't track the
process per-NMDnsDnsmasq instance, but in a global variable "gl_pid".
- Usually, when NetworkManager quits, it also stops the dnsmasq process.
Previously, we would always try to terminate the process based on the
pidfile. That is wrong. Most of the time, NetworkManager spawned the
process itself, as a child process. Hence, the PID is known and NetworkManager
will get a signal when dnsmasq exits. The only moment when NetworkManager should
use the pidfile, is the first time when checking to kill the previous instance.
That is: only once at the beginning, to kill instances that were
intentionally or unintentionally (crash) left running earlier.
This is now done by _gl_pid_kill_external().
- Previously, before starting a new dnsmasq instance we would kill a
possibly already running one, and block while waiting for the process to
disappear. We should never block. Especially, since we afterwards start
the process also in non-blocking way, there is no reason to kill the
existing process in a blocking way. For the most part, starting dnsmasq
is already asynchronous and so should be the killing of the dnsmasq
process.
- Drop GDBusProxy and only use GDBusConnection. It fully suffices.
- When we kill a dnsmasq instance, we actually don't have to wait at
all. That can happen fully in background. The only pecularity is that
when we restart a new instance before the previous instance is killed,
then we must wait for the previous process to terminate first. Also, if
we are about to exit while killing the dnsmasq instance, we must register
nm_shutdown_wait_obj_*() to wait until the process is fully gone.
We only have two real DNS plugins: "dnsmasq" and "systemd-resolved" (the "unbound"
plugin is very incomplete and should eventually be dropped).
Of these two, only "dnsmasq" spawns a child process. A lot of the logic
for that is in the parent class NMDnsPlugin, with the purpose for that
logic to be reusable.
However:
- We are unlikely to add more DNS plugins. Especially because
systemd-resolved seems the way forward.
- If we happen to add more plugins, then probably NetworkManager
should not spawn the process itself. That causes problems with
restarting the service. Rather, we should let the service manager
handle the lifetime of such "child" processes. Aside separating
the lifetime of the DNS plugin process from NetworkManager's,
this also would allow to sandbox NetworkManager and the DNS plugin
differently. Currently, NetworkManager itself may might need
capabilities only to pass them on to the DNS plugin, or (more likely)
NetworkManager would want to drop additional capabilities for the
DNS plugin (which we would rather not implement ourself, since that
seems job of the service management already).
- The current implementation is far from beautiful. For example,
it does synchronous (blocking) killing of the running process
from the PID file, and it uses PID fils. This is not something
we would want to reuse for other plugins. Also, note that
dnsmasq already spawns the service asynchronosly (of course).
Hence, we should also kill it asynchronously, but that is complicated
by having the logic separated in two different classes while
providing an abstract API between the two.
Move the code to NMDnsDnsmasq. This is the only place that cares about
this. Also, that makes it actually clearer what is happening, by seeing
the lifetime handling of the child proceess all in one place.
For logging, if the plugin fails with update, it should return a reason
that we can log.
Note that both dnsmasq and system-resolved plugins do the update asynchronously
(of course). Hence, usually they never fail right away, and there isn't really
possibility to handle the failure later. Still, we should print something sensible
for that we need information what went wrong.
The plugin name and whether a plugin is caching only depends on the type,
it does not require a virtual function where types would decided depending
on other reasons.
Convert the virtual functions into fields of the class.
We no longer add these. If you use Emacs, configure it yourself.
Also, due to our "smart-tab" usage the editor anyway does a subpar
job handling our tabs. However, on the upside every user can choose
whatever tab-width he/she prefers. If "smart-tabs" are used properly
(like we do), every tab-width will work.
No manual changes, just ran commands:
F=($(git grep -l -e '-\*-'))
sed '1 { /\/\* *-\*- *[mM]ode.*\*\/$/d }' -i "${F[@]}"
sed '1,4 { /^\(#\|--\|dnl\) *-\*- [mM]ode/d }' -i "${F[@]}"
Check remaining lines with:
git grep -e '-\*-'
The ultimate purpose of this is to cleanup our files and eventually use
SPDX license identifiers. For that, first get rid of the boilerplate lines.
I also like this because it's non-obvious that subscription IDs from
GDBusConnection are "guint" (contrary to signal handler IDs which are
"gulong"). So, by using this API you get a compiler error when using the
wrong type.
In the past, when switching to nm_clear_g_signal_handler() this uncovered
multiple bugs where the wrong type was used to hold the ID.
We will use the D-Bus connection of our NMDBusManager singleton more.
Use a macro.
- it's shorter to type and it's one distinct word.
- the name indicates what this is: the main D-Bus connection singleton.
By searching for this name we can find all users that care about using
this singleton.
From the files under "shared/nm-utils" we build an internal library
that provides glib-based helper utilities.
Move the files of that basic library to a new subdirectory
"shared/nm-glib-aux" and rename the helper library "libnm-core-base.la"
to "libnm-glib-aux.la".
Reasons:
- the name "utils" is overused in our code-base. Everything's an
"utils". Give this thing a more distinct name.
- there were additional files under "shared/nm-utils", which are not
part of this internal library "libnm-utils-base.la". All the files
that are part of this library should be together in the same
directory, but files that are not, should not be there.
- the new name should better convey what this library is and what is isn't:
it's a set of utilities and helper functions that extend glib with
funcitonality that we commonly need.
There are still some files left under "shared/nm-utils". They have less
a unifying propose to be in their own directory, so I leave them there
for now. But at least they are separate from "shared/nm-glib-aux",
which has a very clear purpose.
(cherry picked from commit 80db06f768)
The proxy does nothing for us, except overhead.
We can directly subscribe to "NameOwnerChanged" signals on the
GDBusConnection. Also, instead of asynchronously creating the
GDBusProxy, asynchronously call "GetNameOwner". That's what the
proxy does anyway.
GDBusConnection is actually a decent API. We don't need another layer on
top of that, for functionality that we don't use.
Also, don't use G_BUS_TYPE_SYSTEM, but use the GDBusConnection that
also the bus-manager uses. For all practical purposes, that is the
connection was want to use also in NMDnsSystemdResolved.
Every (failed) attempt to D-Bus activate a service results in log-messages
from dbus-daemon. It must be avoided to spam the logs that way.
Let connectivity check not only ask whether systemd-resolved is enabled
(and NetworkManager would like to push information there), but also
whether it looks like the service is actually available. That is,
either it has a name-owner or it's not blocked from starting.
The previous workaround was to configure main.systemd-resolved=no
in NetworkManager.conf. But that requires explict configuration.
Previously, we would create the D-Bus proxy without
%G_DBUS_PROXY_FLAGS_DO_NOT_AUTO_START_AT_CONSTRUCTION
flag.
That means, when systemd-resolved was not available or masked, the creation
of the D-Bus proxy would fail with
dns-sd-resolved[0x561905dc92d0]: failure to create D-Bus proxy for systemd-resolved: Error calling StartServiceByName for org.freedesktop.resolve1: GDBus.Error:org.freedesktop.systemd1.NoSuchUnit: Unit dbus-org.freedesktop.resolve1.service not found.
and never retried.
Now, when creating the D-Bus proxy don't autostart the instance.
Instead, each D-Bus call will try to poke and start the service.
There is a problem however: if systemd-resolved is not available, then
we must not constantly trying to start it, because it results in a slur
or syslog messages from dbus-daemon:
dbus-daemon[991]: [system] Activating via systemd: service name='org.freedesktop.resolve1' unit='dbus-org.freedesktop.resolve1.service' requested by ':1.23' (uid=0 pid=1012 comm="/usr/bin/NetworkManager --no-daemon ")
dbus-daemon[991]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.resolve1.service': Unit dbus-org.freedesktop.resolve1.service not found.
dbus-daemon[991]: [system] Activating via systemd: service name='org.freedesktop.resolve1' unit='dbus-org.freedesktop.resolve1.service' requested by ':1.23' (uid=0 pid=1012 comm="/usr/bin/NetworkManager --no-daemon ")
Avoid that by watching the name owner.
But, since systemd-resolved is D-Bus activated, watching the name owner
alone is not enough to know whether we should try to autostart the service.
Instead:
- if we have a name owner, assume the service runs and we send the update
- if we have no name owner, and we did not recently try to start
the service by name, poke it via "StartServiceByName". The idea
is, that in total we only try this once and remember a previous
attempt in priv->try_start_blocked.
- if we get a name-owner, priv->try_start_blocked gets reset.
Either it was us who started the service, or somebody else.
Either way, we are good to send updates again.
The nice thing is that we only try once to start resolved and only
generate one logging message from dbus-daemon about failure to do so.
But still, after blocking start on failure, when somebody else starts
resolved, we notice it and start using it again.
As we frequently send updates to systemd-resolved and for each update
send multiple messages, it can happen that we log a large number of
warnings if they all fail.
Rate limit the warnings to only warn once (until the failure is
recovered).
Currently, if systemd-resolved is not installed (or disabled) we already
fail once to create the D-Bus proxy (and never retry). That should be
fixed, to create the proxy with G_DBUS_PROXY_FLAGS_DO_NOT_AUTO_START_AT_CONSTRUCTION.
If we allow creating the proxy we would repeatedly try to send messages
and they would all fail. This is one example, where we need to ratelimit
the warning.
If the child is respawning too fast, consider the plugin failed so
that upstream servers are written to resolv.conf until the plugin gets
restarted after the delay.
When the dnsmasq process dies, two events are generated:
(1) a NM_DNS_PLUGIN_FAILED signal in nm-dns-dnsmasq.c:name_owner_changed()
(2) a NM_DNS_PLUGIN_CHILD_QUIT signal in nm-dns-plugin.c:from watch_cb()
Event (1) is handled by updating resolv.conf with upstream servers,
(2) by restarting the child process.
The order in which the two signals are received is not deterministic,
so when (1) comes after (2) the manager leaves upstream servers in
resolv.conf even if a dnsmasq instance is running.
When dnsmasq disappears from D-Bus and we know that the process is not
running, we should not emit a FAILED signal because the disappearing
is caused by the process termination, and that event is already
handled by the manager.
https://gitlab.freedesktop.org/NetworkManager/NetworkManager/issues/105
If the user disabled systemd-resolved, two things seem apparent:
- the user does not want us to use systemd-resolved
- NetworkManager is not pushing the DNS configuration to
systemd-resoved.
It seems to me, we should not consult systemd-resolved in that case.
Each time when enabling/disabling "systemd-resolved" in combination with another
plugin (which is unchanged), another pair of signal handlers was
connected. That's wrong.
Fixes: d4eb4cb45f
Even when the system resolver is configured to something else that
systemd-resolved, it still is a good idea to keep systemd-resolved up to
date. If not anything else, it does a good job at doing per-interface
resolving for connectivity checks.
If for whatever reasons don't want NetworkManager to push the DNS data
it discovers to systemd-resolved, the functionality can be disabled
with:
[main]
systemd-resolved=false
g_string_new_len() allocates the buffer with length
bytes. Maybe it should be obvious (wasn't to me), but
if a init argument is given, that is taken as containing
length bytes.
So,
str = g_string_new_len (init, len);
is more like
str = g_string_new_len (NULL, len);
g_string_append_len (str, init, len);
and not (how I wrongly thought)
str = g_string_new_len (NULL, len);
g_string_append (str, init);
Fixes: 95b006c244
Previously, if "main.rc-manager" was set to "unmanaged"
and "/etc/resolv.conf" was symlink to our internal file
"/var/run/NetworkManager/resolv.conf", NM would not rewrite
the file, in an attempt to honor the requirement of NetworkManager
not changing resolv.conf.
No longer special case this. I think it was wrong and inconsistent.
If the user specifies rc-manager unmanaged, he also should manage
/etc/resolv.conf accordingly. And if the user decided to symlink
it to our internal file, that is fine. It should not stop NM from
updating that file.
Also, this was the only cases, where NM would not write our internal
resolv.conf (errors aside). It was inconsitent, and also not documented
behavior. Instead, it is documented that `man NetworkManager.conf`:
Regardless of this setting, NetworkManager will always write
resolv.conf to its runtime state directory
/var/run/NetworkManager/resolv.conf.
When a DNS plugin is enabled (like "main.dns=dnsmasq" or "main.dns=systemd-resolved"),
the name servers announced to the rc-manager are coerced to be 127.0.0.1
or 127.0.0.53.
Depending on the "main.rc-manager" setting, also "/etc/resolv.conf"
contains only this coerced name server to the local caching service.
The same is true for "/var/run/NetworkManager/resolv.conf" file, which
contains what we would write to "/etc/resolv.conf" (depending on
the "main.rc-manager" configuration).
Write a new file "/var/run/NetworkManager/no-stub-resolv.conf", which contains
the original name servers, uncoerced. Like "/var/run/NetworkManager/resolv.conf",
this file is always written.
The effect is, when one enables "main.dns=systemd-resolved", then there
is still a file "no-stub-resolv.conf" with the same content as with
"main.dns=default".
The no-stub-resolv.conf may be a possible solution, when a user wants
NetworkManager to update systemd-resolved, but still have a regular
/etc/resolv.conf [1]. For that, the user could configure
[main]
dns=systemd-resolved
rc-manager=unmanaged
and symlink "/etc/resolv.conf" to "/var/run/NetworkManager/no-stub-resolv.conf".
This is not necessarily the only solution for the problem and does not preclude
options for updating systemd-resolved in combination with other DNS plugins.
[1] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/issues/20
Make them just ask for connections from GDBus, as other D-Bus clients
do. GDBus anyway reuses the connection if it has one, but allows us to
deal with errors in a more civilized manner.
The DNS manager reacts to NM_DEVICE_IP4_CONFIG_CHANGED events, which
are generated when there is a relevant change to an IP4 configuration.
Until now, changes to the mdns,llmnr properties were not
considered relevant (and neither minor, this is already a bug).
Promote them to relevant so that the DNS manager is notified and will
rewrite the DNS configuration when one of this properties changes.
Note that the DNS priority should be considered relevant and added
into the checksum as well, but is a problem right now because in the
DNS manager we rely on the fact that an empty configuration (i.e. just
created) has a zero checksum. This is needed to avoid rewriting
resolv.conf when there is no change. The DNS priority initial value
depends on the connection type (VPN or not), so it's a bit difficult
to add it to checksum maintaining the assumption of checksum(empty)=0.
This should be improved in the future.