Add a new `main.rc-manager=auto` setting, that favours to use
systemd-resolved (and not touch "/etc/resolv.conf" but configure
it via D-Bus), or falls back to `resolvconf`/`netconfig` binaries
if they are installed and enabled at compile time.
As final fallback use "symlink", like before.
Note that on Fedora there is no "openresolv" package ([1]). Instead, "systemd"
package provides "/usr/sbin/resolvconf" as a wrapper for systemd-resolved's
"resolvectl". On such a system the fallback to resolvconf is always
wrong, because NetworkManager should either talk to systemd-resolved
directly or not but never call "/usr/sbin/resolvconf". So, the special handling
for resolvconf and netconfig is only done if NetworkManager was build with these
applications explicitly enabled.
Note that SUSE builds NetworkManager with
--with-netconfig=yes
--with-config-dns-rc-manager-default=netconfig
and the new option won't be used there either. But of course, netconfig
already does all the right things on SUSE.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=668153
Suggested-by: Jason A. Donenfeld <Jason@zx2c4.com>
Arguably, a fixme comment isn't useful. It would be better to fix it.
On the other hand, nowadays these modes are not very popular and usually
not used. If somebody cares, please provide a patch.
NM_IS_IP_CONFIG() is a standard name for GObject related macros. Next,
we will add NMIPConfig object, so this macro (and name) will have a use.
Rename, and adjust the existing macro to avoid the name conflict.
g_clear_pointer() would always cast the destroy notify function
pointer to GDestroyNotify. That means, it lost some type safety, like
GPtrArray *ptr_arr = ...
g_clear_pointer (&ptr_arr, g_array_unref);
Since glib 2.58 ([1]), g_clear_pointer() is also more type safe. But
this is not used by NetworkManager, because we don't set
GLIB_VERSION_MIN_REQUIRED to 2.58.
[1] f9a9902aac
We have nm_clear_pointer() to avoid this issue for a long time (pre
1.12.0). Possibly we should redefine in our source tree g_clear_pointer()
as nm_clear_pointer(). However, I don't like to patch glib functions
with our own variant. Arguably, we do patch g_clear_error() in
such a manner. But there the point is to make the function inlinable.
Also, nm_clear_pointer() returns a boolean that indicates whether
anything was cleared. That is sometimes useful. I think we should
just consistently use nm_clear_pointer() instead, which does always
the preferable thing.
Replace:
sed 's/\<g_clear_pointer *(\([^;]*\), *\([a-z_A-Z0-9]\+\) *)/nm_clear_pointer (\1, \2)/g' $(git grep -l g_clear_pointer) -i
I think it's preferable to use nm_clear_g_free() instead of
g_clear_pointer(, g_free). The reasons are not very strong,
but I think it is overall preferable to have a shorthand for this
frequently used functionality.
sed 's/\<g_clear_pointer *(\([^;]*\), *\(g_free\) *)/nm_clear_g_free (\1)/g' $(git grep -l g_clear_pointer) -i
It was not very clear whether update_dns() would always set the error
output variable if (and only if) it would return false.
Rework the code to make it clearer.
Not using cleanup attribute is error prone.
In theory, a function should only return a GError if (and only if) it
signals a failure. However, for example in commit 324f67956a ('dns:
ensure to log a warning when writing /etc/resolv.conf fails') due to
a bug we was violated. In that case, it resulted in a leak.
Avoid explicit frees and use the gs_free_error cleanup attribute
instead. That would also work correctly in face of such a bug and in
general it seems preferable to explicitly assign ownership to auto
variables on the stack.
When setting "main.rc-manager=symlink" (the default) and /etc/resolv.conf
is a file, NetworkManager tries to write the file directly. When that fails,
we need to make sure to propagate the error so that we log a warning about that.
With this change:
<debug> [1583320004.3122] dns-mgr: update-dns: updating plugin systemd-resolved
<trace> [1583320004.3123] dns-sd-resolved[f9e3febb7424575d]: send-updates: start 8 requests
<trace> [1583320004.3129] dns-mgr: update-resolv-no-stub: '/var/run/NetworkManager/no-stub-resolv.conf' successfully written
<trace> [1583320004.3130] dns-mgr: update-resolv-conf: write to /etc/resolv.conf failed (rc-manager=symlink, $ERROR_REASON)
<trace> [1583320004.3132] dns-mgr: update-resolv-conf: write internal file /var/run/NetworkManager/resolv.conf succeeded
<trace> [1583320004.3133] dns-mgr: current configuration: [{ [...] }]
<warn> [1583320004.3133] dns-mgr: could not commit DNS changes: $ERROR_REASON
<info> [1583320004.3134] device (eth0): Activation: successful, device activated.
https://bugzilla.redhat.com/show_bug.cgi?id=1809181
Several macros are used to define function. They had a "_STATIC" variant,
to define the function as static.
I think those macros should not try to abstract entirely what they do.
They should not accept the function scope as argument (or have two
variants per scope). This also because it might make sense to add
additional __attribute__(()) to the function. That only works, if
the macro does not pretend to *not* define a plain function.
Instead, embrace what the function does and let the users place the
function scope as they see fit.
This also follows what is already done with
static NM_CACHED_QUARK_FCN ("autoconnect-root", autoconnect_root_quark)
Most callers would pass FALSE to nm_utils_error_is_cancelled(). That's
not very useful. Split the two functions and have nm_utils_error_is_cancelled()
and nm_utils_error_is_cancelled_is_disposing().
We should use the same "is-valid" function everywhere.
Since nm_utils_ipaddr_valid() is part of libnm, it does not qualify.
Use nm_utils_ipaddr_is_valid() instead.
and _nm_utils_inet6_ntop() instead of nm_utils_inet6_ntop().
nm_utils_inet4_ntop()/nm_utils_inet6_ntop() are public API of libnm.
For one, that means they are only available in code that links with
libnm/libnm-core. But such basic helpers should be available everywhere.
Also, they accept NULL as destination buffers. We keep that behavior
for potential libnm users, but internally we never want to use the
static buffers. This patch needs to take care that there are no callers
of _nm_utils_inet[46]_ntop() that pass NULL buffers.
Also, _nm_utils_inet[46]_ntop() are inline functions and the compiler
can get rid of them.
We should consistently use the same variant of the helper. The only
downside is that the "good" name is already taken. The leading
underscore is rather ugly and inconsistent.
Also, with our internal variants we can use "static array indices in
function parameter declarations" next. Thereby the compiler helps
to ensure that the provided buffers are of the right size.
The abbreviations "ns" and "ms" seem not very clear to me. Spell them
out to nsec/msec. Also, in parts we already used the longer abbreviations,
so it wasn't consistent.
Note that the only DNS plugin that actually emits the FAILED signal was
NMDnsDnsmasq. Let's not handle restart, retry and rate-limiting by
NMDnsManager but by NMDnsDnsmasq itself.
There are three goals here:
(1) we want that when dnsmasq (infrequently) crashes, that we always keep
retrying. A random crash should be automatically resolved and
eventually dnsmasq should be working again.
Note that we anyway cannot fully detect whether something is wrong.
OK, we detect crashes, but if dnsmasq just gets catatonic, it's just
as broken. Point being: our ability to detect non-working dnsmasq is limited.
(2) when dnsmasq keeps crashing all the time, then rate limit the retry.
Of course, at this point there is already something seriously wrong,
but we shouldn't kill the system by respawning the process without rate
limiting.
(3) previously, when NMDnsManager noticed that the pluging was broken
(and rate-limiting kicked in), it would temporarily disable the plugin.
Basically, that meant to write the real name servers to /etc/resolv.conf
directly, instead of setting localhost. This partly conflicts with
(1), because we want to retry and recover automatically. So what good
is it to notice a problem, resort to plain /etc/resolv.conf for a
short time, and then run into the issues again? If something is really
broken, there is no way but to involve the user to investigate and
fix the issue. Hence, we don't need to concern NMDnsManager with this either.
The only thing that the manager notices is when the dnsmasq binary is not
available. In that case, update() fails right away, and the manager falls back
to configure the name servers in /etc/resolv.conf directly.
Also, change the backoff time from 5 minutes to 1 minute (twice the
burst interval). There is not particularly strong reason for either
choice, I think that if the ratelimit kicks in, then something is
already so wrong that it doesn't matter either way. Anyway, also 60
seconds is long enough to not kill the machine otherwise.
Several points.
- We spawn the dnsmasq process directly. That has several downsides:
- The lifetime of the process is tied to NetworkManager's. When
stopping NetworkManager, we usually also stop dnsmasq. Or we keep
the process running, but later the process is no longer a child process
of NetworkManager and we can only kill it using the pidfile.
- We don't do special sandboxing of the dnsmasq process.
- Note that we want to ensure that only one dnsmasq process is running
at any time. We should track that in a singletone. Note that NMDnsDnsmasq
is not a singleton. While there is only one instance active at any time,
the DNS plugin can be swapped (e.g. during SIGHUP). Hence, don't track the
process per-NMDnsDnsmasq instance, but in a global variable "gl_pid".
- Usually, when NetworkManager quits, it also stops the dnsmasq process.
Previously, we would always try to terminate the process based on the
pidfile. That is wrong. Most of the time, NetworkManager spawned the
process itself, as a child process. Hence, the PID is known and NetworkManager
will get a signal when dnsmasq exits. The only moment when NetworkManager should
use the pidfile, is the first time when checking to kill the previous instance.
That is: only once at the beginning, to kill instances that were
intentionally or unintentionally (crash) left running earlier.
This is now done by _gl_pid_kill_external().
- Previously, before starting a new dnsmasq instance we would kill a
possibly already running one, and block while waiting for the process to
disappear. We should never block. Especially, since we afterwards start
the process also in non-blocking way, there is no reason to kill the
existing process in a blocking way. For the most part, starting dnsmasq
is already asynchronous and so should be the killing of the dnsmasq
process.
- Drop GDBusProxy and only use GDBusConnection. It fully suffices.
- When we kill a dnsmasq instance, we actually don't have to wait at
all. That can happen fully in background. The only pecularity is that
when we restart a new instance before the previous instance is killed,
then we must wait for the previous process to terminate first. Also, if
we are about to exit while killing the dnsmasq instance, we must register
nm_shutdown_wait_obj_*() to wait until the process is fully gone.
We only have two real DNS plugins: "dnsmasq" and "systemd-resolved" (the "unbound"
plugin is very incomplete and should eventually be dropped).
Of these two, only "dnsmasq" spawns a child process. A lot of the logic
for that is in the parent class NMDnsPlugin, with the purpose for that
logic to be reusable.
However:
- We are unlikely to add more DNS plugins. Especially because
systemd-resolved seems the way forward.
- If we happen to add more plugins, then probably NetworkManager
should not spawn the process itself. That causes problems with
restarting the service. Rather, we should let the service manager
handle the lifetime of such "child" processes. Aside separating
the lifetime of the DNS plugin process from NetworkManager's,
this also would allow to sandbox NetworkManager and the DNS plugin
differently. Currently, NetworkManager itself may might need
capabilities only to pass them on to the DNS plugin, or (more likely)
NetworkManager would want to drop additional capabilities for the
DNS plugin (which we would rather not implement ourself, since that
seems job of the service management already).
- The current implementation is far from beautiful. For example,
it does synchronous (blocking) killing of the running process
from the PID file, and it uses PID fils. This is not something
we would want to reuse for other plugins. Also, note that
dnsmasq already spawns the service asynchronosly (of course).
Hence, we should also kill it asynchronously, but that is complicated
by having the logic separated in two different classes while
providing an abstract API between the two.
Move the code to NMDnsDnsmasq. This is the only place that cares about
this. Also, that makes it actually clearer what is happening, by seeing
the lifetime handling of the child proceess all in one place.
For logging, if the plugin fails with update, it should return a reason
that we can log.
Note that both dnsmasq and system-resolved plugins do the update asynchronously
(of course). Hence, usually they never fail right away, and there isn't really
possibility to handle the failure later. Still, we should print something sensible
for that we need information what went wrong.
The plugin name and whether a plugin is caching only depends on the type,
it does not require a virtual function where types would decided depending
on other reasons.
Convert the virtual functions into fields of the class.
We no longer add these. If you use Emacs, configure it yourself.
Also, due to our "smart-tab" usage the editor anyway does a subpar
job handling our tabs. However, on the upside every user can choose
whatever tab-width he/she prefers. If "smart-tabs" are used properly
(like we do), every tab-width will work.
No manual changes, just ran commands:
F=($(git grep -l -e '-\*-'))
sed '1 { /\/\* *-\*- *[mM]ode.*\*\/$/d }' -i "${F[@]}"
sed '1,4 { /^\(#\|--\|dnl\) *-\*- [mM]ode/d }' -i "${F[@]}"
Check remaining lines with:
git grep -e '-\*-'
The ultimate purpose of this is to cleanup our files and eventually use
SPDX license identifiers. For that, first get rid of the boilerplate lines.
I also like this because it's non-obvious that subscription IDs from
GDBusConnection are "guint" (contrary to signal handler IDs which are
"gulong"). So, by using this API you get a compiler error when using the
wrong type.
In the past, when switching to nm_clear_g_signal_handler() this uncovered
multiple bugs where the wrong type was used to hold the ID.
We will use the D-Bus connection of our NMDBusManager singleton more.
Use a macro.
- it's shorter to type and it's one distinct word.
- the name indicates what this is: the main D-Bus connection singleton.
By searching for this name we can find all users that care about using
this singleton.
From the files under "shared/nm-utils" we build an internal library
that provides glib-based helper utilities.
Move the files of that basic library to a new subdirectory
"shared/nm-glib-aux" and rename the helper library "libnm-core-base.la"
to "libnm-glib-aux.la".
Reasons:
- the name "utils" is overused in our code-base. Everything's an
"utils". Give this thing a more distinct name.
- there were additional files under "shared/nm-utils", which are not
part of this internal library "libnm-utils-base.la". All the files
that are part of this library should be together in the same
directory, but files that are not, should not be there.
- the new name should better convey what this library is and what is isn't:
it's a set of utilities and helper functions that extend glib with
funcitonality that we commonly need.
There are still some files left under "shared/nm-utils". They have less
a unifying propose to be in their own directory, so I leave them there
for now. But at least they are separate from "shared/nm-glib-aux",
which has a very clear purpose.
(cherry picked from commit 80db06f768)
The proxy does nothing for us, except overhead.
We can directly subscribe to "NameOwnerChanged" signals on the
GDBusConnection. Also, instead of asynchronously creating the
GDBusProxy, asynchronously call "GetNameOwner". That's what the
proxy does anyway.
GDBusConnection is actually a decent API. We don't need another layer on
top of that, for functionality that we don't use.
Also, don't use G_BUS_TYPE_SYSTEM, but use the GDBusConnection that
also the bus-manager uses. For all practical purposes, that is the
connection was want to use also in NMDnsSystemdResolved.
Every (failed) attempt to D-Bus activate a service results in log-messages
from dbus-daemon. It must be avoided to spam the logs that way.
Let connectivity check not only ask whether systemd-resolved is enabled
(and NetworkManager would like to push information there), but also
whether it looks like the service is actually available. That is,
either it has a name-owner or it's not blocked from starting.
The previous workaround was to configure main.systemd-resolved=no
in NetworkManager.conf. But that requires explict configuration.
Previously, we would create the D-Bus proxy without
%G_DBUS_PROXY_FLAGS_DO_NOT_AUTO_START_AT_CONSTRUCTION
flag.
That means, when systemd-resolved was not available or masked, the creation
of the D-Bus proxy would fail with
dns-sd-resolved[0x561905dc92d0]: failure to create D-Bus proxy for systemd-resolved: Error calling StartServiceByName for org.freedesktop.resolve1: GDBus.Error:org.freedesktop.systemd1.NoSuchUnit: Unit dbus-org.freedesktop.resolve1.service not found.
and never retried.
Now, when creating the D-Bus proxy don't autostart the instance.
Instead, each D-Bus call will try to poke and start the service.
There is a problem however: if systemd-resolved is not available, then
we must not constantly trying to start it, because it results in a slur
or syslog messages from dbus-daemon:
dbus-daemon[991]: [system] Activating via systemd: service name='org.freedesktop.resolve1' unit='dbus-org.freedesktop.resolve1.service' requested by ':1.23' (uid=0 pid=1012 comm="/usr/bin/NetworkManager --no-daemon ")
dbus-daemon[991]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.resolve1.service': Unit dbus-org.freedesktop.resolve1.service not found.
dbus-daemon[991]: [system] Activating via systemd: service name='org.freedesktop.resolve1' unit='dbus-org.freedesktop.resolve1.service' requested by ':1.23' (uid=0 pid=1012 comm="/usr/bin/NetworkManager --no-daemon ")
Avoid that by watching the name owner.
But, since systemd-resolved is D-Bus activated, watching the name owner
alone is not enough to know whether we should try to autostart the service.
Instead:
- if we have a name owner, assume the service runs and we send the update
- if we have no name owner, and we did not recently try to start
the service by name, poke it via "StartServiceByName". The idea
is, that in total we only try this once and remember a previous
attempt in priv->try_start_blocked.
- if we get a name-owner, priv->try_start_blocked gets reset.
Either it was us who started the service, or somebody else.
Either way, we are good to send updates again.
The nice thing is that we only try once to start resolved and only
generate one logging message from dbus-daemon about failure to do so.
But still, after blocking start on failure, when somebody else starts
resolved, we notice it and start using it again.
As we frequently send updates to systemd-resolved and for each update
send multiple messages, it can happen that we log a large number of
warnings if they all fail.
Rate limit the warnings to only warn once (until the failure is
recovered).
Currently, if systemd-resolved is not installed (or disabled) we already
fail once to create the D-Bus proxy (and never retry). That should be
fixed, to create the proxy with G_DBUS_PROXY_FLAGS_DO_NOT_AUTO_START_AT_CONSTRUCTION.
If we allow creating the proxy we would repeatedly try to send messages
and they would all fail. This is one example, where we need to ratelimit
the warning.
If the child is respawning too fast, consider the plugin failed so
that upstream servers are written to resolv.conf until the plugin gets
restarted after the delay.
When the dnsmasq process dies, two events are generated:
(1) a NM_DNS_PLUGIN_FAILED signal in nm-dns-dnsmasq.c:name_owner_changed()
(2) a NM_DNS_PLUGIN_CHILD_QUIT signal in nm-dns-plugin.c:from watch_cb()
Event (1) is handled by updating resolv.conf with upstream servers,
(2) by restarting the child process.
The order in which the two signals are received is not deterministic,
so when (1) comes after (2) the manager leaves upstream servers in
resolv.conf even if a dnsmasq instance is running.
When dnsmasq disappears from D-Bus and we know that the process is not
running, we should not emit a FAILED signal because the disappearing
is caused by the process termination, and that event is already
handled by the manager.
https://gitlab.freedesktop.org/NetworkManager/NetworkManager/issues/105
If the user disabled systemd-resolved, two things seem apparent:
- the user does not want us to use systemd-resolved
- NetworkManager is not pushing the DNS configuration to
systemd-resoved.
It seems to me, we should not consult systemd-resolved in that case.