If the interface is the DNS default route but has no domain, its name
servers were not sent to systemd-resolved. Fix this.
Fixes: ee9fab0361 ('dns: fix handling default routing domains with systemd-resolved')
Let's try to rework and improve the implementation.
The code is already rather complicated from the start. Let's first try
to use consistent names for things. Naming is very important, so that
they follow a pattern that tells you what something does.
We will rework preparing the DNS data, so that we do things that
currently "nm-dns-systemd-resolved.c" does (see InterfaceConfig).
We thus will need to access the NMDnsConfigData, which is per-ifindex.
As such, it is useful to have NMDnsConfigData in a stable order, sorted
by ifindex. As we track NMDnsConfigData in a hash table, we need to
do the sorting outside of that.
There are many ways to achive that. The solution here is to let
NMDnsConfigData also be tracked by a CList. We only need to resort
the list, when we add a new ifindex -- which should happen only seldom.
The advantage is that the sorting happens in-place and can expose
it to other uses (by providing them access to the CList head).
Let's add a nm_utils_get_reverse_dns_domains_ip() function, which can
operate on both address families. We frequently do that, but then our
address family specific functions tend to have an underscore in the name.
Rename.
Revert this change. One problem is that none of the current GUIs
(nm-connection-editor, gnome-control-center, plasma-nm) expose the
dns-priority option. So, users tend to have their profile value set to
0. Changing the default means for them not only a change in behavior,
but its hard to fix via the GUI.
Also, what other call DNS leaks, is Split DNS to some. Both uses make
sense, but have conflicting goals. The default cannot accommodate both
at the same time.
Also, with split DNS enabled (dnsmasq, systemd-resolved), the concern
for DNS leaks is smaller. Imagine:
Wi-Fi profile with ipv4.dns-priority (effectively) 100, domain "example.com".
VPN profile with ipv4.dns-priority (effectively) 50 and a default route.
That is a common setup that one gets by default (and what probably many
users have today). In such a case with split DNS enabled, the Wi-Fi's DNS
server only sees requests for "*.example.com". So, it does not leak
everything.
Hence, revert this change before 1.28.0 release to the earlier behavior.
This reverts commit af13081bec.
https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/688
We used to set "~." domains for all devices that should be used for
resolving unknown domains.
Systemd-resolved also supports setting "SetLinkDefaultRoute()".
We should only set the wildcard domain if we want that this
interface is used exclusively. Otherwise, we should only set
DefaultRoute. See ([1], [2], [3], [4]).
Otherwise the bad effect is if other components (wg-quick) want
to set exclusive DNS lookups on their link. That is achieved by
explicitly adding "~." and that is also what resolved's
`/usr/sbin/resolvconf -x` does. If NetworkManager sets "~." for
interfaces that are not important and should not be used exclusively,
then this steals the DNS requests from those external components.
In NetworkManager we know whether a link should get exclusive lookups
based on the "ipv[46].dns-priority" setting.
[1] https://www.freedesktop.org/software/systemd/man/org.freedesktop.resolve1.html
[2] https://www.freedesktop.org/software/systemd/man/systemd-resolved.service.html
[3] https://github.com/systemd/systemd/issues/17529#issuecomment-730522444
[4] https://github.com/systemd/systemd/pull/17678
domain_is_shadowed() only works, because we pre-sort all items. When
we call domain_is_shadowed(), then "priority" must be not smaller than
any priority already in the dictionary.
Let's add an nm_assert() for that.
While at it, I also found it ugly to rely on
GPOINTER_TO_INT(g_hash_table_lookup(ht, domain))
returning zero to know whether the domain is tracked. While more
cumbersome, we should check whether the value is in the hash (and not).
Not whether the value does not translate to zero.
Add domain_ht_get_priority() for that.
Change the default DNS priority of VPNs to -50, to avoid leaking
queries out of full-tunnel VPNs.
This is a change in behavior. In particular:
- when using dns=default (i.e. no split-dns) before this patch both
VPN and the local name server were added (in this order) to
resolv.conf; the result was that depending on resolv.conf options
and resolver implementation, the name servers were tried in a
certain manner which does not prevent DNS leaks.
With this change, only the VPN name server is added to resolv.conf.
- When using a split-dns plugin (systemd-resolved or dnsmasq), before
this patch the full-tunnel VPN would get all queries except those
ending in a local domain, that would instead be directed to the
local server.
After this patch, the VPN gets all queries.
To revert to the old behavior, set the DNS priority to 50 in the
connection profile.
If a VPN has never-default=no but doesn't get a default route (this
can happen for example when the server pushes routes with
openconnect), and there are no search domains, then the name servers
pushed by the server would be unused. It is preferable in this case to
use the VPN DNS server for all queries.
https://bugzilla.redhat.com/show_bug.cgi?id=1863041
Run:
./contrib/scripts/nm-code-format.sh -i
./contrib/scripts/nm-code-format.sh -i
Yes, it needs to run twice because the first run doesn't yet produce the
final result.
Signed-off-by: Antonio Cardace <acardace@redhat.com>
Add a new `main.rc-manager=auto` setting, that favours to use
systemd-resolved (and not touch "/etc/resolv.conf" but configure
it via D-Bus), or falls back to `resolvconf`/`netconfig` binaries
if they are installed and enabled at compile time.
As final fallback use "symlink", like before.
Note that on Fedora there is no "openresolv" package ([1]). Instead, "systemd"
package provides "/usr/sbin/resolvconf" as a wrapper for systemd-resolved's
"resolvectl". On such a system the fallback to resolvconf is always
wrong, because NetworkManager should either talk to systemd-resolved
directly or not but never call "/usr/sbin/resolvconf". So, the special handling
for resolvconf and netconfig is only done if NetworkManager was build with these
applications explicitly enabled.
Note that SUSE builds NetworkManager with
--with-netconfig=yes
--with-config-dns-rc-manager-default=netconfig
and the new option won't be used there either. But of course, netconfig
already does all the right things on SUSE.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=668153
Suggested-by: Jason A. Donenfeld <Jason@zx2c4.com>
Arguably, a fixme comment isn't useful. It would be better to fix it.
On the other hand, nowadays these modes are not very popular and usually
not used. If somebody cares, please provide a patch.
NM_IS_IP_CONFIG() is a standard name for GObject related macros. Next,
we will add NMIPConfig object, so this macro (and name) will have a use.
Rename, and adjust the existing macro to avoid the name conflict.
g_clear_pointer() would always cast the destroy notify function
pointer to GDestroyNotify. That means, it lost some type safety, like
GPtrArray *ptr_arr = ...
g_clear_pointer (&ptr_arr, g_array_unref);
Since glib 2.58 ([1]), g_clear_pointer() is also more type safe. But
this is not used by NetworkManager, because we don't set
GLIB_VERSION_MIN_REQUIRED to 2.58.
[1] f9a9902aac
We have nm_clear_pointer() to avoid this issue for a long time (pre
1.12.0). Possibly we should redefine in our source tree g_clear_pointer()
as nm_clear_pointer(). However, I don't like to patch glib functions
with our own variant. Arguably, we do patch g_clear_error() in
such a manner. But there the point is to make the function inlinable.
Also, nm_clear_pointer() returns a boolean that indicates whether
anything was cleared. That is sometimes useful. I think we should
just consistently use nm_clear_pointer() instead, which does always
the preferable thing.
Replace:
sed 's/\<g_clear_pointer *(\([^;]*\), *\([a-z_A-Z0-9]\+\) *)/nm_clear_pointer (\1, \2)/g' $(git grep -l g_clear_pointer) -i
I think it's preferable to use nm_clear_g_free() instead of
g_clear_pointer(, g_free). The reasons are not very strong,
but I think it is overall preferable to have a shorthand for this
frequently used functionality.
sed 's/\<g_clear_pointer *(\([^;]*\), *\(g_free\) *)/nm_clear_g_free (\1)/g' $(git grep -l g_clear_pointer) -i
It was not very clear whether update_dns() would always set the error
output variable if (and only if) it would return false.
Rework the code to make it clearer.
Not using cleanup attribute is error prone.
In theory, a function should only return a GError if (and only if) it
signals a failure. However, for example in commit 324f67956a ('dns:
ensure to log a warning when writing /etc/resolv.conf fails') due to
a bug we was violated. In that case, it resulted in a leak.
Avoid explicit frees and use the gs_free_error cleanup attribute
instead. That would also work correctly in face of such a bug and in
general it seems preferable to explicitly assign ownership to auto
variables on the stack.
When setting "main.rc-manager=symlink" (the default) and /etc/resolv.conf
is a file, NetworkManager tries to write the file directly. When that fails,
we need to make sure to propagate the error so that we log a warning about that.
With this change:
<debug> [1583320004.3122] dns-mgr: update-dns: updating plugin systemd-resolved
<trace> [1583320004.3123] dns-sd-resolved[f9e3febb7424575d]: send-updates: start 8 requests
<trace> [1583320004.3129] dns-mgr: update-resolv-no-stub: '/var/run/NetworkManager/no-stub-resolv.conf' successfully written
<trace> [1583320004.3130] dns-mgr: update-resolv-conf: write to /etc/resolv.conf failed (rc-manager=symlink, $ERROR_REASON)
<trace> [1583320004.3132] dns-mgr: update-resolv-conf: write internal file /var/run/NetworkManager/resolv.conf succeeded
<trace> [1583320004.3133] dns-mgr: current configuration: [{ [...] }]
<warn> [1583320004.3133] dns-mgr: could not commit DNS changes: $ERROR_REASON
<info> [1583320004.3134] device (eth0): Activation: successful, device activated.
https://bugzilla.redhat.com/show_bug.cgi?id=1809181
Several macros are used to define function. They had a "_STATIC" variant,
to define the function as static.
I think those macros should not try to abstract entirely what they do.
They should not accept the function scope as argument (or have two
variants per scope). This also because it might make sense to add
additional __attribute__(()) to the function. That only works, if
the macro does not pretend to *not* define a plain function.
Instead, embrace what the function does and let the users place the
function scope as they see fit.
This also follows what is already done with
static NM_CACHED_QUARK_FCN ("autoconnect-root", autoconnect_root_quark)
Most callers would pass FALSE to nm_utils_error_is_cancelled(). That's
not very useful. Split the two functions and have nm_utils_error_is_cancelled()
and nm_utils_error_is_cancelled_is_disposing().
We should use the same "is-valid" function everywhere.
Since nm_utils_ipaddr_valid() is part of libnm, it does not qualify.
Use nm_utils_ipaddr_is_valid() instead.
and _nm_utils_inet6_ntop() instead of nm_utils_inet6_ntop().
nm_utils_inet4_ntop()/nm_utils_inet6_ntop() are public API of libnm.
For one, that means they are only available in code that links with
libnm/libnm-core. But such basic helpers should be available everywhere.
Also, they accept NULL as destination buffers. We keep that behavior
for potential libnm users, but internally we never want to use the
static buffers. This patch needs to take care that there are no callers
of _nm_utils_inet[46]_ntop() that pass NULL buffers.
Also, _nm_utils_inet[46]_ntop() are inline functions and the compiler
can get rid of them.
We should consistently use the same variant of the helper. The only
downside is that the "good" name is already taken. The leading
underscore is rather ugly and inconsistent.
Also, with our internal variants we can use "static array indices in
function parameter declarations" next. Thereby the compiler helps
to ensure that the provided buffers are of the right size.
The abbreviations "ns" and "ms" seem not very clear to me. Spell them
out to nsec/msec. Also, in parts we already used the longer abbreviations,
so it wasn't consistent.
Note that the only DNS plugin that actually emits the FAILED signal was
NMDnsDnsmasq. Let's not handle restart, retry and rate-limiting by
NMDnsManager but by NMDnsDnsmasq itself.
There are three goals here:
(1) we want that when dnsmasq (infrequently) crashes, that we always keep
retrying. A random crash should be automatically resolved and
eventually dnsmasq should be working again.
Note that we anyway cannot fully detect whether something is wrong.
OK, we detect crashes, but if dnsmasq just gets catatonic, it's just
as broken. Point being: our ability to detect non-working dnsmasq is limited.
(2) when dnsmasq keeps crashing all the time, then rate limit the retry.
Of course, at this point there is already something seriously wrong,
but we shouldn't kill the system by respawning the process without rate
limiting.
(3) previously, when NMDnsManager noticed that the pluging was broken
(and rate-limiting kicked in), it would temporarily disable the plugin.
Basically, that meant to write the real name servers to /etc/resolv.conf
directly, instead of setting localhost. This partly conflicts with
(1), because we want to retry and recover automatically. So what good
is it to notice a problem, resort to plain /etc/resolv.conf for a
short time, and then run into the issues again? If something is really
broken, there is no way but to involve the user to investigate and
fix the issue. Hence, we don't need to concern NMDnsManager with this either.
The only thing that the manager notices is when the dnsmasq binary is not
available. In that case, update() fails right away, and the manager falls back
to configure the name servers in /etc/resolv.conf directly.
Also, change the backoff time from 5 minutes to 1 minute (twice the
burst interval). There is not particularly strong reason for either
choice, I think that if the ratelimit kicks in, then something is
already so wrong that it doesn't matter either way. Anyway, also 60
seconds is long enough to not kill the machine otherwise.
Several points.
- We spawn the dnsmasq process directly. That has several downsides:
- The lifetime of the process is tied to NetworkManager's. When
stopping NetworkManager, we usually also stop dnsmasq. Or we keep
the process running, but later the process is no longer a child process
of NetworkManager and we can only kill it using the pidfile.
- We don't do special sandboxing of the dnsmasq process.
- Note that we want to ensure that only one dnsmasq process is running
at any time. We should track that in a singletone. Note that NMDnsDnsmasq
is not a singleton. While there is only one instance active at any time,
the DNS plugin can be swapped (e.g. during SIGHUP). Hence, don't track the
process per-NMDnsDnsmasq instance, but in a global variable "gl_pid".
- Usually, when NetworkManager quits, it also stops the dnsmasq process.
Previously, we would always try to terminate the process based on the
pidfile. That is wrong. Most of the time, NetworkManager spawned the
process itself, as a child process. Hence, the PID is known and NetworkManager
will get a signal when dnsmasq exits. The only moment when NetworkManager should
use the pidfile, is the first time when checking to kill the previous instance.
That is: only once at the beginning, to kill instances that were
intentionally or unintentionally (crash) left running earlier.
This is now done by _gl_pid_kill_external().
- Previously, before starting a new dnsmasq instance we would kill a
possibly already running one, and block while waiting for the process to
disappear. We should never block. Especially, since we afterwards start
the process also in non-blocking way, there is no reason to kill the
existing process in a blocking way. For the most part, starting dnsmasq
is already asynchronous and so should be the killing of the dnsmasq
process.
- Drop GDBusProxy and only use GDBusConnection. It fully suffices.
- When we kill a dnsmasq instance, we actually don't have to wait at
all. That can happen fully in background. The only pecularity is that
when we restart a new instance before the previous instance is killed,
then we must wait for the previous process to terminate first. Also, if
we are about to exit while killing the dnsmasq instance, we must register
nm_shutdown_wait_obj_*() to wait until the process is fully gone.