Since commit ecc73eb239 ('ovs-port: always remove the OVSDB entry on
slave release'), ovs port were removing the ovsdb entry upon being
un-enslaved, no matter what the reason for un-enslavement was. The idea
was to remove the stale ovsdb entry upon forcible device removal.
This cleanup is specific to OpenVSwitch, since for other device types,
the device master is the property of the slave and thus goes away along
with the device.
Turns out we're now removing the ovsdb entry even when the device
actually doesn't go away, but we're pretending it does because the
daemon is shutting down.
To add insult to injury, we generally end up removing one entry,
because the other ovsdb calls end up in a queue and don't get serviced
before the daemon shuts down. The result is a mess. (This patch
doesn't solve that -- if someone terminates the daemon with in-flight
ovsdb calls they're still out of luck).
Let's do the cleanup now only if the device was actually physically
removed.
Fixes-test: @NM_reboot_openvswitch_vlan_configuration
Fixes: ecc73eb239 ('ovs-port: always remove the OVSDB entry on slave release')
https://bugzilla.redhat.com/show_bug.cgi?id=2055665https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/1117
gcc-12.0.1-0.8.fc36 is annoying with false positives.
It's related to g_error() and its `for(;;) ;`.
For example:
../src/libnm-glib-aux/nm-shared-utils.c: In function 'nm_utils_parse_inaddr_bin_full':
../src/libnm-glib-aux/nm-shared-utils.c:1145:26: error: dangling pointer to 'error' may be used [-Werror=dangling-pointer=]
1145 | error->message);
| ^~
/usr/include/glib-2.0/glib/gmessages.h:343:32: note: in definition of macro 'g_error'
343 | __VA_ARGS__); \
| ^~~~~~~~~~~
../src/libnm-glib-aux/nm-shared-utils.c:1133:31: note: 'error' declared here
1133 | gs_free_error GError *error = NULL;
| ^~~~~
/usr/include/glib-2.0/glib/gmessages.h:341:25: error: dangling pointer to 'addrbin' may be used [-Werror=dangling-pointer=]
341 | g_log (G_LOG_DOMAIN, \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
342 | G_LOG_LEVEL_ERROR, \
| ~~~~~~~~~~~~~~~~~~~~~~~
343 | __VA_ARGS__); \
| ~~~~~~~~~~~~
../src/libnm-glib-aux/nm-shared-utils.c:1141:13: note: in expansion of macro 'g_error'
1141 | g_error("unexpected assertion failure: could parse \"%s\" as %s, but not accepted by "
| ^~~~~~~
../src/libnm-glib-aux/nm-shared-utils.c:1112:14: note: 'addrbin' declared here
1112 | NMIPAddr addrbin;
| ^~~~~~~
I think the warning could potentially be useful and prevent real bugs.
So don't disable it altogether, but go through the effort to suppress it
at the places where it currently happens.
Note that NM_PRAGMA_WARNING_DISABLE_DANGLING_POINTER macro only expands
to suppressing the warning with __GNUC__ equal to 12. The purpose is to
only suppress the warning where we know we want to. Hopefully other gcc
versions don't have this problem.
I guess, we could also write a NM_COMPILER_WARNING() check in
"m4/compiler_options.m4", to disable the warning if we detect it. But
that seems too cumbersome.
pppd is a delicate flower. On orderly shutdown, it likes to tell the
other side. This seems to take at least a second even when no real
network latency is at play, on busy systems 1.5 seconds easily ends up
being inadequate.
A violent shutdown is generally okay apart from that it can leave
garbage (port lock) behind and the other side potentially confused for a
while.
As it happens, this interacts badly with modemu.pl which is used for
testing: the pseudo terminal in PPP line discipline mode has no idea
that the remote disconnected and while ModemManager is learning that
something wrong the hard way (AT command timing out, because the remote
still expects to talk PPP), the test times out.
Let's increase the timeout to something more reasonable.
https://bugzilla.redhat.com/show_bug.cgi?id=2049596https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/1103
Don't progress to the IP ready state until all objects are committed
to platform. Note that l3cfg has a 20 seconds timeout after which
unavailable objects are considered "definitely unavailable" and are
removed from the list.
Fixes-test: @ipv6_routes_with_src
https://bugzilla.redhat.com/show_bug.cgi?id=2043133
l3cfg has a "temp_not_available" list of objects that couldn't be
added to platform, but can be added once some preconditions become
true (for example, a IPv6 route with a "src" attribute requires a
non-tentative src address to be present).
Retry to commit those objects once all addresses have completed
ACD/DAD.
nm_l3_config_data_get_nameservers() returns a pointer to "struct in6_addr". Not
a pointer to pointers.
#0 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:389
#1 0x00007f8060dd9109 in memcpy (__len=<optimized out>, __src=0xfd, __dest=<optimized out>) at /usr/include/bits/string_fortified.h:29
#2 g_array_append_vals (len=1, data=0xfd, farray=0x55dd69332130) at ../glib/garray.c:522
#3 g_array_append_vals (farray=0x55dd69332130, data=0xfd, len=1) at ../glib/garray.c:509
#4 0x000055dd68d2a27d in _garray_inaddr_add (p_arr=<optimized out>, addr_family=<optimized out>, addr=0xfd) at src/core/nm-l3-config-data.c:295
#5 0x000055dd68ef6510 in nm_l3_config_data_add_nameserver (nameserver=<optimized out>, addr_family=10, self=0x55dd6949f900) at src/core/nm-l3-config-data.c:1442
#6 nm_device_copy_ip6_dns_config (self=0x55dd693c4420, from_device=<optimized out>) at src/core/devices/nm-device.c:10468
#7 0x00007f8060f28aba in _g_closure_invoke_va (param_types=0x0, n_params=<optimized out>, args=0x7fffed43d610, instance=0x55dd693c4420, return_value=0x0, closure=0x55dd693cdb10)
at ../gobject/gclosure.c:893
#8 g_signal_emit_valist (instance=0x55dd693c4420, signal_id=<optimized out>, detail=0, var_args=var_args@entry=0x7fffed43d610) at ../gobject/gsignal.c:3406
#9 0x00007f8060f28c03 in g_signal_emit (instance=<optimized out>, signal_id=<optimized out>, detail=<optimized out>) at ../gobject/gsignal.c:3553
#10 0x000055dd68efd1fb in _dev_ipac6_start (self=0x55dd693c4420) at src/core/devices/nm-device.c:11348
#11 0x000055dd68efd698 in _dev_ipac6_start_continue (self=0x55dd693c4420) at src/core/devices/nm-device.c:11373
#12 _dev_ipll6_set_llstate (self=0x55dd693c4420, llstate=<optimized out>, lladdr=<optimized out>) at src/core/devices/nm-device.c:10576
#13 0x000055dd68e7915e in _emit_changed_on_idle_cb (user_data=user_data@entry=0x55dd6941ca50) at src/core/nm-l3-ipv6ll.c:221
#14 0x00007f8060e0639b in g_idle_dispatch (source=0x55dd693eea30, callback=0x55dd68e78fd0 <_emit_changed_on_idle_cb>, user_data=0x55dd6941ca50) at ../glib/gmain.c:5897
#15 0x00007f8060e0a05f in g_main_dispatch (context=0x55dd6922c800) at ../glib/gmain.c:3381
#16 g_main_context_dispatch (context=0x55dd6922c800) at ../glib/gmain.c:4099
#17 0x00007f8060e5f2a8 in g_main_context_iterate.constprop.0 (context=0x55dd6922c800, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at ../glib/gmain.c:4175
#18 0x00007f8060e09773 in g_main_loop_run (loop=0x55dd69211010) at ../glib/gmain.c:4373
#19 0x000055dd68d09c7b in main (argc=<optimized out>, argv=<optimized out>) at src/core/main.c:509
Fixes: 58287cbcc0 ('core: rework IP configuration in NetworkManager using layer 3 configuration')
test_machine_id_read() is a flawed unit test, as it reads the machine-id
of the machine where it's running. That means the test depends on the
test machine, which is obviously a problem.
If you had no /etc/machine-id but a /var/lib/dbus/machine-id, then previously
the test would fail. If the file exists, assume we are able to read a
valid machine-id.
On test systems that have a bogus /etc/machine-id or /var/lib/dbus/machine-id,
the test would still fail. Just don't do that.
Certain route types (blackhole, unreachable, prohibit) are not tied to
an interface. They are thus global and we need to track them system wide
(or better: per network namespace). That is done by NMPRouteManager.
For the routing rules, it's NMDevice itself to track/untrack the rules.
That is done for historical reasons, at the time, NML3Cfg did not exit.
Now with NML3Cfg, it seems that also NML3Cfg should be the part that
handles nodev routes. One reason is that we want to move IP
functionality out of NMDevice. So callers (NMDevice) would just add
blackhole routes to the NML3ConfigData and let NML3Cfg handle them.
Still, to handle these routes is rather different from regular routes.
Normally, NML3Cfg tracks an object state (ObjStateData) for each address/route,
and it hooks into platform signals to update the os_plobj field. Those signals
are dispatched by NMNetns and are only per-ifindex. Hence, NML3Cfg
wouldn't be notified about those nodev routes. Consequently, there
os_plobj could not be (efficiently) maintained and there is no
ObjStateData for such routes.
Instead, all that NML3Cfg does is have the routes in the NML3ConfigData and
tell NMPRouteManager about them. Seems simple enough. The only question
is when should NMPRouteManager sync? For now, we sync when the
track/untracking brings any changes and during reapply. Which is
probably fine.
Specifically, in nm_utils_ip_route_attribute_to_platform() and in
_l3_config_data_add_obj() handle such new route type. For the moment,
they cannot be stored in a valid NMSettingIPConfig, but later this will
be necessary.
The general idea is that when we have entries tracked by the
route-manager, that we can mark them all as dirty. Then, calling the
"track" function will reset the dirty flag. Finally, there is a method
to delete all dirty entries.
As we can lookup an entry with O(1) (using dictionaries), we can
sync the list of tracked objects with O(n). We just need to track
all the ones we care about, and then delete those that were not touched
(that is, are still dirty).
Previously, we had to explicitly mark all entries as dirty. We can do
better. Just let nmp_route_manager_untrack_all() mark the survivors as
dirty right away. This way, we can save iterating the list once.
It also makes sense because the only purpose of the dirty flag is to
aid this prune mechanism with track/untrack-all. So, untrack-all can
just help out, and leave the remaining entries dirty, so that the next
track does the right thing.
Routes of type blackhole, unreachable, prohibit don't have an
ifindex/device. They are thus in many ways similar to routing rules,
as they are global. We need a mediator to keep track which routes
to configure.
This will be very similar to what NMPRulesManager already does for
routing rules. Rename the API, so that it also can be used for routes.
Renaming the file will be done next, so that git's rename detection
doesn't get too confused.
So far, certain NMObject types could not have an ifindex of zero. Hence,
nmp_lookup_init_object() took such an ifindex to mean lookup all objects
of that type.
Soon, we will support blackhole/unreachable/prohibit route types, which
have their ifindex set to zero. It is still useful to lookup those routes
types via nmp_lookup_init_object().
Change behaviour how to interpret the ifindex. Note that this also
affects various callers of nmp_lookup_init_object(). If somebody was
relying on the previous behavior, it would need fixing.
The signal parameters are G_TYPE_UINT. We should not assume that our
enums are a compatible integer type.
In practice of course they always were. Let's just be clear about when
we have an integer and when we have an enum.
- use slice allocator
- use designated initializers
- first determine all parameters in killswitch_new() before
setting ks.
- drop unnecessary memset().
Naming is important. Especially when we have a 8k LOC monster, that
manages everything.
Rename things related to Rfkill to give them a common prefix.
Also, move code around so it's beside each other.
RadioState contained both the mutable user-data and meta-data about the
radio state. The latter is immutable. The parts that cannot change, should
be separate from those that can change. Move them to a separate array.
Of course, we only have on NMManager instance, so having these fields embedded
in NMManagerPrivate did not waste memory. This change is done, because immutable
fields should be const. In this case, they are now const global data, which
is even protected by the linker from accidental mutation.
Names in header files should have an "nm" prefix. We do that pretty
consistently. Fix the offenders RfKillState and RfKillType.
Also, rename the RfKillState enums to follow the type name. For example,
NM_RFKILL_STATE_SOFT_BLOCKED instead of RFKILL_SOFT_BLOCKED.
Also, when we camel-case a typedef (NMRfKillState) we would want that
the lower-case names use underscore between the words. So it should be
`nm_rf_kill_state_to_string()`. But that looks awkward. So the right solution
here is to also rename "RfKill" to "Rfkill". That make is consistent
with the spelling of the existing `NMRfkillManager` type and the
`nm-rfkill-manager.h` file.
- use "const RadioState" where possible
- use "bool" bitfields in RadioState (boolean flags in structs
should be `bool` types for consistency) and order fields by
alignment.
- break lines for variable declaration in manager_rfkill_update_one_type().
- return (and nm_assert()) from update_rstate_from_rfkill(). By
not adding a default case, compiler would warn if we forget to
handle an enum value. We can easily do that, by just returning,
and let the "default" case be handled by nm_assert_not_reached()
-- which unlike g_warn_if_reached() compiles to nothing in
release build.
- add nm_assert() that `priv->radio_states[rtype]` is not out of range.
- use designated initializers for priv->radio_states[].
GObject Properties are flexible and powerful. In practice, NMDevicePrivate.rfkill_type
was only set once via the construct-only property NM_DEVICE_RFKILL_TYPE.
Which in turn was always set to a well-known value, only depending on the device
type.
We don't need this flexibility. The rfkill-type only depends on the
device type and doesn't change. Replace the property by a field in
NMDeviceClass.
For one, construct properties have an overhead, that the property setter is
called whenever we construct a NMDevice. But the real reason for this
change, is that a property give a notion as this could change during the
lifetime of a NMDevice (which it in fact did not, being construct-only).
Or that the type depends on something more complex, when instead it only
depends on the device type. A non-mutated class property is simpler,
because it's clear that it does not depend on the device instance,
only on the type/class.
Also, `git grep -w rfkill_type` now nicely shows the (few) references to
this variable and its easier to understand.
We don't run glib-mkenums for certain sources like "core" and
"libnm-glib-aux".
These annotations have no effect. Drop them.
They also mess with the automated formatting.
We made the choice, that NMPlatformIPRoute does not contain the actual
route table, instead it contains a "remapped" number: table_coerced.
That remapping done, so that the default (which we want semantically to
be 254, RT_TABLE_MAIN) is numerical zero so that struct initialization
doesn't you require to explicitly set the default.
Hence, we must always distinguish whether we have the real table number
or the "table_coerced", and you must convert back and forth between the
two.
Now, the parameter of nm_l3_config_data_merge() are real table numbers
(as also indicated by their name not having the term "coerced"). So
usually they are set to actually 254.
When we set the field of NMPlatformIPRoute, we must coerce it. This was
wrong, and we would see wrong table numbers in the log:
l3cfg[17b98e59a477b0f4,ifindex=2]: obj-state: track: [2a32eca99405767e, ip4-route, type unicast table 0 0.0.0.0/0 via ...
Fixes: b4aa35e72d ('l3cfg: extend nm_l3cfg_add_config() to accept default route table and metric')
The metered status can depend on the DHCP lease, as we accept the
ANDROID_METERED vendor option. That means, on a DHCP update we need
to re-evaluate the metered flag.
This fixes a potential race, where IPv6 might succeed first and
activation completes (with GUESS_NO metered flag). A subsequent
DHCPv4 update requires to re-evaluate that decision.
Fixes-test: @connection_metered_guess_yes
https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/1080
When the NM_UNMANAGED_PLATFORM_INIT flag is cleared last in
device_link_changed(), a recheck-assume is scheduled and then the
device goes immediately to UNAVAILABLE. During the state transition,
addresses and routes are removed from the interface. Then,
recheck-assume finds that the device can be assumed but it's too late
since the device was already deconfigured.
This is a problem as the whole point of assuming a device is to
activate a connection while leaving the device untouched.
In the NMCI "dracut_NM_vlan_over_bridge and dracut_NM_vlan_over_bond"
test, NM in real root tries to assume a vlan device that was activated
in initrd. When the interface gets deconfigured in UNAVAILABLE, the
connection to the NFS server breaks and the rootfs becomes
inaccessible.
The fix to this problem is to delay state transitions in
device_link_changed() to a idle handler, so that recheck-assume can
run before.
Fixes-test: @dracut_NM_vlan_over_bridge
Fixes-test: @dracut_NM_vlan_over_bond
https://bugzilla.redhat.com/show_bug.cgi?id=2047302
nm_device_set_unmanaged_by_user_settings() does nothing when the
device is unmanaged by platform-init. Remove the if branch to make
this more explicit.
The ACD state handling is unfortunately very complicated. That is, because
we obviously need to track state about how ACD is going (the acd_data, and
in particular NML3AcdAddrState). Then there are various things that can
happen, which are the AcdStateChangeMode enums. All these state-changes
come together in one function: _l3_acd_data_state_change(), which is
therefore complicated (I don't think that it would become simpler by
spreading this code out to different functions, on the contrary).
Anyway.
So, what happens when we need to reset the n-acd instance? For example,
because the MAC address of the link changed or some error. I guess, we
need to restart probing.
Previously, I think this was not handled properly. We already tried to
fix this several times, the last was commit b331606386 ('l3cfg: on
n-acd instance-reset clear also ready ACD state'). There is still an
issue ([1]).
The bug [1] is, that we are in state NM_L3_ACD_ADDR_STATE_READY, during
ACD_STATE_CHANGE_MODE_TIMEOUT event. That leads to an assertion
failure.
#5 0x00007f23be74698f in g_assertion_message_expr (domain=0x5629aca70359 "nm", file=0x5629aca62aab "src/core/nm-l3cfg.c", line=2395, func=0x5629acb26b30 <__func__.72.lto_priv.4> "_l3_acd_data_state_change", expr=<optimized out>) at ../glib/gtestutils.c:3091
#6 0x00005629ac937e46 in _l3_acd_data_state_change (self=0x5629add69790, acd_data=0x5629add8d520, state_change_mode=ACD_STATE_CHANGE_MODE_TIMEOUT, sender_addr=0x0, p_now_msec=0x7ffded506460) at src/core/nm-l3cfg.c:2395
#7 0x00005629ac939f4d in _l3_acd_data_timeout_cb (user_data=user_data@entry=0x5629add8d520) at src/core/nm-l3cfg.c:1933
#8 0x00007f23be71c5a1 in g_timeout_dispatch (source=0x5629addd7a80, callback=0x5629ac939ee0 <_l3_acd_data_timeout_cb>, user_data=0x5629add8d520) at ../glib/gmain.c:4889
#9 0x00007f23be71bd4f in g_main_dispatch (context=0x5629adc6da00) at ../glib/gmain.c:3337
#10 g_main_context_dispatch (context=0x5629adc6da00) at ../glib/gmain.c:4055
That can only happen, (I think) when we scheduled the timeout
during an earlier ACD_STATE_CHANGE_MODE_INSTANCE_RESET event. Meaning,
we need to handle instance-reset better.
Instead, during instance-reset, switch always back to state PROBING, and
let the timeout figure it out.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=2047788
The first point is that ACD timeout is strongly tied to the current state. That
is (somewhat) obvious, because _l3_acd_data_state_set_full() will clear any pending
timeout. So you can only schedule a timeout *after* setting the state,
and setting the next state, will clear the timeout.
Likewise, note that l3_acd_data_state_change() for the event
ACD_STATE_CHANGE_MODE_TIMEOUT asserts that it is only called in the few
states where that is expected. See rhbz#2047788 where that assertion
gets hit.
The first point means that we must only schedule a timer when we are
also in a state that supports that. Add an assertion for that at the
point when scheduling the timeout. The assert at this point is useful,
because it catches the moment when we do the wrong thing (instead of
getting the assertion later during the timeout, when we no longer know
where the error happened).
See-also: https://bugzilla.redhat.com/show_bug.cgi?id=2047788
Now that higher priorities numbers really mean more important,
it seems that the VPN configuration should be rather important.
Bump the number, also in relation to NMDevice's L3ConfigDataType.
It might not matter too much, because usually the VPN tunnel device does
not have NDevice to add other l3cds and those from VPN might be alone.
Except, maybe with routing VPN (libreswan) that is different. Dunno.