iproute2 and the kernel accept 0 as valid rto_min value:
# ip route add 172.16.0.1 dev enp1s0 rto_min 0ms
# ip route show
172.16.0.1 dev enp1s0 scope link rto_min lock 0ms
Even if a value of 0ms would not be useful in practice, it is better
to exactly track what kernel reports, instead of assuming that when
the value is zero it is "not set".
The function introduced queries the FDB table via netlink socket. It
accepts a list of ifindexes to filter out the FDB content not related to
it. It returns an array of MAC addresses.
To cltarify this function is unusually exposed directly on
nm-linux-platform.h as we don't want this be part of the whole
NMPlatform object or cache. This, is an exception to the rule to
simplify the integration of this functionality on NetworkManager.
In addition, it also doesn't use the async mechanism that is widely used
on netlink communication across nm-linux-platform. Again, the reason is
to simplify its use, as async communication won't provide a benefit to
the use cases we have planned for this, i.e balance-slb RARP announcing.
By default, on reapply we were only syncing the main routes table. This
causes that routes added by NM to other tables are not removed on
reapply. This was done to preserve routes added externally, but routes
added by NM itself should be removed.
Add a new route table syncing mode "main + NM routes". This mode
maintains the normal behaviour of syncing completely the main table,
and for other tables removes only routes that were added by us, leaving
the rest untouched. Use this mode by default, as this is what a user
would expect on reapply.
Note: this might not work if NM is restarted between the profile being
modified and the reapply, because NM forgets what routes were added by
itself because of the restart. This is a rare corner case, though.
Use the D-Bus property "VersionInfo" to expose a capability flag
indicating that this bug is fixed. It is the first capability that we
expose in this way. However, it is convenient to do it this way as it's
something that clients like nmstate needs to know, so they can decide
whether a conn down is needed or not. It is not enough to decide that by
version number because it might be fixed via a downstream patch in distros
like RHEL.
https://issues.redhat.com/browse/RHEL-67324https://issues.redhat.com/browse/RHEL-66262
Fixes: e9c17fcc9b ('l3cfg: default to 'main' route table sync mode')
The difference between FULL and ALL was not obvious without reading the
documentation. Moreover, a new mode is going to be introduced so the
confusion could grow. Rename to a more explicit name.
Introducing support of ethtool FEC mode:
D-BUS API: `fec-mode: uint32_t`.
Keyfile:
```
[ethtool]
fec-mode=<uint32_t>
```
nmcli: `ethtool.fec-mode` allowing values are any combination of:
* auto
* off
* rs
* baser
* llrs
Unit test cases included.
Resolves: https://issues.redhat.com/browse/RHEL-24055
Signed-off-by: Gris Ge <fge@redhat.com>
This patch add support to IPVLAN interface. IPVLAN is a driver for a
virtual network device that can be used in container environment to
access the host network. IPVLAN exposes a single MAC address to the
external network regardless the number of IPVLAN device created inside
the host network. This means that a user can have multiple IPVLAN
devices in multiple containers and the corresponding switch reads a
single MAC address. IPVLAN driver is useful when the local switch
imposes constraints on the total number of MAC addresses that it can
manage.
Move the static _ip4_address_is_link_local() check to a new global
nm_platform_ip4_address_is_link_local() helper so we can check if
an IPv4 is link local in other files
If the socket's RX buffer is full it's probably because other
process is doing lot of changes very quickly, faster than we
can process them. Let's give the writer a small time to finish:
1. Avoid contending the kernel's RTNL lock, so we don't make
the whole situation even worse and it can finish earlier.
2. Avoid having to resync again and again due to trying to
resync while the writer is still doing quick changes, so
we are unable to catch up yet.
This won't help if this situation takes a long time or is
continuous, but that's unlikely to happen, and if it does,
it's the writer's fault for starving the whole system.
There is no need to progresively increase the backoff time
for the same reason: if this situation takes lot of time,
it's the writer's fault. It's neither a good idea because the whole NM
process will end being sleeping long times, not doing anything at all,
without being able to react when the Netlink messages burst stops.
Add a function to compare two arrays of NMPlatformBridgeVlan. It will
be used in the next commit to compare the VLANs from platform to the
ones we want to set.
To compare in a performant way, the vlans need to be normalized (no
duplicated VLANS, ranges into their minimal expression...). Add the
function nmp_utils_bridge_vlan_normalize.
Co-authored-by: Íñigo Huguet <ihuguet@redhat.com>
Currently, nm_platform_link_set_bridge_vlans() accepts an array of
pointers to vlan objects; to avoid multiple allocations,
setting_vlans_to_platform() creates the array by piggybacking the
actual data after the pointers array.
In the next commits, the array will need to be manipulated and
extended, which is difficult with the current structure. Instead, pass
separately an array of objects and its size.
In case the platform fails dumping a specific route protocol, retry
multiple times. If all attempts fail, emit a warning and proceed as
there is nothing more to do.
When doing a dump of routes, we want to exclude routes having
protocols we do not care about. Since the netlink socket has
STRICT_CHK enabled, we can request multiple dumps for the protocols we
need.
While doing 6 dumps is less efficient than doing 1, it normally
doesn't matter. However, the new implementation is more efficient when
there are e.g. millions of BGP routes that can be excluded from the
results.
Introduce an array of tracked route protocols that will be used in the
next commit. To have the list of protocols defined in a single place,
define a macro.
In the future we might want to specify filters when requesting netlink
dumps; this requires that strict check is enabled on the socket.
When enabling strict check, we need to pass a full struct in the
netlink message, otherwise kernel ignores it.
This commit doesn't change behavior.
When we recibe a Netlink message with a "route change" event, normally
we just ignore it if it's a route that we don't track (i.e. because of
the route protocol).
However, it's not that easy if it has the NLM_F_REPLACE flag because
that means that it might be replacing another route. If the kernel has
similar routes which are candidates for the replacement, it's hard for
NM to guess which one of those is being replaced (as the kernel doesn't
have a "route ID" or similar field to indicate it). Moreover, the kernel
might choose to replace a route that we don't have on cache, so we know
nothing about it.
It is important to note that we cannot just discard Netlink messages of
routes that we don't track if they has the NLM_F_REPLACE. For example,
if we are tracking a route with proto=static, we might receive a replace
message, changing that route to proto=other_proto_that_we_dont_track. We
need to process that message and remove the route from our cache.
As NM doesn't know what route is being replaced, trying to guess will
lead to errors that will leave the cache in an inconsistent state.
Because of that, it just do a cache resync for the routes.
For IPv4 there was an optimization to this: if we don't have in the
cache any route candidate for the replacement there are only 2 possible
options: either add the new route to the cache or discard it if we are
not interested on it. We don't need a resync for that.
This commit is extending that optimization to IPv6 routes. There is no
reason why it shouldn't work in the same way than with IPv4. This
optimization will only work well as long as we find potential candidate
routes in the same way than the kernel (comparing the same fields). NM
calls to this "comparing by WEAK_ID". But this can also happen with IPv4
routes.
It is worth it to enable this optimization because there are routing
daemons using custom routing protocols that makes tens or hundreds of
updates per second. If they use NLM_F_REPLACE, this caused NM to do a
resync hundreds of times per second leading to a 100% CPU usage:
https://issues.redhat.com/browse/RHEL-26195
An additional but smaller optimization is done in this commit: if we
receive a route message for routes that we don't track AND doesn't have
the NLM_F_REPLACE flag, we can ignore the entire message, thus avoiding
the memory allocation of the nmp_object. That nmp_object was going to be
ignored later, anyway, so better to avoid these allocations that, with
the routing daemon of the above's example, can happen hundreds of times
per second.
With this changes, the CPU usage doing `ip route replace` 300 times/s
drops from 100% to 1%. Doing `ip route replace` as fast as possible,
without any rate limitting, still keeps NM with a 3% CPU usage in the
system that I have used to test.
Command NL80211_CMD_GET_WIPHY without any flag only returns channels
in the 2 GHz and 5 GHz bands, for backwards compatibility with old
userspace tools. To get the full list we need to pass attribute
NL80211_ATTR_SPLIT_WIPHY_DUMP (added in Linux 3.9 released in 2013),
and allow the handler to be called multiple times.
https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/1500
Probably not all drivers and devices return all parameters. Set them to
"unknown" if they are missing and let the caller to decide what to do.
In our case, if the sriov setting has a value different to "preserve" it
will try to set it (and will probably fail). But if the missing
parameter is set to "preserve" in the sriov setting we can continue,
just ignoring it.
(cherry picked from commit 7346c5b556)
If sriov_totalvfs file doesn't exist we don't need to consider it a
fatal failure. Try to create the required number of VFs as we were doing
before.
Note: at least netdevsim doesn't have sriov_totalvfs file, I don't know
if there are real drivers that neither has it.
(cherry picked from commit 27eaf34fcf)
Set these parameters according to the values set in the new properties
sriov.eswitch-inline-mode and sriov.eswitch-encap-mode.
The number of parameters related to SR-IOV was becoming too big.
Refactor to group them in a NMPlatformSriovParams struct and pass it
around.
(cherry picked from commit 4669f01eb0)
It is not safe to change the eswitch mode when there are VFs already
created: it often fails, or even worse, doesn't fail immediatelly but
there are later problems with the VFs.
What is supposed to be well tested in all drivers is to change the
eswitch mode with no VFs created, and then create the VFs, so let's set
num_vfs=0 before changing the eswitch mode.
As we want to change num_vfs asynchronously in a separate thread, we
need to do a multi-step process with callbacks each time that a step
finish (before it was just set num_vfs asynchronously and invoke the
callback when it's done).
This makes link_set_sriov_params_async to become even larger and more
complex than it already was. Refactor it to make it cleaner and easier
to follow, and hopefully less error prone, and implement that multi-step
process.
(cherry picked from commit 770340627b)
Add support for Devlink, which is just another family of Generic Netlink
like nl80211. Implement get_eswitch_mode and set_eswitch_mode to allow
changing between legacy SRIOV and switchdev modes.
Devlink's purpose is to allow querying and configuring stuff related to
a piece of hardware but not to any of the usual Linux device classes.
For example, nowadays the Smart NICs normally allow to change the
eswitch mode per PF, because their hardware implements one eswitch per
PF, but future models might have a single eswitch for all the physical
and virtual ports of the NIC allowing more advanced bridge offloads.
Regarding the above example, for the moment we only support PCI network
devices with the "one eswitch per PF" model. The reason is that currently
NM only knows about netdevs so dealing with "devlink devices" that
doesn't map 1-1 with a netdev would require new mechanisms to understand
what they are and their relation with the netdevs that NM manage. We
will deal with that use cases when they arise and we have more
information about the right way to support them.
(cherry picked from commit f31d29bbb7)
When doing reapply on linux bridge interface, NetworkManager will reset
the VLAN filtering and default PVID which cause PVID been readded to all
bridge ports regardless they are managed by NetworkManager.
This is because Linux kernel will re-add PVID to bridge port upon the
changes of bridge default-pvid value.
To fix the issue, this patch introduce netlink parsing code for
`vlan_filtering` and `default_pvid` of NMPlatformLnkBridge, and use that
to compare desired VLAN filtering settings, skip the reset of VLAN
filter if `default_pvid` and `vlan_filtering` are unchanged.
Signed-off-by: Gris Ge <fge@redhat.com>
(cherry picked from commit 02c34d538c)
The supervision address is read-only. It is constructed by kernel and
only the last byte can be modified by setting the multicast-spec as
documented indeed.
As 1.46 was not released yet, we still can drop the whole API for this
setting property. We are keeping the NMDeviceHsr property as it is a
nice to have for reading it.
https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/1823
Fixes: 5426bdf4a1 ('HSR: add support to HSR/PRP interface')
This option was only introduced only to allow keeping the old behavior
in RHEL7, while the default order was changed from 'ifindex' to 'name'
in RHEL8. The usefulness of this option is questionable, as 'name'
together with predictable interface names should give predictable order.
When not using predictable interface names, the name is unpredictable
but so is the ifindex.
https://issues.redhat.com/browse/NMT-926https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/1814
This patch add support to HSR/PRP interface. Please notice that PRP
driver is represented as HSR too. They are different drivers but on
kernel they are integrated together.
HSR/PRP is a network protocol standard for Ethernet that provides
seamless failover against failure of any network component. It intends
to be transparent to the application. These protocols are useful for
applications that request high availability and short switchover time
e.g electrical substation or high power inverters.
https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/1791
The hash/cmp functions were wrong with respect to IPv4 routes. Fix that.
- the weight was cast to a guint8, which is too small to hold all
values.
- NM_PLATFORM_IP_ROUTE_CMP_TYPE_ID comparison would normalize a zero
weight to one
NM_CMP_DIRECT(NM_MAX(a->weight, 1u), NM_MAX(b->weight, 1u));
That was very wrong.
- the handling of the weight depends on the n_nexthops parameter.
See _ip4_route_weight_normalize().
The remarkable thing is that upper layers find it useful to track IPv4
single-hop routes with a non-zero weight. Consequently, this is honored
by NM_PLATFORM_IP_ROUTE_CMP_TYPE_ID to treat single-hop routes
different, when their weight differs. However, adding such a route in
kernel will not work. To kernel, single-hop routes have no weight. While
the route exists as distinct in our hash tables, according to the
implemented identity, it never exists in kernel (or NMPlatform cache).
Well, you can call nm_platform_ip_route_add() with such a route, but the
result will have a weight of zero (making it a different route). See also
nm_platform_ip_route_normalize().
This works all mostly fine. The only thing to take care is that you
cannot look into the NMPlatform cache and ever find a IPv4 single-hop
route with a non-zero weight. If you preform such a lookup, realize that
such routes don't exist in platform. You can however normalize the
weight to zero first (nm_platform_ip_route_normalize()) and look for a
similar route with a zero weight.
Fixes: 1bbdecf5e1 ('platform: manage ECMP routes')