Heavily inspired by systemd ([1]).
We now also have nm_random_get_bytes{,_full}() and
nm_random_get_crypto_bytes(), like systemd's random_bytes()
and crypto_random_bytes(), respectively.
Differences:
- instead of systemd's random_bytes(), our nm_random_get_bytes_full()
also estimates whether the output is of high quality. The caller
may find that interesting. Due to that, we will first try to call
getrandom(GRND_NONBLOCK) before getrandom(GRND_INSECURE). That is
reversed from systemd's random_bytes(), because we want to find
out whether we can get good random numbers. In most cases, kernel
should have entropy already, and it makes no difference.
Otherwise, heavily rework the code. It should be easy to understand
and correct.
There is also a major bugfix here. Previously, if getrandom() failed
with ENOSYS and we fell back to /dev/urandom, we would assume that we
have high quality random numbers. That assumption is not warranted.
Now instead poll on /dev/random to find out.
[1] a268e7f402/src/basic/random-util.c (L81)
nm_utils_random_bytes() is supposed to give us good random number from
kernel. It guarantees to always provide some bytes, and it has a
boolean return value that estimates whether the bytes are good
randomness. In practice, most callers ignore that return value, because
what would they do about it anyway?
Of course, we want to primarily use getrandom() (or "/dev/urandom"). But
if we fail to get random bytes from them, we have a fallback path that
tries to generate "random" bytes.
It does so, by initializing a global seed from various sources, and keep
sha256 hashing the buffer in a loop. That's certainly not efficient nor
elegant, but we already are in a fallback path.
Still, we can do slightly better. Instead of just using the global state
and keep updating it (entirely deterministically), every time also mix in
the results from getrandom() and a current timestamp. The idea is that if you
have a virtual machine that gets cloned, we don't want that our global
state keeps giving the same random numbers. In particular, because
getrandom() might handle that case, even if it doesn't have good
entropy.
We use clang-format for automatic formatting of our source files.
Since clang-format is actively maintained software, the actual
formatting depends on the used version of clang-format. That is
unfortunate and painful, but really unavoidable unless clang-format
would be strictly bug-compatible.
So the version that we must use is from the current Fedora release, which
is also tested by our gitlab-ci. Previously, we were using Fedora 34 with
clang-tools-extra-12.0.1-1.fc34.x86_64.
As Fedora 35 comes along, we need to update our formatting as Fedora 35
comes with version "13.0.0~rc1-1.fc35".
An alternative would be to freeze on version 12, but that has different
problems (like, it's cumbersome to rebuild clang 12 on Fedora 35 and it
would be cumbersome for our developers which are on Fedora 35 to use a
clang that they cannot easily install).
The (differently painful) solution is to reformat from time to time, as we
switch to a new Fedora (and thus clang) version.
Usually we would expect that such a reformatting brings minor changes.
But this time, the changes are huge. That is mentioned in the release
notes [1] as
Makes PointerAligment: Right working with AlignConsecutiveDeclarations. (Fixes https://llvm.org/PR27353)
[1] https://releases.llvm.org/13.0.0/tools/clang/docs/ReleaseNotes.html#clang-format
src/libnm-glib-aux/nm-random-utils.c:112:12: error: ignoring return value of 'getrandom' declared with attribute 'warn_unused_result' [-Werror=unused-result]
Fixes: 18597e33cb ('glib-aux: also use getrandom() for seeding pseudo random generator')
- the return value of getrandom() is ssize_t.
- handle EAGAIN to indicate low entropy.
- treat a return value of zero the same as any other
low "n", by falling back to bad random bytes.
We make an effort to get a better fallback case with
_bad_random_bytes().
Also make an effort to get good randomness in the first place. Even if
we compile against libc headers that don't provide getrandom(). Also,
this isn't really ugly, because for a long time glibc was reluctant to
add getrandom() wrapper and using syscall() was the way to go.
nm_utils_random_bytes() tries to get good randomness. If it fails, we still
try our own approach, but also signal that the returned numbers are bad.
In practice, none of the callers cares about the return value, because they
wouldn't know what to do in case of bad randomness (abort() is not an
option and retry is not expected to help and sending an email to the
admin isn't gonna help either). So the fallback case really should try
its best.
The fallback case depends on a good random seed and a good pseudorandom
number generator.
Getting a good seed is in reality impossible, after kernel let us down.
That is part of the problem, but we try our best.
The other part is to use a cryptographic pseudorandom number generator.
GRand uses a Mersenne Twister, so that is not good enough. In this low
level code we also cannot call gnutls/nss, because usually we don't have
that dependency. Maybe we could copy&paste the chacha20 implementation,
it's simple enough and a compatible license. That might be good, but
instead cock our own by adding some sha256 into the mix. This is
fallback code after all, and we want to try hard, but not *that* hard to
add chacha20 to NetworkManager.
So, what we do is to use a well seeded GRand instance, and XOR that
output with a sha256 digest of the state. It's probably slow, but
performance is not the issue in this code path.
I missed that we already have a gettid() wrapper. Drop the duplicated
again and use nm_utils_gettid().
Fixes: e874c5bf6b ('random: Provide missing gettid() declaration')
g_rand_new() reads /dev/urandom and falls back to timestamp and pid.
At this point, we already unsuccessfully tried getrandom()/urandom,
so that doesn't seem promising to try.
Try harder to get good random seeds for our GRand instance.
Have one global instance, that gets seeded with various things that come
to mind. The random sequence of that instance is then used to initialize
the thread-local GRand instances.
Maybe this is all snake oil. If we fail to get good randomness by using
kernel API, what can we do? But really, callers also don't know how they
should handle a failure to get random data (short of abort() or
logging), so there is value in nm_utils_random_bytes() trying really
the best it can, and callers pretending that it doesn't fail.
This aims to improve the fallback case.