The CI_JOB_TIMEOUT variable is the GitLab-defined job timeout in
seconds.
Use this variable in LAVA instead of the separate JOB_TIMEOUT,
which was intended to represent the test phase timeout (job timeout
minus 5 minutes), but was often overlooked.
Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32609>
Sets the default exit code to 1 to ensure the GitLab job fails when the
LAVA job fails or is interrupted. Adds tests to verify the exit code is
correctly set based on the logs or the lack of them (unexpected
finishing: timeouts and canceling).
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32163>
python-fire auto-converts `item1,item2` into a tuple, but if there is a
dash `-` inside the argument, it treats it as a string.
Let's validate the data, when it comes as a `str` or `tuple`.
For more details, here are the tested scenarios:
| --lava-tags= | Type | Value |
|--------------|-------|---------------------|
| None | bool | True |
| '' | str | '' |
| tag1 | str | "tag1" |
| tag1, | tuple | ("tag1",) |
| tag-1,tag-2 | str | 'tag-1,tag-2' |
| tag1,tag2 | tuple | ("tag1", "tag2") |
| ',' | str | ',' |
| ',,' | str | ',,' |
| 'tag1,,' | str | 'tag1,,' |
See also:
https://google.github.io/python-fire/guide/#argument-parsing
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31882>
Instead of providing a hardcoded set of arguments, allow overlays to be
added to the submitter script. Passing Python dicts as a string
representation and relying on coercion from strings is far from great,
but fire doesn't give anything else, so.
Signed-off-by: Daniel Stone <daniels@collabora.com>
Co-authored-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31882>
We compose the rootfs from a mixture of the base rootfs (exported from
the container build stage, currently lava_build.sh, which can be reused
as long as the container isn't rebuilt), the Mesa build overlay
(exported from the debian-* build job, which can be reused for every job
in that pipeline), and the per-job rootfs (containing job-specific
variables which cannot be reused).
Instead of having LAVA pull the base rootfs and separately downloading
the build/per-job parts on the DUT, get LAVA to compose the whole thing
by using overlays.
Signed-off-by: Daniel Stone <daniels@collabora.com>
Co-authored-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31882>
All the foreground colours pass 1 to ANSI SGR, which sets bold. The
other arguments are either a colour from 30-37 (passed directly), or
38;5;nnn, where nnn is an extended RGB colour. It looks like most of the
definitions were cargo-culted from FG_RED, which correctly sets an
extended colour, because the arguments there were being parsed as
setting blinking, followed by 197 which was ignored as unknown.
Fix them to just set the original definition.
Signed-off-by: Daniel Stone <daniels@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31602>
Set exit_code to 1 in case of an exception; otherwise,
the job exits with 0, and GitLab shows the job as successful.
Fixes: b9cee06f9e ("ci/lava: handle non-zero exit codes")
Signed-off-by: Vignesh Raman <vignesh.raman@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31556>
The LAVA job submitter always exits with code 1, regardless
of the HWCI_TEST_SCRIPT's exit code. This commit fixes the
LAVA job submitter to exit with the actual code returned by
the test.
Signed-off-by: Vignesh Raman <vignesh.raman@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31189>
Sporadically a6xx gpu will fail to recover causing the lava job
a660_vk_full to loop on error messages for three hours before timing
out.
A few sporadic error messages may still be recoverable, but when multiple
errors occur over a short period, successful recovery is unlikely. Parse
the logs to look for repeated error messages within a short time period.
If found, cancel the lava job and rerun it.
Also add unit tests for this behaviour.
cc: mesa-stable
Reported-by: Valentine Burley <valentine.burley@gmail.com>
Acked-by: Daniel Stone <daniel.stone@collabora.com>
Reviewed-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Signed-off-by: Deborah Brouwer <deborah.brouwer@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30032>
Fastboot devices need an indirection for creating a boot image via
`mkbootimg`, so we need to propagate the cmdline from LAVA and our extra
arguments to it properly.
This commit fixes it by retrieving the default cmdline from LAVA and
sending it, together with the `extra_nfsroot_args` to the `mkbootimg`
command.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29611>
This commit refactors the exception hierarchy to differentiate between
retriable and fatal errors in the CI pipeline, specifically within the
LAVA job submission process.
A new base class, `MesaCIRetriableException`, is introduced for
exceptions that should trigger a retry of the CI job, while
`MesaCIFatalException` is added for non-recoverable errors that halt the
process immediately.
Additionally, the logic for deciding whether a job should be retried or
not is updated to check for instances of `MesaCIRetriableException`,
improving the robustness and reliability of the CI job execution
strategy.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28778>
This workaround is termporary and will be dropped once the
farm is moved to its permament location.
Signed-off-by: Martin Krastev <martin.krastev@broadcom.com>
Reviewed-by: David Heidelberg <david.heidelberg@collabora.com>
Reviewed-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28261>
The r8152 error detection is now considering any order of the known
patterns to detect variations of the r8152 issues during the test phase.
This includes a small refactoring for eventual new issues.
Additionally, adjusted the timing for setting the `start_time` in
`test_lava_job_submitter.py` to ensure consistency and reliability in
test execution, aligning the start time closer to the job submission
process.
With this fix, the bad state shown in the following job will be
detected:
https://gitlab.freedesktop.org/drm/msm/-/jobs/55033953
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27688>
In our process of monitoring LAVA logs, we typically skip numerous
messages to enhance log clarity. We already exclude `feedback` messages
from display. These messages were just used as a heartbeat signal,
indicating that if we are receiving them, the Device Under Test (DUT) is
active.
Practically, if we continuously receive feedback messages without any
other message level (either `debug` or `target`) for several minutes,
this could be a cause for concern, as it likely indicates the device is
in a kind of livelock state.
Therefore, it is more prudent to ignore feedback messages, as they tend
to occur frequently in unstable scenarios. However, it's important to
note that any other message level will still be considered as a
heartbeat signal.
Real case where several minutes of feedback messages indicate a hang:
https://gitlab.freedesktop.org/mesa/mesa/-/jobs/53546660
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26996>
This week we found that the r8152 issue can happen during the boot
phase, make the necessary adjustments to detect it.
https://gitlab.freedesktop.org/vigneshraman/linux/-/jobs/53651940
Notes:
- The kernel messages during the boot phase is being redirected to the
feedback messages due to the namespaces from the SSH job.
- Update the unit tests:
- Add boot phase detection
- Correctly set the boot phase when mocking LogFollower
Reported-by: Vignesh Raman <vignesh.raman@collabora.com>
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27081>
We were just detecting if a log like
[ 143.080663] r8152 2-1.3:1.0 eth0: Tx status -71
happened once before
[ 316.389695] nfs: server 192.168.201.1 not responding, still trying
But we can use a counter to be more assured that the device is
struggling to recover and we can add let this detection happen during
the boot phase.
This mimics how other freedreno devices deal with this problem, see
`cros_servo_run.py:64` for example.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27081>
We need update kernel often. We need test kernel changes often.
Introduced `KERNEL_EXTERNAL_TAG` to differ between `KERNEL_TAG` which is
also used to rebuild the containers. We don't need rebuild containers
for the external kernel, so this way we don't have to.
Updating kernel goes wruuuuuum.
Signed-off-by: David Heidelberg <david.heidelberg@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23563>
Add two unit tests related to the LAVA job definition.
test_generate_lava_job_definition_sanity checks for the most important
fields, deploy actions, namespaces etc.
test_lava_job_definition compares the generated definition with static
skeleton YAML files committed inside tests/data folder.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25912>
Simplify both UART and SSH job definitions module to share common
building blocks themselves.
- generate_lava_yaml_payload is now a LAVAJobDefinition method, so
dropped the Strategy pattern between both modules
- if SSH is supported and UART is not enforced, default to SSH
- when SSH is enabled, wrap the last deploy action to run the SSH server
and rewrite the test actions, which should not change due to the boot
method
- create a constants module to load environment variables
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25912>
To absorb complexity from the building blocks to generate job
definitions for each mode:
- fastboot-uart
- uboot-uart
- uboot-ssh
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25912>
Break it to smaller pieces with variable size (fastboot has 3 deploy
actions and uboot only one) to build the base definition nicely in the
end.
Extract kernel/dtb attachment and init_stage1 extraction into functions
to be later reused by SSH job definition.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25912>
The LAVA job submitter is being used by other fd.o projects, such as
`drm/ci`, so let's make it generate more generic job definitions and
test cases.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25912>
At least these two can be easily used in bare-metal or Labgrid setups.
Currently I already have MR for implementing these for Labgrid.
Reviewed-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Signed-off-by: David Heidelberg <david.heidelberg@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24665>
Use a lightweight container for ping, ssh, curl and bash support.
Also use an image located at fd.o infrastructure, since we are having
some issues with Collabora's gitlab one lately.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23534>
Our LAVA farm is currently experiencing issues with running and pulling
docker. LAVA has been detecting (with a low rate) timeouts during these
commands, causing some jobs to fail with infrastructure errors.
Increasing the failure_retry will make the job retry run the container
when LAVA detects the failure without losing its place in the job queue.
We are currently investigating why docker times out. But, when LAVA
fails to detect it, we cancel the job on our side and resubmit it to the
job queue. For more information, please refer to following dashboard:
https://ci-stats-grafana.freedesktop.org/goto/VjZvaA_4z?orgId=1
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23534>
To ensure proper SSH functioning, the device IP should be added to the
LAVA device dictionary by setting device_ip. LAVA will then map the
value to lava-target-ip.
meson-g12b-a311d-khadas-vim3-cbg-4 has an IP in the dictionary, while
sun50i-h6-pine-h64-cbg-1 and meson-g12b-a311d-khadas-vim3-cbg-2 do not.
Since some devices are not yet properly configured, and device tag
fixing is not an option here, let's temporarily switch to a job
definition based on UART, until it gets fixed.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22870>
Make hide_sensitive_data work in a block fashion, not only hiding the
JWT line, since these tokens are huge, it may break the line when it
extrapolates the YAML dump width.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22870>
Some LAVA signals have similar log outputs and the regex associated with
the log section may conflict. Use the policy of the first regex as the
chosen one, otherwise one line may produce two Gitlab sections in a row.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22870>