mesa/.gitlab-ci/lava
Deborah Brouwer 816c835c84 ci/lava: Detect a6xx gpu recovery failures
Sporadically a6xx gpu will fail to recover causing the lava job
a660_vk_full to loop on error messages for three hours before timing
out.

A few sporadic error messages may still be recoverable, but when multiple
errors occur over a short period, successful recovery is unlikely. Parse
the logs to look for repeated error messages within a short time period.
If found, cancel the lava job and rerun it.

Also add unit tests for this behaviour.

cc: mesa-stable

Reported-by: Valentine Burley <valentine.burley@gmail.com>
Acked-by: Daniel Stone <daniel.stone@collabora.com>
Reviewed-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Signed-off-by: Deborah Brouwer <deborah.brouwer@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30032>
(cherry picked from commit 72c182f873)
2024-07-23 22:28:07 +02:00
..
utils ci/lava: Detect a6xx gpu recovery failures 2024-07-23 22:28:07 +02:00
__init__.py ci/lava: Create LogFollower and move logging methods 2022-07-07 00:28:53 +00:00
exceptions.py ci/lava: Introduce unretriable exception handling 2024-04-22 21:20:07 +00:00
lava-gitlab-ci.yml ci: kernel stored in a different s3 bucket 2024-05-15 15:37:05 +02:00
lava-pytest.sh ci: enable shellcheck on whole .gitlab-ci 2023-05-25 16:06:53 +02:00
lava-submit.sh ci: Use id_tokens for JWT auth 2024-05-15 15:37:05 +02:00
lava_job_submitter.py ci/lava: Fix how exception entry in structured log 2024-04-22 21:20:07 +00:00
requirements-test.txt ci/lava: Add LavaFarm class to find LAVA farm from runner tag 2023-02-16 13:08:41 +00:00
requirements.txt ci/lava: Use python-fire in job submitter 2023-04-19 14:36:37 +00:00