3

Frequent browser/components/migration/tests/unit/test_MigrationUtils_timedRetry....

 1 year ago
source link: https://bugzilla.mozilla.org/show_bug.cgi?id=1760074
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client
Closed Bug 1760074 Opened 11 months ago Closed 6 days ago

Frequent browser/components/migration/tests/unit/test_MigrationUtils_timedRetry.js | single tracking bug

Categories

(Firefox :: Migration, defect, P3)

Tracking

(bug RESOLVED as FIXED)

RESOLVED FIXED
Tracking Status
firefox111 --- fixed

People

(Reporter: jmaher, Assigned: jjalkanen)

References

Details

(Keywords: intermittent-failure, intermittent-testcase, Whiteboard: [stockwell unknown][stockwell unknown])

No description provided.
Comment hidden (Intermittent Failures Robot)

Hi Jari, could you take a look at this?

The failing line in the error log says, "testgetRowsFromDBWithoutLocksRetries - [testgetRowsFromDBWithoutLocksRetries : 271] A promise chain failed to handle a rejection: Error(s) encountered during statement execution: disk I/O error - stack: handleCompletion@resource://gre/modules/Sqlite.jsm:963:25" and I see you most recently made some changes to Sqlite.jsm.

Flags: needinfo?(jjalkanen)

Hi Jared, thank you for reporting.

Do you happen to have any estimation of how likely it is that a user will encounter this issue, and how severe is the impact?

At a first glance, the underlying disk I/O issue might be pronounced on our CI but the promise chain failure seems like something that shouldn't happen.

Flags: needinfo?(jjalkanen) → needinfo?(jaws)

It has failed 5 times out of 3304 pushes on CI, and all 5 are in debug builds. If the bug does happen in production, it would likely surface on much slower machines that are akin to our debug performance.

5/3304 is not close to 1% of runs, so I don't think it is that likely a user will experience this, rather if they do they will probably have other issues happening at the same time too.

If you don't see an obvious error here, I would de-prioritize this until we see the failure rate increase significantly.

Flags: needinfo?(jaws)

keep in mind 3304 pushes includes all of try which we do not annotate. In general we have ~100 pushes/day on integration/release branches and a test only runs on a subset of those, probably more like 20-30%, which makes this more like 6 failures in 140-180 pushes.

Hey Jari,

Is this revealing a real issue with the SQLite.jsm changes that users might actually hit in the wild, or do you think this is purely a test-only issue?

Flags: needinfo?(jjalkanen)

in the last week I queried some data and found that this test was run on macosx/debug:

branch pass fail % failure rate
try 112 14 11.1%
autoland 156 16 9.3%
central 48 6 11.1%
total 316 36 10.2%
Comment hidden (Intermittent Failures Robot)

So I went through the code paths which are relevant for this test and out of the two places where an uncaught exception can be thrown, it seems that one is currently untriggerable, and the more likely explanation is the expected error when the retry limit is exceeded.

Additionally, in the code there is no obvious explanation for why this issue would take place specifically on mac.

The most likely cause is that there is a high storage contention on our test jobs for the mac platform at the CI. It is a common issue, and some ways to mitigate it are to optimize and clean up the drive, have a larger pool of drives available for the jobs and if possible, use better hardware or a better storage driver.

Returning an error under these circumstances seems like a reasonable thing to do for the code, the only thing I would like to change is the mechanism of reporting only the first encountered error which causes a retry. In this case, we had both the expected "table doesn't exist" error which is part of the test, and the disk I/O error which gets dropped even though it is the actual cause of the test failure.

Flags: needinfo?(jjalkanen)
Assignee: nobody → jjalkanen
Status: NEW → ASSIGNED
Comment hidden (Intermittent Failures Robot)

(In reply to Jari Jalkanen from comment #9)

So I went through the code paths which are relevant for this test and out of the two places where an uncaught exception can be thrown, it seems that one is currently untriggerable, and the more likely explanation is the expected error when the retry limit is exceeded.

Additionally, in the code there is no obvious explanation for why this issue would take place specifically on mac.

The most likely cause is that there is a high storage contention on our test jobs for the mac platform at the CI. It is a common issue, and some ways to mitigate it are to optimize and clean up the drive, have a larger pool of drives available for the jobs and if possible, use better hardware or a better storage driver.

Returning an error under these circumstances seems like a reasonable thing to do for the code, the only thing I would like to change is the mechanism of reporting only the first encountered error which causes a retry. In this case, we had both the expected "table doesn't exist" error which is part of the test, and the disk I/O error which gets dropped even though it is the actual cause of the test failure.

So AIUI the attached patch won't address the intermittent. Do we know why the issue started with the sqlite update when it wasn't an issue previously? The same frontend code will have been running on the same hardware so I'm not sure why the same issue wouldn't have happened before? Has someone done a sanity check to see that the code in general still works? (In particular: can we still import from Chrome while it is running and has these DBs open?)

Joel, what options do we have for the hardware/CI resolutions Jari has suggested (penultimate paragraph) ?

Flags: needinfo?(jmaher)
Flags: needinfo?(jjalkanen)
Keywords: leave-open

this is failing on osx 10.15, this is intel based OSX and using mac minis from a couple of years ago (revision 8 mac minis). So these are fairly modern, here are specs:
https://wiki.mozilla.org/TestEngineering/Performance/Platforms#Apple_Mac_Mini_R8

Possibly there are cleanup of temp files on the disk that could take place- or it could be that something else within firefox is interfering with the table?

Flags: needinfo?(jmaher)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Summary: Intermittent browser/components/migration/tests/unit/test_MigrationUtils_timedRetry.js | single tracking bug → Frequent browser/components/migration/tests/unit/test_MigrationUtils_timedRetry.js | single tracking bug
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)

There have been 30 total failures in the last 7 days, recent failure log.
Affected platforms are:

  • windows10-64-2004-qr
  • windows10-64-2004-shippable-qr
[task 2022-12-21T02:51:10.620Z] 02:51:10     INFO -  TEST-START | browser/components/migration/tests/unit/test_MigrationUtils_timedRetry.js
[task 2022-12-21T02:51:14.055Z] 02:51:14  WARNING -  TEST-UNEXPECTED-FAIL | browser/components/migration/tests/unit/test_MigrationUtils_timedRetry.js | xpcshell return code: 0
[task 2022-12-21T02:51:14.058Z] 02:51:14     INFO -  TEST-INFO took 3435ms
Whiteboard: [stockwell needswork:owner]
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)

Update:

There have been 45 failures within the last 7 days:

  • 23 failures on Windows 10 x64 2004 WebRender opt
  • 22 failures on Windows 10 x64 2004 WebRender Shippable opt

Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=401735434&repo=autoland&lineNumber=1875

[task 2023-01-08T22:51:40.380Z] 22:51:40     INFO -  TEST-START | browser/components/migration/tests/unit/test_MigrationUtils_timedRetry.js
[task 2023-01-08T22:51:43.374Z] 22:51:43  WARNING -  TEST-UNEXPECTED-FAIL | browser/components/migration/tests/unit/test_MigrationUtils_timedRetry.js | xpcshell return code: 0
[task 2023-01-08T22:51:43.377Z] 22:51:43     INFO -  TEST-INFO took 2995ms
Whiteboard: [stockwell unknown] → [stockwell unknown][stockwell needswork:owner]
Comment hidden (Intermittent Failures Robot)

Gijs, could you have a look over or reassign this windows opt only frequent fail?
This test gets skipped with pattern: toolkit == 'android' OR !debug && os == "mac" and has failed 135 times in the last 30 days: https://treeherder.mozilla.org/intermittent-failures/bugdetails?startday=2022-12-15&endday=2023-01-14&tree=trunk&failurehash=all&bug=1760074

Flags: needinfo?(gijskruitbosch+bugs)
Comment hidden (Intermittent Failures Robot)

Honestly I'm pretty lost. The test deliberately tries to read a table that doesn't exist yet. In the more recent failing instances on Windows, we just run 10 times (at 100ms intervals, if things are working...) and never succeed in reading the table. This is odd because the timeout in the test for creating this table is 50ms, and the 10 retries to read the table should be happening 100ms apart. Even accounting for some fuzz I wouldn't expect it to fail so often, given the factor 20 difference between the two timeouts.

... however, 100% of these Windows failures are on "condprof" xpcshell runs.

What are those? And is it possible they started existing / running as tier-1 on autoland about 2 months ago? If not, perhaps we can try to find what regressed this...

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(csabou)

Joel, could you provide more insight into condprof tasks for Gijs? afaik, these were enabled sometime around half a yeah ago. Thank you.

Flags: needinfo?(csabou) → needinfo?(jmaher)

condprof is a conditioned profile. On each m-c build we run a firefox-full task, which launches firefox and browsers some random pages. This adds stuff to the cache and keeps it recent. Then we close firefox and bundle up the profile, this is done every m-c push so it builds upon itself. In setting this up, we found tests which failed because they expected empty fields in tables or data structures, or a specific number, when in fact by browsing we some or many entries.

you can see how we skip many tests on condprof:
https://searchfox.org/mozilla-central/search?q=condprof&path=*.ini&case=false&regexp=false

here is the script we run to generate the full profile:
https://searchfox.org/mozilla-central/source/testing/condprofile/condprof/scenarii/full.py

I did some retriggers:
https://treeherder.mozilla.org/jobs?repo=mozilla-central&searchStr=condprof%2Cxpcshell%2Cwindows&fromchange=2781784fdc50fa3de9c864f2907f0ed89eadb41a&tochange=e4ac0dccafdddfa13e91dec0e852d001ca7c528e

and found the offending set of changes on m-c:
https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=909fc842471c16d10aa12a4be1c90c5efe219924

This range on autoland looks like:
https://treeherder.mozilla.org/jobs?repo=autoland&searchStr=win%2Cxpcshell%2Ccondprof&tochange=909fc842471c16d10aa12a4be1c90c5efe219924&fromchange=e46a721b2af4ee095d4f554a9e894efc1050d6e1

I have done some retriggers and backfills on autoland and narrowed it down to here:
https://treeherder.mozilla.org/jobs?repo=autoland&searchStr=win%2Cxpcshell%2Ccondprof&tochange=73b3372a88c5894eec2125711abcaa11f0e87552&fromchange=7889809e96a075d41c17df790f444049203d6bc7

unfortunately the test is moving from chunk 8 -> 1 -> 6 and the filter by test path isn't working as the chunk1 tests seem to have a crash somewhere and the tests are not showing up in the filter.

But...narrowing it down to that range might help

Flags: needinfo?(jmaher)

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #64)

I have done some retriggers and backfills on autoland and narrowed it down to here:
https://treeherder.mozilla.org/jobs?repo=autoland&searchStr=win%2Cxpcshell%2Ccondprof&tochange=73b3372a88c5894eec2125711abcaa11f0e87552&fromchange=7889809e96a075d41c17df790f444049203d6bc7

unfortunately the test is moving from chunk 8 -> 1 -> 6 and the filter by test path isn't working as the chunk1 tests seem to have a crash somewhere and the tests are not showing up in the filter.

But...narrowing it down to that range might help

It does! That points pretty squarely to https://phabricator.services.mozilla.com/D162492 and bug 1801495, which is changing exactly the sqlite path that this test uses, cf. https://searchfox.org/mozilla-central/rev/5bcbe6ae54ee5b152f6cef29dc5627ec7c0e1f1e/browser/components/migration/MigrationUtils.sys.mjs#215 .

Marco, can you take a look?

Flags: needinfo?(mak)

This is puzzling.

A few initial thoughts:

  1. the test is not awaiting for IOUtils.remove(tmpFile.path); while it probably should. Note: I suspect in practice this won't make a difference.
  2. it's possible the change somehow is stuck in the journal, ignoreLockingMode would then likely not see it, because it opens the db main file in isolation. dbConn.close()-ing should enforce any pending changes to be merged in the main file, so it's something one may try here and see if it helps. My only concern about this theory is that by default we should be using a DELETE journal, that would merge changes on every transaction, and the CREATE TABLE here is a transaction. Though it's possible the presence of an active reader delays the merge to the next transaction, and since no other transaction happens, that delays the merge to close().
  3. I don't yet understand why this would only happen in condprof. Maybe the app is slower with the conditioned profile and that timing difference causes the merge delay more often.

I'm trying a couple things on Try and will continue the investigation from there. leaving the ni? open.

A few interesting findings:

  1. the TmpD folder used by the test is apparently just the profile folder, the same folder that was copied over by condprof, testDB is at the same level as xulStore, just to say. Does this matter? not sure, probably not.
  2. I added a "done" log to the execute, and it actually prints after "exiting test"... So maybe the table is really being created later and the CREATE TABLE just needs to be awaited for...
  3. I noticed the previousException handling is a bit strange, there is an if (previousException) but that var is always initialized to an object, did it intend to check previousException.message? Fwiw I would just break; after rows assignment and remove this check.

Overall, my opinion is just that CREATE TABLE in these cases happens after all the reads, for how strange that may sound. The test may probably either increase RETRYINTERVAL, or just await for CREATE TABLE, after all it would change nothing for the test to await, the first SELECT would still fail thanks to the do_timeout call, and the next one would succeed.

For how much it may sound strange, sometimes all the reads just happen before
the write, so they won't see the table.
Also fixes a wrong check on previousException.

Comment hidden (Intermittent Failures Robot)
Status: ASSIGNED → RESOLVED
Closed: 6 days ago
Flags: needinfo?(mak)
Flags: needinfo?(jjalkanen)
Keywords: leave-open
Resolution: --- → FIXED
Comment hidden (Intermittent Failures Robot)
You need to log in before you can comment on or make changes to this bug.

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK