Closed Bug 1514392 Opened 2 years ago Closed 2 years ago

Need better sanity-checking of values computed based on timestamps generated from monotonicNow

Tracking

(bug RESOLVED as FIXED for Firefox 67)

Status:

RESOLVED FIXED

Milestone:

mozilla67

Tracking Flags:

Tracking Status firefox67 --- fixed

People

(Reporter: wlach, Assigned: janerik)

References

Details

This week while investigating nightly usage, I found that a significant number of clients were reporting huge usage hours across both nightly and release:

https://sql.telemetry.mozilla.org/queries/60531/source
https://sql.telemetry.mozilla.org/queries/60533/source

These bogus results are enough to seriously distort any calculations based on usage hours.

Investigating deeper, I was a little concerned with what I found:

* We have two implementations of "montonic now", one of which returns -1 in the case of an error, the other of which returns Date.now() (which would not be comparable with another monotonicNow call...)

https://searchfox.org/mozilla-central/rev/49e78df13e7a505827a3a86daae9efdf827133c6/toolkit/components/telemetry/app/TelemetryUtils.jsm#219
https://searchfox.org/mozilla-central/rev/49e78df13e7a505827a3a86daae9efdf827133c6/toolkit/components/telemetry/pings/TelemetrySession.jsm#105

* For the case of the one that returns -1, we don't do any check to see if this value that was the one returned: we shouldn't try to do any calculations based on a number like this!
* For the other one, I don't think using Date.now() as a fallback is a good idea, as it's uncomparable with the return value of monotonicNow, so if for example it failed once but worked later, we'd get a weird result trying to compare the two.

Thoughts:

* I think we should only have one montonicNow function
* I think we ought to return an undefined result if it's not accurate, and always check for that before computing telemetry values based on it. Depending on the case, we might choose to just not record a probe with that value, or record a null value.

I have a strawman patch which tries to correct for this. It's probably wrong and needs more testing, but I figured it might make a good opening bid at solving this problem.

Attachment #9031597 - Attachment description: Bug 1514392 - Don't trust 'monotonic now' for telemetry timing r\?gfritzsche → Bug 1514392 - Don't trust 'monotonic now' for telemetry timing r?gfritzsche

It probably also deserves some thought on whether the above problems would actually cause the weird values we're seeing. I can think of cases where this monotonic now function failing could lead to some weird/invalid results, but it's still hard to explain results in the millions (aside from a client's machine going bananas). Perhaps we need further check these kinds of values on the ingestion/etl side...

(In reply to William Lachance (:wlach) (use needinfo!) from comment #2)
> It probably also deserves some thought on whether the above problems would
> actually cause the weird values we're seeing. I can think of cases where
> this monotonic now function failing could lead to some weird/invalid
> results, but it's still hard to explain results in the millions (aside from
> a client's machine going bananas). Perhaps we need further check these kinds
> of values on the ingestion/etl side...

Yes, we do.
I think this needs to be broken out into
- a specific issue with Firefox Telemetry you noticed (this bug?)
- a general investigation into a class of errors? a specific dataset issue to fix?

We can aim to fix individual clients sending bogus data (for "most" clients, not all - some clients will have issues outside of our control).
We can't control other issues that contribute to this (like profile cloning etc.).


There were some prior investigations into this, it would probably good to connect this back to that and data science?

(In reply to William Lachance (:wlach) (use needinfo!) from comment #0)
> * I think we should only have one montonicNow function
> * I think we ought to return an undefined result if it's not accurate, and
> always check for that before computing telemetry values based on it.
> Depending on the case, we might choose to just not record a probe with that
> value, or record a null value.

I think there are two separate themes here:
- One about unifying how we handle monotonic timestamps/timespans in Telemetry. This is probably a good thing, but has some possible downstream implications (for pipeline, analysis, topline metrics) we should discuss.
- One about proposing a different approach on how to handle when we fail to record. The current approach is to send `-1`, see [0] et al.

0: https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/data/main-ping.html#sessionlength

I think my main take-away here is that this has potentially far-reaching implications and we should be very deliberate about any changes made.
I think this is a good fit for making one or more targeted proposals to share with the larger team.

I think I found the root cause of the weird values we're seeing, filed a new bug 1515396 for that.

That said, I believe the issues pointed out in this bug are still valid. I'll work on solving them in a series of bite-sized patches.

Ok, after further digging, I determined bug 1515396 which basically invalid. Here's what I think is really happening:

To get subsession lengths, we use this function:

https://searchfox.org/mozilla-central/rev/9528360768d81b1fc84258b5fb3601b5d4f40076/toolkit/components/telemetry/core/TelemetryCommon.cpp#85

This function calls two methods: one to get the current monotonic timestamp (more or less reliable, I think!) and another to get the estimated monotonic time since process start (TimeStamp::ProcessCreation). If you look at it closely, you can see it tries to use a system call to get the process uptime, then subtracts that from a call to get the current monotonic time:

https://searchfox.org/mozilla-central/rev/9528360768d81b1fc84258b5fb3601b5d4f40076/mozglue/misc/TimeStamp.cpp#91

If you look closely at this function, you see that it actually uses the current local time to calculate the uptime:

https://searchfox.org/mozilla-central/rev/9528360768d81b1fc84258b5fb3601b5d4f40076/mozglue/misc/TimeStamp_windows.cpp#530

If the date shifted in an unexpected way between the time the process was started and when we called this function, we could end up with an unexpected result (i.e. one where the process uptime is less than that of the statically initialized monotonic time stamp). If you look at the timestamp function above, you see that we account for this, and fall back to the statically initialized time in this case. This is good. However, in that case, we set a flag to indicate we found an error and return the statically initialized time. Telemetry would then take this as a signal to return -1 for the currently sampled monotonic time:

https://searchfox.org/mozilla-central/source/toolkit/components/telemetry/pings/TelemetrySession.jsm#96

... which it will then happily use as the monotonic time signalling the start of the session, as I indicate in comment 0! Since the monotonic time is an arbitrary counter, it is possible it would have a very large value. VERY_LARGE_VALUE - (-1) = VERY_LARGE_VALUE + 1.

As I noted in bug 1515396, subsequent calls to get the monotonic time of the subsession don't seem to fail (we don't see any "extreme negative" results we would expect were that the case). This is because after the first call which returns an inconsistent result, subsequent results will use the cached (statically initialized) version which we fall back to. 

--

After chatting with :mreid, I would propose two things:

1. Instead of returning -1 in the case of this inconsistency error, let's just return the best estimate that we have (the version of monotonic time that we got from the static initializer). This should be more than accurate enough for telemetry's purposes, we got it early in the startup process.
2. Keep a running count of these inconsistency errors, and then send them along with every telemetry ping.

I believe this will give us much more sane and consistent behaviour (which is more resilient to programmer error), as well as a way of verifying that my hypothesis is true: we can monitor the nightly data after this lands and verify that things are more or less as we'd expect.

Cheers for the deep-dive!

(In reply to William Lachance (:wlach) (use needinfo!) from comment #7)
> If the date shifted in an unexpected way between the time the process was
> started and when we called this function, we could end up with an unexpected
> result (i.e. one where the process uptime is less than that of the
> statically initialized monotonic time stamp).

Have we tested that yet? I'd like to confirm this.

> If you look at the timestamp
> function above, you see that we account for this, and fall back to the
> statically initialized time in this case. This is good. However, in that
> case, we set a flag to indicate we found an error and return the statically
> initialized time. Telemetry would then take this as a signal to return -1
> for the currently sampled monotonic time:
> 
> https://searchfox.org/mozilla-central/source/toolkit/components/telemetry/
> pings/TelemetrySession.jsm#96
> 
> ... which it will then happily use as the monotonic time signalling the
> start of the session, as I indicate in comment 0! Since the monotonic time
> is an arbitrary counter, it is possible it would have a very large value.
> VERY_LARGE_VALUE - (-1) = VERY_LARGE_VALUE + 1.

What very large values could we realistically see here?
Would that be values "from system start time"? Or something else?

> 1. Instead of returning -1 in the case of this inconsistency error, let's
> just return the best estimate that we have (the version of monotonic time
> that we got from the static initializer). This should be more than accurate
> enough for telemetry's purposes, we got it early in the startup process.
> 2. Keep a running count of these inconsistency errors, and then send them
> along with every telemetry ping.
> 
> I believe this will give us much more sane and consistent behaviour (which
> is more resilient to programmer error), as well as a way of verifying that
> my hypothesis is true: we can monitor the nightly data after this lands and
> verify that things are more or less as we'd expect.

We could record this times and offsets in new, separate metrics initially and test them on Nightly first, potentially also Beta/Release.
That would be an easy thing to test out without planning through all the possible impact right away.
Our Firefox population can be pretty weird and the pre-release populations are biased.

Before anything else, i'd like someone from the Telemetry team to review your analysis.

Flags: needinfo?(wlachance)

(In reply to Georg Fritzsche [:gfritzsche] from comment #8)
> Cheers for the deep-dive!
> 
> (In reply to William Lachance (:wlach) (use needinfo!) from comment #7)
> > If the date shifted in an unexpected way between the time the process was
> > started and when we called this function, we could end up with an unexpected
> > result (i.e. one where the process uptime is less than that of the
> > statically initialized monotonic time stamp).
> 
> Have we tested that yet? I'd like to confirm this.
> 
> > If you look at the timestamp
> > function above, you see that we account for this, and fall back to the
> > statically initialized time in this case. This is good. However, in that
> > case, we set a flag to indicate we found an error and return the statically
> > initialized time. Telemetry would then take this as a signal to return -1
> > for the currently sampled monotonic time:
> > 
> > https://searchfox.org/mozilla-central/source/toolkit/components/telemetry/
> > pings/TelemetrySession.jsm#96
> > 
> > ... which it will then happily use as the monotonic time signalling the
> > start of the session, as I indicate in comment 0! Since the monotonic time
> > is an arbitrary counter, it is possible it would have a very large value.
> > VERY_LARGE_VALUE - (-1) = VERY_LARGE_VALUE + 1.
> 
> What very large values could we realistically see here?
> Would that be values "from system start time"? Or something else?

Can we also contrast that to the value distribution you see in analysis?
Can you give a summary of the value ranges that we see across the whole population?

(In reply to Georg Fritzsche [:gfritzsche] from comment #8)
> Cheers for the deep-dive!
> 
> (In reply to William Lachance (:wlach) (use needinfo!) from comment #7)
> > If the date shifted in an unexpected way between the time the process was
> > started and when we called this function, we could end up with an unexpected
> > result (i.e. one where the process uptime is less than that of the
> > statically initialized monotonic time stamp).
> 
> Have we tested that yet? I'd like to confirm this.

I don't know how to test this condition, it sounds super rare and hard to reproduce. All I know is that we try to account for the condition in the timestamp code:

https://searchfox.org/mozilla-central/rev/9528360768d81b1fc84258b5fb3601b5d4f40076/mozglue/misc/TimeStamp.cpp#91

> > If you look at the timestamp
> > function above, you see that we account for this, and fall back to the
> > statically initialized time in this case. This is good. However, in that
> > case, we set a flag to indicate we found an error and return the statically
> > initialized time. Telemetry would then take this as a signal to return -1
> > for the currently sampled monotonic time:
> > 
> > https://searchfox.org/mozilla-central/source/toolkit/components/telemetry/
> > pings/TelemetrySession.jsm#96
> > 
> > ... which it will then happily use as the monotonic time signalling the
> > start of the session, as I indicate in comment 0! Since the monotonic time
> > is an arbitrary counter, it is possible it would have a very large value.
> > VERY_LARGE_VALUE - (-1) = VERY_LARGE_VALUE + 1.
> 
> What very large values could we realistically see here?
> Would that be values "from system start time"? Or something else?

So it appears that the timestamp code on Windows either uses QueryPerformanceCounter (ideally) or GetTickCount64 (fallback in case hardware does not support it). 

GetTickCount is supposed to be milliseconds elapsed since system start time: https://docs.microsoft.com/en-us/windows/desktop/api/sysinfoapi/nf-sysinfoapi-gettickcount64
QueryPerformanceTimer is not based on any time scale "QPC is independent of and isn't synchronized to any external time reference" -> https://docs.microsoft.com/en-us/windows/desktop/SysInfo/acquiring-high-resolution-time-stamps

In my testing of the latter on a Windows 7 VM, I got a fairly small value for a random sample (5607 seconds):

>>> kernel32.QueryPerformanceCounter(ctypes.byref(sample))
1
>>> kernel32.QueryPerformanceFrequency(ctypes.byref(frequency))
1
>>> sample.value / frequency.value
5607.575726196575

... which would not cause problems. But then, who knows what value this would give on actual hardware. Given what I've read, a very large value does not sound impossible and is the only thing I can think of that would account for the behaviour we're seeing.

(if you're curious to reproduce, here's the reference I used: https://stackoverflow.com/questions/38461335/python-2-x-queryperformancecounter-on-windows)

> > 1. Instead of returning -1 in the case of this inconsistency error, let's
> > just return the best estimate that we have (the version of monotonic time
> > that we got from the static initializer). This should be more than accurate
> > enough for telemetry's purposes, we got it early in the startup process.
> > 2. Keep a running count of these inconsistency errors, and then send them
> > along with every telemetry ping.
> > 
> > I believe this will give us much more sane and consistent behaviour (which
> > is more resilient to programmer error), as well as a way of verifying that
> > my hypothesis is true: we can monitor the nightly data after this lands and
> > verify that things are more or less as we'd expect.
> 
> We could record this times and offsets in new, separate metrics initially
> and test them on Nightly first, potentially also Beta/Release.
> That would be an easy thing to test out without planning through all the
> possible impact right away.
> Our Firefox population can be pretty weird and the pre-release populations
> are biased.

I don't know about this, there's *definitely* a bug here in that we're not actually checking the results of the our monotonic time function. I don't know for sure that my explanation on what's happening is correct, but I can't imagine how we'd be worse off by just fixing our code to not blindly use a value of -1 if it detects a startup time inconsistency between a static initializer and the process time system call. 

> Before anything else, i'd like someone from the Telemetry team to review
> your analysis.

Sure, I was thinking that would happen as part of a review.

Flags: needinfo?(wlachance)

(In reply to Georg Fritzsche [:gfritzsche] from comment #9)

> > What very large values could we realistically see here?
> > Would that be values "from system start time"? Or something else?
> 
> Can we also contrast that to the value distribution you see in analysis?
> Can you give a summary of the value ranges that we see across the whole
> population?

Here's what I got for Dec 11 on Windows Nightly (note these are summarized counts, but almost certainly the very large numbers come from individual samples):

0 < subsession_hours < 1: 9729
1 < subsession_hours < 10: 17365
10 < subsession_hours < 100: 14843
100 < subsession_hours < 1000: 160
1000 < subsession_hours < 10000: 4
10000 < subsession_hours < 2104114.992777: 170

Source: https://extremely-alpha.iodide.io/notebooks/198/?viewMode=report

An analysis across the full population would take longer, so haven't done it yet. But you can see in comment 0 that we see the same strange behavior there.

Based on what I measured in my VM, it's also possible we're overcounting even in the case where we don't have an absurd number of subsession hours. Hard to tell.

(In reply to William Lachance (:wlach) (use needinfo!) from comment #10)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #8)

> > Before anything else, i'd like someone from the Telemetry team to review
> > your analysis.
> 
> Sure, I was thinking that would happen as part of a review.

Setting needinfo on this so we don't forget about it over the holidays. I feel like I've done about as much analysis as I can reasonably do at this point.

Flags: needinfo?(gfritzsche)

ni? myself, we'll have to figure out the way forward.

Flags: needinfo?(jrediger)

Flags: needinfo?(gfritzsche)

I took a look at "event" pings' processStartTimestamp, one of the consumers of msSinceProcessStart.

Here's the notebook: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/75037/command/75126

The tl;dr is that, 3% of sessions described by the event pings sent from Nightly this January (from builds new enough to have "event" pings) had more than 1 processStartTimestamp.

Long document with notes: https://docs.google.com/document/d/1DAwLvF5F3cT1dG85VA2zC7ANPx2EZxPb9vL0RwaWIV0

I did another query last night on the number of pings with sketchy subsession lengths (between 24 and 48 hours and more than 48 hours):
https://sql.telemetry.mozilla.org/queries/61420/source
It hovers around 1% for most (up to 6% of pings, on release, on a 1% sample of all pings).

I also took :wlach's query and adjusted it for actual client counts:
https://sql.telemetry.mozilla.org/queries/61419/source
It shows similar small numbers.

While we certainly do have clients with invalid session lengths, IMO they are much less significant than initially thought.
We might still need to be careful when basing analysis on session length/subsession hours sum (active ticks/active hours seems to be more accurate anyway)

I went ahead and submitted the most minimal patch fixing the monotonicNow/Date.now/-1 bug.
I left out any instrumentation, as I think if we want that we need to implement it much lower (in TimeStamp directly), as Telemetry is most likely not the first/only caller of this.

:wlach, does this make sense? do you agree with my conclusions?
(Edit: Oh and if you do, you could abandon your patch :))

Flags: needinfo?(wlachance)

(In reply to Jan-Erik Rediger [:janerik] from comment #16)

While we certainly do have clients with invalid session lengths, IMO they are much less significant than initially thought.
We might still need to be careful when basing analysis on session length/subsession hours sum (active ticks/active hours seems to be more accurate anyway)

Yeah, I never really claimed that the volume of pings with incorrect subsession_hours was high, just that the small number of pings which did have this problem would have a disproportionate impact. :)

Aside from that minor quibble, I agree with the analysis. Let's get this landed and see what the results look like on Nightly!

Flags: needinfo?(wlachance)

Comment on attachment 9043637 [details]
data-review-request-1514392.txt

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes. This collection is Telemetry so is documented in its definitions file ([Scalars.yaml](https://hg.mozilla.org/mozilla-central/file/tip/toolkit/components/telemetry/Scalars.yaml)), the [Probe Dictionary](https://telemetry.mozilla.org/probe-dictionary/), and on telemetry.mozilla.org's [Measurement Dashboards](https://telemetry.mozilla.org/new-pipeline/dist.html).

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

N/A, the collection expires in Firefox 77.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default-on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does there need to be a check-in in the future to determine whether to renew the data?

Yes. :janerik is responsible for renewing or removing this collection before it expires in Firefox 77.

---
Result: datareview+

Attachment #9043637 - Flags: review?(chutten) → review+

https://hg.mozilla.org/mozilla-central/rev/085bd7c16f99

Status: NEW → RESOLVED

Closed: 2 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla67

You need to log in before you can comment on or make changes to this bug.

Need better sanity-checking of values computed based on timestamps generated fro...

Need better sanity-checking of values computed based on timestamps generated from monotonicNow

Categories

(Toolkit :: Telemetry, enhancement, P1)

Tracking

(bug RESOLVED as FIXED for Firefox 67)

People

(Reporter: wlach, Assigned: janerik)

References

Details

Recommend

mozilla-pipeline-schemas

An overview of Mozilla’s Data Pipeline

Shutting down legacy Taskcluster deployment

shlex — Simple lexical analysis

No Such Resource

Building and Distributing Packages with Setuptools

swift storage | Juju

Juju documentation

Read stderr from layer subprocesses in a thread

asyncio

About Joyk