Statistical inference considered harmful

Today we'll talk about a very exciting paper:

Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing

This paper appeared at Usenix Security, 2014, where it won the best paper award! Holy crap!

The paper looks at challenges in dosing Warfarin, an anticoagulant with a somewhat challenging dosing protocol. It can be hard to figure out how to get patients on the right track with dosage, and any help they can get by looking harder at the patient (or in this case, their genetic markers) could help. In particular, an accurate statistical model of correlation between certain genetic markers and the appropriate Warfarin dose could help tremendously.

But, maybe if you look too hard, you accidentally reveal more information about the patients than you intended. This sounds like it could be a privacy problem, and differential privacy claims it can help here. The authors want to know!

The paper claims two main contributions:

First, they describe an attack, named model inversion, which "given the model and some demographic information about a patient, can predict the patient’s genetic markers."

"Our results indicate that understanding the implications of differential privacy for pharmacogenomic dosing is a difficult matter—even small values of ε might lead to unwanted disclosure in many cases"

Second, they find that "for privacy budgets effective at preventing attacks, patients would be exposed to increased risk of stroke, bleeding events, and mortality."

"We conclude that current DP mechanisms do not simultaneously improve genomic privacy while retaining desirable clinical efficacy,"

Oh geez. This doesn't sound like a "best" paper for differential privacy at all! Actually this sounds pretty bad.

Fortunately, it is mostly wrong. Both of these reported conclusions have serious issues.

If you can't bear the suspense, the corresponding issues are:

They define a privacy attack as performing statistical inference using (i) private personal data you disclose to the attacker and (ii) statistics about Warfarin dosing in other people, laying the blame on (ii) rather than (i). Unfortunately, (ii) is called "science", and (i) is you telling someone else something you shouldn't have. Their conclusion, roughly translated, is "science is hard to suppress, even with small epsilon". You are welcome.
They didn't actually use statistical inference when they applied it to their target domain, so they take patients off of the baseline treatment even when the confidence they should do so is not high. When epsilon is small, you should be leaving patients on the baseline treatment because you lack strong evidence to do anything else; it seems they mostly just randomly dose patients in this case. Mortality ensues.

I should say, the paper has a LOT of cool info on Warfarin, and you get a good amount of context on how these things actually need to work. I don't really dig their scientific methods, but I would totally read the paper anyhow for the incidental learnings.

Issue 1

To start things off well, the authors write in their abstract that

"an attacker, given the model and some demographic information about a patient, can predict the patient’s genetic markers."

This would be seriously bad news, because other than the demographic information, the only way the information about genetic markers could leak out is through the model, which we may have learned using differential privacy!!!

But, what they meant to say, once you've read the paper, is actually

"an attacker, given the model, some demographic information about a patient, and the patient's stable Warfarin dosage, can predict the patient’s genetic markers.",

Oops, they forgot to tell you that their attacker also gets to look at the patient's stable dose of the drug! You know, that thing that is presumed to be correlated with the genetic markers. Maybe the information isn't actually leaking through the model after all.

If there is a correlation between the genetic markers and the stable dosage, which kinda is the premise of dosing based on a learned model between the two, then maybe the patient's dosage is where the secret is coming from, rather than the model itself. Of course the second version of the text does sound a little less dramatic than the first version, a little less like a privacy violation, which is probably why they used the first version in their abstract.

The attack they describe isn't actually violating differential privacy, or compromising the secrecy of data you contributed to train the model, it just sounds creepy because people don't like being machine-learned about.

If you switch Warfarin doctors and your new doctor asks you what dosage you are currently taking, and once told says "you must have marker XYZ huh?" has the new doctor just violated your privacy? No, you told them your dose, and they just know the connection between tests and doses generally. While it may feel like a "privacy violation" to you, it isn't one that results from mishandling your data; it just comes from modern science being good at knowing things about humans, and that you said something (your dosing) that spilled the beans.

It's kinda like how putting a cast on someone's foot may disclose information about the brokenness of their foot. Except in this case, the cast is invisible and you would have to tell people about it for them to know. You could desperately wish people didn't know that casts are used for breaks, and that maybe differential privacy could have suppressed that information back when casts were invented, but that train has left the station.

The flip side is that since the accuracy of this "attack" derives from how well-understood is the connection between markers and stable Warfarin dosing, accuracy for the attack largely corresponds to accuracy of dosing predictions based on markers. As scientists get better at predicting the connection, getting right to the stable dose from your markers, the "attack" gets worse and more people get the dosages they need.

The authors develop this with the observation that the disclosure can be less with small-epsilon differential privacy (using the data less) and is higher with large-epsilon differential privacy. This boils down to your model being bad and getting better, and has nothing to do with differential privacy. You would get the same results if your training population just went from small to large; would you write a paper headlined

"Increasing the size and accuracy of clinical trials may pose privacy risks"?

Here is a different one:

"Studying Warfarin at all may pose privacy risks"

Let's distill it down to its essence:

"Statistical inference considered harmful"

Unfortunately, the deadline for Usenix Security 2016 has already passed.

Differential privacy?

Importantly, all of the anxiety about learning models has nothing to do with disclosing the private data of patients used to train the dosing model. The attack applies even to people not involved in the training process. Apropos genomic data leaking through the model, the authors do observe that

"... there is no discernible difference between the model inverter’s performance on the training and validation sets."

The discernibility of the difference between these sets stays pretty much the same as epsilon increases, per their Figure 4 (the gaps between corresponding dotted and solid lines).

In fact, their attack is often much less effective than differential privacy would guarantee, so it's not entirely clear what all the fuss is about. Certainly not differential privacy's guarantees about maintaining the secrecy of training data.

Conclusions

The authors seem to be making the argument that the existence of accurate models are enough to launch a privacy attack, where their privacy attack involves you first revealing secret personal information to the attacker. I think they are barking up the wrong tree here. Their proposed remedy is "suppress science".

The paper does report not much privacy risk in participating in training such a dosing model. I would stress that just because they don't find much privacy risk in participating in training doesn't mean there isn't any. Their results indicate much less risk than differential privacy guarantees, especially for large epsilon, which may just mean that they don't have a very powerful attack. There could easily be a paper next year indicating that the risk is actually closer to the differential privacy bound.

Issue 2

What about the "increased risks of stroke, bleeding events, and mortality"? It seems they correspond to a hypothetical treatment plan that takes several differentially private measurements as absolute truth, ignoring how strong or weak the indicator is for each patient. If you have a really strong signal that is a great reason to change someone's dosage, but if you get only a weak signal it is probably best to keep it at (or closer to) the baseline. They just treat all the signals as "very strong" and change dosages all over the place and some hypothetical people die.

Let's look at their Figure 8, which evaluates varying privacy values epsilon (the x-axes) for four metrics for success in Warfarin dosing, compared against always starting with 10mg (Fixed 10mg) and the non-privacy linear regression (LR).

This figure reveals that small epsilon pretty much kills people. And if you use their approach, it probably does. In fact, although they don't plot it, their approach probably kills the most people with epsilon = 0 measurements. The measurements would be entirely noise, no signal, and so they would basically be randomly dosing people.

Imagine you are going in for your Warfarin dosing, and are told "often we use 10mg when we don't know any better, but we've just gotten this pile of literally worthless measurements and we are going to dose you based on them."

I'm guessing you'd say "10mg please".

The important feature here is that we know how valuable the measurements are. You know that with epsilon = 0 there is no value, and so you should not adjust your prior strategy of 10mg dosing. A similar correction is supposed to apply to Figure 8 too: rather than just use the results of noisy measurements, integrate them with your prior belief that 10mg might be fine, weighted by your confidence in the measurements.

If you do this properly (which might be hard; it's well-defined mathematically, but often challenging computationally), the differential privacy curves in Figure 8a should start at the Fixed 10mg line and as you increase epsilon approach the LR line (the non-privacy linear regression). There is no reason to believe that the curves should approach LR quickly, though it probably wouldn't be any slower than they do now. The important difference is that everywhere in Figure 8a that DP is worse than Fixed 10mg it should be no worse (perhaps with some variation because randomness).

Figures 8b, 8c, and 8d are a bit of a mystery to me. They seem to show that LR doesn't reduce the risk over Fixed 10mg, and if you corrected the differential privacy approaches they shouldn't either. Now you have several techniques that don't improve over Fixed 10mg. I'm not sure what that shows; it's not my figure.

Conclusions

I think the main conclusion is that if you apply a technique to statistical data without using statistical techniques, you are likely to see bad results (like, actually harmful) especially when the noise in the data is relatively large. You can correct for this using elementary probability and statistics in principle, but doing so may be non-trivial.

Ollie Williams and I had a paper in NIPS 2010, Probabilistic Inference and Differential Privacy about "how to do this", where we mostly gave the short mathematical formulas and showed a few cases where it works. It is all relatively elementary mathematics starting from first principles, but you do end up with integrals. Solving them may be quite hard, depending on your case, but it sure beats killing hypothetical people by mis-using statistical information.

Technically, the authors do not say that differential privacy can't work. They say (emphasis theirs)

"We conclude that current DP mechanisms do not simultaneously improve genomic privacy while retaining desirable clinical efficacy, ..."

I think this is the only concession they make that there may be better approaches than what they chose to perform. In their conclusions they make the stronger statement:

"We show that differential privacy substantially interferes with the main purpose of these models in personalized medicine: for ε values that protect genomic privacy, which is the central privacy concern in our application, the risk of negative patient outcomes increases beyond acceptable levels."

You can blame differential privacy for interfering if you like (!@#$ing math), but I think it is more accurate to say that

"We show that failing to account for the statistical nature of the information differential privacy provides substantially interferes with the main purpose of these models in personalized medicine: ..."

Analogously, if you had a small sample size and then mis-treated a whole bunch of patients, should you blame the sample size or your over-eager mis-use of statistics? I have an opinion, maybe the authors have the other opinion.

Actually, the other opinion isn't totally deranged. When presenting information from differentially private measurements, we do need to think about how to be clear about what the measurements actually communicate. They are not reports of exact values, but rather partial statistical information. Can we make life easier for analysts who may not be experts in applying probabilistic inference? For example, medical professionals who may eventually treat me based on differentially private measurements. How's that for motivation?

With Davide Proserpio and Sharon Goldberg we showed that yes in principle you can do this. We built a system that both (i) takes differentially private measurements and (ii) automatically performs the probabilistic inference for you. It works kinda ok. It is the sort of thing that could be greatly improved by someone with practical expertise in probabilistic inference; if that is your bag, let me know.

Take-aways

This paper's definition of genomic privacy seems to be "preventing model inversion", which I can't imagine is a standard definition because they just made up model inversion in this paper. This attack involves you sharing information, your stable Warfarin dose, known to be correlated with your genomic information, with an attacker who may know something about the correlation.

Importantly, the privacy violation doesn't come as a result of applying genomic models to Warfarin dosing, it comes from determining the models at all. The model itself is what leads to the privacy violation; model inversion applies even to people who do not use the model to set their initial dose.

The remedy studied in this paper is literally "can we suppress the public knowledge of the correlation between genetic markers and stable Warfarin doses?"

You went and told your neighbor your Warfarin dosing, or let them snoop around in your medicine cabinet (sucks!), and in the interest of your genomic privacy reached the conclusion "can we shut down science please?"

Good luck with that. It isn't your knowledge to withhold.

This confusion is the exact issue that differential privacy was introduced to address: when there is a perceived privacy violation, how can we diagnose which information was rightfully yours to withhold, and which information is not yours to suppress.

Anything that can be learned without you is not your secret to withhold.

These authors have concluded the opposite: we should reasonably discuss the suppression of information about large populations of other people as a way to protect conclusions about ourselves. You can live in that world if you like, but you'll probably be pretty sad when Europe goes and publishes some excellent model for Warfarin dosing. Damn Europe, always messing up our genomic privacy.

Other thoughts

This paper occasionally gets trotted out with this grim image from the talk:

The quote is

"We did not observe a budget that significantly prevented model inversion, without introducing risk over fixed dosing."

This is totally true, they didn't observe such a budget. But it's fine for so many reasons.

The first reason is that model inversion misdiagnoses the source of the privacy violation: sharing your Warfarin dosage, or having it snooped from you, is what discloses information about your genetic markers. Their correlation as observed among large populations of people who are not you is not the source of your privacy woes. Model inversion is a non-attack; no one should care whether it is prevented or not.

The second reason is that the risk introduced over fixed dosing was primarily due to ignoring the statistical information about the differentially private measurements. The confidence associated with the measurement is (or should be) an important part of determining by how much you depart from your baseline treatment. That didn't happen in these experiments. The observed increased risk over fixed dosing is there because the use of statistical data without statistical techniques introduced it.

Differential privacy doesn't kill (hypothetical) people, these authors do. ;)

An analogy for the Usenix Security crowd

Imagine you had to review this paper, where

stable Warfarin dosage is replaced with a hash of my password
genomic dosing model is replaced with a list of popular passwords

Model inversion is replaced with "brute-force password attack", where the attacker of course is given the hash of a password, and may learn about commonly used passwords but only if the authors let them. The paper then studies how to suppress the accurate release of common passwords by adding noise somehow. Finally, it concludes

"We did not observe a noise level that significantly prevents brute force password attacks, without seriously undermining password strength-checking dictionaries."

What does your review of that paper look like?

Statistical inference considered harmful

Statistical inference considered harmful

Issue 1

Differential privacy?

Conclusions

Issue 2

Conclusions

Take-aways

Other thoughts

An analogy for the Usenix Security crowd

Recommend

Lunchtime for Data Privacy

Differential privacy and correlated data

Tracking motifs in changing graphs

Eventual Consistency isn't for Streaming

Change Data Capture (part 1)

Lateral Joins and Demand-Driven Queries

Materialize under the Hood

Georg Brandl

Joe Wilm

Rust Language Server (IDE support)

About Joyk