IBM homomorphic encryption: A DASHing solution for healthcare data privacy

Blog Post

Learn about a winning solution for data privacy in healthcare using the IBM homomorphic encryption HElayers library

In 2021, our team won third place in the second track of the iDASH workshop challenge on healthcare data privacy. Our solution classified 2000 viruses in less than 1 second with more than 99% accuracy by using the IBM homomorphic encryption HElayers library.

In this blog, we describe the iDASH competition, our solution, and what makes it so effective. As a motivating scenario, think of a hospital that, after much research, has collected a large number of virus DNA sequences that are labeled as one of four possible strains. The hospital wants to provide local clinics with a service that classifies the DNA sequences taken from their patients. However, the hospital does not want to disclose the classification algorithm to the clinics for obvious business reasons.

A simple solution would consist of a client/server system in which the local clinics serve as the client, and the hospital is the server. In such a solution, the client would send the DNA sequence to the server. The server would classify the sequence and send the label back to the client. The problem is that both the client and the server in this relationship want to avoid disclosing patient information, including the DNA sequences of the viruses that patients have contracted because doing so will require them to comply with extensive and exhausting regulations.

Specifically, we want the server to be able to classify a virus without knowing what its DNA sequence is. Until recently, this seemed impossible. Today, this can be done by using homomorphic encryption (HE) technology. This encryption technology is the focus of the iDASH competition.

What is the iDASH competition?

iDASH is an annual workshop on data privacy in healthcare. As part of the workshop, the organizers set up a three-tiered secure genome analysis competition that is open for competing teams from around the world. Each year the challenge is different, and teams have three months to plan and implement a solution for the challenge.

What was the iDASH challenge this year?

This year, the second track of the competition challenged participants to classify virus DNA sequences as one of four strains. Each competing team received a training set consisting of 8,000 DNA sequences with their strain labels (2,000 from each strain). The DNA sequences were provided as a FASTA file (roughly 30,000 DNA letters for each virus). Each team trained their own model using the training set.

The solutions were evaluated using a new set of 2,000 DNA sequences to be classified, which were unknown ahead of the competition’s deadline. The DNA sequences were encoded and encrypted using the client’s key, with each team providing software that simulated the client. The DNA sequences were then classified (while still encrypted) by another software that each team provided to simulate the server. The output of this classification was also encrypted with the client’s keys, and, therefore, could not be read by the server. Eventually, the encrypted classification was decrypted by the software simulating the client to reveal the output.

The rules of the competition included the following items.

Privacy-related

The server must not learn anything about the DNA sequences that it classifies.
The client must not learn anything except the label that is classified by the server (for example, it cannot be given the model). Specifically, this also means that the client must not perform any type of feature selection because the selected features reveal something about the server’s model.

Performance-related

The client can use only 1 CPU with 1 GB RAM.
The server can use 4 CPUs with 4 GB RAM.

What is Homomorphic encryption?

Homomorphic encryption (HE) is a new encryption technology that allows encrypted numbers to be added together or multiplied. With these two primitive operations, any algorithm can be computed on encrypted numbers, including the classification of an encrypted DNA sequence.

HE is known for being notoriously slow and memory consuming. This is primarily because it only supports addition and multiplication operations, which can be used to evaluate any polynomial over encrypted ciphertexts. This computation model is different from traditional algorithms that can look at the values of variables and make decisions based on their values. Theoretically, every algorithm can be expressed as a polynomial. For example, locations where the algorithm takes a branch in its flow based on a condition C is replaced by computing both branches and multiplying the output of one branch by C and the output of the other by (1-C), and then calculating their sum. Here, we assume that C is either 0 or 1, in which case one branch is multiplied by 1 (that is, taken as is), and the other branch is multiplied by 0 (that is, is nullified). The value of C determines which branch will be taken. This idea extends to cases where C takes on other values as well. But, the size of the polynomial could grow exponentially with the number of branches. Thus, the most important thing is to redesign the algorithm so that it has as few branches as possible. A similar efficiency problem arises when we need to compare values.

To improve the running time, HE schemes use the underlying mathematics to gain a technical advantage that allows packing many values (usually a few thousand) into one ciphertext. A ciphertext then holds not just one message, but an array of messages, where additions or multiplications are executed in a SIMD (Single Instruction Multiple Data) manner. This significantly boosts the running time. However, packing introduces its own challenge (more on this later). That said, this improvement does not address the inherent problem of not being able to branch, as discussed in the previous paragraph.

What does our model do?

In a nutshell, in our approach we encoded DNA sequences as k-mer sets. We used the training set to find a representative for each strain. For the classification, we computed a similarity score with each representative set, and then chose the one with the highest score.

We used the notion of k-mers, which are sequences of k letters that appear consecutively in a DNA sequence. For example, for k=7 there are 47 = 8192 possible 7-mers because each DNA letter has one of four options. Given a DNA sequence, we encoded it as the set of all k-mers appearing in its DNA sequence. A representative set for a strain is a set of k-mers that have high correlation with the training DNA sequences belonging to the strain. Computing these representatives was done offline as part of the training process.

To classify a DNA sequence, the client computes its k-mer set, encrypts this set, and sends it to the server. The server computes similarity scores that signify how much the given DNA sequence is similar to each of the strain representatives. We used a standard similarity score that measures similarity between sets: 1 meaning that it is the same set, and 0 meaning that they have nothing in common. As stated previously, because the DNA sequence that needs to be classified is encrypted, the resulting similarity scores are also encrypted. As a technical last step, we normalized the similarity scores by dividing each of them by their average.

The encrypted normalized similarity scores were then returned to the client, which decrypted them as defined in the goals of the competition.

Challenges and solutions

During our work on the iDASH competition, we encountered several challenging situations that we addressed using a variety of approaches.

Making our model HE-friendly

As mentioned previously, reducing the number of branches that an algorithm makes in HE is critical. In our case, the naïve algorithm branches when it computes the intersection of two k-mer sets (as part of the computation of the similarity score). To avoid all of these branches, we encoded (before encryption) each k-mer set as a 4k-long binary vector. We created a global public map that assigned an index i to each 6-mer. The i-th coordinate in the vector was set to 1 or 0 depending on whether the i-th k-mer appears in the set. This encoding is HE-friendly.

Nontrivial encoding

To take advantage of the SIMD feature of HE, we used a nontrivial encoding in which a single ciphertext contained four k-mers of each of the 2000 DNA sequences to be classified (the usual encoding would pack all k-mers of a DNA sequence in a single ciphertext). This nontrivial encoding helped us reduce the running time and the memory requirement. The encoding was done using the IBM HElayers library, which easily supports such (and similar nontrivial) encodings over HE schemes.

The competition’s memory limitation was so tight that it did not allow us to simultaneously keep all of the encrypted DNA sequences in memory. To meet these limitations, we chose a “pipeline” architecture in which a ciphertext of the input was read, processed, and discarded before reading the next ciphertext of the input.

No division

As mentioned previously, HE schemes support only additions and multiplications. They do not support divisions (or computing the inverse 1/x). To compute and normalize the similarity scores, our algorithm needed to compute two inverses. We used a standard low-degree polynomial approximation to the inverse function.

What were our results?

The results of our system were:

Client encryption: 10 seconds using 280 MB of RAM and 1 CPU
Classification on the server: 0.6 seconds using 400 MB of RAM and 1 CPU
Decryption on the client: 1 second

Note that although the competition allowed the server to use 4 CPUs, our solution was so fast on 1 CPU that we did not need to parallelize it.

Conclusion

The world of HE is evolving and improving. Classification tasks that took days to complete 10 years ago and took hours then minutes to compute several years ago, now take a second. This is a result of advancements in the encryption schemes, in the algorithms, and in hardware. Our results show that privacy preserving classification can be done in a client/server model in real time.

We believe HE will become a common method for providing privacy-preserving solutions. Our HElayers library makes HE accessible and easy to use by noncryptographers.

You can download HElayers. Our code will be published soon as part of the demos that come with HElayers.

IBM homomorphic encryption: A DASHing solution for healthcare data privacy

Blog Post