1

Security issues with sending ChatGPT sensitive data

 1 year ago
source link: https://andrewpwheeler.com/2023/08/18/security-issues-with-sending-chatgpt-sensitive-data/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Security issues with sending ChatGPT sensitive data

Part of my job as a data scientist is to be a bridge for lay-people interested in applying artificial intelligence and machine learning to their particular applications. Most quant people with a legit background will snicker at the term “artificial intelligence” – it is a buzzword for sure, but it doesn’t matter really. People have potential applications they need help with, and various statistical and optimization techniques can help.

Given the popularity of ChatGPT and other intelligent chatbots, I figured it would be worthwhile articulating the potential security issues with these technologies in criminal justice and healthcare domains. In particular, you should not send sensitive information in internet chatbot prompts. Examples of this include:

  • a crime analyst inputting incident narratives (that include names) and asking a chatbot to summarize them
  • a clinical coder inputting hospital notes and asking for the relevant billing codes
  • a business analyst inputting text from a set of slides, and asking ChatGPT to edit for grammer

The first two examples should be pretty clear why they are sensitive – they contain obviously sensitive and personal identifiable data. The last example is related to intellectual property leakage, which is more fuzzy, but for a general piece of advice if it is not OK to post publicly for everyone to see on the internet, you should not put it into a prompt. (So crime analysts talking about crime trends is probably OK, since that is already public info, but a business analyst with your pitch deck for internal business applications is probably not.)

Why can’t I send ChatGPT sensitive information?

So the way many online APIs work (including ChatGPT) is this:

  1. You go to website, you input information into a webform
  2. This data gets posted to a webpoint (someone elses computer)
  3. Someone elses computer takes that input, does something with that data
  4. That other computer sends information back to your computer

Here is a diagram of that flow:

AIL4fc8oTF7il22z9IZPDlYZfo0T46JQZ4QfFZ5kMz_WV-dCD66MmAR9QkLcucboVtKZI0SQ60HD2t8_4AznSOx0E3ueZdFizzdzB6OryaMgqPneYqGlKonxqUKYJwlifdduUMr-MzgYvcFu-VzkFYd-NTdy=w861-h391-s-no?authuser=0

So there are two potential attack vectors in this diagram. The first are the arrows sending data to/from OpenAIs computer. Someone could potentially intercept that data. This is not really a huge issue as stated, as the data is likely encrypted in transit. The second, and more important issue, is that the red OpenAI computer now has your sensitive data cached in some capacity.

If the red computer becomes compromised it can cause issues. This is not hypothetical, OpenAI has had issues of leaking sensitive information to other users. This is a computer glitch – bad but fixable. It is a risk though you should be aware of.

A more important issue though, the licensing I am aware of, they can use your conversations to improve the product. This is very bad as to my current understanding, as you can have conversations that are prompt leaked to third parties if they are updating models with your conversations downstream.

This is even worse than say Microsoft being able to read your emails – it would be like a potential third non-Microsoft party could become privy to some of your emails. For example, say a crime analyst in Raccoon city inputted crime incident narratives like I said in my prior example. Then I asked ChatGPT “Give me an example crime incident narrative”, and it outputs narratives very similar to the ones Raccoon city crime analyst previous put into ChatGPT. This is a feature under the current licensing, not a bug.

Let me know in the comments if they are offering paid tiers for the “don’t use my data for training and it is always encrypted and we can’t see it” (I don’t know why they do not offer that). Also they would need to have particular HIPPA standards for medical data, and CJIS standards for CJ data to be in security compliance for these example applications.

Now it is important to discuss other chatbots, who are often just calling OpenAI under the hood. The data flow diagram then looks like this:

AIL4fc_84Cp30Xbdl-IzIZo3S1oVIc3kzTUsH97-RuvneQrC7Uhg2HYwn3Y4hwL50eU-Q3ZiPu1w2olGp7Swd35U8X7fxAQbxW3EZz1TtlYC7SDrhektmJalzMGOua0FXPa0cuyUppSn31Nkbqjt0b_s6QXm=w883-h698-s-no?authuser=0

It is essentially the same attack vectors but just doubled; now we have two computers instead of one that is a potential vulnerability.

Again here the issue is now two different people have your data cached in some capacity (the blue computer and the red computer). We have people making new services all the time now (the blue computers), that are just wrappers on OpenAI. Now you could have your data leaked by the blue computer, in addition to the problems with leaking in OpenAI.

The solution is local hosting, but local hosting is hard

OpenAI is to be commended for making a quality product – its very easy to use APIs are what make having wrapper services on top of it so easy (hence these many chatbot APIs). From a security standpoint though, you just need to do your due diligence now with two (or more) services when using these secondary tools, not just one. There will be malicious apps (so the blue computer is intentionally a bad actor), and there will be cases where the blue computer is compromised (so not intended to be malicious, but the people running the blue computer messed up).

Given that OpenAI as I am aware doesn’t have the necessary licensing to prevent info leakage, as well as the more specific security clearances, the solution like I said is to self host a model. Self hosting here means instead of sending data to the red OpenAI computer, the flow stays entirely in the single black computer you own, or you have your own server (so a second black computer that speaks to the first black computer).

There are open source and freemium models that are reasonable competitors. But, it is painful to self host these models. For neophytes the way these language models work, they take your text input, turn the text into a set of 1,000s of numbers. They then feed those 1,000s of numbers into a model with billions of parameters to get the final output. You can just think of it as doing several billion mathematical operations you individually could do on your hand-held calculator.

This takes a computer with a large memory and a GPU to do anything that doesn’t take hours. So self hosting a smaller batch process is maybe doable for a normal person or business, but if you want a live chatbot for even one person is hard (let alone a chatbot for multiple people to use at the same time).

Several large companies (including OpenAI) are currently even using up the majority of cloud infrastructure that has machines that can host and run these models, so even if you have money to pay AWS for one of their large GPU computers (it is expensive, think 5 digit costs per month), you maybe can’t even get a slot to get one of those cloud resources. And it is questionable how many people can even use that single machine.

I think eventually OpenAI will solve some of these security issues, and offer special paid tiers to accomodate use cases in healthcare and CJ. But until that happens, please do not post sensitive data into ChatGPT.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK