Security issues with sending ChatGPT sensitive data

Part of my job as a data scientist is to be a bridge for lay-people interested in applying artificial intelligence and machine learning to their particular applications. Most quant people with a legit background will snicker at the term “artificial intelligence” – it is a buzzword for sure, but it doesn’t matter really. People have potential applications they need help with, and various statistical and optimization techniques can help.

Given the popularity of ChatGPT and other intelligent chatbots, I figured it would be worthwhile articulating the potential security issues with these technologies in criminal justice and healthcare domains. In particular, you should not send sensitive information in internet chatbot prompts. Examples of this include:

a crime analyst inputting incident narratives (that include names) and asking a chatbot to summarize them
a clinical coder inputting hospital notes and asking for the relevant billing codes
a business analyst inputting text from a set of slides, and asking ChatGPT to edit for grammer

The first two examples should be pretty clear why they are sensitive – they contain obviously sensitive and personal identifiable data. The last example is related to intellectual property leakage, which is more fuzzy, but for a general piece of advice if it is not OK to post publicly for everyone to see on the internet, you should not put it into a prompt. (So crime analysts talking about crime trends is probably OK, since that is already public info, but a business analyst with your pitch deck for internal business applications is probably not.)

Why can’t I send ChatGPT sensitive information?

So the way many online APIs work (including ChatGPT) is this:

You go to website, you input information into a webform
This data gets posted to a webpoint (someone elses computer)
Someone elses computer takes that input, does something with that data
That other computer sends information back to your computer

Here is a diagram of that flow:

AIL4fc8oTF7il22z9IZPDlYZfo0T46JQZ4QfFZ5kMz_WV-dCD66MmAR9QkLcucboVtKZI0SQ60HD2t8_4AznSOx0E3ueZdFizzdzB6OryaMgqPneYqGlKonxqUKYJwlifdduUMr-MzgYvcFu-VzkFYd-NTdy=w861-h391-s-no?authuser=0

So there are two potential attack vectors in this diagram. The first are the arrows sending data to/from OpenAIs computer. Someone could potentially intercept that data. This is not really a huge issue as stated, as the data is likely encrypted in transit. The second, and more important issue, is that the red OpenAI computer now has your sensitive data cached in some capacity.

If the red computer becomes compromised it can cause issues. This is not hypothetical, OpenAI has had issues of leaking sensitive information to other users. This is a computer glitch – bad but fixable. It is a risk though you should be aware of.

A more important issue though, the licensing I am aware of, they can use your conversations to improve the product. This is very bad as to my current understanding, as you can have conversations that are prompt leaked to third parties if they are updating models with your conversations downstream.

This is even worse than say Microsoft being able to read your emails – it would be like a potential third non-Microsoft party could become privy to some of your emails. For example, say a crime analyst in Raccoon city inputted crime incident narratives like I said in my prior example. Then I asked ChatGPT “Give me an example crime incident narrative”, and it outputs narratives very similar to the ones Raccoon city crime analyst previous put into ChatGPT. This is a feature under the current licensing, not a bug.

Let me know in the comments if they are offering paid tiers for the “don’t use my data for training and it is always encrypted and we can’t see it” (I don’t know why they do not offer that). Also they would need to have particular HIPPA standards for medical data, and CJIS standards for CJ data to be in security compliance for these example applications.

Now it is important to discuss other chatbots, who are often just calling OpenAI under the hood. The data flow diagram then looks like this:

AIL4fc_84Cp30Xbdl-IzIZo3S1oVIc3kzTUsH97-RuvneQrC7Uhg2HYwn3Y4hwL50eU-Q3ZiPu1w2olGp7Swd35U8X7fxAQbxW3EZz1TtlYC7SDrhektmJalzMGOua0FXPa0cuyUppSn31Nkbqjt0b_s6QXm=w883-h698-s-no?authuser=0

It is essentially the same attack vectors but just doubled; now we have two computers instead of one that is a potential vulnerability.

Again here the issue is now two different people have your data cached in some capacity (the blue computer and the red computer). We have people making new services all the time now (the blue computers), that are just wrappers on OpenAI. Now you could have your data leaked by the blue computer, in addition to the problems with leaking in OpenAI.

The solution is local hosting, but local hosting is hard

OpenAI is to be commended for making a quality product – its very easy to use APIs are what make having wrapper services on top of it so easy (hence these many chatbot APIs). From a security standpoint though, you just need to do your due diligence now with two (or more) services when using these secondary tools, not just one. There will be malicious apps (so the blue computer is intentionally a bad actor), and there will be cases where the blue computer is compromised (so not intended to be malicious, but the people running the blue computer messed up).

Given that OpenAI as I am aware doesn’t have the necessary licensing to prevent info leakage, as well as the more specific security clearances, the solution like I said is to self host a model. Self hosting here means instead of sending data to the red OpenAI computer, the flow stays entirely in the single black computer you own, or you have your own server (so a second black computer that speaks to the first black computer).

There are open source and freemium models that are reasonable competitors. But, it is painful to self host these models. For neophytes the way these language models work, they take your text input, turn the text into a set of 1,000s of numbers. They then feed those 1,000s of numbers into a model with billions of parameters to get the final output. You can just think of it as doing several billion mathematical operations you individually could do on your hand-held calculator.

This takes a computer with a large memory and a GPU to do anything that doesn’t take hours. So self hosting a smaller batch process is maybe doable for a normal person or business, but if you want a live chatbot for even one person is hard (let alone a chatbot for multiple people to use at the same time).

Several large companies (including OpenAI) are currently even using up the majority of cloud infrastructure that has machines that can host and run these models, so even if you have money to pay AWS for one of their large GPU computers (it is expensive, think 5 digit costs per month), you maybe can’t even get a slot to get one of those cloud resources. And it is questionable how many people can even use that single machine.

I think eventually OpenAI will solve some of these security issues, and offer special paid tiers to accomodate use cases in healthcare and CJ. But until that happens, please do not post sensitive data into ChatGPT.

Security issues with sending ChatGPT sensitive data

Security issues with sending ChatGPT sensitive data

Why can’t I send ChatGPT sensitive information?

The solution is local hosting, but local hosting is hard

Recommend

收藏！《2023年中国数字心理健康服务企业大数据全景图谱》(附企业数量、企业竞争、企...

伊利×2023FIBA篮球世界杯《热爱，一直到世界尽头》

3 more simple AirPods tricks everyone should know

Midjourney 推出 AI 创意局部重绘功能（智能区域替换或重绘）

【行业深度】洞察2023：中国连锁药店行业竞争格局及市场份额(附营收排名、企业竞争力...

【全网最全】2023年精准医疗行业上市公司全方位对比(附业务布局汇总、业绩对比、业务...

Richest Billionaires in Logistics Industry (August 18, 2023)

From Rookie to Result: integrating S/4HANA Cloud Public Edition with ChatGPT usi...

华东医药的“销售铁军”

Next Smallest Palindrome

About Joyk