What's the best way to keep data secure while interacting with an LLM — especially in the case of sensitive information?

If you are especially concerned about keeping sensitive data secure while interacting with a large language model (LLM), there are a few things you’ll want to think about:

a) which generative AI model to use

b) if open-source, how to host it

c) how to anonymize your data

Model selection

The best approach to ensure strong data privacy is to use an open-source model that you can fully control. Open-source models give you greater transparency and the ability to restrict where and how data flows. Proprietary models provided by companies like OpenAI and Claude have been improving their data privacy practices — e.g., providing configuration options to protect submitted data being used for training purposes. Furthermore, some of these providers also offer enterprise-tier services with stronger data isolation, encryption, and compliance standards (e.g., HIPAA, SOC 2). However, you are still putting trust into a third-party service.

Model hosting

If you move forward with an open-source model, the next question is how to host it. Hosting on a secure cloud platform, like Amazon Bedrock, can be reasonably safe for most organizations. For those with higher privacy or compliance requirements, an on-premises solution—running the model on your own hardware—offers the highest level of security. Though more in-house technical expertise would be required in this case.

Data anonymization

Aside from model selection and hosting, a critical step is data anonymization. Before sending any information to an LLM, strip out personally identifiable information (PII) like names, addresses, or Social Security numbers. These details aren’t usually necessary to get a useful model output. In some cases, however, sensitive information might still be relevant (e.g., a health app may need to retain gender to provide accurate results). In these cases, consider minimizing identifiability as much as possible. For example, an individual’s date of birth could be converted to an age range (e.g., 45–50) instead of keeping the exact date. These kinds of transformations help preserve data utility while reducing the risk of re-identification.

Additional Resources:

What is Personally Identifiable Information (PII)? | IBM

This response has been generated by an LLM based on notes from PJMF technical consultations. All responses go through human review by our PJMF Products & Services team and are anonymized to protect our consultation participants.