Should we fine-tune our LLM with our own internal data?

Fine-tuning a large language model (LLM) with your organization’s internal data can seem like a powerful way to boost performance, but it’s not always necessary. For many use cases, out-of-the-box models, which are pre-trained on huge amounts of data, already perform remarkably well — especially when paired with thoughtful prompt engineering.

Before pursuing any fine-tuning, start by clearly defining your success metrics (e.g., accuracy, tone, completeness) and test the model’s default performance for whether it is able to hit those metrics. Work on improving the performance of the model first through well-crafted prompts, perhaps testing out more advanced prompting techniques like chain-of-thought prompting—which guide the model to reason step-by-step. These techniques can often significantly improve output quality without any further training.

If, after testing and optimizing prompts, the model still falls short, it may be worth exploring more advanced approaches. Fine-tuning is one option, but it’s often costly, complex, and harder to maintain over time. For many internal use cases, a simpler and more flexible alternative is retrieval-augmented generation (RAG), which allows the model to reference your data dynamically without retraining. Or in the case that you only need to supplement with a tiny amount of data, you may be able to provide such internal data within the prompt context.

In general, start small and iterate: refine prompts first, then explore ways to incorporate internal data only if needed.

Additional Resources:

Methods for adapting large language models

This response has been generated by an LLM based on notes from PJMF technical consultations. All responses go through human review by our PJMF Products & Services team and are anonymized to protect our consultation participants.