How can we reduce costs associated with generative AI models?

To reduce costs associated with generative AI models, organizations should strategically balance quality, speed, and infrastructure choices, while also recognizing their level of internal expertise.

Proprietary models like OpenAI’s GPT-4o may offer high performance, and importantly do not require setting up your own infrastructure to host the model. But this can come at a premium. Processing, say, 1M pages of PDF documents could cost between $750$^1$ - $1125$^2$ depending on the model. There are different costs associated with each proprietary model, so if going this direction, we would recommend testing multiple models on a small set of sample data to help identify the best trade-off between performance and cost for your given use case.

In contrast, open-source alternatives — for example, Llama self-hosted via AWS Bedrock — will be significantly cheaper. For the same 1M pages, the cost may be more like $250$^3$. These do require some additional technical expertise to build and maintain the infrastructure for hosting the model. However, managed model hosting services (e.g., AWS Bedrock, Google Vertex, Azure AI services) minimize this overhead.

Aside from model selection, cost efficiency can be improved through techniques like:

Minimizing the number of tokens used in your prompt and/or requested response.
Prompt caching to save intermediate outputs and reduce redundant processing.
Batch processing for tasks that aren’t time-sensitive. Processing large volumes of data in scheduled batches reduces the need for always-on infrastructure and can take advantage of lower compute costs during off-peak hours.

Combining smart model selection with some of the above strategies can help minimize costs without sacrificing critical quality or functionality.

Additional Resources:

Cost calculations:

$^1$ GPT 4.1 from OpenAI

$^2$ Claude Sonnet 4 from Anthropic

$^3$ Llama 4 Scout 17B from Meta hosted on Bedrock

This response has been generated by an LLM based on notes from PJMF technical consultations. All responses go through human review by our PJMF Products & Services team and are anonymized to protect our consultation participants.