Surfing the LLM Waves: Continuously Benefitting from the Ecosystem Advances

Divyam.AI
5
February 24, 2025

Table of Contents

Generated with Imagen 3, with inspiration from "The Great Wave off Kanagawa".

On September 17, 2024, OpenAI announced o1-preview, which heralded the era of reasoning Large Language Models – which not only auto-regressively generates output tokens, but ponders over them at inference time (via intermediate thinking tokens) to ensure quality. This model enjoys a good performance rating (Artificial Analysis Performance Index: 86), but comes at a high cost (input: $15.00/mt; output: $60.00/mt – where the output tokens also include thinking tokens, and mt is an abbreviation for million tokens). On January 20, 2025, DeepSeek R1 was announced. It delivers an even more impressive performance (Artificial Analysis Performance Index: 89) at ~20-25x lower price (input: $0.75/mt; output: $2.40 on DeepInfra). Shortly thereafter, on January 31, 2025, OpenAI followed suit with o3-mini, which matches the DeepSeek R1 quality (Artificial Analysis Quality Index: 89), but at an intermediate price point (input: $1.1/mt; output: $4.40/mt).

If you are an application developer who benefits from the reasoning capability, should you migrate from o1-preview to DeepSeek R1, and then again to o3-mini? In an intensely competitive field such as frontier LLM, such potentially disruptive events tend to occur frequently – e.g., when a new frontier LLM arrives; when a provider, such as Groq, is able to slash the cost of access. In future as fine-tuning becomes commoditised, we surmise such events will occur even more frequently. Irrespective of these events that cause a step-jump, the quality and price of every provider or LLM change with time (a phenomenon christened LLMflation by a16z: for the same performance, LLM  inference cost gets 10x cheaper every year). This begs the question: must we migrate continuously?

The question of migration is even more nuanced. Two frontier LLMs with equivalent overall performance may perform differently on different tasks: e.g., while both o3-mini and DeepSeek R1 enjoy an equal Artificial Analysis Quality Index of 89, on quantitate reasoning task (MATH-500 benchmark), DeepSeek R1 fetches 97%, whereas o3-mini fetches 92%. This makes the migration decision further contingent on the nature of the application.

An application developer, thus, needs a mechanism for continuous migration – which enables her to decouple the choice of the provider/LLM from the application logic.

As an aside, in the world of finance, a trader would need to re-allocate her portfolio in response to (predicted) movements in the asset prices. A quantitative trader offloads this task to an algorithm.

At Divyam, we believe that an algorithmic, continuous, fractional migration is feasible – where the migration decision is offloaded to a router at a per prompt granularity.

To study the efficacy of routers, we conducted an experiment at Divyam. Specifically, we took the MT-Bench dataset, which contains 80 two-turn conversations between a human expert and a LLM. With Divyam’s evaluation harness, we replayed these conversations to both o1-preview and DeepSeek-R1-Distill-Llama-70B (input cost: $0.23/mt; output cost: $0.69 on DeepInfra; almost equal performance as o1-preview in MATH-500 and HumanEval), and used o1-preview to judge the resulting responses. The prompt template for the judge follows the best practices listed in the landmark Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena paper (note that we allow the application developer to plug in her own eval, instead). The result shows that 48 out of the 80 conversations (60%) elicit an equivalent or a better response if we choose to use the cheaper alternative in place of o1-preview – which amounts to slashing a $100 bill to $42.4 (~2.4x reduction) – at the expense of slight reduction in quality on half the conversations (note that this is a function of willingness-to-pay, a knob that the application developer can tune to suit her appetite for tradeoff). We present a visual summary below:-

Model comparison across traffic, cost, and quality.

This, however, is only an upper-bound. To operationalise this insight, one needs to actually build a router. While a detailed discussion of routing algorithms is deferred to a later blogpost, we illustrate the intuition behind a conceptually simple routing algorithm: k-Nearest Neighbour (kNN). kNN builds an “atlas” of all the prompts in the application log, and remembers the quality that each LLM yielded on them. Presented with a new prompt, kNN simply places it on that atlas, looks up k nearest neighbours around it, and chooses to route to that LLM which yielded the highest average quality in this neighbourhood. The following figure (left panel) visualises the atlas of the MT-Bench. This atlas was obtained by first embedding each prompt into a 384-dimensional space with the “all-MiniLM-L12-v2” Sentence Transformer, and then subsequently projecting them into the plane with t-SNE – a dimension-reduction algorithm – and, lastly, by colouring each conversation according to the most performant LLM for it. The right panel segments it as per the routing decision: i.e., if a prompt maps to a red region, the kNN router, when k=3, routes it to DeepSeek; else, if it falls into the green region, it goes to o1-preview.

Model performance distribution on MT-Bench.

At Divyam, we built an agentic workflow, where agents specialising in evaluation, routing, etc.collaborate to facilitate continuous and fractional (i.e., per prompt) migration – allowing the application developer to direct 100% focus on application development, devoid of any distraction posed by migration.This workflow requires a low-touch integration with the application, and can be deployed on the client’s infrastructure.

Explore More AI Insights

Stay ahead of the curve with our latest blogs on AI innovations, industry trends, and business transformations. Discover how intelligent systems are shaping the future—one breakthrough at a time.

BLOG

Divyam.AI's Performance vis a vis Microsoft and Nvidia Routers

5
Read More

Today, you have a choice from a crowd of models which flex intelligence and capability. You have intelligence on tap, but which tap should you turn? Getting the right balance of power and proportionality when choosing your AI toolset plays a huge role in your success with AI deployments. In the context of LLMs, this challenge is crucial as it often bags the bulk of your AI expenditure. Divyam.AI addresses this exact challenge for you by helping you optimize your cost-performance balance for your GenAI deployments.

In this article, we present to you a comparative study of Divyam’s Router (the DAI Router in the diagram)  capabilities vis-a-vis industry titans – Microsoft Model Router ,NVIDIA LLM Router.

To understand the comparison, let us dig into the principle on which Divyam’s Router works. 

Suppose you want to assess the mental abilities and knowledge-based skills of thousands of students – what you would do is design a test with a questionnaire – and make them take the test and rank them on their performance in the test. Institutions have been doing this for decades using the psychometric framework called Item Response Theory (IRT), which has been around since 1968! IIRT is a family of psychometrically-grounded statistical models that takes the response matrix (i.e., how each student answered every question in the questionnaire) as input, and provides an estimate of “skill” possessed by each student, the “skill” needed to solve each question, along with their “difficulty”.

To draw a parallel, now consider each LLM as a student and evaluation benchmarks to be the test. Divyam extends the IRT family to enable the estimation of skill and difficulty of a hitherto unseen prompt, and utilizes the estimated skill of each LLM to estimate an ex-ante performance estimate.  

Comparison of Routers

Routers are models that are trained to select the best large language model (LLM) to respond to a given prompt in real time. It uses a combination of preexisting models to provide high performance while saving on compute costs where possible, all packaged as a single model deployment. 

The Divyam LLM Router employs a proprietary algorithm that assesses the skill required (and the difficulty level) of each prompt, and based on that, routes it to the available models for routing.

Dataset

For our comparative study, the benchmark that we have chosen is the MMLU-Pro , which tests the reasoning and knowledge capabilities of an LLM via a set of multiple choice questions spanning 14 subjects, such as Math, Physics, Computer Science, and Philosophy. Each question is presented with 10 possible answers, and an LLM, upon receiving a 5-shot prompt, must choose the sole correct answer. Randomly-chosen  20% samples (2,406 out of 12,032) serve as the test dataset, on which we report the performance.

LLM Performance

In the table below, we present the performance of a set of contemporary LLMs on MMLU-Pro. From the table, we can see that o4-mini has the best accuracy for this benchmark. Our subsequent tests  will take o4-mini as the basis for our relative comparisons.

Results with Microsoft Model Router

The Microsoft Model Router (MS Router) is packaged as a single Azure AI Foundry model that you deploy. Notably, Model Router cannot be fine-tuned on your data.

The LLMs are chosen from a pre-existing set of models namely gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, o4-mini. Notably, one cannot add or remove to this list.

Unlike Microsoft Model Router, Divyam routes your queries to the right LLM based on your preference for  (a) cost optimization (b) performance optimization.

The graph below of Cost Savings vs Accuracy presents Router performance where the selection is limited to the MS router set of LLMs.  Divyam’s Quality Optimization parameters have been tuned to suit Accuracy requirements to compare to MS Router. This tuning is unique to Divyam and is not possible with MS Router.

You can see that for the same relative accuracy, Divyam’s Cost Savings (59.92%)  are nearly double that of MS Router(35.52%).

Whereas MS Router is stuck with its choices of LLMs, nothing restricts Divyam to add the right set of LLMs for our customer. After the 3 Gemini models presented in the above graph are added to Divyam Router (along with the ones Microsoft Model Router was already routing to), we notice a clear uptick in the cost-performance Pareto. 

You can see from the graph above that Divyam does even better in terms of Cost Savings and Accuracy compared to MS Router. For the same relative accuracy, Divyam’s Cost Savings(84.46%) is nearly 3 times that of MS Router(35.52).

Results with NVidia Router:  

Divyam’s ability to prioritize cost and accuracy and separate and combined parameters are unique and yield better and desirable results in both cases. 

The NVIDIA LLM Router can be configured with one of 3 Router Models – 1) task-router 2) complexity-router 3) intent-router – each, in turn, are powered by a (pre-LLM era) language model – Microsoft/DeBERTa-v3-base – which contains 86M parameters.

Furthermore, we consider the task-router and the intent-router unsuitable for our purpose and focus only on the complexity-router. The complexity-router classifies each prompt into one of 6 pre-defined classes (e.g., “Domain”), and routes all prompts in a class to a single, configurable LLM. In our specific example, all queries belonging to “Domain” are routed to the LLM, whereas everything else is routed to the SLM. 

We have tuned Divyam’s Quality Optimizer to “Priority Cost Saving” and “Priority Accuracy”

From the above graph you can see that for the same range of Cost Saving, Divyam(-0.16%) surpasses Nvidia(-18.1%) by a factor of 18 when tuned for Cost Saving. Also, Divyam’s Accuracy (1.31) surpasses that of GPT 4.1 when tuned for accuracy.

The table below is a level deeper into the results in the above table. It shows how Divyam has used the number of dimensions of LLM abilities to get the LLM with the best probable value to be correct for that prompt. It also lists the distribution of LLMs chosen for the percentage of prompts from the test set.

*Divyam’s MMLU-Pro Router Performance 

For a similar test, the NVIDIA LLM Router insights are depicted in the below graphs.

A close-up of a graphAI-generated content may be incorrect.

In conclusion, we see that Divyam’s Router yields better Pareto for both Microsoft Router and NVIDIA Router, even though the philosophies of LLM choice are different in both comparisons. Divyam’s ability to prioritize cost and accuracy and separate and combined parameters are unique and yield better and desirable results in both cases. Moreover, Divyam spans cartel borders in the industry and can easily incorporate LLMs from all segments.

Stay tuned for more experimental results on the cost – performance ratios and deeper tests on confirming Divyam’s low running costs.

BLOG

AI Strategy focused on maximizing returns on your GenAI investments

5
June 24, 2025
Read More

As industries across the spectrum continue to be sculpted by Gen AI, embracing it with strategic, ethical, and operational foresight will determine the extent to which businesses can truly harness its transformative potential to craft a future that resonates with success, sustainability, and societal contributions. However, GenAI faces its fair share of adoption hurdles. Organizations committed to leveraging generative AI must navigate through myriad challenges, ensuring both the solution efficacy and ethical application. 

Let us take the challenges of

Ensuring Adaptability and Scalability – given the scores of choices of GenAI products in the ever-evolving market today, an organization is continually wondering whether they have the right product chosen. The problem of vendor-lock in looms large as the costs of adapting and scaling are formidable. Your choice of LLM for your applications depends on the crucial balance of cost vs benefits. But this factor is not static – it needs continuous evaluation and evolution given the fluidity of GenAI advancements.

Accuracy and hallucinations – Your organization has GenAI based solutions, but you are continually concerned about the quality of the output you get. There are techniques to mitigate the tendency for AI models to hallucinate, such as by cross-checking the results of one AI model against another, which can bring the rate to under 1%. The lengths one will go to mitigate hallucinations is largely dependent on the actual use case, but it’s something that you need AI developers to continually work on. 

The above two challenges are compounded by the fact that your organization’s AI skill set not only needs to be continually upgraded, but your skilled workforce must also be involved in experimenting and training your application with new entrants in the Gen AI market to ensure that you are not losing out on the latest benefits that they have to offer. This affects your own time to market and cost of production.

What if there was a solution which chose the best LLMs for you , fine-tuned them for your application, continuously evaluated the quality of your output,provided excellent observability across all your GenAI use cases, all this ensuring the best optimized cost benefit ratio always?

A diagram of a companyAI-generated content may be incorrect.

Divyam.AI is the solution you are looking for.

If you are an organization who has established its inference pipelines using GenAI, Divyam can help you upgrade to the best model suited for your application. It is a solution where the best model for your use case is made accessible to you through a universal API which follows the OpenAI API standards. Divyam’s model selector takes each prompt of your application through the API, works its magic and directs it to the best LLM model for that prompt.

 It is unique because you have a leader board of models to choose from at a per prompt granularity. In the absence of Divyam, you would have needed to employ data scientists who would experiment and choose the best model for your application. Moreover, choosing the best model at a per prompt granularity is a hard problem to solve, you would rather have this problem solved for you by a plug and play solution like Divyam. The LLM chosen by Divyam’s fabric could be API based models like ChatGPT or Gemini or models like Llama which could be self hosted in your inference server. 

If you have been running your application through Divyam you would also not worry about fine tuning your inference server. Divyam’s fine tuner takes that headache off you. The fine tuner has intelligence built which chooses the right values of parameters to tune which are suited for your application patterns and uploads the fine tuned model back to your  inference server. This will continuously give your user an evolved experience and the best performance of your inference server model.

In cases where Divyam has chosen API based LLMs for your application, and you are wondering if you are still at your peak Cost benefit ratio, Divyam’s evaluation engine has you covered . The evaluator runs in the background and continuously does AB testing with earlier cheaper versions of the LLM models so that you always  have an almost equal or greater artificial intelligence analysis performance index for your application.

The cost-quality trade-offs of LLM usage are application specific. Different applications have different tolerance and often an organization wants to make a choice between reach vs quality. Divyam.AI provides you with a slider per application, which you can configure to have the desired balance. You can also observe the cost benefits and quality metric improvement on our rich dashboards and compare the performance. This can make all the difference between a positive RoI vs a negative RoI for the application. 

A black background with blue textAI-generated content may be incorrect.

Let us come to the inevitable question of data privacy. Your organization’s data remains safe within your VPCs. Divyam is deployed within your enterprise boundaries, so your data stays put. Divyam only has control plane access to push in the latest and greatest intelligence and monitor quality so that you are always at peak performance.

Divyam.AI is also available as a hosted solution in case you want to get started with a single line of code change. 

In conclusion, Divyam.AI learns from both its global expertise and your own historical data to build a private, use-case-specific knowledge base that trims away hundreds of irrelevant model choices, automatically selects the best one, and continuously monitors live quality. If performance ever dips, it intervenes instantly to protect production, and it reruns the whole process on schedule or the moment a new model emerges. All of this happens without manual effort, so your team can stay focused on delivering core value instead of chasing model upgrades or cost savings.

Cut costs. Boost accuracy. Stay secure.

Smarter enterprise workflows start with Divyam.ai.

Book a Demo