Blog

07.2024

Mayfield CXO Insight Call – LLM Application Observability and Evaluation with LangChain

We recently hosted our latest CXO Insight Call on the topic of LLM Application Observability and Evaluation. We were joined by Ankush Gola, Co-founder of LangChain and Jorge Torres CEO and Co-founder of MindsDB.

The promise of generative AI is immense, but when it comes to building and scaling successful LLM applications, it can be difficult to move from curiosity to a solid proof of concept with repeatable results. The future is going to require new plumbing to make this all a bit less painful – on our call today, we spent some time discussing how to utilize frameworks like LangChain and dev tools like MindsDB.

For those who aren’t in the know, LangChain is an open-source app development framework which is used to build, collaborate and test LLM applications through evaluations. In terms of open source statistics, they have over 85,000 stars on GitHub, over 2,500 contributors, 80 model integrations, and 600 API and tool integrations. They support over 35,000 applications that are built with LangChain and their Python framework alone has over 14 million downloads per month, which is fantastic. More recently, they offer a couple of new platforms as well: LangSmith, which helps with the problem of evaluations and observability, and Graph, designed to help developers build agentic applications with large language models.

What Does an End to End Use Case Look Like in a Large Organization?

There are a few steps developers are going to have to take when it comes to building AI applications. The first step comes down to: what should we build? There’s so much promise in generative AI today, and there are so many interesting applications out there, so can we get started? The process usually flows something like what we have below:

What should we build? Lots of options – crawl/walk/run approach
How do we go from curiosity to proof of concept? Need a new toolchain and shift in building strategy
How do we meet our quality standards? Repeatability of results at scale requires new methods of testing and evaluation
How do we build trust in a non-deterministic application? Minimize the reputational risk of delivering poor experiences
How do we keep customer data safe? There is a generalized uneasiness today around how customer data is being stored and used

Steps three and four can be the hardest because you see a lot of awesome demos. Generative AI applications are very easy to demo in a sense and they make for great demos, but gaining repeatability results at scale and then building trust in these non-deterministic applications is actually quite challenging. So this is where we see the differentiation between companies who are building real applications that provide high ROI, and companies that are struggling to get out of the demo phase.

The Challenge of Step Three: Evaluations

Evaluations are a way to measure the performance of your LLM application, and this is really the differentiator between companies that are building demos and companies that are building real applications that are serving production workloads reliably, efficiently, and predictably. Similar to how developers write unit tests for their software in the AI world, the analogy in the LLM world is evaluations. You’ll need some data that you can use to test your application, and then have a set of inputs and expected outputs. For example, if you’re building a chatbot, the inputs might be a set of user questions and the outputs might be the expected answers to those questions. Developers will often start with manually curated examples – they write down questions they think a user will ask, and then answer the questions themselves by looking at the docs.

You can build a dataset from user-provided examples, so once you have your application in production, you can actually take some of the production logs and add them back to your dataset to increase your benchmarking set over time, and even generate synthetic examples.

The next thing to focus on when it comes to evaluations is the evaluator itself – which is really just a piece of code. It can be an LLM, an automatic evaluator, or it can even be a human that grades the output of your LLM app. Not all evaluation modes require a ground truth, if you’re evaluating an output based on vagueness, or helpfulness, or criteria along those lines, it’s not necessary to have an expected output. But if you’re doing any kind of accuracy measure, this is where having that output expectation is key.

The last step is applying evaluations – you can write them as unit tests, or as judging your application performance in both an online and offline manner, and then you can do A/B testing. This is where you deploy different versions of your application into production with different groups. This will help you better understand the results of a prompt iteration or a model change or things of that nature.

Best practices:

Datasets: Add interesting production logs to datasets
Evaluators: Use LLM-as-a-judge evaluators
Evaluators: Evaluate on intermediate steps (especially important for RAG use cases)
Applying Evaluators: Online evaluations

Building a Data Flywheel

In terms of best practices, building a good dataset is a great place to start. When you first build a dataset for evaluations, you might be manually curating examples or building up synthetic examples using an LLM, but it’s often a really good idea to take the production logs that you’re getting when you deploy your application to an initial set of users. You can bring those production logs back to your application and back to your dataset. You log to production, find interesting log data,and start to build a data flywheel.

This is where understanding your data becomes important. One way we see people do this is they actually log a thumbs up or thumbs down response to their LLM generation, and then they log that feedback along with their production log. This way they’re able to filter on the logs that get the best user feedback or the worst user feedback and add them to data sets as interesting data points. Once refined, they can use the best results for benchmarking, and the worst results to ensure that their application doesn’t trip up on these inputs again when they iterate on
In LangSmith you can add filters to your logs and then add automations to automatically add the most interesting inputs and outputs of the most interesting traces as production dataset examples. Companies have been able to achieve 30% accuracy with no prompt engineering, and just building off their data flywheel.

Use LLM-as-a-Judge Evaluators

Surprisingly, LLMs are quite good at evaluating application results, particularly if they have a known output to reference. What this involves is prompting an LLM to score your evaluation results. It’s recommended to use the most advanced model you can afford – there’s a difference between speed and reasoning capability for evaluations. Usually when you’re dealing with production workloads, latency is more of a factor, but for offline evaluations like this, you can get away with higher latency, but more reasoning capability.

Another issue can be noise. LLM evaluations have the tendency to be noisy because you’re using a non-deterministic system to grade a non-deterministic system. There are two ways to reduce this noise:

Pairwise Evaluations – These rely on the theory that LLMs are better at ranking outputs than giving absolute scores. Pairwise evaluations are already part of LangSmith today.
Human Auditor – You can iterate on the prompt you’re using for evaluation by having a human audit the results of your evaluator. By measuring the alignment between the LLM and the human judge, you can use the results of these corrections to prompt your LLM evaluator and get it to be more aligned with the human response over time

There will always be a tradeoff here between performance, latency and cost, with observability as an important fourth piece. As you start to experiment, you’ll figure out how much you’re willing to pay for different levels of performance and latency.

What’s the right insertion point for a large company to consider evaluation? At the prompt level? At the output summary level? What are the key insertion points for evaluation to prioritize and where does LangSmith help out?

The most basic form of evaluation is to treat your LLM application as a black box and only evaluate its output. Your application might be doing a number of different things. It could be retrieving documents from a Vector database. It could be rephrasing the question. It could be doing a generation based on the results that were returned from the vector database. And so at some point, you want to evaluate all of those steps individually. But to start off, you should just treat your application as a black box and evaluate the final output.

LangSmith can help a lot with evaluating intermediate steps. So for RAG applications and agent applications especially, it’s super important to look at your steps individually. So for example, you might want to score your application on document relevance. This basically means that you’re evaluating the retrieved documents with respect to the query that came in from the user. You might also want to evaluate on answer hallucination – and this is evaluating the final generation with respect to the documents that were retrieved. LangSmith allows you to take in the application trace and iterate over that trace to get the relevant steps and evaluate those steps in isolation.

In RAG applications, we want to put in some evaluation or safety checks. Does the retrieved content (before you build your context) even come close to matching the query, so that we’re not injecting the possibility for the LLM to generate an off-target response? And then also for groundedness, does the response that comes back look like the context at all? Just how easy is it to plug in these kinds of checks in your application workflow?

It’s easy to start building evaluations, but over time you have to tune these evaluations to be more use-case specific for all the reasons I mentioned before. LLM-based evaluations can be noisy, and your evaluation scores might not be perfect right away. It’s easy to plug in evals, but it’s going to take some work to evolve them over time to be meaningful.

Going from Generic AI to Custom Enterprise AI

Generic AI where you ask a question and it can give you an answer so long it doesn’t have to access specific data inside your organization. This is like going into ChatGPT and asking it a question about your data: it wouldn’t be able to answer – this is where things leave the realm of generic AI. Our portfolio company, MindsDB, makes it easier for enterprises to build systems where your AI requires not just the AI component, but also your company’s data.

When you’re building out LLM applications – your focus can’t just be on an agent, or an LLM, or a RAG pipeline – the whole system requires you gluing different pieces together to accomplish your needs. MindsDB is a system which orchestrates the workflows between data sources and AI systems – with integrations to all of the relational databases, data warehouses, CRMs, etc. The key is being able to query all these different data sources through SQL, which is Postgres. Then, MindsDB does a great job at translating those queries into the way that you query the data originally. So, you simply write SQL and MindsDB can plan and execute those queries into the dialect that the systems require. Once you plug data sources in, it all looks like part of one single database. This capability comes in very handy when you’re building AI systems.

For example, when you’re using agents and LLMs, it makes it very simple for your LLM to only have to be really proficient at Postgres, while answering questions over all kinds of different data sources. This is really powerful because it reduces the scope of what the LLM must do well, and it makes the variability a little bit more manageable. Additionally, you may start with GPT-4 and move to other kinds of more proficient LLMs in the future, but the contract that you have between your AI system and the application that will consume it, be it Slack, Teams, or a custom application, will remain the same. So you can keep on developing these applications and increasing the complexity.

What’s special about the integration points, either on the AI side or the application side that would differentiate it from other data federation solutions like GraphQL?

MindsDB is a data federation solution. It makes it easy for people to write those abstractions, and then once it’s written, then you can use them. So MindsDB has hundreds of already existing integrations to many different data sources that it federates through a single dialect. When you do, for instance, GraphQL, you’re the one that has to build how you resolve it.

When someone writes an abstraction in MindsDB, from then on, anyone can consume this through standard SQL. And not only that, you can combine that with data you may have in other databases without having to write all the piping. So you can think of MindsDB’s orchestration as a plumbing system where the pipes and all the tooling that you need have pretty much already been prebuilt for you. You just have to know how you’re going to plug it in.

When you’re thinking of GraphQL, on the other hand, you have to build these abstractions on your own. Then, you’re going to be defining how you plan to expose this graph on the other side.

A lot of the major vendors are trying to be the center of the solar system. Salesforce wants to be in the middle. Microsoft wants to be in the middle. How would you think of MindsDB vs. something like Microsoft Fabric?

MindsDB partners with all of them. People have data in not just one unified platform, so by definition, if you want to build a system that could bin several different data sources, you’re going to have to get out of Microsoft. MindsDB has thrived on building partnerships with each one of these players and writing integrations.

Vertical integration is great if you have one set of systems, but most companies can’t escape having different providers, because different providers provide different kinds of tooling. That diversity is where MindsDB thrives.

There’s also diversity in the way that AI is happening – you have OpenAI, Anthropic, open source models, etc. MindsDB tries to abstract those components, with the objective that customers can unlock themselves from that vertical integration that the hyperscalers and all the providers want to utilize.

MindsDB wants you to be able to swap between different providers. So within one single interface, you can build your application with LangChain and the OpenAI connector, and still use your application without changing any lines of code. You just change the real endpoint and then you can start consuming LLAMA-3 models, or Mistral, or Gemini. MindsDB has built a pre-packaged system where if you want to connect to any of these through a single API then you can do so.

Tell us some of the learnings you’re seeing. You’ve been involved in a number of both enterprise and early adoption deployments over the last couple of years. What are you learning? Where are people running into issues and what’s really accelerating things, I guess?

The first thing we learned is that software developers are the ones that end up shipping things that the customer actually sees. At some point, data scientists got locked into this kind of prototyping state where a lot of this stuff was just happening in Jupyter Notebooks, but never really making it into something that the customer would use. So organizations have started making things more actionable. Companies have already identified use cases where they’re not mission critical, and are producing compelling arguments on introducing AI into non-mission critical systems. Current processes, without AI, are still not better than an AI that sometimes makes mistakes.

So people are starting to ask:

What’s my reference architecture?
How does my roadmap look in terms of incorporating these tools like LangChain? Which ones are compatible with the systems I have today?

If a team has already built something that can answer questions from your database, it’s important for that block to be abstracted for the rest of your developers so that they don’t have to think too much on how to make that happen. Instead, they can work on thinking about more low hanging fruit.

Large organizations have all kinds of legacy teams supporting infrastructure for years. How does that unlock really occur?

Legacy systems aren’t going anywhere. Those migrations will take a long time, and if they’re working, there’s no reason to change. So it’s very important to guarantee that the data inside that system can still be unlocked. Extracting that data is fundamentally where MindsDB provides value. It takes a long time for existing IT and tech teams to build a great deal of domain expertise on those systems, and they know the ins and outs of how they work.

How can you take those skills and turn them into actionable microsystems that allow them to incorporate new AI capabilities? So if you think of traditional ERPs or traditional CRMs or even legacy data warehouses, these systems already contain a great deal of information. So the question is: How do you guarantee that someone who works well with those systems can write a few lines of code to expose their capabilities to agents in a way that’s production-ready? Use your existing workforce and knowledge and just incorporate simple units of AI capabilities that unlock new possibilities for old systems.