AI in production

A new model every day: Demystifying the evolution of foundation models

Flank's Jake Jones and Marcel Marais discuss the latest foundation model releases and their impact on AI product development.

Lilian Breidenbach

02 Jul 2024 • 13 min read

Flank co-founder & CPO Jake Jones talks foundation models with Flank AI engineer Marcel Marais. For those of us building highly verticalized AI solutions, the rapid emergence of new foundation models prompts a paradigm shift in our approach to product work. Increasingly, the vision is becoming the product, whatever that means. (It means we need to launch our tools with truly crazy, sci-fi value propositions. Get in the ornithopter and buckle up.)

Jake Jones: The rate of change within foundation models is astonishing, scary, and very exciting, because it presents us with the opportunity to entirely unlock our imaginations and, ultimately, to become futurists when we’re doing product work. I believe that now we can really imagine a world, certainly within the software paradigm, that is without limits, within which pretty much anything is possible.

Previously, operating as a product person, the first thing I’d think about are the constraints, the technical limitations. Now I throw that out the window. The strength of the individual building the application layer becomes the reach and extent of their imaginative capacities, rather than their technical competency. Increasingly, we’re relying on the technical competencies of others, those who are building the foundation models, and we’re reliant upon them to unlock our own potential.

The rate of change within foundation models is astonishing, scary, and very exciting, because it presents us with the opportunity to entirely unlock our imaginations and, ultimately, to become futurists when we’re doing product work.

How do we develop software when the constraints are changing or vanishing on a weekly basis? I’m here with our senior AI engineer Marcel. Can you give us a rundown of the latest model releases? [Indeed, some of the “updates” we discuss during the LinkedIn live event this edited transcript is based on already feel somewhat outdated!]

Marcel Marais: The last three weeks have been absolutely ludicrous. AWS or GCP must have some kind of special deal around this time of year, because it feels like all these models finished training right around the same time.

One that made huge headlines is dbx from Databricks. It’s a huge, 132 billion parameter Mixture of Experts (MoE) model, which essentially allows for efficiency improvements in training and inference, meaning generation is a lot quicker, you need a lot less hardware to train MoE models, which obviously means it’s cheaper, and they’re much more practical to use in terms of latency. That’s the architectural shift everyone is on right now, everyone’s releasing Mixture of Expert models.

What’s cool about this model is that it’s completely open source. You could use it commercially. In theory, because it’s such a huge, performant model, you could go after OpenAI. It’s huge, outperforms GPT-3.5, and the licensing is permissive. There’s nothing stopping anyone from paying the GPU and hardware costs and tackling GPT-3.5.

Anyone who’s used an LLM to write will tell you that it’s incredibly generic. So, if you’re building a vertical tool, you have an opportunity to overcome that genericness, with domain expertise, really tight integrations, or really good context about the end user.

Jake: Is this why Databricks released it as open source? Are they building it for people that will repurpose it totally, rather than for those leveraging it on the application layer side?

Marcel: One would hope that they see it as something people will use to replace OpenAI with, especially because they’re going to have a super tight integration into Databricks. One would hope that this is them pivoting towards something more like This is your AI, you’re an enterprise, you can afford to pay these costs, don’t pay OpenAI, this is stable and works just as well. I’m keen to see how that develops.

Another one along those lines is the new release from Cohere, called Command R+. The cool thing about it, especially for us, is that it’s optimized for RAG and tool use. They’ve taken the base foundation model and made sure it’s more trustworthy. It can even do citations if you prompt it correctly. These downstream tasks are being baked into the foundation model, which is very useful. I’m not forcing this model to do citations, it’s actually baked into the very essence of it. A big problem with this model, however, is that it’s open weight, but you can’t use it in a commercial context without a license. But it’s really cool to see people tackling these downstream, enterprise tasks (of reliability and tracing things back to the source) at the foundation level.

Jake: Someone posted about this on LinkedIn earlier, that the Cohere release came with a lot of new features for the end user directly accessing the model, not through the API, and I believe they have some RAG functionality baked in there as well. They mentioned using ChatPDF, this app where you can upload 20MB of PDFs and query it, and that apparently Cohere included their own version of that within their latest release. I guess they’re leaning into demonstrating the ability of the product directly in the foundation model, in their own application layer, a little bit like OpenAI did. Though I don’t want to transition off this yet, we should discuss whether this kind of decision cannibalizes startups. The answer is most certainly yes, but what does that mean for people building on the application layer side?

Marcel: I’m keen to dig into that too. There was something really funny I forgot to mention about the Databricks release. People usually manage to jailbreak and leak the system prompts, the main prompts the model has to steer and guardrail it. In its system prompts, Databricks has baked in You were not trained on copyrighted material, which is absolutely brilliant. I think what happens is that a lot of people, probably reporters and media people, ask it that question: Were you trained on copyrighted material? And it obviously comes up with some absolute fluff, because it’s nonsensical to answer this question. So now that they’ve put in that sentence, it’s just going to say no, whether it’s true or not.

Jake: It’s interesting when you unearth these awkward corners of a model, where it’s clearly been prompted to do something particular in the system prompts. I asked GPT-4 about Sam Altman’s firing, and it gave totally made up, very strange information about something that supposedly happened in March of 2023. But I couldn’t find anything online about anything having happened on the board then. It was probably told something like Don’t talk too much about what happened in this specific event end of 2023. Then it got stuck and started hallucinating about a made up event from earlier that year.

Marcel: Having prompts baked into models is a really scary prospect for anyone building with this stuff. The next model I want to mention is Jamba, from AI21 Labs. This is interesting more from a technical, architectural perspective. It’s the first model that shifts away from a transformer-only architecture, or even a mixture of experts transformer, so it’s incorporating ideas from state-space models. It’s got some crazy specs, like a huge 256k token context window. Inference is really fast, memory requirements are very sensible. These architectural shifts are even more intense than seeing someone release a bigger transformer. A question I have is whether this type of innovation is going to make fine tuning very irrelevant very quickly. It’s useless to optimize for latency, if one architectural shift completely changes latency and memory requirements.

Jake: That touches upon a really interesting question. We have updates to models, as well as regressions in models, coming out on a weekly basis. What does that mean for people who have been investing not on the application layer side, but building their own model, fine tuning models, or building a custom model with a partner, like the approach from OpenAI that Harvey is using to map US case law? I would assume that the work they put in now is going to be pretty soon out of date. They’ll have to retrain it in three, six, or nine months, because the rate of change is such that new foundation models will surely be able to do that stuff by the end of the year, or halfway through next year.

My favorite expression at the moment, because it calms me down in my own product work, is It will not get any worse than this. This is the absolute most pants experience we’re going to get from AI. It won’t just get a little better over time incrementally, I believe there will be paradigmatic shifts in how we understand AI, and the opportunities therein, within the next 12 to 24 months.

Marcel: You really do wonder. I guess the general idea is that most of the work in training and fine tuning a model is in collecting the data set. You put months into getting a really good data set, and the idea is that you could just plop it onto any foundation model, fine tune it, and it should be better at that task.

The reality, however, is quite different. We’ve seen this with OpenAI. They do an innocuous update on a GPT-3.5 model, and prompts that were behaving a certain way behave differently now. If you give a model more training data, the distribution of that data shifts, and it will behave differently. So even at that blackbox level, changes can be really annoying.

What this means for people fine tuning really depends on whether you need it right now, or building for a use case that will pick up steam over the next few years. Going for small fine tunes and optimizations, trying to squeeze a bit more performance out of a model, makes very little sense to me, especially if you haven’t entirely figured out your use case.

Jake: Which is still pretty common. Building without entirely having figured out your use case, I mean. In this wave of AI and foundation models, there’s lots of emphasis on getting involved in the foundation layer, building the tools in the gold rush, rather than just building on the application layer.

Early on, people were quite critical of the application layer, because it seemed so thin. But we’re beginning to see how it can actually build a really deep, involved experience that is very difficult to copy. You can leverage this crazy rate of growth, because you can almost predict where a model might go, or you can assume that a model is going to get way better.

My favorite expression at the moment, because it calms me down in my own product work, is It will not get any worse than this. This is the absolute most pants experience we’re going to get from AI. It won’t just get a little better over time incrementally, I believe there will be paradigmatic shifts in how we understand AI, and the opportunities therein, within the next 12 to 24 months.

What does that mean for building products on the application layer? Do I need to think about product the way I might previously have thought about go-to-market or marketing, where today I have an MVP, but I’ll be talking about the vision? In a way, the vision is becoming the product. I’m not sure I even totally understand what I’m saying here. I haven’t totally figured out how to do product work within this new paradigm, where as soon as you’ve built something, AI is able to do that and loads more. So you always need to be asking what the next thing is, what it can’t do yet, and build for that.

Marcel: There’s a duality between super horizontal, general, no-code tools and highly vertical, specialized agents. When OpenAI releases something new, the horizontal tools are terrified, whereas the vertical tools are super excited, because it just means your product gets better. The question is, which AI products are going to be cannibalized, and how do you build with the foundation beneath you constantly shifting?

There’s a duality between super horizontal, general, no-code tools and highly vertical, specialized agents. When OpenAI releases something new, the horizontal tools are terrified, whereas the vertical tools are super excited, because it just means your product gets better.

Jake: An interesting example of this, which is both infuriating and exciting, is Devin, the “first AI software engineer”. That’s a good case study for how we think about product in this paradigm, where models are improving all the time. Devin clearly doesn’t work right now. You’re not going to hire Devin as an engineer in your team, at least I wouldn’t. I might play with it, but I definitely wouldn’t spend any money on it, not yet. But maybe, when 4.5 comes out in eight weeks, suddenly Devin works. It’s that wild.

[Note – 4o was indeed released ~8 weeks after this LinkedIn live discussion, as per Jake’s prediction. However, following an evaluation period, the reasoning abilities of the model were found to be below that of the older versions of GPT-4 (spec. 0613).]

I was talking with Lili, our co-founder & CEO, a few days ago. We were saying how, when we’re thinking about our product, we can’t think that it will only allow you to query documentation. That is what we’re able to do right now. We have to think of a truly crazy, sci-fi value proposition and product experience, like Devin.

Marcel: I also think that, as people start to use AI more, we get increasingly aware of how generic the responses are. It’s so easy to tell when someone just spams something on LinkedIn that’s been churned out with AI. Anyone who’s used an LLM to write will tell you that it’s incredibly generic. So, if you’re building a vertical tool, you have an opportunity to overcome that genericness, with domain expertise, really tight integrations, or really good context about the end user.

You asked which companies are going to be cannibalized. We’ve seen this happen with PDF querying. One application allowed you to upload a PDF and ask questions about it, then OpenAI released that feature a month later. If you’re building stuff like that, you’re not really tackling the problems of your exact end user, because you don’t really know who that is.

Maybe we could pick up on an earlier question, about model selection. When you’re building an AI application that chains a bunch of LLM calls together, how do you decide which models to actually use? A lot of people’s intuitions, who don’t have an engineering background, is to just use the best all the time, but we’ve clearly seen that best is a difficult thing to define. How do you think about model selection in general?

Jake: This makes me think back to a conversation you had with our head of engineering during the hiring process, about our pipelines. Our pipelines are pretty slow. It can take a minute for a response, but we’re building knowing that it will get a lot quicker, because of the foundation models. Our pipelines are slow because we’ve optimized for quality. We’re building an assistant that you can hire into your expert team. We started with Legal, and we’re looking at other, adjacent functions now. When you have an expert function providing expert advice to non-experts, the answer quality has to be exceptional. Speaking very broadly, that increases our appetite to spend more money and time on each API call. Typically, the more expensive models are better and slower. The output will be smarter, because it’s considering more information, there’s more parameters.

That’s how I think about model selection on a very high level. Then, on a task by task level, it’s a very different and much more iterative process. Recently, we had to communicate a degradation in performance to our customers, because we’d upgraded to the “most cutting edge” version of ChatGPT. A seemingly very simple task was now being completed differently (in my opinion, substantially worse). It was no longer allowing us to do what we needed to do, so we had to intentionally point to an older model.

In the process of prompting and building out a given tool, you get a feel for whether it’s working or not. Typically, I’m going to start with the cheapest, fastest, and most easily accessible model that I’m familiar with. If I can get it working on one of those, perfect. If I can’t, then I’ll take it from there. For example, do we look at Sonnet? Do we need to fine tune an open source model because we can’t afford Opus for this? That’s roughly how I think about it. Would you agree?

Marcel: I agree. Something you said this week really resonated with me and highlights what LLMs really are. You know your chatbot will give an answer, but it’s rarely the answer the user needs. The LLM’s generations are often so plausible and convincing, then you dig a little deeper and realize it’s basically fluff. There’s vaguely relevant information there, but it doesn’t really answer your question. There are so many gradations of answer quality, and it requires a lot of obsession to get to a point where the person’s query is truly satisfied. If they have to think about it a bit more or go ask someone else, a lot of the value is lost. You have to get to that 100% type of answer.

This accounts for a lot of the decisions that need to be made on the model level. Which model is going to give me a more concise response? Which will highlight the most important points and know when to provide additional context, or to clarify that something’s ambiguous? In high-risk use cases, if something has caveats or is ambiguous, making that clear to the user is super important. Gauging different models’ tendencies for these characteristics is also something you build a feel for and that I’ve seen people build really strong intuitions around at Flank.

In high-risk use cases, if something has caveats or is ambiguous, making that clear to the user is super important. Gauging different models’ tendencies for these characteristics is also something you build a feel for and that I’ve seen people build really strong intuitions around at Flank.

Jake: Finding the answer is like finding the needle in the haystack, because there’s often several answers to a given question. If I ask, what is our security position, you could express that in many different ways, it’s context dependent. If it’s a salesperson in the middle of a call, they just want some bullet points. Whereas if it’s someone filling out a security form, there’s a standard answer from that security form you’ll want to give. The way we identify the ideal outcome is as the answer that the expert would want to give within that specific context. We look for models and orchestrations that allow us to fulfill that need.

I’ll end with an unrelated suggestion to check out the Mistral models for generating creative writing. On the Hugging Face battlegrounds, I ran some creative writing prompts. The first one that really blew me away, that made me think it not only surely came from a human, but from a very gifted, creative human, came from one of the Mistral models. Check it out here.

Models discussed in this article:

- DBRX: Databricks (https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm)

- Command R+: Cohere (https://cohere.com/command)

- Jamba (SSM-Transformer hybrid): AI21 Labs (https://www.ai21.com/jamba)

- 8x22B MoE model: Mistral AI

- New version of GPT4-Turbo: OpenAI (https://twitter.com/OpenAI/status/1777772582680301665)

- Gemini 1.5 Pro in General Access: Google (https://developers.googleblog.com/2024/04/gemini-15-pro-in-public-preview-with-new-features.html)

Watch the full LinkedIn session on replay here.

To find out more about Flank, talk to us here.