OpenAI GPT Function Calling - Badly Named, Madly Powerful
14 Jul 2023
14 Jul 2023 by Luke Puplett - Founder
Bye bye REST. Hello Chat API
I’m taking a pot-shot at their naming but I don’t know what I’d have called it. You see, it doesn’t actually call any functions. I shall explain.
It’s pretty simple but really powerful for reasons I’ll come to. Let me just quickly explain what it is and does.
-
OpenAI have trained GPT to produce a very specific JSON structure, when you send it a free form chat message, along with a different specific JSON structure .
-
To be clear, it takes JSON and produces JSON.
-
The JSON it takes, that your app sends to GPT, defines a set of actions that can be taken and their parameters, rather like an interface definition language (IDL).
-
It also takes a free form message, like a chat phrase.
-
GPT then uses the chat message plus the available actions (functions) and picks the most appropriate action, it uses the chat message to populate arguments for the parameters.
-
It returns specific JSON which describes the action taken, i.e. “the function it has called”, and parameter-value pairs for the arguments.
-
Your app then uses this JSON response do whatever function calling it needs to do, whatever that may mean to your app.
The response from GPT can also contain some chatty message which is designed to be sent back to the user of your app.
Let’s see it then
Ta dah!
{
"model": "gpt-3.5-turbo-0613",
"messages": [
{
"role": "user",
"content": "Add Joseph Biden, 1600 Pennsylvannia Avenue and remind me to wish him happy birthday"
}
],
"functions": [
{
"name": "create_crm_contact",
"description": "Creates a new contact in the CRM and allows an optional todo item task to be added.",
"parameters": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "The name of the new contact."
},
"streetAddress": {
"type": "string",
"description": "The street address of the new contact."
},
"city": {
"type": "string",
"description": "The city of their address."
},
"region": {
"type": "string",
"description": "The region or county of their address."
},
"postalCode": {
"type": "string",
"description": "The post code of their address."
},
"country": {
"type": "string",
"enum": [
"UK",
"US"
]
},
"todoItem": {
"type": "string",
"description": "A description of an outstanding task or action to take, such as a follow-up."
}
},
"required": [
"name"
]
}
}
]
}
Gorgeous, isn't it? This is what you send to GPT advertising the available functions in your app.
And this is what you get sent back
{ "id": "chatcmpl-123", "etc": "other JSON properties cut for brevity", "choices": [ { "index": 0, "message": { "role": "assistant", "content": null, "function_call": { "name": "create_crm_contact", "arguments": "{ \"name\": \"Jospeh Biden\"}" "etc": "etc" } }, "finish_reason": "function_call" } ] }
Breathtaking.
Hypermedia
If you’re familiar with HATEOAS web APIs then you’ll find this next bit simple to grok. However, the vast majority of techies in the world think they know what a REST API is, but they do not, because they’ve never seen one and no one has ever built one.
Except every website.
The term REST came from a guy named Roy Fielding. The paper he wrote was obscure and he wasn’t the kind of personality to really go after explaining it better. It was open to misinterpretation and so over time REST merely came to mean JSON-sent-to-and-fro-over-HTTP-with-user-friendly-URLs.
What Roy was describing was just the web, really, in that websites pass HTML over HTTP and a tool called a browser knows how to deal with HTML. The “page” or resource contained content but also the options for what can be done next, i.e. links and forms. The browser, because it knows HTML, knows how to present the links and forms, and it knows how to create the correct HTTP request when a form control is used (submitted).
The crucial part is the content-type, HTML, and this is where the focus of attention and specification and harmonisation takes place; so long as we all know HTML and the rules of processing it, we can hold a conversation (or order a pizza).
Real REST is simply a restating of this in terms of its utility beyond user-app interactions and into machine-machine interactions. Critical to this idea is that the text of the response must contain the possible next actions, somehow. This might be links to the root of the app, controls for deleting something, or updating something or ordering pizza. So basically a website.
I’d argue that if popualr programming languages had included a way to turn HTML into data objects, instead of JSON into objects, i.e. deserialisation, then we might well all be doing proper REST. The problem was that JSON lacked links and forms and a standard way to deal with them. And we got confused by what a content-type is and never really agreed a way to represent affordances in JSON.
Affordances for robots
Anyway, my point is, this new function calling "specification" looks rather like controls, or affordances, in JSON.
I want to explore the ramifications of building with hypermedia-like controls and Large Language Models because it raises some really interesting questions. For example, will your app will need to keep track of what’s going on, where the user might be in a journey, and assess the available things that could be done, and present those options as functions to GPT?
If we consider a URL and how it relates to the web application, aside from the protocol, it identifies the server code to run and optionally the data to run against. That’s all it does. It's why it's called a locator: it's pointing to code on a server and identifying any stuff the code should bring into context.
But given that there is no URL or state being batted back and forth in the GPT function calling request-response model, your app will need to maintain state before making its call to GPT and then match it back up on the response. In most programming models, especially those with async-await, this is trivial, but worth pointing out.
OpenAI's documentation is scant and there might indeed be a state object that can be passed back and forth.
The examples I’ve seen have all been weather based, which is sort of object-less; sure, the location of Sunnyvale is a database record but location names are already known by the client, in the brain of the user, whereas the private database record ID of a customer is not, so how do you get it into context?
The Future of APIs is English, or Spanish, or AI Lingua Franca
Machine-machine interaction could quite rapidly move to using English as its main form of communication, and perhaps the AIs could develop their own efficient language.
As we’ve discussed, an API is just a text-based way to get information from another system and potentially make it do things. The text is important because it’s programmers who have to write code to interact with the other system, and having readable requests and responses is a big deal, and partly why XML didn’t last long.
Being able to explore the API and what it can do is another important feature and perhaps contributed to Roy’s vision of REST not leading anywhere. Consider how websites don’t normally come with a user manual because they’re self-documenting, and if they’re not able to explain themselves, then that’s a UX bug.
APIs aren’t so visual, there’s no browser rendering buttons and forms, so while a hypermedia API using JSON might be self-documenting, a programmer would still need to “move around” the API to see what’s in the JSON to learn how to integrate with it. So it’d either need documentation anyway, or to be done in HTML so developers can easily explore it, and then they’d need that HTML-objects deserializer I was talking about.
Alas, if I am designing a new text API today, should I build a JSON API and its attendant documentation portal, or should I just accept English and run it all through an LLM? The LLM option kills two birds because you end up also building a chat app, for free.
Chat API
Such a Chat API would need documentation, because as we’ve seen, it’s hard or time-consuming to explore text-based APIs. The problem is very similar to not knowing what can be done with a Google Assistant or Alexa, consigning them to smart kitchen timers. But it could have a root resource, which would be “Hey, what’s up?” And the reply would announce the things that can be done.
Programmers would write code like “Delete customer {cust_id}” and send that to the API. They can identify data to work on from their own store of the foreign reference.
Human end-users using the same API can do the same, gaining knowledge of the customer ID from information presented in a previous question like, “List all my customers in Brighton”.
And there are other perplexing questions.
Should all possible actions or code endpoints be announced, or should they be announced according to the context?
Answering this depends on the intended audience, programmed code or end-user. If we’re expecting end users to use the Chat API, then we should behave as a guide and not present options that aren’t possible.
If the caller is code (issuing English instructions), then it’s less likely to be dynamically assessing the response anyway, but will have been programmed to carry out a series of steps to get something done in the remote application.
This is maybe another reason hypermedia APIs are superfluous. Hypermedia allows URLs to change without breaking the client because the client dynamically reads the URL to operate against at runtime when forming its request. But it’s unlikely that the client code assesses the various options of what actions can and cannot be done next. It’s operations are hard-coded and either an action will succeed or fail, the code won't feel bad and not come back if it sees an error.
However, if the caller is an AI issuing English instructions and consuming English (or English plus structured data) responses, then it might be better to guide it as if it were human. In this model, we can imagine the programmer has given an AI a general task to complete on its own, rather than coding instructions in English. Rather than “delete customer {cust_id}” it's been given Clean up old data for GDPR.
It becomes quite mind-boggling to think about, in the abstract.
Should I assume the client has use of an LLM?
Again, that depends on whether the Chat API is being used by programmed code or end-user, or is designed to be used by an end-user alone. If it’s programmed code, then perhaps the API can present the actions in a function calling specification and a local LLM can comprehend and choose an appropriate action and format the text instruction to issue the API. But this effectively does the remote API’s job for it.
If it’s an end-user, then they will be presented options in text, along with any other information, perhaps even a chart or photograph, like “Would you like to delete it or archive it?”
“Delete it”
But what communicates “it” in a stateless dialog? In a traditional API, we’d either need to accept an object ID, or some state coming along for the ride will help the server code. In REST, the URL often contains critical data. It feels like there needs to be a bag of hints, identifiers passed around somehow. Cookies? Or should we always pass around the chat history as a kind of shared headspace where each system can leave itself notes, like the ID of “the thing”? Would that be accurate enough?
“Erm... neither. What was I doing? Going back to that other thing.”
Changing the subject
Are the contents of the chat message all that’s needed to make accurate progress, and the knowledge that you can always reset and return to the beginning, or does the client need to maintain a kind of location history somehow in order to go “back”. Or should we assume we can only move forward?
Again, it depends on the audience. Human end-users are likely to hop around while hard-coded integrations are likely to reset and start doing something else. Drop state. Start over.
Perhaps we need special chat browsers that keep clear, visually, just what it is we’re currently talking about.
I don’t have the answers, but they’re fun to think about.
That's lovely and everything but what is Zipwire?
Zipwire Collect simplifies document collection for a variety of needs, including KYC, KYB, and AML compliance, plus RTW and RTR. It's versatile, serving recruiters, agencies, people ops, landlords, letting agencies, accountants, solicitors, and anyone needing to efficiently gather, verify, and retain documented evidence and ID.
Zipwire Approve is tailored for recruiters, agencies, and people ops. It manages contractors' timesheets and ensures everyone gets paid. With features like WhatsApp time tracking, approval workflows, data warehousing and reporting, it cuts paperwork, not corners.
For contractors & temps, Zipwire Approve handles time journalling via WhatsApp, and techies can even use the command line. It pings your boss for approval, reducing friction and speeding up payday. Imagine just speaking what you worked on into your phone or car, and a few days later, money arrives. We've done the first part and now we're working on instant pay.
Both solutions aim to streamline workflows and ensure compliance, making work life easier for all parties involved. It's free for small teams, and you pay only for what you use.