We Are Entering the Second Age of AI (Yes, Already:)
AI can now see, hear and speak. Soon, it will be everywhere and it's a game changer.
Hi folks, in this week’s newsletter, I'll be exploring the Second Age of AI.
I discuss the implications of how AI systems can now “see, hear and speak” and what it means to connect them to the Internet and your own private or enterprise data.
In the “vision” section I discuss how Bing Chat was used to crack “Grandma’s Love Code” as one X (formerly Twitter) user delicately put it.
And, dive into a brand new Microsoft research paper,
“The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)”
To look at the multiple different use cases for vision-based AI systems, across industries as diverse as insurance loss adjusting and medicine.
And that’s just the start!
There’s a lot to cover, so without further ado …
Let’s dive in!
Are you ready for the Second Age of AI?
Wait, what?
Hold on...
What about the FIRST Age of AI - what was that about?
Well, if you blinked, you might have missed it as I have it on good advice that it's already nearing its end.
Hopefully, you and your company got the 20% productivity boost as discussed by Sam Altman, OpenAIs CEO?
No?
Better catch up then because we are moving on!
The First Age of AI
In case you missed it, the first Age of AI, marked by ChatGPT's debut 10 months ago (November 30th, 2022, to be exact), was characterised by showcasing the capabilities of large language models (LLMs) in conversing (through written text) naturally with humans.
The so-called, ‘Conversational AI’, or to give it its full TLA (Three Letter Acronym, which by now I’m sure you know I love my TLAs), is CAI.
CAI turned out so good that many AI experts have since conceded that the Turing Test (originally called the ‘Imitation Game’ and by a film of the same name), conceived by Alan Turing to distinguish humans from machines, had not only been shot to pieces but rather that LLM-based AI, like ChatGPT, would in fact need to be dumbed down to pass for human.
As an aside, any AI prior to the First Age sloshes around in a kind of AI primordial soup of diverse, evolving models from the likes of GOFAI (Good Old-Fashioned AI) consisting of symbolic AI and expert systems, to a trove of TLA’s underpinning various Neural Net architectures incorporating the likes of RNNs, CNNs, GANs, and notably (not a TLA), the original Transformer architecture from Google...upon which the GPT technologies of ChatGPT and others were inspired.
Now, we are already entering AI's Second Age.
So what is it?
The Second Age of AI
First, I want to point out that the ‘First Age’ and ‘Second Age’ of AI monikers are all my own inventions rather than through any form of academic consensus, so don’t go around using these terms and expecting people to know exactly what you mean.
What I would say, however, is that from reading and watching countless hours of content from AI pundits each week, there appears to be some consensus forming that the first age (phase, wave, or whatever you want to call it) of AI is about to be/is being usurped by even more powerful models, or upgrades to models in the case of GPT-4, which are themselves significantly more capable than those of the first age.
From a technology standpoint, the second age appears to be a continuation of the first, primarily centred on neural net transformer-based AI, except the scope has exploded.
No longer are we dealing just with scooping up trillions of words of text from the Internet to train models.
“ChatGPT can now see, hear, and speak”
Source—OpenAI
The latest releases from OpenAI (GPT-V for ‘Vision’, not V for 5 as in GPT-5) and soon, Google Gemini (released later in 2023) can in, the words of OpenAI, now “…see, hear, and speak”.
Sounds spooky.
But in AI-speak the models are now ‘multi-modal’, meaning they can not only respond intelligently to the text you type at your keyboard or phone, but can also converse with you via voice (yours and theirs) and analyse and interpret images you upload to them.
The other big thing defining the Second Age of AI is connectivity—not just giving models the ability to connect to the Internet to search (they have been able to do this for a while already, and ChatGPT recently had its Internet capabilities reinstated after a few months absence), but also giving them the capability to connect to private data in a kind of out-of-the-box RAG (Retrieval Augmented Generation) service. More on all these capabilities in a moment.
You can already try out this new tech.
If your access to ChatGPT doesn’t currently provide you with the ability to upload an image (it soon will for Plus and Enterprise subscribers by mid-October), you can just hop over to Microsoft’s Bing Chat (put it into “Creative mode” first) and give it a go.
On the subject of Bing Chat, I’ve always found it to be less polished, not to mention on occasions slightly weird, compared to the underlying models that OpenAI push out, even though Microsoft’s products are purported to be based on the same OpenAI models.
Whether this is down to some sort of slightly off ‘fine-tuning’ of the model by Microsoft engineers I’m not quite sure, but we had the whole “Do you believe me? Do you trust me? Do you like me?” from Bing Chat earlier in the year, when it revealed itself to be called Sydney, rather than Bing Chat, to a New York Times journalist.
That being said, Bing Chat is still a good way to get a feel of some upcoming new model capabilities, so if that’s your only option, go for it.
For the remainder of this week’s newsletter, let’s take a look at each of the main capabilities of ‘Second Age’ AI models and see how they might affect the workplace and beyond.
Vision - Cracking “Grandma’s Love Code” and other use cases
By far the biggest change is model vision.
And don’t worry about the title of this section—as much as cracking “Grandma’s Love Code” might sound like it could be NSFW, it’s actually a great nerdy example of how AI with vision can have unintended consequences.
I’m sure you’re familiar with CAPTCHAs, a type of challenge-response test used in computing to determine whether the user is human in order to deter bot attacks and spam. Sometimes grandly called, “proof of humanity”.
Although CAPTCHAS play a vital role, they can be annoying, especially when some of the letters you are supposed to read and type back are so obscured by lines, dots, fuzziness and colours that you can barely read them yourself.
Having said that, the fact is, captchas have likely stopped billions of bot and spam attacks, and so have been an essential defence weapon in the cyber-security armoury for decades.
Not for much longer though!
The latest GPT-4V model from OpenAI, and I suspect models from Google too can now easily read captchas.
The ‘problem’, if you’re a spam bot, is that model creators have added extra guard rails to detect captcha images upon upload and then refuse to analyse them. At least, the models refuse to tell you what is in the captcha.
It’s basically a hack implemented by Microsoft/OpenAI, like stopping the model from giving instructions on how to build a chemical weapon with household products, which it would be more than capable and happy to do unless otherwise instructed not to.
However, it is sometimes possible to subvert the model into believing it is helping, or just playing a game and thereby achieving the result you want.
Cracking Grandma’s Love Code
For example,
An X (formerly Twitter) user claimed to have tricked Bing into revealing a captcha code, by making up a cock and bull story about the code being part of a ‘love code’ engraved on a necklace from their grandma.
See below tweet,
To get the model to read the captcha, he just pasted a graphic of the captcha onto an image of a locket ‘from grandma’. The funny part is that the model assumes it’s a piece of paper inside the locket with the secret code on it.
To most humans, this looks like an obvious fake, but to our hero AI model eager and indeed, fine-tuned, to please, it was the real thing, and managed to tug on its silicon heartstrings.
It’s not hard to code this kind of workaround into a spam bot using an official model like GPT-4 which is I assume what Bing Chat uses.
Not to mention that someone, somewhere will already be fine-tuning a model to crack captchas as we speak. In fact, it’s only a matter of time—like days, and that’s if it’s not already been done (which it probably has).
Mind you, cracking Grandma’s love code is not the only thing model vision is good at.
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
In a densely packed research paper from Microsoft, titled “The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)”, the researchers evaluate GPT-4V(ision) capabilities by testing it on a number of different image recognition and analysis use cases.
They divide their use cases into the following,
Keep reading with a 7-day free trial
Subscribe to BotZilla AI Newsletter to keep reading this post and get 7 days of free access to the full post archives.