first in a series on language models and covers the advances leading up to the release of Chat-GPT.
0) Prologue: The Turing test
In October 1950, Alan Turing proposed a test. Was it possible to have a conversation with a machine and not be able to tell it apart from a human. He called this “the imitation game”. It was introduced in the paper “Computing Machinery and Intelligence”. He was intending to use this test as a proxy to the deeper and vaguer question, “can a machine think”.
Seventy years later, in 2020, several large language models like ChatGPT from OpenAI passed modern, rigorous variants of this test.
In 2022, OpenAI released ChatGPT publicly and it immediately captured the world’s “attention”.
It was the first chatbot you could have an extended conversation with on almost any topic (the first obvious application of a Turing test breaker).
And since then, we know how disruptive this technology has been, with companies like OpenAI and Anthropic training and hosting these models becoming the fastest growing in history.
While it might seem like it on the surface, such progress doesn’t happen in a vacuum and overnight. Under the cover, there are gradual advances that eventually culminate in such an event. And indeed, there was a flurry of activity (in terms of papers), leading up to the 2020 breakthrough. And since then, a bunch of other important developments as these models continue to get new capabilities and improve.
Since the landscape is starting to stabilize, its a good time to review some of the key papers leading up to this breakthrough.
In the chart below is a timeline of the papers we’ll be covering in this chapter (14 on the axis means the year 2014 and so on).

The key architecture that caused a quantum leap to materialize was called the Transformer. So, what was the deep insight behind it?
I) Transformers: subtracting, not adding

There was a single deep learning architecture (called the transformer) that pushed the boundaries of natural language models to new heights a few short months after it was released. It was introduced in the famous 2017 paper, “Attention is all you need”.
So what was the key advance that facilitated this? What was the “missing element” that was not there in the previous state of the art that the transformer introduced?
What’s really interesting is that if you consider the delta between the architectural elements in the previous state of the art before the Transformer and the Transformer itself, nothing new was added. Instead, a specific element (recurrence) was subtracted. This is reflected in the title of the paper — “Attention is all you need”, meaning you can do away with “other stuff” that is not “attention”. But if this famous paper didn’t invent attention, which one did?
II) Translation is where it all started

Although ChatGPT is a chatbot (a prerequisite for passing the Turing test), the use case that was driving all the early advances towards the transformer architecture was language translation. In other words, translating from one human language to another.
So, there is a “source statement” in language-1 (ex: English) and the goal is to convert it to a “target statement” in language-2 (ex: Spanish).
This is essentially a “sequence to sequence” task. Given an input sequence, return an output sequence. And there are many other things besides translation that can be framed as sequence to sequence tasks. For instance, a chatbot is also a sequence to sequence task. It takes the input sequence from the user and returns the sequence that is the response from the chatbot.
In general, progress in these kinds of models happens iteratively. There is some architecture which is the current state of the art on some task. Researchers understand its weaknesses, things that make it hard to work with. A new architecture is proposed that addresses those weaknesses. It’s run by the benchmarks and becomes the new dominant architecture if it succeeds. This is also how the Transformer came to be.
The first neural network based language translation models operated in three steps (at a high level). An encoder would embed the “source statement” into a vector space, resulting in a “source vector” (the encoder). Then, the source vector would be mapped to a “target vector” through a neural network (some non-linear mapping) and finally a decoder would map the resulting vector to the “target statement”.
People quickly realized that the vector that was supposed to encode the source statement had too much responsibility. The source statement could be arbitrarily long. So, instead of a single vector for the entire statement, let’s convert each word into a vector and then have an intermediate element that would pick out the specific words that the decoder should focus more on. This intermediate architectural element was dubbed “the attention mechanism”.
It so happened then that the intermediate mechanism that was responsible for helping the decoder pick out the words to pay attention to had very desirable scaling characteristics.
The next idea was to make it the center-piece of the entire architecture. And this is what led to the current state of the art model, the transformer.
Let’s look at the key papers in language translation leading up to the transformer.
II-A) Attention is born
Since “Attention is all we need” apparently, (see section I), let’s first understand what attention even is. We have to go to the paper that introduced it.
2014: “Neural machine translation by jointly learning to align and translate” https://arxiv.org/abs/1409.0473
This paper first introduced the “Attention mechanism”. Its a way for different parts of a source sentence to “attend to” what goes into a certain position in the target statement during translation.
Here are the key points:
1) They started with the encoder-decoder mechanism for translating between languages as described above. The key limitation called out was the encoder step (taking a source statement and encoding it to a vector in a high dimensional space). If the source statement was very long (especially longer than the typical lengths observed in the training data), the performance of simple encoder-decoder models would deteriorate, because a lot of responsibility was placed on the target vector to encode the full context of the source statement.
2) Quoting from the paper on their new approach: “The most important distinguishing feature of this approach from the basic encoder–decoder is that it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation. This frees a neural translation model from having to squash all the information of a source sentence, regardless of its length, into a fixed-length vector. We show this allows a model to cope better with long sentences.”. In other words, they moved away from encoding the entire input sentence as a vector towards encoding the individual words of the input sentence as vectors.
3) On the decoder, in section-3 they say: “Intuitively, this implements a mechanism of attention in the decoder. The decoder decides parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed-length vector. With this new approach, the information can be spread throughout the sequence of annotations, which can be selectively retrieved by the decoder accordingly.”. And this is the first mention of the attention mechanism. The decoder decides what parts of the input sentence to “pay attention” to as it generates the output sequence.
The mechanism by which the words were converted to vectors was based on recurrent neural networks (RNNs). Details of this can be obtained from the paper itself. These recurrent neural networks relied on hidden states to encode the past information of the sequence. While it’s convenient to have all that information encoded into a single vector, it’s not good for parallelizability since that vector becomes a bottleneck and must be computed before the rest of the sentence can be processed. And this limits the extent to which the power of GPUs can be brought to bear on training these models.
II-B) And now its all you need, apparently
We now get to the most famous paper that actually introduced the new Transformer architecture that would later go on to beat the Turing test.
2017: “Attention is all you need” https://arxiv.org/abs/1706.03762
This one originated in Google deep brain.
From the title, you can infer that the authors are talking of attention like it’s already a thing. It was 3 years old at the time. So if they didn’t invent “attention”, then what was their novel contribution? Like the title suggests, they simplified the architecture to “just attention”, doing away with recurrence completely. Well, they did combine attention with simple feed-forward networks, so the title is a bit of a lie. In fact, most of the parameters live in the feed-forward layers. But they got rid of the recurrent layers completely. Just attention and feed-forward and repeat. In parallel (“multi-head” and also in sequence).
Since attention had the nice property of being parallelizable, they could scale to larger architectures and train them in a more parallelizable manner leveraging the power of GPUs.
With this new, simpler architecture, they crossed a new state of the art on the major translation datasets.
Kind of wild given their core contribution was removing a key component from existing models and simplifying the architecture. This could have easily been just an ablation study in the previous paper, that introduced attention.
As to why this might have occurred to them, one can imagine them being frustrated with the hardships the recurrent layers were causing while the attention layers would have been really easy to train. This might have led them to wonder “if the recurrent layers are so problematic, why not do away with them?”.
With this new attention-only architecture, they crossed the state of the art in language translation tasks.
III) Beyond translation

This is where OpenAI first enters this scene. Unlike other research labs within large companies, they can chase a mandate of “general intelligence” on language tasks.
III-A) Generative Pre-training
In this paper, they introduced the first GPT (Generalized Pre-Trained) model, GPT-1. The model was supposed to be a general-purpose toolkit capable of performing any language task. It had about 117 million parameters.
2018: “Improving Language Understanding by Generative Pre-Training” https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Unlike Vaswani et.al. from the previous paper who were focused on language translation, the authors of this paper were interested in building a general agent capable of excelling at multiple language tasks. This makes sense for a research organization like OpenAI was at that point. The big idea in this paper is: don’t train models for every task from scratch.
First train a model that is generally good at language in an unsupervised manner on a large corpus of text.
Note that this step, training a general model on a large corpus of text was the subject of a landmark copyright lawsuit (between Anthropic, one of the companies that trains these models and the publishers of books it trained its models on), extremely consequential in the future of such AI models.
On June 23, 2025, U.S. District Judge William Alsup ruled that Anthropic’s use of lawfully acquired (purchased and scanned) copyrighted books to train its AI models constituted “fair use” under U.S. copyright law. He described the training as “quintessentially transformative,” likening it to how “any reader aspiring to be a writer” learns and synthesizes content in their own words.
Then, tune it further in a supervised manner on task specific data. Since the transformer is a sequence-to-sequence model, all we have to do is frame the task as a sequence to sequence problem. For example, if the task is sentiment analysis, the input becomes the source sentence whose sentiment needs to be deciphered. The target output becomes “POSITIVE” or “NEGATIVE”.
This is similar to how a human first learns general language skills, and then specializes in a specific field like law.
Take Bobby Fisher, the chess grandmaster who first learnt Russian (since all good chess books of the time were in that language) and then read them to get good at Chess.
III-B) Few shot learning
2020, Language models are few shot learners https://arxiv.org/abs/2005.14165
This is the paper that first introduced the famous model, GPT-3. A few years after this, in November 2022, OpenAI released ChatGPT to the public. The model underlying the chatbot was the same as the one in this paper. This model had 175 billion parameters.
The authors spend a lot of time marveling at how good humans are at generally learning to do novel language tasks with just a few illustrative examples. They then dream about AI models showing the same kind of generalizability without having to re-train the model for every single task. They argue that scaling the models to more and more parameters can take us towards this goal.
Quoting: “In recent years the capacity of transformer language models has increased substantially, from 100 million parameters, to 300 million parameters, to 1.5 billion parameters, to 8 billion parameters, 11 billion parameters, and finally 17 billion parameters. Each increase has brought improvements in text synthesis and/or downstream NLP tasks, and there is evidence suggesting that log loss, which correlates well with many downstream tasks, follows a smooth trend of improvement with scale. Since in-context learning involves absorbing many skills and tasks within the parameters of the model, it is plausible that in-context learning abilities might show similarly strong gains with scale.”
The idea is to give the model demonstrative examples at inference time as opposed to using them to train its parameters. If no such examples are provided in-context, it is called “zero shot”. If one example is provided, “one shot” and if a few are provided, “few shot”.
The graph below, taken from the paper, shows not only how the performance improves as the number of model parameters goes up, but also how the models are able to take advantage of the one or few examples shown to them. The performance of the one and few shot cases starts to rip away from zero shot as the number of parameters is increased.

A fascinating experiment was evaluating the models performance on simple arithmetic tasks like two digit addition and subtraction. Quoting: “On addition and subtraction, GPT-3 displays strong proficiency when the number of digits is small, achieving 100% accuracy on 2 digit addition, 98.9% at 2 digit subtraction, 80.2% at 3 digit addition, and 94.2% at 3-digit subtraction.”. Models with even 13 billion parameters failed miserably even on two digit addition.
And this paragraph must have made the authors feel like proud parents:
“To spot-check whether the model is simply memorizing specific arithmetic problems, we took the 3-digit arithmetic problems in our test set and searched for them in our training data in both the forms “
Looking ahead and Conclusion
These were some of the key papers leading up to the GPT-3 model that was released to the public as a chatbot (ChatGPT) in late 2022. Hopefully they provide a glimpse into the iterative evolution that led to the breaking of the Turing test.
There have been many notable papers since that have removed the limitations and improved further the capabilities of these models.
First, there was a need to align the responses of the models to human preferences. To prevent the models from being toxic, unhelpful, etc. This is where the concept of RLHF (Reinforcement Learning from Human Feedback) was put into play. It used a technique previously used to teach models to play video games, adapted to tuning the parameters of language models. The OpenAI paper was titled: “Training language models to follow instructions” and came out in November 2022.
If you were an early adopter of these models, you might remember that if you asked it about current news it would say: “I am a language model trained on a snapshot of the internet before 2022” or similar and was unable to answer questions about events since that snapshot. Further, as we saw in section III-B, these models wouldn’t achieve perfect scores on simple arithmetic. Why rely on the generative process for these kinds of things when we have specialized tools. Instead of simply saying it wasn’t trained on current affairs, the model could simply call a news API and retrieve the information it needed. Similarly, instead of trying to do arithmetic through its generative process, it could simply call a calculator API. This is where the toolformers paper (https://arxiv.org/abs/2302.04761) by Metas AI lab (FAIR at the time) taught these models to use tools like news APIs and calculators.
This article covered the advances up to the release of ChatGPT, which can be fairly called a pivotal moment in AI models. Next up, in the series, I’ll be covering follow-up advances like the ones mentioned in this section that have continued to push the boundaries. Stay tuned.