Essay
During our recordings of the podcast, we came to talk about the use of LLMs as search engines and databases, and the idea that large language models (LLMs) like GPT are like “lossy compressed images of the internet”. The argument, roughly, looks at lossy compression as a method of identifying statistical regularities in something to be stored in order to be able to approximately reconstruct it on demand later. This, according to the analogy, explains the hallucinations and misbehaviors of LLMs: as compression artefacts. Viewing LLMs as copies of their training data suggests a certain viewpoint to look at their behavior. If you think that an LLMs’ training is to memorize its training data and make that accessible via prompt queries then the idea of using LLMs as search engines or databases seems natural – all that needs to be done is to manage the compression artefacts. Whereas the use of LLMs to generate novel content seems senseless – surely you would at best be repackaging what already exists, and at worst you’d be directly plagiarising the works of others.
However, in that unpublished podcast conversation between Nicholas Guttenberg and Tanner Lund, the point was raised that there’s a distinction between compressed files and compression algorithms. When you look at the structure of an LLM, how they’re trained, and what specifically they’re trained to do, then that doesn’t behave as one would expect a compressed file to behave, but instead acts like a compression algorithm.
GPT models for example are trained to predict the probability distribution over the next token, not just say what the correct next token is. The distinction is important because that means that – if the model is trained correctly – there should not just be one continuation (the ‘correct’ file you’re trying to decompress), but instead there should be a whole distribution of continuations weighted according to their likelihood. It’s not just because the LLM is blurry and getting things wrong, but the idealized ‘best possibly trained on the internet’ LLM should not reproduce the documents that constitute ‘the internet’ exactly. The way such models are evaluated is not on their training data, but on unseen test data whose relationship with the training data is just that in some sense it should be drawn from a similar distribution.
In a compressed file, there is no such sense of unseen data – the file represents a specific document, and the correct behavior is to recover that document. In a compression algorithm however, the behavior of that algorithm on a user’s files (which the programmer does not get to know of in detail) is paramount.
There’s a more formal connection between LLMs and compression algorithms as well. In information theory, the better you can infer something's distribution, the better you can compress it – to do so, you store only the part you can't predict. Compression algorithms therefore implicitly represent different priors and inferences about the distribution of the data – things that can be known independently and universally without regards to the specific message or thing being compressed – whereas compressed files represent the remainder that is unpredictable given only the context. This is exactly the way that LLMs are trained and evaluated.
The key point is that LLMs are more like compression algorithms than they are compressed (lossy or lossless) files.
This isn’t a new idea, it goes all the way back to Shannon in 1948, working out the mathematics of sending messages across noisy channels and figuring out just how much power you need at minimum. In that context, the point was how much redundancy you would need to include in order to guarantee that the original message could be reconstructed even if there are errors, but it also implies things in the limit where the channel becomes noiseless.
The idea is: lets say you’re sending a message, but you and the recipient can agree on some previous shared information that is universal across all such messages you might wish to send. For example, you can agree on the distribution of letter uses in English. In that case, you only have to transmit the parts of the message which disagree with what that prior information would predict. Or in a more nuanced sense, if you’re using code words of different length to encode sub-parts of your message of different length, you can assign the shortest code word to the most likely continuation, the next shortest code word to the next most likely continuation, etc. Even if, say, your message is in romanized Japanese and the model is ‘wrong’ for that, it’s still possible to encode that message – it will just take up more space.
Now imagine that instead of something which is static like ‘the frequency of letter usages’, your shared prior information includes things which are dependent on the message so far: the frequency of an ‘a’ given that the last letter was an ‘n’. Or the frequency of letter sequences, or entire words. You could go even further and use a complicated statistical model of the rest of the message conditioned on the entirety of the conversation so far. You can use, say, an LLM like GPT as this shared prior information.
If you do that, then the LLM in that structure is filling the same role as the compression algorithm: its the shared software that lets one person compress a file, transmit it, and have it be decompressed to the same file which was sent. Furthermore, when viewed this way, an LLM can be used as part of a lossless compression algorithm. The equivalent ‘compressed internet’ would be a file that, for each document on the internet, stores only where that document would disagree with the LLM’s continuation of the document.
To put it another way: compression is the art of communicating only that which is surprising, and leaving the rest unsaid.
The distinction between ‘file’ and ‘algorithm’ leads us to arrive at different conclusions about the utility of LLMs, the meaning of their outputs, and how they should be evaluated. For one thing, an LLM which is trained to predict the probability distribution of possible messages is actually failing if it always gives you the one particular message that occurred in its training data – it means it’s becoming worse at compressing future, unseen messages. The standard practice for testing machine learning models (including LLMs) is to hold out some data that the training process doesn’t get to see, and call the result of an architectural or hyperparameter change or training step good only if it makes the model better on that unseen data, not on the training data. It turns out this is really difficult to ensure at scale, so it may well be that GPT memorizes more than it should.
This matters because it means that as we get better at building LLMs, they should become less like databases, not more. The ‘hallucinations’ aren’t an error because we’re just not there yet, they’re essential to the nature of what people training LLMs are actually training them to do – at least in the case of LLMs trained to predict the next token.
Furthermore, this leads to a very different perspective on the use of LLMs for generating content and for inspiring original writing. What a well-trained LLM does is not to repackage existing documents in a blurry manner, but rather it identifies redundancy: all those things which we say or write which are not surprising either to us nor to the reader, but which nevertheless need to be said.
And in that sense, the things which LLMs are most suited towards producing autonomously are also those things where human authorial intent would matter the least – the boilerplate, the form letters, the idioms and canned phrases that we expect to hear, the tropes and cliches. But that also means that, by looking at the points where an LLM does give diverse outputs, it can be a tool to identify those locations where an author could make an intentional decision without going against the expectations of structure. One could even use an LLM’s feedback to intentionally try to structure their writing such that it maximizes the opportunities to communicate something surprising to both author and reader (or just to minimize the wordiness of a thing). One could use LLM feedback to detect falling into cliches and then intentionally subvert them.
It also bears on the question of whether the use of an LLM as a writing aid is inherently plagiaristic. If the model overfits and becomes just a compressed file rather than a compression algorithm - if it more and more returns an exact document from its training data – then that bends in the direction of LLM usage being plagiaristic. However if overfitting is avoided and the LLM is trained in such a way that it generalizes, then the only time an LLM should match significant portions of its training data word-for-word is if that word-for-word match would be expected given the context and the prompt. For instance, being asked to directly quote something from a famous source.
So what’s the takeaway of all this, why bother spilling digital ink nitpicking this metaphor? Putting aside frameworks for reasoning about LLMs and what they are good or bad at, perhaps the key issue is that it is possible to overfit these models and ultimately we could even decide ‘sure, lets just do that’. So rather than just being a question of fact or technical accuracy between ‘is GPT a lossy image’ and ‘is GPT a compression algorithm’, we would argue: we should not want language models to act as a store of knowledge. We should train, test, regularize, and generally challenge LLMs to act in their full capacity of separating what is surprising from what is predictable, rather than settling for something a traditional database already does better.