Meta AI recently published pre-print research showing off a radical new “Megabyte” framework for building generative pre-trained transformer (GPT) systems.
Dubbed “promising” by OpenAI’s Andrej Karpathy, former director of artificial intelligence at Tesla, the new architecture is designed to process large volumes of data — such as images, novels and video files — without the use of a process known as tokenization.
Promising. Everyone should hope that we can throw away tokenization in LLMs. Doing so naively creates (byte-level) sequences that are too long, so the devil is in the details.
Tokenization means that LLMs are not actually fully end-to-end. There is a whole separate stage with… https://t.co/t240ZPxPm7
— Andrej Karpathy (@karpathy) May 15, 2023
Tokenization is a lossy process that’s comparable to file compression. To process large amounts of data, GPT models convert bytes to tokens. The tokens are then processed by the transformer and used to generate output tokens, which are then decoded.
The tokenization process allows an AI system to process larger strings of data as numbers. The words “my favorite color is red,” if processed by OpenAI’s ChatGPT, for example, would be converted to the token string “3666, 4004, 3124, 318, 2266, 13” for processing.
Unfortunately, even through tokenization, the amount of data current state-of-the-art systems can process still has a hard limit. For GPT-3.5, the limit is slightly over 4,000 tokens or about 3,000 words, whereas GPT-4 maxes out at around 32,000 tokens or about 24,000 words.
Meta’s new Megabyte system ditches tokenization in favor of a novel multi-layer prediction architecture capable of end-to-end modeling over 1 million bytes of data.
Most standard English-language encoding systems use standard 8-bit encoding. In this paradigm, each character takes up one byte of data. Therefore, an AI system capable of processing 1 million bytes of data without tokenization could work with text documents containing 750,000 words — a 3,025% increase over GPT-4.
For comparison, GPT-4 can currently handle about 10 feature-length news articles in a single prompt, whereas Megabyte would be able to parse the entirety of Leo Tolstoy’s War and Peace plus another two average-length novels.
Meta’s Megabyte model also performed well on ImageNet tests and benchmarks related to processing audio files, either equaling or surpassing existing byte-based transformer models such as DeepMind’s Perciever AR on both:
“Megabyte matches the state-of-the-art performance of PerceiverAR whilst using only half the compute.”
The implications of this research could be far-reaching. Tokenization is considered a roadblock in the field due to its hard data limits and the amount of energy and time required to train systems.
Without tokenization, it should be possible to train AI models with stronger foundational support for non-English languages, especially those that can’t be easily encoded in standard 8-bit characters.
This could lead to the further democratization of these technologies and enable everything from cryptocurrency trading bots to decentralized autonomous organization technologies to be built in native language codes around the world.
Related: Sam Altman’s Worldcoin secures $115M for decentralized ID
It would also increase the capacity of models like ChatGPT to work with image, video and audio files by generating multimedia clips using around the same time and energy consumption as text.