You are currently viewing Tokens are a big reason why today’s generative AI fails |  TechCrunch

Tokens are a big reason why today’s generative AI fails | TechCrunch

Generative AI models don’t process text the same way humans do. Understanding their “token”-based internal environments can help explain some of their strange behaviors—and stubborn limitations.

Most models, from the tiny device like Gemma to OpenAI’s industry-leading GPT-4o, are built on an architecture known as a transformer. Because of the way transformers create associations between text and other data types, they cannot accept or output raw text—at least not without a huge amount of computation.

So, for both pragmatic and technical reasons, today’s transformer models work with text that is broken into smaller, bite-sized pieces called tokens—a process known as tokenization.

Tokens can be words, such as “fantastic”. Or they can be syllables, like ‘fan’, ‘tas’ and ‘tic’. Depending on the tokenizer—the model that performs the tokenization—they may even be individual characters in words (eg “f”, “a”, “n”, “t”, “a”, “s”, “t, ” “IC”).

Using this method, transformers can accept more information (in a semantic sense) before reaching an upper limit known as the context window. But tokenization can also introduce bias.

Some tokens have odd spacing that can derail a transformer. A tokenizer may encode “once upon a time” as “ever”, “before”, “a”, “time”, for example, while encoding “once upon a time” (which has a trailing blank) as “once”, “at , “a”, “.” Depending on how the model is prompted—with “once upon a time” or “once upon a time”—the results can be completely different, because the model doesn’t understand (as a human would) that the meaning is the same.

Tokenizers also treat the registry differently. “Hello” is not necessarily the same as “HELLO” for a model; “Hello” is usually one character (depending on the tokenizer), while “HELLO” can be up to three (“HE”, “El”, and “O”). This is why many transformers fail the capitalization test.

“It’s hard to get around the question of what exactly a ‘word’ should be for a language model, and even if we got human experts to agree on a perfect lexicon of lexemes, the models would probably still find it useful to ‘tear’ things even further.” , Sheridan Voicht, a doctoral student studying the interpretability of large language patterns at Northeastern University, told TechCrunch. “My guess would be that there is no such thing as a perfect tokenizer because of this kind of dilution.”

This “fuzziness” creates even more problems in languages ​​other than English.

Many tokenization methods assume that a space in a sentence denotes a new word. This is because they are designed with the English language in mind. But not all languages ​​use spaces to separate words. The Chinese and Japanese do not, nor do the Koreans, Thais or Khmers.

A 2023 Oxford study found that due to differences in how non-English languages ​​are tokenized, a transformer can take twice as long to complete a task formulated in a non-English language compared to the same task formulated in English. The same study—and another—found that users of less “token-efficient” languages ​​are likely to see worse model performance, yet pay more for usage, given that many AI vendors charge for token.

Tokenizers often treat each character in logographic writing systems—systems in which printed characters represent words unrelated to pronunciation, such as Chinese—as a separate character, resulting in a high character count. Similarly, tokenizers handling agglutinative languages—languages ​​where words are made up of small meaningful word units called morphemes, such as Turkish—tend to convert each morpheme into a token, increasing the total number of tokens. (The equivalent word for “hello” in Thai, Sawada, is six tokens.)

In 2023, Google DeepMind AI researcher Yeni Jun conducted an analysis comparing the tokenization of different languages ​​and its downstream effects. Using a dataset of parallel texts translated into 52 languages, June showed that some languages ​​need up to 10 times more tokens to capture the same meaning in English.

Beyond linguistic inequalities, tokenization may explain why today’s models are bad at math.

Rarely are digits tokenized sequentially. Because they don’t actually know what the numbers are, tokenizers can treat “380” as a single character, but represent “381” as a pair (“38” and “1”) – effectively destroying the relationships between the digits and leading to equations and formulas. The result is transformer confusion; a recent paper showed that the models struggle to understand repeated numerical patterns and context, especially temporal data. (See: GPT-4 considers 7735 to be greater than 7926).

This is also why models are not good at solving anagram or word reversal problems.

So tokenization clearly presents a challenge for generative AI. Can they be solved?

Perhaps.

Voicht points to byte-level state space models like MambaByte, which can accept much more data than transformers without a performance penalty by removing tokenization entirely. MambaByte, which works directly with raw bytes representing text and other data, is competitive with some transformer models in language parsing tasks, while doing better with “noise” such as words with swapped characters, spaces, and capital letters.

However, models like MambaByte are in the early stages of research.

“It’s probably best to let the models look at the characters directly without forcing tokenization, but right now that’s just computationally infeasible for Transformers,” Voicht said. “For transformer models in particular, the computation scales quadratically with the sequence length, and so we really want to use short textual representations.”

Barring a breakthrough in tokenization, it looks like new model architectures will be the key.

Leave a Reply