Is it possible for large language models (LLMs) to successfully learn non-English languages?
That’s the question at the center of an ongoing debate among linguists and data scientists. However, the answer isn’t just a matter of scholarly research. The ability or inability of LLMs to learn so-called “impossible” languages has broader implications in terms of both how LLMs learn and the global societal impacts of LLMs.
Languages that deviate from natural linguistic structures, which are referred to as impossible languages, typically fall into two categories. The first is not a true language, but an artificially constructed language that contains arbitrary rules that cannot be followed and still make sense. The other category includes languages that include non-standard characters or grammar, such as Chinese and Japanese.
Low-resource languages, meaning those with limited training data, such as Lao, often face similar challenges to impossible languages. However, they are not considered to be impossible languages unless they also include non-standard characters, such as Burmese.
Revisiting impossible languages
In 2023, Noam Chomsky, considered the founder of modern linguistics, wrote that LLMs “learn humanly possible and humanly impossible languages with equal facility.”
However, in the Mission: Impossible Language Models paper that received a Best Paper award at the 2024 Association of Computational Linguistics (ACL) conference, researchers shared the results of their testing of Chomsky’s theory, having discovered that language models actually struggle with learning languages with non-standard characters.
Rogers Jeffrey Leo John, CTO of DataChat Inc., a company that he cofounded while working at the University of Wisconsin as a data science researcher, said the Mission: Impossible paper challenged the idea that LLMs can learn impossible languages as effectively as natural ones.
“The models [studied for the paper] exhibited clear difficulties in acquiring and processing languages that deviate significantly from natural linguistic structures,” said John. “Further, the researchers’ findings support the idea that certain linguistic structures are universally preferred or more learnable both by humans and machines, highlighting the importance of natural language patterns in model training. This finding could also explain why LLMs, and even humans, can grasp certain languages easily and not others.”
Measuring the difficulty of an LLM learning a language
An LLM’s fluency in a language falls onto a broad spectrum, from predicting the next word in a partial sentence to answering a question. Additionally, individual users and researchers often bring different definitions and expectations of fluency to the table. Understanding LLMs’ issues with processing impossible languages starts by defining how the researchers, and linguists in general, determine whether a language is difficult for an LLM to learn. Kartik Talamadupula, a Distinguished Architect (AI) at Oracle who previously was head of Artificial Intelligence at Wand Synthesis AI, an AI platform integrating AI agents with human teams, said that when talking about measuring the ability of an LLM, the bar is always about predicting the next token (or word).
“Behavior like ‘answering questions’ or ‘logical reasoning’ or any of the other things that are ascribed to LLMs are just human interpretations of this token completion behavior,” said Talamadupula. “Training on additional data for a given language will only make the model more accurate in terms of predicting that next token, and sequentially, the set of all next tokens, in that particular language.”
John explained that when a model internalizes statistical patterns through probabilities of how words, phrases, and complex ideas co-occur, based on exposure to billions or trillions of examples, it can model syntax, infer semantics, and even mimic reasoning. With this skill mastered in a language, the LLM then uses it as a powerful training signal.
“If a model sees enough questions and answers in its training data, it can learn: When a sentence starts with ‘What is the capital of France?’, the next few tokens are likely to be ‘The capital of France is Paris,’” said John. “Other capabilities, like question-answering, summarization, [and] translation can all emerge from that next-word prediction task, especially if you fine-tune or prompt the model in the right way.”
Sanmi Koyejo, an assistant professor of computer science at Stanford University, said researchers also measure how quickly (in terms of training steps) a model reaches a certain performance threshold when determining if a language is difficult to learn or not. He said the Mission: Impossible paper demonstrated that for AIs to learn impossible languages, they often need more training on the data to reach performance levels comparable to those of other languages.
Low volume of training data increases difficulty
An LLM learns everything, including language and grammar, through training data. If a topic or language does not have sufficient training data, the LLM’s ability to learn it is significantly limited. The majority of high-quality training data is currently in Chinese and English, and many non-standard languages are impossible for LLMs to effectively learn, due to the lack of sufficient data.
Talamadupula said that non-standard languages such as Korean, Japanese, and Hindi, often have the same issue as low-resource languages with standard characters—not having enough data for training. This dearth of data makes it difficult to accurately model the probability of next-token generation. When asked about the challenge of non-Western languages understanding implied subjects, he said that LLMs do not actually understand a subject in a sentence.
“Based on their training data, they just model the probability that a given token, or word, will follow a set of tokens that have already been generated. The more data that is available in a given language, the more accurate the ‘completion’ of a sentence is going to be,” he said.
“If we were to somehow balance all the data available and train a model on a regimen of balanced data across languages, then the model would have the same error and accuracy profiles across languages,” said Talamadupula.
John agreed that because the ability of an LLM to learn a language stems from probability distributions, both the volume and quality of training data significantly influence how well an LLM performs across different languages. Because English and Chinese content dominate most training datasets, LLMs have a higher fluency, deeper knowledge, and stronger capabilities in those languages.
“Ultimately, this stems from how LLMs learn languages—through probability distributions. They develop linguistic understanding by being exposed to examples. If a model sees only a few thousand instances of a language, like Xhosa, compared to trillions of English tokens, it ends up learning unreliable token-level probabilities, misses subtleties in grammar and idiomatic usage, and struggles to form strong conceptual links between ideas and their linguistic representations,” said John.
Language structure also affects the ability to learn
Research also increasingly shows that the structure of the target language plays a role. Koyejo said the Mission: Impossible paper supports the idea that information locality (related words being close together) is an important property that makes languages learnable by both humans and machines.
“When testing various impossible languages, the researchers of the Mission: Impossible Language Models paper found that randomly shuffled languages (which completely destroys locality) were the hardest for models to learn, showing the highest perplexity scores,” said Koyejo. The Mission: Impossible paper defined perplexity as a course-grained metric of language learning. Koyejo also explained that languages created with local ‘shuffles’, where words were rearranged only within small windows, were easier for models to learn than languages with global shuffles.
“The smaller the window size, the easier the language was to learn, suggesting that preserving some degree of locality makes a language more learnable,” said Koyejo. “The researchers observed a clear gradient of difficulty—from English (high locality) → local shuffles → even-odd shuffles → deterministic shuffles → random shuffles (no locality). This gradient strongly suggests that information locality is a key determinant of learnability.”
Koyejo also pointed out that another critical element for a model learning a non-standard language is tokenization, with the character systems of East Asian languages creating special challenges. For example, Japanese mixes multiple writing systems, and the Korean alphabet combines syllable blocks. He said that progress in those languages will require increased data and architectural innovations that better suit their unique properties.
“Neither language uses spaces between words consistently. This means standard tokenization methods often produce sub-optimal token divisions, creating inefficiencies in model learning,” said Koyejo. “Our studies on Vietnamese, which shares some structural properties with East Asian languages, highlight how proper tokenization dramatically affects model performance.”
Insights into learning
The challenge with LLMs learning nonstandard languages is both interesting and impactful, and the issues provide key insights into how LLMs actually learn. The Mission: Impossible Language Models paper also reaches this conclusion, stating, “We argue that there is great value in treating LLMs as a comparative system for human languages in understanding what systems like LLMs can and cannot learn.”
Aaron Andalman, chief science officer and co-founder of Cognitiv and a former MIT neuroscientist, expanded on the paper’s conclusion by adding that LLMs don’t merely learn linguistic structures, but also implicitly develop substantial knowledge about the world during their training, meaning they develop a higher understanding of the languages.
“Effective language processing requires understanding context, which encompasses concepts, relationships, facts, and logical reasoning about real-world situations,” said Andalman. “Consequently, as models grow larger and undergo more extensive training, they accumulate more extensive and nuanced world knowledge.”
Further Reading
-
Brubaker, B. Can AI models show us how people learn? Impossible languages point a way. Quantum Magazine, January 13, 2025. https://www.quantamagazine.org/can-ai-models-show-us-how-people-learn-impossible-languages-point-a-way-20250113/
-
Chomsky, N. The false promise of ChatGPT. The New York Times, March 8, 2023. https://www.nytimes.com/2023/03/08/opinion/noam-chomsky-chatgpt-ai.html
- Kallini, J. et al. Mission: Impossible language models. August 2024. arXiv, 2401.06416v2.pdf
-
Truong, S. et al. Crossing linguistic horizons: Finetuning and comprehensive evaluation of Vietnamese large language models. arXiv, https://arxiv.org/abs/2403.02715

