1225 stories
·
1 follower

They have to be able to talk about us without us

1 Share

It’s absolutely vital to be able to communicate effectively and efficiently to large groups of people. I’ve been lucky enough to get to refine and test my skills in communicating at scale for a few decades now, and the power of talking to communities is the one area where I’d most like to pass on what I’ve learned, because it’s this set of skills that can have the biggest effect on deciding whether good ideas and good work can have their greatest impact.

My own work crosses many disparate areas. Over the years, I’ve gotten to cycle between domains as distinct as building technology platforms and products for developers and creators, enabling activism and policy advocacy in service of humanist ideals, and more visible external-facing work such as public speaking or writing in various venues like magazines or on this site. (And then sometimes I dabble in my other hobbies and fun stuff like scholarship or research into areas like pop culture and media.)

What’s amazing is, in every single one of these wildly different areas, the exact same demands apply when trying to communicate to broad groups of people. This is true despite the broadly divergent cultural norms across all of these different disciplines. It can be a profoundly challenging, even intimidating, job to make sure a message is being communicated accurately, and in high fidelity, to everyone that you need to reach.

That vital task of communicating to a large group gets even more daunting when you inevitably realize that, even if you were to find the perfect wording or phrasing for your message, you’d still never be able to deliver your story to every single person in your target audience by yourself anyway. There will always be another person whom you’re trying to reach that you just haven’t found yet. So, is it hopeless? Is it simply impossible to effectively tell a story at scale if you don’t have massive resources?

It doesn’t have to be. We can start with one key insight about what it takes to get your most important stories out into the world. It’s a perspective that seems incredibly simple at first, but can lead to a pretty profound set of insights.

They have to be able to talk about us without us.

They have to be able to talk about us without us. What this phrase means, in its simplest form, is that you have to tell a story so clear, so concise, so memorable and evocative that people can repeat it for you even after you’ve left the room. And the people who hear it need to be able to do this the first time they hear the story. Whether it’s the idea behind a new product, the core promise of a political campaign, or the basic takeaway from a persuasive essay (guess what the point of this one is!) — not only do you have to explain your idea and make your case, you have to be teaching your listener how to do the same thing for themselves.

This is a tall order, to be sure. In pop music, the equivalent is writing a hit where people feel like they can sing along to the chorus by the time they get to the end of the song for the first time. Not everybody has it in them to write a hook that good, but if you do, that thing is going to become a classic. And when someone else has done it, you know it because it gets stuck in your head. Sometimes you end up humming it to yourself even if you didn’t want to. Your best ideas — your most vital ideas — need to rest on a messaging platform that solid.

Delivering this kind of story actually requires substance. If you’re trying to fake it, or to force a narrative out of fluff or fakery, that will very immediately become obvious. When you set out to craft a story that travels in your absence, it has to have a body if it’s going to have legs. Bullshit is slippery and smells terrible, and the first thing people want to do when you leave the room is run away from it, not carry it with them.

The mission is the message

There’s another challenge to making a story that can travel in your absence: your ego has to let that happen. If you make a story that is effective and compelling enough that others can tell it, then, well…. those other people are going to tell it. Not you. They’ll do it in their own words, and in their own voices, and make it theirs. They may use a similar story, but in their own phrasing, so it will resonate better with their people. This is a gift! They are doing you a kindness, and extending you great generosity. Respond with gratitude, and be wary of anyone who balks at not getting to be the voice or the face of a message themselves. Everyone gets a turn telling the story.

Maybe the simple fact that others will be hearing a good story for the first time will draw them to it, regardless of who the messenger is. Sometimes people get attached to the idea that they have to be the one to deliver the one true message. But a core precept of “talk about us without us” is that there’s a larger mission and goal that everyone is bought into, and this demands that everyone stay aligned to their values rather than to their own personal ambitions around who tells the story.

The truth of whomever will be most effective is the factor used to decide who will be the person to tell the story in any context. And this is a forgiving environment, because even if someone doesn’t get to be the voice one day, they’ll get another shot, since repetition and consistency are also key parts of this strategy, thanks to the disciplined approach that it brings to communication.

The joy of communications discipline

At nearly every organization where I’ve been in charge of onboarding team members in the last decade or so, one of the first messages we’ve presented to our new colleagues is, “We are disciplined communicators!” It’s a message that they hopefully get to hear as a joyous declaration, and as an assertion of our shared values. I always try to explicitly instill this value into teams I work with because, first, it’s good to communicate values explicitly, but also because this is a concept that is very seldom directly stated.

It is ironic that this statement usually goes unsaid, because nearly everyone who pays attention to culture understands the vital importance of disciplined communications. Brands that are strictly consistent in their use of things like logos, type, colors, and imagery get such wildly-outsized cultural impact in exchange for relatively modest investment that it’s mind-boggling to me that more organizations don’t insist on following suit. Similarly, institutions that develop and strictly enforce a standard tone of voice and way of communicating (even if the tone itself is playful or casual) capture an incredibly valuable opportunity at minimal additional cost relative to how much everyone’s already spending on internal and external communications.

In an era where every channel is being flooded with AI-generated slop, and when most of the slop tools are woefully incapable of being consistent about anything, simply showing up with an obviously-human, obviously-consistent story is a phenomenal way of standing out. That discipline demonstrates all the best of humanity: a shared ethos, discerning taste, joyful expression, a sense of belonging, an appealing consistency. And best of all, it represents the chance to participate for yourself — because it’s a message that you now know how to repeat for yourself.

Providing messages that individuals can pick up and run with on their own is a profoundly human-centric and empowering thing to do in a moment of rising authoritarianism. When the fascists in power are shutting down prominent voices for leveling critiques that they would like to censor, and demanding control over an increasingly broad number of channels, there’s reassurance in people being empowered to tell their own stories together. Seeing stories bubble up from the grassroots in collaboration, rather than being forced down upon people from authoritarians at the top, has an emotional resonance that only strengthens the substance of whatever story you’re telling.

How to do it

Okay, so it sounds great: Let’s tell stories that other people want to share! Now, uh… how do we do it? There are simple principles we can follow that help shape a message or story into one that is likely to be carried forward by a community on its own.

  • Ground it in your values. When we began telling the story of my last company Glitch, the conventional wisdom was that we were building a developer tool, so people would describe it as an “IDE” — an “integrated development environment”, which is the normal developer jargon for the tool coders use to write their code in. We never described Glitch that way. From day one, we always said “Glitch is the friendly community where you'll build the app of your dreams” (later, “the friendly community where everybody builds the internet”). By talking about the site as a friendly community instead of an integrated development environment, it was crystal clear what expectations and norms we were setting, and what our values were. Within a few months, even our competitors were describing Glitch as a “friendly community” while they were trying to talk about how they were better than us about some feature or the other. That still feels like a huge victory — even the competition was talking about us without us! Make sure your message evokes the values you want people to share with each other, either directly or indirectly.
  • Start with the principle. This is a topic I’ve covered before, but you can't win unless you know what you're fighting for. Identify concrete, specific, perhaps even measurable goals that are tied directly to the values that motivate your efforts. As noted recently, Zohran Mamdani did this masterfully when running for mayor of New York City. While the values were affordability and the dignity of ordinary New Yorkers, the clear, understandable, measurable principle could be something as simple as “free buses”. This is a goal that everyone can get in 5 seconds, and can explain to their neighbor the first time they hear it. It’s a story that travels effortlessly on its own — and that people will be able to verify very easily when it’s been delivered. That’s a perfect encapsulation of “talk about us without us”.
  • Know what makes you unique. Another way of putting this is to simply make sure that you have a sense of self-awareness. But the story you tell about your work or your movement has to be specific. There can’t be platitudes or generalities or vague assertions as a core part of the message, or it will never take off. One of the most common failure states for this mistake is when people lean on slogans. Slogans can have their use in a campaign, for reminding people about the existence of a brand, or supporting broader messaging. But very often, people think a slogan is a story. The problem is that, while slogans are definitely repeatable, slogans are almost definitionally too vague and broad to offer a specific and unique narrative that will resonate. There’s no point in having people share something if it doesn’t say something. I usually articulate the challenge here like this: Only say what only you can say.
  • Be evocative, not comprehensive. Many times, when people are passionate about a topic or a movement, the temptation they have in telling the story is to work in every little detail about the subject. They often think, “if I include every detail, it will persuade more people, because they’ll know that I’m an expert, or it will convince them that I’ve thought of everything!” In reality, when people are not subject matter experts on a topic, or if they’re not already intrinsically interested in that topic, hearing a bunch of extensive minutia about it will almost always leave them feeling bored, confused, intimidated, condescended-to, or some combination of all of these. Instead, pick a small subset of the most emotionally gripping parts of your story, the aspects that have the deepest human connection or greatest relevance and specificity to the broadest set of your audience, and focus on telling those parts of the story as passionately as possible. If you succeed in communicating that initial small subset of your story effectively, then you may earn the chance to tell the other more complex and nuanced details of your story.
  • Your enemies are your friends. Very often, when people are creating messages about advocacy, they’re focused on competition or rivals. In the political realm, this can be literal opposing candidates, or the abstraction of another political party. In the corporate world, this can be (real or imagined) competitive products or companies. In many cases, these other organizations or products or competitors occupy so much more mental space in your mind, or your team’s mind, than they do in the mind of your potential audience. Some of your audience has never heard of them at all. And a huge part of your audience thinks of you and your biggest rival as… basically the same thing. In a business or commercial context, customers can barely keep straight the difference between you and your competition — you’re both just part of the same amorphous blob that exists as “the things that occupy that space”. Your competitor may be the only other organization in the world that’s fighting just as hard as you are to create a market for the product that you’re selling. The same is true in the political space; sometimes the biggest friction arises over the narcissism of small differences. What we can take away from these perspectives is that our stories have to focus on what distinguishes us, yes, but also on what we might have in common with those whom we might otherwise have perceived to have been aligned with the “enemy”. Those folks might not have sworn allegiance to an opposing force; they may simply have chosen another option out of convenience, and not even seen that choice as being in opposition to your story at all.
  • Find joy in repetition. Done correctly, a disciplined, collaborative, evocative message can become a mantra for a community. There’s a pride and enthusiasm that can come from people becoming proficient in sharing their own version of the collective story. And that means enjoying when that refrain comes back around, or when a slight improvement in the core message is discovered, and everyone finds a way to refine the way they’re communicating about the narrative. A lot of times, people worry that their team will get bored if they’re “just telling the same story over and over all the time”. In reality, as a brilliant man once said, there’s joy in repetition.
  • Don’t obsess over exact wording. This one is tricky; you might say, “but you said we have to be disciplined communicators!” And it’s true: it’s important to be disciplined. But that doesn’t mean you can’t leave room for people to put their own spin on things. Let them translate to their own languages or communities. Let them augment a general principle with a specific, personal connection. If they have their own authentic experience which will amplify a story or drive a point home, let them weave that context into the consistent narrative that’s been shared over time. As long as you’re not enabling a “telephone game” where the story starts to morph into an unrecognizable form, it’s perfectly okay to add a human touch by going slightly off script.

Share the story

Few things are more rewarding than when you find a meaningful narrative that resonates with the world. Stories have the power to change things, to make people feel empowered, to galvanize entire communities into taking action and recognizing their own power. There’s also a quiet reward in the craft and creativity of working on a story that travels, in finding notes that resonate with others, and in challenging yourself to get far enough out of your own head to get into someone else’s heart.

I still have so much to learn about being able to tell stories effectively. I still screw it up so much of the time, and I can look back on many times when I wish I had better words at hand for moments that sorely needed them. But many of the most meaningful and rewarding moments of my life have been when I’ve gotten to be in community with others, as we were not just sharing stories together, but telling a united story together. It unlocks a special kind of creativity that’s a lot bigger than what any one of us can do alone.

Read the whole story
mrmarchant
50 minutes ago
reply
Share this story
Delete

The Land of Giants, a conceptual proposal to build power line towers...

1 Share
The Land of Giants, a conceptual proposal to build power line towers so that they look like people.

💬 Join the discussion on kottke.org

Read the whole story
mrmarchant
3 hours ago
reply
Share this story
Delete

Accessible by Design: The Role of the 'lang' Attribute

1 Share
by Todd Libby

When starting a project, whether it is an application, a mobile app or site, or just a website in general I still see an alarming number of examples where the language attribute is not included in the <html> element. Not the !DOCTYPE, but the element directly after the DOCTYPE.

I have audited many sites and many frameworks in the past, I have noticed an alarming omission right from the outset when developers are building sites or applications. Especially in the mobile space and let's face it, in web development we focus on making things for ourselves and if it works on our computer, it must work everywhere! Right?

I see it more prevalent these days. There are surveys out and the issue of accessibility education in university or boot camps still lacks. New developers entering the field who aren't aware, framework authors that just don't know, understand, or they just don'make their work accessible.

I am here to discuss the importance of the language attribute in your code.

The Attribute and the Importance of the Language Used

Sometimes, a tiny detail can make or break the experience for millions of users. One of these tiny, powerful details is the lang attribute in your HTML.

The lang attribute is a simple piece of code that tells web browsers and screen readers what human language your page is written in. For example:

<html lang="en"> means the page is in English.
<html lang="es"> means the page is in Spanish.

When you forget this attribute, you're not just missing a semantic tag—you're creating a major accessibility barrier. If you don't tell the computer what language you're using, assistive tools won't know how to read your content correctly.

There Is Data Here and You Should Read It

The WebAIM Million Report is an accessibility report done by WebAIM every year and it's an accessibility evaluation of the top one million homepages on the internet. 2025 marked the seventh year this has been done and the results are not surprising.

Let's show the data for the language attribute.

A graph showing the top six accessibility issues found in the top one million websites by WebAIM. Low contrast of text is number one followed by missing alt text, missing labels, empty links, empty buttons and finally missing language attribute.

For the seventh year in a row, a missing document language made the list.

A graph showing the top six accessibility issues found in the top one million websites by WebAIM by year starting in 2019 up to 2025. Low contrast of text is number one followed by missing alt text, missing labels, empty links, empty buttons and finally missing language attribute.

As with the rest of the items in the data, it has been a common theme the last seven years. Missing language attribute has always been the last item on the repeating list of common failures. So what are the implications?

A numerical look shows the data is still trending to the same six problems in the report. So why is it that these issues are the ones that stay in the top six?

The WebAIM Million report showing the percentage of top million websites tested and the percentage of those with issues.
The WebAIM Million Report showing low contrast of text at 79.1% followed by missing alternative text for images at 55.5%, missing form input labels at 48.2%, empty links at 45.4%, empty buttons at 29.6%, and finally missing language attribute at 15.8%.

What Happens When the Language is Missing? The Wrong Voice Problem

The main group affected by a missing lang tag is the screen reader user. Screen readers are essential tools that read web content aloud. They're mainly used by people who are blind, have low vision or for those that use text-to-speech. They are also used by people that find reading difficult for other reasons, this is a common practice with people with ADHD (Adult attention-Deficit/Hyperactivity Disorder).

Screen readers don't just use one voice; they use specialized software packages for each language. This software knows the pronunciation rules, rhythm, and stress for English, French, Japanese, etc.

When your page is missing the lang attribute, the screen reader has to guess the language. It usually guesses based on the user's computer settings (for example, if the user lives in Germany, the screen reader will try to use the German voice).

Example: English Text Read by a German Voice

Imagine your entire website is in clear English. If a German screen reader tries to read it, it will apply German pronunciation rules.

“The” might sound like “Tee-hay.”

or;

“Data” might be pronounced with a hard ‘A’ sound instead of a soft one.

The result is garbled, unnatural, and often unintelligible speech. The text is still on the page, but for the screen reader user, the content is lost. They cannot understand your article, buy your product, or use your service.

This single small mistake transforms your helpful website into a frustrating, unusable experience.

It's a Rule, Not a Suggestion (WCAG)

Using the lang attribute isn't just a friendly suggestion; it's a core requirement for making your website accessible.

The Web Content Accessibility Guidelines (WCAG) are the international standard for web accessibility. WCAG Success Criterion 3.1.1 (Language of Page) states that the language of the page must be clear to the computer. This is a level ‘A’ requirement, which means it's mandatory for basic accessibility.

If your website fails this check, it is officially considered inaccessible.

How It Affects Other Tools

The lang attribute helps more than just screen readers:

1. Braille Displays

A refreshable braille display translates text into small patterns of raised bumps. Different languages use different contraction rules in braille (called Grade 2 braille). If the language is not set, the braille translator might use the wrong rules, turning clear text into meaningless gibberish for the braille reader.

2. Automated Translation

When a user relies on tools like Google Translate or a browser's built-in translation feature, telling the tool the source language (the language you wrote it in) ensures a much more accurate translation. If the source language is unclear, the translation quality drops sharply. An example can be found here.

3. Quotation Marks

The lang attribute helps the browser and other user agents select the correct typographical glyphs for quotation marks, especially when it comes to when the <q> and <blockquote> elements are used (when styled using CSS generated content such as content: open-quote). For example:

  • In English lang="en", quotes are typically “double quotes”.

  • In German lang="de", they are often rendered as „low-9 quotes‟.

  • In French lang="fr", they use « guillemets ».

While less related to visual quotation marks, providing the correct language helps assistive technologies pronounce the surrounding text accurately, ensuring a fluid and comprehensible reading experience.

Not providing the correct language may cause browsers to default to the user's system language or a neutral setting for quotation marks which may not match the document's language which results in incorrect or confusing typography (e.g., using English quote marks for German language).

Without a declared language, a screen reader may attempt to read the text using incorrect phonetic rules, voice, and accent. Which makes the content sound like gibberish and can make it incomprehensible for users who rely on audio output.

4. Hyphenation

Proper hyphenation is entirely language-dependent. Hyphenation rules can be complex and unique to each language. when CSS is used, hyphens: auto, the browser or user agent relies on the lang attribute to load the appropriate hyphenation dictionary and apply correct linguistic rules which can improve text flow and readability. Especially in justified or narrow columns.

For example, a long compound word in German, lang="de", will be broken according to German rules such as Rechtsschutzversicherungsgesellschaften (which means, insurance companies providing legal protection).

Most browsers do not provide automatic hyphenation if the language is not declared. This can not only lead to unsightly text blocks with excessive white space between words, but also horizontal scrolling or overflow on mobile devices which severely impacts readability and layout stability.

If the browser attempts to guess the language or uses the wrong default, it could apply the incorrect hyphenation rules, which breaks words in places that are linguistically wrong, which, in turn, confuses the reader.

What About Pages with Two Languages?

What if your page is mostly English but includes a quote in Spanish? If you don't do anything, the screen reader will read the Spanish quote using the English voice, again leading to mispronunciation.

You can fix this instantly by adding the lang attribute to the specific element that changes language:

<p lang="en">
The artist once said, "Always remember this phrase:
<span lang="fr">Je ne regrette rien.</span>" I think that sums up his career.
</p>

In this code, the screen reader switches to the French voice for the quote and then immediately switches back to the English voice for the rest of the sentence. This small change ensures all users hear the content exactly as intended.

How to Set the Language in Modern Web Frameworks

In modern websites built with tools like React, Vue, or Angular, you usually don't touch the main HTML file very often. Since these tools mostly control the content inside the <body> tag, you have to know where to find the root template file to set the lang attribute correctly. for example,

React uses the file, public/index.html. Therefore you would directly place the attribute in the <html> tag in that file.

Framework What File to Edit Where to Put the Code
React public/index.html Directly on the <html> tag in that file.
Next.js app/layout.tsx (or similar root file) Set the lang in the JSX for the root <html> element.
Vue public/index.html Directly on the <html> tag in that file.
Nuxt nuxt.config.ts Inside the app.head.htmlAttrs setting in your config file.
Angular src/index.html Directly on the <html> tag in that file.
Svelte/SvelteKit index.html or src/app.html Directly on the <html> tag in the main template file.

Example: Setting the Language in a Static Template

For most simple apps (React, Angular, plain HTML), you will open your main index.html file and change the first line like this:

<!DOCTYPE html>
<!-- Change the line below from <html> to the correct language code -->
<html lang="en">
<head>
<!-- ... -->
</head>
<body>
<!-- Your app code loads here -->
</body>
</html>

Conclusion

The lang attribute is a tiny line of code that provides universal access to your content. It's arguably the easiest, fastest, and most impactful accessibility fix you can make on any website.

By correctly setting the language, you ensure that everyone has equal access to your content. Regardless of whether they use a screen reader, braille display, or translation tool to do so, their tools have the fundamental information they need to do their jobs correctly. It's a simple commitment that makes the web better for everyone.

Don't let a missing two-letter code turn your content into a foreign language for your users and don't be afraid to use it or add it in!

Read the whole story
mrmarchant
3 hours ago
reply
Share this story
Delete

Magic Magikarp Makes Moves

1 Share
A picture of a life sized magikarp from pokemon

One of the most influential inventions of the 20th century was Big Mouth Billy Bass. A celebrity bigger than the biggest politicians or richest movie stars, there’s almost nothing that could beat Billy. That is, until [Kiara] from Kiara’s Workshop built a Magikarp version of Big Mouth Billy Bass.

Sizing in at over 2 entire feet, the orange k-carp is able to dance, it is able to sing, and it is able to stun the crowd. Magikarp functions the same way as its predecessor; a small button underneath allows the show to commence. Of course, this did not come without its challenges.

Starting the project was easy, just a model found online and some Blender fun to create a basic mold. Dissecting Big Mouth Billy Bass gave direct inspiration for how to construct the new idol in terms of servos and joints. Programming wasn’t even all that much with the use of Bottango for animations. Filling the mold with the silicone filling proved to be a bit more of a challenge.

After multiple attempts with some minor variations in procedure, [Kirara] got the fish star’s skin just right. All it took was a paint job and some foam filling to get the final touches. While this wasn’t the most mechanically challenging animatronic project, we have seen our fair share of more advanced mechanics. For example, check out this animatronic that sees through its own eyes!

Read the whole story
mrmarchant
4 hours ago
reply
Share this story
Delete

Learning How Learning Works

1 Share

Is it possible for large language models (LLMs) to successfully learn non-English languages?

That’s the question at the center of an ongoing debate among linguists and data scientists. However, the answer isn’t just a matter of scholarly research. The ability or inability of LLMs to learn so-called “impossible” languages has broader implications in terms of both how LLMs learn and the global societal impacts of LLMs.

Languages that deviate from natural linguistic structures, which are referred to as impossible languages, typically fall into two categories. The first is not a true language, but an artificially constructed language that contains arbitrary rules that cannot be followed and still make sense. The other category includes languages that include non-standard characters or grammar, such as Chinese and Japanese.

Low-resource languages, meaning those with limited training data, such as Lao, often face similar challenges to impossible languages. However, they are not considered to be impossible languages unless they also include non-standard characters, such as Burmese.

Revisiting impossible languages

In 2023, Noam Chomsky, considered the founder of modern linguistics, wrote that LLMs “learn humanly possible and humanly impossible languages with equal facility.”

However, in the Mission: Impossible Language Models paper that received a Best Paper award at the 2024 Association of Computational Linguistics (ACL) conference, researchers shared the results of their testing of Chomsky’s theory, having discovered that language models actually struggle with learning languages with non-standard characters.

Rogers Jeffrey Leo John, CTO of DataChat Inc., a company that he cofounded while working at the University of Wisconsin as a data science researcher, said the Mission: Impossible paper challenged the idea that LLMs can learn impossible languages as effectively as natural ones.

“The models [studied for the paper] exhibited clear difficulties in acquiring and processing languages that deviate significantly from natural linguistic structures,” said John. “Further, the researchers’ findings support the idea that certain linguistic structures are universally preferred or more learnable both by humans and machines, highlighting the importance of natural language patterns in model training. This finding could also explain why LLMs, and even humans, can grasp certain languages easily and not others.”

Measuring the difficulty of an LLM learning a language

An LLM’s fluency in a language falls onto a broad spectrum, from predicting the next word in a partial sentence to answering a question. Additionally, individual users and researchers often bring different definitions and expectations of fluency to the table. Understanding LLMs’ issues with processing impossible languages starts by defining how the researchers, and linguists in general, determine whether a language is difficult for an LLM to learn. Kartik Talamadupula, a Distinguished Architect (AI) at Oracle who previously was head of Artificial Intelligence at Wand Synthesis AI, an AI platform integrating AI agents with human teams, said that when talking about measuring the ability of an LLM, the bar is always about predicting the next token (or word).

“Behavior like ‘answering questions’ or ‘logical reasoning’ or any of the other things that are ascribed to LLMs are just human interpretations of this token completion behavior,” said Talamadupula. “Training on additional data for a given language will only make the model more accurate in terms of predicting that next token, and sequentially, the set of all next tokens, in that particular language.

John explained that when a model internalizes statistical patterns through probabilities of how words, phrases, and complex ideas co-occur, based on exposure to billions or trillions of examples, it can model syntax, infer semantics, and even mimic reasoning. With this skill mastered in a language, the LLM then uses it as a powerful training signal.

“If a model sees enough questions and answers in its training data, it can learn: When a sentence starts with ‘What is the capital of France?’, the next few tokens are likely to be ‘The capital of France is Paris,’” said John. “Other capabilities, like question-answering, summarization, [and] translation can all emerge from that next-word prediction task, especially if you fine-tune or prompt the model in the right way.”

Sanmi Koyejo, an assistant professor of computer science at Stanford University, said researchers also measure how quickly (in terms of training steps) a model reaches a certain performance threshold when determining if a language is difficult to learn or not. He said the Mission: Impossible paper demonstrated that for AIs to learn impossible languages, they often need more training on the data to reach performance levels comparable to those of other languages.

Low volume of training data increases difficulty

An LLM learns everything, including language and grammar, through training data. If a topic or language does not have sufficient training data, the LLM’s ability to learn it is significantly limited. The majority of high-quality training data is currently in Chinese and English, and many non-standard languages are impossible for LLMs to effectively learn, due to the lack of sufficient data.

Talamadupula said that non-standard languages such as Korean, Japanese, and Hindi, often have the same issue as low-resource languages with standard characters—not having enough data for training. This dearth of data makes it difficult to accurately model the probability of next-token generation. When asked about the challenge of non-Western languages understanding implied subjects, he said that LLMs do not actually understand a subject in a sentence.

“Based on their training data, they just model the probability that a given token, or word, will follow a set of tokens that have already been generated. The more data that is available in a given language, the more accurate the ‘completion’ of a sentence is going to be,” he said.

“If we were to somehow balance all the data available and train a model on a regimen of balanced data across languages, then the model would have the same error and accuracy profiles across languages,” said Talamadupula.

John agreed that because the ability of an LLM to learn a language stems from probability distributions, both the volume and quality of training data significantly influence how well an LLM performs across different languages. Because English and Chinese content dominate most training datasets, LLMs have a higher fluency, deeper knowledge, and stronger capabilities in those languages.

“Ultimately, this stems from how LLMs learn languages—through probability distributions. They develop linguistic understanding by being exposed to examples. If a model sees only a few thousand instances of a language, like Xhosa, compared to trillions of English tokens, it ends up learning unreliable token-level probabilities, misses subtleties in grammar and idiomatic usage, and struggles to form strong conceptual links between ideas and their linguistic representations,” said John.

Language structure also affects the ability to learn

Research also increasingly shows that the structure of the target language plays a role. Koyejo said the Mission: Impossible paper supports the idea that information locality (related words being close together) is an important property that makes languages learnable by both humans and machines.

“When testing various impossible languages, the researchers of the Mission: Impossible Language Models paper found that randomly shuffled languages (which completely destroys locality) were the hardest for models to learn, showing the highest perplexity scores,” said Koyejo. The Mission: Impossible paper defined perplexity as a course-grained metric of language learning. Koyejo also explained that languages created with local ‘shuffles’, where words were rearranged only within small windows, were easier for models to learn than languages with global shuffles.

“The smaller the window size, the easier the language was to learn, suggesting that preserving some degree of locality makes a language more learnable,” said Koyejo. “The researchers observed a clear gradient of difficulty—from English (high locality) → local shuffles → even-odd shuffles → deterministic shuffles → random shuffles (no locality). This gradient strongly suggests that information locality is a key determinant of learnability.”

Koyejo also pointed out that another critical element for a model learning a non-standard language is tokenization, with the character systems of East Asian languages creating special challenges. For example, Japanese mixes multiple writing systems, and the Korean alphabet combines syllable blocks. He said that progress in those languages will require increased data and architectural innovations that better suit their unique properties.

“Neither language uses spaces between words consistently. This means standard tokenization methods often produce sub-optimal token divisions, creating inefficiencies in model learning,” said Koyejo. “Our studies on Vietnamese, which shares some structural properties with East Asian languages, highlight how proper tokenization dramatically affects model performance.”

Insights into learning

The challenge with LLMs learning nonstandard languages is both interesting and impactful, and the issues provide key insights into how LLMs actually learn. The Mission: Impossible Language Models paper also reaches this conclusion, stating, “We argue that there is great value in treating LLMs as a comparative system for human languages in understanding what systems like LLMs can and cannot learn.”

Aaron Andalman, chief science officer and co-founder of Cognitiv and a former MIT neuroscientist, expanded on the paper’s conclusion by adding that LLMs don’t merely learn linguistic structures, but also implicitly develop substantial knowledge about the world during their training, meaning they develop a higher understanding of the languages.

“Effective language processing requires understanding context, which encompasses concepts, relationships, facts, and logical reasoning about real-world situations,” said Andalman. “Consequently, as models grow larger and undergo more extensive training, they accumulate more extensive and nuanced world knowledge.”

Further Reading

Read the whole story
mrmarchant
7 hours ago
reply
Share this story
Delete

Fizz Buzz in CSS

1 Share

What is the smallest CSS code we can write to print the Fizz Buzz sequence? I think it can be done in four lines of CSS as shown below:

li { counter-increment: n }
li:not(:nth-child(5n))::before { content: counter(n) }
li:nth-child(3n)::before { content: "Fizz" }
li:nth-child(5n)::after { content: "Buzz" }

Here is a complete working example: css-fizz-buzz.html.

I am neither a web developer nor a code-golfer. I am just an ordinary programmer playing on the sea-shore and diverting myself in now and then finding a rougher pebble or an uglier shell than ordinary, whilst the great ocean of absurd contraptions lay all undiscovered before me.

Seasoned code-golfers looking for a challenge can probably shrink this solution further. However, such wizards are also likely to scoff at any mention of counting lines of code, since this mind sport treats such measures as pointless when all of CSS can be collapsed into a single line. The number of bytes is probably more meaningful. The code can also be minified slightly by removing all whitespace:

$ curl -sS https://susam.net/css-fizz-buzz.html | sed -n '/counter/,/after/p' | tr -d '[:space:]'
li{counter-increment:n}li:not(:nth-child(5n))::before{content:counter(n)}li:nth-child(3n)::before{content:"Fizz"}li:nth-child(5n)::after{content:"Buzz"}

This minified version is composed of 152 characters:

$ curl -sS https://susam.net/css-fizz-buzz.html | sed -n '/counter/,/after/p' | tr -d '[:space:]' | wc -c
152

If you manage to create a shorter solution, please do leave a comment.

Read on website | #absurd | #web | #technology

Read the whole story
mrmarchant
7 hours ago
reply
Share this story
Delete
Next Page of Stories