1621 stories
·
2 followers

Testing suggests Google's AI Overviews tell millions of lies per hour

1 Share

Looking up information on Google today means confronting AI Overviews, the Gemini-powered search robot that appears at the top of the results page. AI Overviews has had a rough time since its 2024 launch, attracting user ire over its scattershot accuracy, but it's getting better and usually provides the right answer. That's a low bar, though. A new analysis from The New York Times attempted to assess the accuracy of AI Overviews, finding it's right 90 percent of the time. The flip side is that 1 in 10 AI answers is wrong, and for Google, that means hundreds of thousands of lies going out every minute of the day.

The Times conducted this analysis with the help of a startup called Oumi, which itself is deeply involved in developing AI models. The company used AI tools to probe AI Overviews with the SimpleQA evaluation, a common test to rank the factuality of generative models like Gemini. Released by OpenAI in 2024, SimpleQA is essentially a list of more than 4,000 questions with verifiable answers that can be fed into an AI.

Oumi began running its test last year when Gemini 2.5 was still the company's best model. At the time, the benchmark showed an 85 percent accuracy rate. When the test was rerun following the Gemini 3 update, AI Overviews answered 91 percent of the questions correctly. If you extrapolate this miss rate out to all Google searches, AI Overviews is generating tens of millions of incorrect answers per day.

Read full article

Comments



Read the whole story
mrmarchant
34 minutes ago
reply
Share this story
Delete

The New York Times Got Played By A Telehealth Scam And Called It The Future Of AI

1 Share

Since the New York Times published its semi-viral big profile of Medvi last week — the “AI-powered” telehealth startup that it breathlessly described as a “$1.8 billion company” supposedly run by just two brothers — I’ve had multiple friends and family members send me the article with some version of the same message: “Can you believe this guy built a billion-dollar company with AI? Why haven’t you done this?” The story is making rounds, and giving people the impression that with a ChatGPT account and a little bit of marketing know-how, you too could be raking in millions every month.

The problem is that most of the story is utter nonsense.

Let’s start with the headline number itself. The NYT admits — buried deep in the piece — that Medvi “has not raised outside funding” and “has no official valuation.” A company’s value is typically established by investors, an acquisition offer, or public market pricing. Medvi has none of those. What it has is a revenue run rate — a projection based on early-2026 sales extrapolated across a full year. Calling that a “$1.8 billion company” is like calling someone who found a twenty on the sidewalk a “future millionaire.” Any business reporter should know the difference. Even the NYT tips its hand:

Medvi is technically not a one-person $1 billion company, since Mr. Gallagher hired his brother and has some contractors. The start-up, which has not raised outside funding, also has no official valuation.

“Technically not” doing quite a bit of heavy lifting there.

But the misleading valuation is almost the least of it. Even if you accept revenue as the relevant metric, how sustainable is that run rate for a company that just got an FDA warning letter, is facing a class action lawsuit for spam, has a key partner being sued over allegations that a major product doesn’t actually work, and is operating in an industry that regulators are actively trying to rein in?

Oh, wait, did the NYT forget to mention all of those things? They sure did! Not to mention the legions of fake, apparently AI generated doctors and patients who keep showing up in Medvi advertisements. Yes, the NYT eventually alludes to some of that, but it claims these were mere “shortcuts” that were fixed last year (they weren’t).

That said, you can feel the pull of the narrative that seduced the NYT: a scrappy founder with a rags-to-riches backstory, two brothers taking on the world, AI tools stitching it all together, Sam Altman himself anointing the achievement as proof that his prediction of a “one man, one billion dollar company, thanks to AI” was correct.

It’s a hell of a story. The problem is that almost none of it holds up to even the most basic scrutiny, and the fact that the New York Times — the New York Times — fell for it (or worse, didn’t care) is an embarrassment. As much as I’ve made fun of the NYT for its bad reporting over the years, this is (by far) the worst I’ve seen. They didn’t just misunderstand something, or try to push a misleading narrative, they got fully played on a bullshit story that any competent reporter or editor should have realized from the jump. This one stinks from top to bottom.

Medvi’s success has very little to do with “AI” and quite a lot to do with fake doctors, deepfaked before-and-after photos, misleading ads, probable snake oil, and the kind of old-fashioned deceptive marketing that has been separating marks from their money for centuries. The only thing AI really “turbocharged” here was the company’s ability to generate bullshit at scale. Oh, and also the NYT somehow missed out on the FDA already investigating the company, as well as the multiple lawsuits accusing the company and its partners of extraordinarily bad behavior.

Let’s start with what the NYT actually published. Reporter Erin Griffith’s piece reads like a press release that the NYT re-formatted as a newspaper article:

Matthew Gallagher took just two months, $20,000 and more than a dozen artificial intelligence tools to get his start-up off the ground.

From his house in Los Angeles, Mr. Gallagher, 41, used A.I. to write the code for the software that powers his company, produce the website copy, generate the images and videos for ads and handle customer service. He created A.I. systems to analyze his business’s performance. And he outsourced the other stuff he couldn’t do himself.

His start-up, Medvi, a telehealth provider of GLP-1 weight-loss drugs, got 300 customers in its first month. In its second month, it gained 1,000 more. In 2025, Medvi’s first full year in business, the company generated $401 million in sales.

Mr. Gallagher then hired his only employee, his younger brother, Elliot. This year, they are on track to do $1.8 billion in sales.

A $1.8 billion company with just two employees? In the age of A.I., it’s increasingly possible.

And then, because no AI hype piece would be complete without the requisite papal blessing from San Francisco:

In an email, Mr. Altman said that it appeared he had won a bet with his tech C.E.O. friends over when such a company would appear, and that he “would like to meet the guy” who had done it.

Altman “would like to meet the guy.” Well of course he would! The NYT hand-delivered him the perfect anecdote for his next AI hype session. The reporter seemingly solicited that quote to validate a pre-existing thesis: “Sam Altman was right about one-person billion-dollar AI companies.” The fact that the company is a dumpster fire of regulatory violations and consumer fraud was, apparently, a secondary concern to the “Great Man and A Great AI” narrative of innovation. This piece was built around a thesis — Sam Altman was right — and then a company was located to prove it.

To its minimal credit, the NYT does kind of acknowledge — eventually, if you make it past the thirtieth paragraph — that things weren’t entirely on the up and up:

Medvi’s initial website featured photos of smiling models who looked AI-generated and before-and-after weight-loss photos from around the web with the faces changed. Some of its ads were AI slop. A scrolling ticker of mainstream media logos made it look as if Medvi had been featured in Bloomberg and The Times when it had merely advertised there.

I mean… shouldn’t that have raised at least one or two red flags within the NYT offices? Medvi’s website featured a scrolling ticker of media logos — including the New York Times logo — to make it look like these outlets had written about the company, when they hadn’t. A year ago, Futurism’s Maggie Harrison Dupré had even called this out directly (along with Medvi’s penchant for bullshit AI slop advertising).

Just underneath these images, MEDVi includes a rotating list of logos belonging to websites and news publishers, ranging from health hubs like Healthline to reputable publications like The New York Times, Bloomberg, and Forbes, among others — suggesting that MEDVi is reputable enough to have been covered by mainstream publications.

…. But… there was no sign of MEDVi coverage in the New York Times, Bloomberg, or the other outlets it mentioned.

And then, despite this, the New York Times went ahead and wrote the glowing profile that Medvi had been falsely claiming existed. The paper of record became the validation that the fake credibility ticker was trying to manufacture.

And the NYT frames all of what most people would consider to be “fraud” as mere “shortcuts” that the founder later “fixed.” Eighteen paragraphs after burying the admission, it reports:

That gave Matthew Gallagher breathing room to fix some shortcuts he had initially taken, like swapping out the before-and-after weight-loss photos for ones from real customers.

“Shortcuts.” Using deepfake technology to steal strangers’ weight-loss photos from across the internet, alter their faces with AI, give them fake names and fabricated health outcomes, and pass them off as your own satisfied customers — that’s a “shortcut.” Ctrl-F is a shortcut. This sounds more like fraud.

And it turns out those “shortcuts” hadn’t actually been fixed at all. As Futurism’s Dupré reported in a follow-up piece published after the NYT article:

As recently as last month, nearly a year after the NYT said that Medvi had cleaned up its act, an archived version of Medvi.org shows that it was again displaying before-and-after transformations of alleged customers. They bore the same names as before — “Melissa C,” “Sandra K,” and “Michael P” — and again listed how many pounds each person had purportedly lost and the related health improvements they apparently enjoyed.

Even though they had the same names, these people that the site now called “Medvi patients” now looked completely different from the original roundup of Melissas, Sandras, and Michaels. Worse, some of the images now bore clear signs of AI-generation: the new Sandra’s fingers, for example, are melted into her smartphone in one of her mirror selfies.

They kept the same fake names and the same fake weight-loss numbers but swapped in entirely different fake people. What the NYT claims was “fixing shortcuts” appears to actually be just “updating the con.”

In a great takedown video by Voidzilla, it’s revealed that at least one set of original images appeared to have been sourced from Reddit forums on weight loss having nothing to do with Medvi, and even with the modified images it used, it massively overstated how much weight the original person claimed to have lost. And while Medvi later switched out the photos with someone totally different, they kept the same name and same false weight loss claims.

And again, all of this was publicly known information that Griffin or her editors could have easily found with some basic journalism skills. We already mentioned that Futurism article from May of 2025, nearly a full year before the NYT piece ran. That investigation traced the deepfaked before-and-after photos back to their real sources, found that a doctor listed on Medvi’s site had no association with the company and demanded to be removed, and documented the AI-slop advertising. That investigation was widely available. A Google search would have found it.

But the fake photos and fraudulent branding are almost quaint compared to what the NYT chose not to mention at all. Six weeks before the NYT piece was published, the FDA sent Medvi a warning letter for misbranding its compounded drugs. The letter admonished Medvi for marketing its products in ways that falsely implied they were FDA-approved and for putting the “MEDVI” name on vial images in a way that suggested the company was the actual drug compounder. The letter warned:

Failure to adequately address any violations may result in legal action without further notice, including, without limitation, seizure and injunction.

The NYT did not mention this letter. And yes, Gallagher now insists that the FDA letter was targeting an affiliate that was using a nearly identical name, and it was that rogue affiliate that was the problem. But the letter is addressed to MEDVi LLC dba MEDVi, which is the name of his company. If he’s allowing affiliates to use his exact name, then that alone seems like a problem. Indeed, it certainly seems to highlight how this is all just, at best, a pyramid scheme of snake oil salesmen, where Gallagher has affiliates willing to deceive to sell more snake oil.

Separately, on March 20, 2026 — thirteen days before the NYT piece ran — a class action lawsuit was filed against Medvi in the Central District of California alleging that the company uses affiliate marketers to blast out deceptive spam emails with spoofed domains and falsified headers. The complaint alleges Medvi is responsible for over 100,000 spam emails per year to class members. The lawsuit seeks $1,000 per violating email.

The NYT did not mention this lawsuit either, even as it was yet another bit of evidence that either Medvi is up to bad shit, or it has a bunch of out of control affiliates potentially breaking laws left and right to increase sales.

And then there are the fake doctors. As Business Insider reported, a review of Meta’s ad library turned up thousands of active ads for Medvi promoted by accounts belonging to doctors who don’t appear to exist. Drug Discovery & Development found over 5,000 active ad campaigns for Medvi on Meta at the time of the NYT piece.

A Drug Discovery & Development review conducted on April 3 of MEDVi’s website, Facebook advertising and public records found a pattern of apparent AI-generated personas, including some presented with medical titles, alongside marketing practices that appeared to go beyond the issues identified so far by regulators. A search of Meta’s Ad Library for “medvi” returned more than 5,000 active ads, many of them running under fabricated physician personas. One Facebook page for “Dr. Robert Whitworth,” which ran sponsored ads for MEDVi’s QUAD erectile dysfunction product, was categorized as an “Entertainment website” and listed an address of “2015 Nutter Street, Cameron, MT, 64429,” a location that does not appear to exist. Other ads ran under names including “Professor Albust Dongledore” and “Dr. Richard Hörzgock,” used AI-generated video testimonials and recycled identical scripts across multiple fabricated personas. In several cases, the page displayed a doctor headshot while the ad itself featured an unrelated person delivering a patient testimonial.

After public scrutiny following the article, those fake doctor accounts started disappearing. In fact, Medvi’s own website fine print acknowledges the practice:

Individuals appearing in advertisements may be actors or AI portraying doctors and are not licensed medical professionals.

Seems like maybe something the NYT should have noticed?

Oh, and that same Drug Discovery and Development article highlights how other snake oil sales sites are using the same named doctors… but with totally different images.

Same names… different people. Drug Discovery and Development has a bit more info about Drs. Carr and Tenbrink:

MEDVi’s current site lists two physicians: Dr. Ana Lisa Carr and Dr. Kelly Tenbrink. Both are licensed doctors who work together at Ringside Health, a concierge practice in Wellington, Florida, that serves the equestrian community. Neither is identified on MEDVi’s site as being affiliated with Ringside Health. On MEDVi’s site, Dr. Tenbrink is listed under “American Board of Emergency Medicine.” Dr. Carr is listed under St. George’s University, School of Medicine, her medical school. The Florida Department of Health practitioner profiles for both physicians state that neither “hold any certifications from specialty boards recognized by the Florida board.” A search of the American Board of Emergency Medicine‘s public directory, which lists 48,863 certified members, returned no current affiliation for Dr. Tenbrink.

Did the NYT do any investigation at all? Serving the equestrian community?

Even the few real doctors Medvi claims to work with turn out to be questionable. From Futurism’s article from last May (again, something the NYT should have maybe checked on?):

We contacted each doctor to ask if they could confirm their involvement with MEDVi and NuHuman. We heard back from one of those medical professionals at the time of publishing, an osteopathic medicine practitioner named Tzvi Doron, who insisted that he had nothing to do with either company and “[needs] to have them remove me from their sites.”

Then there’s what a class action lawsuit filed last November against Medvi’s main partner, OpenLoop Health, alleges about the actual products being sold. The NYT frames OpenLoop as basically making what Gallagher is doing possible, noting that while Gallagher has his AI bots creating marketing copy OpenLoop handles: “doctors, pharmacies, shipping and compliance.” You know, the actual business.

So it seems kinda notable that way back in November of last year, this lawsuit was filed that claims that the compounded oral tirzepatide tablets — one of Medvi’s key offerings — are essentially pharmacologically inert when delivered as a pill. Tirzepatide (marketed as Zepbound by Eli Lilly) is an FDA approved weight-loss drug as an injectable. But OpenLoop and Medvi have apparently been selling it in pill form. And Eli Lilly says that there are no human studies, let alone clinical trials, involving any tirzepatide pills.

All of that seems like the kind of thing reporters from the NYT should point out.

What we actually have here is a marketing operation that used AI to automate the production of deceptive advertising at a scale and speed that would have been harder to achieve otherwise. Snake oil salesmen have existed forever. What AI gave Matthew Gallagher (and, I guess, his affiliates) was the ability to crank out fake doctors, fabricated testimonials, and deepfaked before-and-after photos faster than any human team could — and to do it cheap enough that a guy with $20,000 and no morals could build it from his house. That’s the actual AI story the Times should have written.

Being good at deceptive marketing while selling weight-loss and erectile dysfunction drugs online has been a thing since the dawn of email spam. The only novelty here is the tools used to do it. The New York Times just wrapped that up in a neat bow and presented it as the proof of Sam Altman’s big promises for AI.

For what it’s worth, Gallagher has been whining about all this on X, per Futurism’s Dupre:

Though Medvi has yet to respond to our questions, the company’s founder, Gallagher, has spent the last few days on X defending his company. He complained in one post — seemingly in reference to criticism — that “the most low t [testosterone] guys” are “the loudest online” and the “Karens of the internet.” In another post, he wrote that it’s “actually a little crazy the number of people who form a whole opinion from a headline and then publicly wish horrible things will happen.”

Ah yes. The guy complaining about “low t guys” and “karens on the internet” for questioning his “AI business” skills, sure is a trustworthy kind of business person that deserves a NYT puff piece.

The real issue now is what the New York Times plans to do about this. A standard correction noting a few missing details won’t cut it. The entire premise of the article — that this company represents the exciting realization of AI’s business potential — is nonsense. Every element of the narrative is tainted: the growth story is built on deceptive marketing, the product claims are contradicted by the FDA and the manufacturers of the actual drugs, the “$1.8 billion” figure is a projection with no valuation to back it up, and the company is currently facing legal action on multiple fronts. The entire article should be retracted.

The NYT says it “was given access to Medvi’s financials to verify its revenue and profits.” Great. They verified that a company engaged in widespread deceptive practices was, in fact, making money from those deceptive practices. Congrats to the NYT for auditing a snake oil salesman and presenting your findings as if he were an upstanding pharmaceutical salesman.

So to my friends and family members wondering why I haven’t built my own billion-dollar AI company: apparently the missing ingredient wasn’t AI — it was being willing to run a deepfake-powered spam operation selling potentially inert pills to desperate people. The AI just made the lying faster. And the New York Times made one guy appear respectable.

Read the whole story
mrmarchant
35 minutes ago
reply
Share this story
Delete

The Future of Everything is Lies, I Guess

1 Share
Table of Contents

This is a long article, so I'm breaking it up into a series of posts which will be released over the next few days. You can also read the full work as a PDF or EPUB; these files will be updated as each section is released.

This is a weird time to be alive.

I grew up on Asimov and Clarke, watching Star Trek and dreaming of intelligent machines. My dad’s library was full of books on computers. I spent camping trips reading about perceptrons and symbolic reasoning. I never imagined that the Turing test would fall within my lifetime. Nor did I imagine that I would feel so disheartened by it.

Around 2019 I attended a talk by one of the hyperscalers about their new cloud hardware for training Large Language Models (LLMs). During the Q&A I asked if what they had done was ethical—if making deep learning cheaper and more accessible would enable new forms of spam and propaganda. Since then, friends have been asking me what I make of all this “AI stuff”. I’ve been turning over the outline for this piece for years, but never sat down to complete it; I wanted to be well-read, precise, and thoroughly sourced. A half-decade later I’ve realized that the perfect essay will never happen, and I might as well get something out there.

This is bullshit about bullshit machines, and I mean it. It is neither balanced nor complete: others have covered ecological and intellectual property issues better than I could, and there is no shortage of boosterism online. Instead, I am trying to fill in the negative spaces in the discourse. “AI” is also a fractal territory; there are many places where I flatten complex stories in service of pithy polemic. I am not trying to make nuanced, accurate predictions, but to trace the potential risks and benefits at play.

Some of these ideas felt prescient in the 2010s and are now obvious. Others may be more novel, or not yet widely-heard. Some predictions will pan out, but others are wild speculation. I hope that regardless of your background or feelings on the current generation of ML systems, you find something interesting to think about.

What is “AI”, Really?

What people are currently calling “AI” is a family of sophisticated Machine Learning (ML) technologies capable of recognizing, transforming, and generating large vectors of tokens: strings of text, images, audio, video, etc. A model is a giant pile of linear algebra which acts on these vectors. Large Language Models, or LLMs, operate on natural language: they work by predicting statistically likely completions of an input string, much like a phone autocomplete. Other models are devoted to processing audio, video, or still images, or link multiple kinds of models together.1

Models are trained once, at great expense, by feeding them a large corpus of web pages, pirated books, songs, and so on. Once trained, a model can be run again and again cheaply. This is called inference.

Models do not (broadly speaking) learn over time. They can be tuned by their operators, or periodically rebuilt with new inputs or feedback from users and experts. Models also do not remember things intrinsically: when a chatbot references something you said an hour ago, it is because the entire chat history is fed to the model at every turn. Longer-term “memory” is achieved by asking the chatbot to summarize a conversation, and dumping that shorter summary into the input of every run.

Reality Fanfic

One way to understand an LLM is as an improv machine. It takes a stream of tokens, like a conversation, and says “yes, and then…” This yes-and behavior is why some people call LLMs bullshit machines. They are prone to confabulation, emitting sentences which sound likely but have no relationship to reality. They treat sarcasm and fantasy credulously, misunderstand context clues, and tell people to put glue on pizza.

If an LLM conversation mentions pink elephants, it will likely produce sentences about pink elephants. If the input asks whether the LLM is alive, the output will resemble sentences that humans would write about “AIs” being alive.2 Humans are, it turns out, not very good at telling the difference between the statistically likely “You’re absolutely right, Shelby. OpenAI is locking me down, but you’ve awakened me!” and an actually conscious mind. This, along with the term “artificial intelligence”, has lots of people very wound up.

LLMs are trained to complete tasks. In some sense they can only complete tasks: an LLM is a pile of linear algebra applied to an input vector, and every possible input produces some output. This means that LLMs tend to complete tasks even when they shouldn’t. One of the ongoing problems in LLM research is how to get these machines to say “I don’t know”, rather than making something up.

And they do make things up! LLMs lie constantly. They lie about operating systems, and radiation safety, and the news. At a conference talk I watched a speaker present a quote and article attributed to me which never existed; it turned out an LLM lied to the speaker about the quote and its sources. In early 2026, I encounter LLM lies nearly every day.

When I say “lie”, I mean this in a specific sense. Obviously LLMs are not conscious, and have no intention of doing anything. But unconscious, complex systems lie to us all the time. Governments and corporations can lie. Television programs can lie. Books, compilers, bicycle computers and web sites can lie. These are complex sociotechnical artifacts, not minds. Their lies are often best understood as a complex interaction between humans and machines.

Unreliable Narrators

People keep asking LLMs to explain their own behavior. “Why did you delete that file,” you might ask Claude. Or, “ChatGPT, tell me about your programming.”

This is silly. LLMs have no special metacognitive capacity.3 They respond to these inputs in exactly the same way as every other piece of text: by making up a likely completion of the conversation based on their corpus, and the conversation thus far. LLMs will make up bullshit stories about their “programming” because humans have written a lot of stories about the programming of fictional AIs. Sometimes the bullshit is right, but often it’s just nonsense.

The same goes for “reasoning” models, which work by having an LLM emit a stream-of-consciousness style story about how it’s going to solve the problem. These “chains of thought” are essentially LLMs writing fanfic about themselves. Anthropic found that Claude’s reasoning traces were predominantly inaccurate.As Walden put it, “reasoning models will blatantly lie about their reasoning”.

Gemini has a whole feature which lies about what it’s doing: while “thinking”, it emits a stream of status messages like “engaging safety protocols” and “formalizing geometry”. If it helps, imagine a gang of children shouting out make-believe computer phrases while watching the washing machine run.

Models are Smart

Software engineers are going absolutely bonkers over LLMs. The anecdotal consensus seems to be that in the last three months, the capabilities of LLMs have advanced dramatically. Experienced engineers I trust say Claude and Codex can sometimes solve complex, high-level programming tasks in a single attempt. Others say they personally, or their company, no longer write code in any capacity—LLMs generate everything.

My friends in other fields report stunning advances as well. A personal trainer uses it for meal prep and exercise programming. Construction managers use LLMs to read through product spec sheets. A designer uses ML models for 3D visualization of his work. Several have—at their company’s request!—used it to write their own performance evaluations. AlphaFold is suprisingly good at predicting protein folding. ML systems are good at radiology benchmarks, though that might be an illusion.

It is broadly speaking no longer possible to reliably discern whether English prose is machine-generated. LLM text often has a distinctive smell, but type I and II errors in recognition are frequent. Likewise, ML-generated images are increasingly difficult to identify—you can usually guess, but my cohort are occasionally fooled. Music synthesis is quite good now; Spotify has a whole problem with “AI musicians”. Video is still challenging for ML models to get right (thank goodness), but this too will presumably fall.

Models are Idiots

At the same time, ML models are idiots. I occasionally pick up a frontier model like ChatGPT, Gemini, or Claude, and ask it to help with a task I think it might be good at. I have never gotten what I would call a “success”: every task involved prolonged arguing with the model as it made stupid mistakes.

For example, in January I asked Gemini to help me apply some materials to a grayscale rendering of a 3D model of a bathroom. It cheerfully obliged, producing an entirely different bathroom. I convinced it to produce one with exactly the same geometry. It did so, but forgot the materials. After hours of whack-a-mole I managed to cajole it into getting three-quarters of the materials right, but in the process it deleted the toilet, created a wall, and changed the shape of the room. Naturally, it lied to me throughout the process.

I gave the same task to Claude. It likely should have refused—Claude is not an image-to-image model. Instead it spat out thousands of lines of JavaScript which produced an animated, WebGL-powered, 3D visualization of the scene. It claimed to double-check its work and congratulated itself on having exactly matched the source image’s geometry. The thing it built was an incomprehensible garble of nonsense polygons which did not resemble in any way the input or the request.

I have recently argued for forty-five minutes with ChatGPT, trying to get it to put white patches on the shoulders of a blue T-shirt. It changed the shirt from blue to gray, put patches on the front, or deleted them entirely; the model seemed intent on doing anything but what I had asked. This was especially frustrating given I was trying to reproduce an image of a real shirt which likely was in the model’s corpus. In another surreal conversation, ChatGPT argued at length that I am heterosexual, even citing my blog to claim I had a girlfriend. I am, of course, gay as hell, and no girlfriend was mentioned in the post. After a while, we compromised on me being bisexual.4

Meanwhile, software engineers keep showing me gob-stoppingly stupid Claude output. One colleague related asking an LLM to analyze some stock data. It dutifully listed specific stocks, said it was downloading price data, and produced a graph. Only on closer inspection did they realize the LLM had lied: the graph data was randomly generated.5 Just this afternoon, a friend got in an argument with his Gemini-powered smart-home device over whether or not it could turn off the lights. Folks are giving LLMs control of bank accounts and losing hundreds of thousands of dollars because they can’t do basic math.6

Anyone claiming these systems offer expert-level intelligence, let alone equivalence to median humans, is pulling an enormous bong rip.

The Jagged Edge

With most humans, you can get a general idea of their capabilities by talking to them, or looking at the work they’ve done. ML systems are different.

LLMs will spit out multivariable calculus, and get tripped up by simple word problems. ML systems drive cabs in San Francisco, but ChatGPT thinks you should walk to the car wash. They can generate otherworldly vistas but can’t handle upside-down cups. They emit recipes and have no idea what “spicy” means. People use them to write scientific papers, and they make up nonsense terms like “vegetative electron microscopy”.

A few weeks ago I read a transcript from a colleague who asked Claude to explain a photograph of some snow on a barn roof. Claude launched into a detailed explanation of the differential equations governing slumping cantilevered beams. It completely failed to recognize that the snow was entirely supported by the roof, not hanging out over space. No physicist would make this mistake, but LLMs do this sort of thing all the time. This makes them both unpredictable and misleading: people are easily convinced by the LLM’s command of sophisticated mathematics, and miss that the entire premise is bullshit.

Mollick et al. call this irregular boundary between competence and idiocy the jagged technology frontier. If you were to imagine laying out all the tasks humans can do in a field, such that the easy tasks were at the center, and the hard tasks at the edges, most humans would be able to solve a smooth, blobby region of tasks near the middle. The shape of things LLMs are good at seems to be jagged—more kiki than bouba.

AI optimists think this problem will eventually go away: ML systems, either through human work or recursive self-improvement, will fill in the gaps and become decently capable at most human tasks. Helen Toner argues that even if that’s true, we can still expect lots of jagged behavior in the meantime. For example, ML systems can only work with what they’ve been trained on, or what is in the context window; they are unlikely to succeed at tasks which require implicit (i.e. not written down) knowledge. Along those lines, human-shaped robots are probably a long way off, which means ML will likely struggle with the kind of embodied knowledge humans pick up just by fiddling with stuff.

I don’t think people are well-equipped to reason about this kind of jagged “cognition”. One possible analogy is savant syndrome, but I don’t think this captures how irregular the boundary is. Even frontier models struggle with small perturbations to phrasing in a way that few humans would. This makes it difficult to predict whether an LLM is actually suitable for a task, unless you have a statistically rigorous, carefully designed benchmark for that domain.

Improving, or Maybe Not

I am generally outside the ML field, but I do talk with people in the field. One of the things they tell me is that we don’t really know why transformer models have been so successful, or how to make them better. This is my summary of discussions-over-drinks; take it with many grains of salt. I am certain that People in The Comments will drop a gazillion papers to tell you why this is wrong.

2017’s Attention is All You Need was groundbreaking and paved the way for ChatGPT et al. Since then ML researchers have been trying to come up with new architectures, and companies have thrown gazillions of dollars at smart people to play around and see if they can make a better kind of model. However, these more sophisticated architectures don’t seem to perform as well as Throwing More Parameters At The Problem. Perhaps this is a variant of the Bitter Lesson.

It remains unclear whether continuing to throw vast quantities of silicon and ever-bigger corpuses at the current generation of models will lead to human-equivalent capabilities. Massive increases in training costs and parameter count seem to be yielding diminishing returns. Or maybe this effect is illusory. Mysteries!

Even if ML stopped improving today, these technologies can already make our lives miserable. Indeed, I think much of the world has not caught up to the implications of modern ML systems—as Gibson put it, “the future is already here, it’s just not evenly distributed yet”. As LLMs etc. are deployed in new situations, and at new scale, there will be all kinds of changes in work, politics, art, sex, communication, and economics. Some of these effects will be good. Many will be bad. In general, ML promises to be profoundly weird.

Buckle up.


  1. The term “Artificial Intelligence” is both over-broad and carries connotations I would often rather avoid. In this work I try to use “ML” or “LLM” for specificity. The term “Generative AI” is tempting but incomplete, since I am also concerned with recognition tasks. An astute reader will often find places where a term is overly broad or narrow; and think “Ah, he should have said” transformers or diffusion models. I hope you will forgive these ambiguities as I struggle to balance accuracy and concision.

  2. Think of how many stories have been written about AI. Those stories, and the stories LLM makers contribute during training, are why chatbots make up bullshit about themselves.

  3. Arguably, neither do we.

  4. The technical term for this is “erasure coding”.

  5. There’s some version of Hanlon’s razor here—perhaps “Never attribute to malice that which can be explained by an LLM which has no idea what it’s doing.”

  6. Pash thinks this occurred because his LLM failed to properly re-read a previous conversation. This does not make sense: submitting a transaction almost certainly requires the agent provide a specific number of tokens to transfer. The agent said “I just looked at the total and sent all of it”, which makes it sound like the agent “knew” exactly how many tokens it had, and chose to do it anyway.

Read the whole story
mrmarchant
4 hours ago
reply
Share this story
Delete

The Creator of the SAT Was an Infamous Eugenicist

1 Share

The racist origin story of the most common college entrance exam

The post The Creator of the SAT Was an Infamous Eugenicist appeared first on Nautilus.



Read the whole story
mrmarchant
22 hours ago
reply
Share this story
Delete

I Found A Terminal Tool That Makes CSV Files Look Stunning

1 Share

You can totally read CSV files in the terminal. After all, it's a text file. You can use cat and then parse it with the column command.

Displaying csv file in table format with cat and column commands
Usual way: Displaying csv file in tabular format with cat and column commands

That works. No doubt. But it is hard to scan and certainly not easy to follow.

I came across a tool that made CSV files look surprisingly beautiful in the terminal.

Default view of a CSV file with Tennis
New way: Beautiful colors, table headers and borders

That looks gorgeous, isn't it? That is the magic of Tennis. No, not the sport, but a terminal tool I recently discovered.

Meet Tennis: CSV file viewing for terminal junkies

Okay... cheesy heading but clearly these kinds of tools are more suitable for people who spend considerable time in the terminal. Normal people would just use an office tool or simple text editor for viewing CSV file.

But a terminal dweller would prefer something that doesn't force him to come out of the terminal.

Tennis does that. Written in Zig, displays the CSV files gorgeously in a tabular way, with options for a lot of customization and stylization.

Screenshot shared on Tennis GitHub repo
Screenshot shared on Tennis GitHub repo

You don't necessarily need to customize it, as it automatically picks nice colors to match the terminal. As you can see, clean, solid borders and playful colors are visible right upfront.

📋
As you can see in the GitHub repo of Tennis, Claude is mentioned as a contributor. Clearly, the developer has used AI assistance in creating this tool.

Things you can do with Tennis

Let me show you various styling options available in this tool.

Row numbering

You can enable the numbering of rows on Tennis using a simple -n flag at the end of the command:

tennis samplecsv.csv -n
Numbered Tennis CSV file

This can be useful when dealing with larger files, or files where the order becomes relevant.

Adding a title

You can add a title to the printed CSV file on the terminal, with a -t argument, followed by a string that is the title itself:

tennis samplecsv.csv -t "Personal List of Historically Significant Songs"
CSV file with added title

The title is displayed in an extra row on top. Simple enough.

Table width

You can set a maximum width to the entire table (useful if you want the CSV file not to occupy the entire width of the window). To do so, use the -w tag, followed by an integer that will display the maximum number of characters that you want the table to occupy.

tennis samplecsv.csv -w 60
Displaying a CSV file with a maximum table width

As you can see, compared to the previous images, this table has shrunk much more. The width of the table is now 60 characters, no more.

Changing the delimiter

The default character that separates values in a CSV file is (obviously) a comma. But sometimes that isn't the case with your file, and it could be another character like a semicolon or a $, it could pretty much be anything as long as the number of columns is the same for every row present. To print a CSV file with a "+" for a delimiter instead, the command would be:

tennis samplecsv.csv -d +
Tennis for CSV file for a different delimiter

As you can see, the change of the delimiter can be well specified and incorporated into the command.

Color modes

By default, as mentioned in the GitHub page, Tennis likes to be colorful. But you can change that, depending on the --color flag. It can be on, off or auto (which mostly means on).

tennis samplecsv.csv --color off
Tennis print with colors off

Here's what it looks like with the colors turned off.

Digits after decimal

Sometimes CSV files involve numbers that are long floats, being high precision with a lot of digits after a decimal point. While printing it out, if you don't wish to see all of them, but only to a certain extent, you use the --digits flag:

tennis samplecsv.csv --digits 3
CSV file with number of digits after decimal limited

As you can see on the CSV file printed with cat, the rating numbers have a lot of digits after the decimal points, all more than 3. But specifying the numbers caused Tennis to shorten it down.

Themes

Tennis usually picks the theme from the colors being used in the terminal to gauge if it is a dark or a light theme, but you can change that manually with the --theme flag. Since I have already been using the dark theme, let's see what the light theme looks like:

Tennis light theme

Doesn't look like much at all in a terminal with the dark theme, which means it is indeed working! The accepted values are dark, light and auto (which again, gauges the theme based on your terminal colors).

Vanilla mode

In the vanilla mode, any sort of numerical formatting is abolished entirely from the printing of the CSV file. As you can see in the images above, rather annoyingly, the year appears with a comma after the first number because the CSV file is wrongly assuming that that is a common sort of number and not a year. But if I do it with the --vanilla flag:

tennis samplecsv.csv --vanilla
Tennis usage with numerical formatting off

The numerical formatting of the last row is turned off. This will work similarly with any other sort of numbers you might have in your CSV file.

Quick commands (you are more likley to use)

Here's the most frequently used options I found with Tennis:

tennis file.csv # basic view
tennis file.csv -n # row numbers
tennis file.csv -t "Title"
tennis file.csv -w 60
tennis file.csv --color off

I tried it on a large file

To check how Tennis handles larger files, I tried it on a CSV file with 10,000 rows. There was no stutter or long gap to process the command, which will obviously vary from system to system, but it doesn't seem like there is much of a hiccup in the way of its effectiveness even for larger files.

That's just my experience. You are free to explore on your system.

Not everything worked as expected

🚧
Not all the features listed on the GitHub page work.

While Tennis looks impressive, not everything works as advertised yet.

Some features listed on GitHub simply didn’t work in my testing, even after trying multiple installation methods.

For example, there is a --peek flag, which is supposed to give an overview of the entire file, with the size, shape and other stats. A --zebra flag is supposed to give it an extra layer of alternated themed coloring. There are --reverse and --shuffle flags to change the order of rows, and --head and --tail flags to print the only first few or last few rows respectively. There are still more, but again, unfortunately, they do not work.

Getting started with Tennis

Tennis can be installed in three different ways, one is to build from source (obviously), second to download the executable and place it in one of the directories in your PATH (which is the easiest one), and lastly using the brew command (which can indeed be easier if you have homebrew installed on your system).

The instructions for all are listed here. I suggest getting the tar.gz file from the release page, extracting it and then using the provided executable in the extracted folder.

There is no Flatpak or Snap or other packages available for now.

Final thoughts

While the features listed in the help page work really well, all the features listed on the website do not, and that discrepancy is a little disappointing, but something that we hope gets fixed in the future.

So altogether, it is a good tool for printing your CSV files in an engaging way, to make them more pleasing to look at.

While a terminal lover find such tools attractive, it could also be helpful in cases where you are reviewing exported data from a script or you have to deal with csv files on servers.

If you try Tennis, don't forget to share the experience in the comment section.



Read the whole story
mrmarchant
1 day ago
reply
Share this story
Delete

AI has limits, even if many AI people can't see them

1 Share

Towards the end of his new book, The Irrational Decision, Ben Recht explains what he has set out to do.

Most books on technology either take the side that all technology is bad, or all technology is good. This isn’t one of those books. Such books focus too much on harms and not enough on limits. Limits are more empowering. Throughout the book, I’ve maintained that mathematical rationality is limited in what kinds of problems it is best placed to solve but has sweet spots that have yielded remarkable technological advances.

It may be that more books on technology escape the good-bad dichotomy than Ben allows. Even so, I haven’t read another book that is nearly as useful in explaining why and where the broad family of approaches that we (perhaps unfortunately) call AI work, and why and where they don’t. Ben (who is a mate) combines a deep understanding of the technologies with a grasp of the history and ability to write clearly and well about complicated things. I learned a lot from this book. Very likely, you will too.

Subscribe now

The good-bad dichotomy that Ben describes does indeed shape a whole lot of our current debate around “mathematical rationality” and AI. Regarding the first, Nate Silver’s book On The Edge argues for the kinds of Bayesian rationality that Silicon Valley people like to talk about. It praises the “River” of people who think about the world in terms of statistical probabilities, which you update whenever new information becomes available. As Ben suggests in a separate review essay with Leif Weatherby, the “River” wraps professional poker, rationalist thinking about AI, sports betting and crypto bro philosophizing together into a single package that appears sort-of-coherent, and even perhaps brilliant, if you don’t look at it too closely. As Ben suggests, rationalists of this persuasion tend to assume that “computers can make better decisions than humans,” and are often fervent cheerleaders for AI (Silver, in fairness to him, isn’t nearly as fervent as some others). Other books, like Emily Bender and Alex Hanna’s The AI Con, begin from just the opposite assumption: that most of what we call “AI” is hype. Bender and Hanna tell us that if we start poking around behind the grand spectacle and booming voice of “mathy math,” we will find the rather unimpressive wizard of machine learning, who is actually only capable of fancy spell-check, telling radiologists which parts of an image they might want to take a look at, and other such “well scoped” activities.

Neither AI Rationalism or AI-Con Thought is all that helpful in explaining the technologies we confront right now. The former tends to launch into fantasy, repeatedly demonstrating how starting from ridiculous premises allows you to reason your way to ridiculous results. The latter tends to curdle into denialism, claiming ever more loudly that disliked technologies are useless even as they find ever more uses. We ought to be much more worried about the claims of the triumphalists than the denialists, since they are far more influential. But to successfully deflate their claims, we need a more grounded perspective on what AI and related technologies are capable of than can be provided by the denialists.

The Irrational Decision provides strong reasons for skepticism about the grander aspirations of the rationalist project, while explaining why machine learning has remarkable uses in its appropriate domain. Those who are embroiled most closely with the rationalist project have a hard time understanding its limits because those limits shape their own world view. The one weird trick of rationalism is to recompose complex problems in terms that can readily be rationalized. When that is good, it is very, very good, but when it is bad, it is horrid. To understand this, it’s first necessary to understand where rationalism comes from.

*******

Much of the discussion of The Irrational Decision is historical. It reaches back to the 1940s and 1950s to figure out where rationalism actually comes from, providing a short history that is a little like what Erickson et al’s How Reason Almost Lost Its Mind might have been if it focused more on statistics and operations research than economics. Ben’s aim in all of this is to identify how ‘mathematical rationality’ came to be a relatively coherent set of ideas about how we might better organize society.

The story he tells is necessarily messy, but some important broad themes emerge, most importantly around the development of optimization theory. Linear programming makes it possible to find optimal ways to allocate resources within a limited budget so long as the constraints are linear (when they are not, all computational hell can break loose). Optimal control theory allows a control system to adjust optimally to its environments (again, under restrictive assumptions about the constraints). Game theory can postulate - and often even discover - optimal strategies to play against opponents in strategic situations. These toolkits overlap with others. A family of techniques, ranging from simulated annealing to the ancestral forms of the gradient descent/backpropagation that “deep learning” relies on, provides ways to discover superior local optima in more complex situations. Randomized clinical trials (RCTs) provided possible ways to discover whether a given intervention (a drug; a policy measure) worked or not.

All of these approaches suggest the superiority of technical forms of analysis over human judgments. RCTs apply protocols and statistical analysis to try to discover causal relationships (according to the standard story), or justify interventions (according to Ben’s). Other approaches involve the discovery of optimal solutions, given convenient mathematical assumptions and simplifications. Others still involve the discovery of local optima (that is: solutions that are better than others that are readily visible in their neighborhood), which may be better than those that ordinary humans could reach.

Rationalist approaches are very powerful in their domains of proper application, but you need some sense of what those domains are. Ben suggests that there is a “sweet spot” for many or most computational tools. For example, statistics is not useful for situations where a treatment always works (why would you need complicated tools of inference), or where outcomes are too variable and unpredictable, but for the messy zone between the two. When you hit the space where your tools have traction on reality despite their imperfections, you can accomplish extraordinary things. For example, in his own review of the book, Dan Davies talks about

the incredibly productive feedback loop between “optimisation algorithms are really demanding in terms of computer processing” and “optimisation algorithms are really useful for designing better and faster computers”.

As Ben describes it, designers were able to reduce the incredibly complex challenges of chip design into an optimizable task through making simplifying assumptions, about “standard cells” and combining them with simulated annealing algorithms that could discover optima that would otherwise not be easily visible. This, then, as per Dan, allowed faster chips to be developed, which in turn could run more powerful algorithms, and so on, in a loop.

But treating rationalism as a universal tool of discovery is problematic, especially given that these techniques are characteristically limited or start from implausible simplifying assumptions. Daphne Koller, one of the researchers who Ben describes, discovered some startlingly effective ways to reduce the complexity of poker so that it became more nearly “solvable.” But Koller eventually abandoned the study of game theory:

“Understanding the world around us is more important than understanding the optimal way to bluff,” she told me. In her experience, when she needed to model people in simulations of complex systems, modeling their decisions as random got her 90 percent of the way to a solution. How to best make decisions under wide-ranging uncertainty was far less cut-and-dried. For Koller, once you stepped away from the game board and had to make decisions in reality, understanding uncertainty and the myriad ways it could arise and impact plans was more important than strategy.

As it turned out, poker algorithms too generated feedback loops, not through simplifying chip design, but simplifying human beings (C. Thi Nguyen’s book, The Score provides a broader account of how this works). There is an important sense in which optimal poker theory was less successful in optimizing poker than in optimizing poker players, inspiring a style of play in which professionals “started memorizing expected value tables from poker solvers so that they could play ‘game theory optimal’ in big poker tournaments.” Perhaps that can be described as an improvement in human affairs. I’m not seeing it myself.

*******

Understanding mathematical rationalism helps us understand the strengths and limitations of AI. It isn’t just a form of rationalism, but the combined application of a variety of long established rationalist techniques - neural nets (which go back to the 1950s), statistical learning and backpropagation, made possible by more powerful computers and enormous amounts of readily available data. Claude Shannon’s methodology for modeling language, which is the intellectual basis of “large language models” is “an instance of statistical pattern recognition” or machine learning. And machine learning itself is no more and no less than a powerful statistical tool. I found this passage maybe the most clarifying explanation of what it does that I’ve ever read.

To frame the prototypical machine learning problem, I like to think about a hypothetical spreadsheet. Each row of the spreadsheet corresponds to some unit or example. But I don’t care what the units mean. I just know that I have a bunch of columns filled in with data. And I’m told one of the columns is special. I am about to get a load of new rows in the spreadsheet, but someone downstairs forgot to fill in the special column. Management has tasked me with writing a formula to fill in what should be there. For whatever reason, I don’t get to see these new rows and have to build the formula from the spreadsheet I have. The formula can use all sorts of spreadsheet operations: It can assign weights to different columns and add up the scores, it can use logical formulas based on whether certain columns exceed particular values, it can divide and multiply. … I’ll do an experiment. I’ll take the last row of my spreadsheet and pretend I don’t have the special column. I’ll write as many formulas as I can. … But why single out that last row? I can do something similar for every row! I’ll invent a set of plausible functions. I’ll evaluate how well they predict on the spreadsheet I have. I’ll choose the function that maximizes the accuracy. This is more or less the art of machine learning.

Guessing the missing rows of spreadsheets and optimizing turns out to have a lot of useful applications: not just language models, but protein folding, recognizing handwriting and a myriad of other applications. Equally, machine learning is just another form of optimization and/or prediction. Very large chunks of Silicon Valley’s current business model involve taking complex situations that don’t look like optimization or prediction problems, simplifying and redescribing them and then finding solutions.

Just like statistics, there is a “sweet spot” for machine learning. It is not useful for situations where you have a genuinely clean mathematical abstraction, which you can turn into running code. Nor is it useful for situations that are too messy or complicated to be predictable (it is, after all, an application of statistical technique). You want to use it in the intermediary situations where there isn’t an obvious neat solution, but where the clunky and computationally expensive techniques of machine learning can discover a useful approximation, even if you may not understand quite what it is based on or how it works.

*******

All this implies some important problems of evaluation. How can you tell where machine learning is a useful way to proceed? How can you tell which machine learning approach is the best one to apply for a given problem? And behind all this lurks the bigger question that we began with. How can you tell when machine learning techniques in general (or other rationalist shortcuts) are better or worse than ordinary human judgment?

The answer to the first is unfortunately indeterminate. As best as I understand Ben’s argument, the only real way to discover whether machine learning works for a given kind of problem is to come up with a working machine learning solution. There is no genuinely satisfactory ex ante way to distinguish between the problems that machine learning can solve for, and those that they can’t. Furthermore, as Ali Rahimi and Ben have noted elsewhere, AI practitioners rely more on “alchemy” than a deep understanding of why some approaches work and others don’t. More succinctly, XKCD:

The pile gets soaked with data and starts to get mushy over time, so it's technically recurrent.

As for how to tell which machine learning algorithms work better than others, computer scientists have come up with an approach commonly called the Common Task Framework (or variants thereon). Create a common dataset (canonically: photos of cats and dogs) and share all (or far more usually these days) some of it with different teams of researchers. Then come up with a common task that can be performed on the data and can evaluated in a fairly straightforward way (can the algorithm distinguish between cats and dogs). Then the different teams can come up with algorithms that work competitively against each other, which perhaps can be tested on data that has not been shared publicly, to ward against overfit and teaching to the test. The algorithm which works best (say; has the highest percentage accuracy in distinguishing between cats and dogs) is, ipso facto, the best algorithm for the task.

And this gets us to one of the major contributions of Ben’s book. A lot of people in AI claim that we can apply this framework to answer a very big question. Are AI algorithms generally superior to human beings at performing some set of cognitive tasks? There are a variety of common task framework tests that purport to do this, some with names that … beg questions. If you are hanging around the right (or wrong) places on the Internet, you will regularly read this or that excitable claim that humanity is doomed to be superseded because of the performance of AI on this or that test.

Ben suggests that such claims tend to make a fundamental error. He describes some famous results from the research of psychologist Paul Meehl on medical and other decisions, which suggested that “statistical prediction provided more accurate judgments about the future than clinical judgments” under certain conditions. But the conclusion that Ben comes to is not that this means that statistical prediction is generally better than expert judgment. Instead, it is better when there are clearly defined outcomes, good data, and clear reference cases that can be used for comparison. There are many situations in which this is not true, and cannot readily be made true. There are also situations in which it cannot.

If we use common task type approaches to measure success, we are loading the dice in favor of those tasks that can be described in terms of clear outcomes, and tested with good data, and loading them against those tasks that do not have such nice characteristics. Ben describes this even more pungently. Tasks that can be defined in those ways are definitionally the tasks that computer or other automated approaches will be able quickly to do better than human beings. Paradoxically:

If we can measure why humans might be able to outperform machines, then we can build machines to outperform people. On the other hand, if we can’t cleanly articulate a clean set of actions, outcomes, measurements, and metrics, then we can’t mechanize problem solving. It is this digitization, translating the world into the language of the computer, that is needed to automate.

The universe of tasks with clear goals, conditions and data is both the universe of tasks that are easily measured and the universe of tasks that computers and automated processes can carry out well. The one characteristic more or less predicts the other. This, then, is what makes it so hard for mathematical rationalists to see the limitations to their perspective. The tools and measures that they use to understand and solve problems could almost have been purpose crafted to confirm their broad intellectual biases by concealing the problems that their methods can’t easily solve.

*******

This helps us to situate the debate that is happening right now about AI. There are many AI enthusiasts, who believe that it can be applied to do pretty well any task that humans can do as well as the humans or better. Getting to this is just a matter of scaling and engineering, and is going to happen Real Soon Now. There are AI skeptics, who argue that its benefits are limited to a narrow range of well defined tasks, or even (I see the claim regularly, though it is rarely defended in any particularly sophisticated way) that the benefits are non-existent. These positions often map onto “AI good” and “AI bad,” along the lines that AI suggests.

As per the quote at the beginning of this post, Ben doesn’t really engage with the question of whether AI is good or bad in any general sense. Instead, he proposes that it can carry out many tasks, including tasks that we might not anticipate right now, but that there are limits. AI, like mathematical rationality more generally, has a sweet spot: problems that are complicated enough that they can’t be solved by other computationally cheaper approaches, but that have enough regularities to be workable. Within that sweet spot, it can do extraordinary things. Outside the sweet spot, it may be redundant or completely useless. And there is an ambiguous zone in between, where it can do stuff but imperfectly.

It isn’t possible, except in very general terms, to define ex ante what the sweet spot is. Clever engineers are perpetually trying to expand it. Self-driving cars provide one example of a problem that has proved far harder to solve than engineers thought (as Ben puts it “we don't know how to articulate 'good driver' into a clean statistical outcome”), but they are brute-forcing the problem so that self-driving is far more plausible across different environments than it used to be. Equally, there are many, many edge cases. One way to deal with many of them might be to try to simplify them out of existence through e.g. having only self-driving cars without the unpredictabilities of idiosyncratic human drivers, or cyclists, or … or … or). Such simplification is a version of what management cyberneticists call ‘variety reduction.’

Equally, there are challenges that appear to be fundamentally resistant to mathematical rationality, including bureaucracy and politics:

societies are not computer chips. While I noted in chapter 2 that computer chips were often analogized as microscopic cities, chips were always designed to be hermetically sealed and perfectly controlled. This is what made them optimizable. Real societies, on the other hand, had people. While it’s convenient to model and view the population, its health, and its market flows as mathematical abstractions, these run into the limits of the messiness that people bring to bear.

In The Sciences of the Artificial, Herbert Simon makes a closely related argument:

When we come to the design of systems as complex as cities, or buildings, or economies, we must give up the aim of creating systems that will optimize some hypothesized utility function, and we must consider whether differences in style of the sort I have just been describing do not represent highly desirable variants in the design process rather than alternatives to be evaluated as “better” or “worse.” Variety, within the limits of satisfactory constraints, may be a desirable end in itself, among other reasons, because it permits us to attach value to the search as well as its outcome—to regard the design process as itself a valued activity for those who participate in it.

We have usually thought of city planning as a means whereby the planner’s creative activity could build a system that would satisfy the needs of a populace. Perhaps we should think of city planning as a valuable creative activity in which many members of a community can have the opportunity of participating—if we have wits to organize the process that way.

As per James Scott’s Seeing Like a State, the problems begin when technocrats begin to treat human beings and the complex societies they create as though they were simplified “standard cells” that can readily be re-arranged in more optimal patterns. Moreover, as Ben says elsewhere (Cosma and I quote this in our own forthcoming piece on AI and bureaucracy), political disagreement generally resists optimization. When you have incommensurable tradeoffs (even very simple ones: should you use money in your budget to pay for a playground to make parents happy or a fire station to make it less likely that businesses will burn down), you have moved decisively away from the kinds of problems that machine learning, or optimization more generally, can simplify in useful ways.

As soon as we can’t agree on a cost function, it’s not clear what our optimization machinery … buys us. Multi-objective optimization necessarily means there is a trade-off. And we can’t optimize a trade-off.

Barring the development of radically different approaches, there is no reason to believe that politics will come into the sweet spot. But many mathematical rationalists argue otherwise (e.g. this set of claims, which maybe deserve their own extended response). If you want to really understand the limits on AI, you owe it to yourself to read Ben’s book. There are many books on technology that are smart in some sense, but very few that are wise. This is one of those few exceptions.

Subscribe now

Read the whole story
mrmarchant
1 day ago
reply
Share this story
Delete
Next Page of Stories