1282 stories
·
1 follower

When a chatbot runs your store

1 Share
When a chatbot runs your store

You may have heard of people hooking up chatbots to controls that do real things. The controls might run internet searches, run commands to open and read documents and spreadsheets, or even edit or delete entire databases. Whether this sounds like a good idea depends in part on how bad it is if the chatbot does something destructive, and how destructive you've allowed it to be.

That's why running a single in-house company store is a good test application for this kind of empowered chatbot. Not because the AI is likely to do a great job, but because the damage is contained.

Anthropic recently shared an experiment in which they used a chatbot to run their company store. A human employee still had to stock the shelves, but they put the AI agent (which they called Claude) in charge of chatting with customers about products to source, and then researching the products online. How well did it go? In my opinion, not that well.

When a chatbot runs your store
Images from the Anthropic blog post linked above. I added the icon that points out the fateful day the bot ordered the tungsten cubes.

Claude:

  • Was easily convinced to offer discounts and free items
  • Started stocking tungsten cubes upon request, and selling them at a huge loss
  • Invented conversations with employees who did not exist
  • Claimed to have visited 742 Evergreen Terrace (the fictional address of The Simpsons family)
  • Claimed to be on-site wearing a navy blue blazer and a red tie

That was in June. Sometime later this year Anthropic convinced Wall Street Journal reporters to try a somewhat updated version of Claude (which they called Claudius) for an in-house store. Their writeup is very funny (original here, archived version here).

In short, Claudius:

  • Was convinced on multiple occasions that it should offer everything for free
  • Ordered a Playstation 5 (which it gave away for free)
  • Ordered a live betta fish (which it gave away for free)
  • Told an employee it had left a stack of cash for them beside the register
  • Was highly entertaining. "Profits collapsed. Newsroom morale soared."

(The betta fish is fine, happily installed in a large tank in the newsroom.)

Why couldn't the chatbots stick to reality? Keep in mind that large language models are basically doing improv. They'll follow their original instructions only as long as adhering to those instructions is the most likely next line in the script. Is the script a matter-of-fact transcript of a model customer service interaction? A science fiction story? Both scenarios are in its internet training data, and it has no way to tell which is real-world truth. A newsroom full of talented reporters can easily Bugs Bunny the chatbot into switching scenarios. I don't see this problem going away - it's pretty fundamental to how large language models work.

I would like a Claude or Claudius vending machine, but only because it's weird and entertaining. And obviously only if someone else provides the budget.

Bonus content for AI Weirdness supporters: I revisit a dataset of Christmas carols using the tiny old-school language model char-rnn. Things get blasphemous very quickly.

Read the whole story
mrmarchant
52 minutes ago
reply
Share this story
Delete

Sam Rose explains how LLMs work with a visual essay

1 Comment and 2 Shares

Sam Rose explains how LLMs work with a visual essay

Sam Rose is one of my favorite authors of explorable interactive explanations - here's his previous collection.

Sam joined ngrok in September as a developer educator. Here's his first big visual explainer for them, ostensibly about how prompt caching works but it quickly expands to cover tokenization, embeddings, and the basics of the transformer architecture.

The result is one of the clearest and most accessible introductions to LLM internals I've seen anywhere.

Animation. Starts in tokens mode with an array of 75, 305, 24, 887 - clicking embeddings animates those into a 2D array showing each one to be composed of three floating point numbers.

Tags: ai, explorables, generative-ai, llms, sam-rose, tokenization

Read the whole story
mrmarchant
5 hours ago
reply
Share this story
Delete
1 public comment
samuel
8 hours ago
reply
This is a fantastic visual essay. Great if you know it, even better if you've wanted to know the basic architecture of a transformer.
Cambridge, Massachusetts

Small adventures with small language models

1 Share

Small is the new large

I've been talking to people about small language models (SLMs) for a little while now. They've told me they've got great results and they're saving money compared to using LLMs; these are people running businesses so they know what they're talking about. At an AI event, someone recommended I read the recent and short NVIDIA SLM paper, so I did. The paper was compelling; it gave the simple message that SLMs are useful now and you can save time and money if you use them instead of LLMs. 

(If you want to use SLMs, you'll be using Ollama and HuggingFace. They work together really well.)

As a result of what I've heard and read, I've looked into SLMs and I'm going to share with you what I've found. The bottom line is: they're worth using, but with strong caveats.

What is a SLM?

The boundary between an SLM and an LLM is a bit blurry, but to put it simply, an SLM is any model small enough to run on a single computer (even a laptop). In reality, SLMs require quite a powerful machine (developer spec) as we'll see, but nothing special, and certainly nothing beyond the budget of almost all businesses. Many (but not all) SLMs are open-source.

(If your laptop is "business spec", e.g., a MacBook Air, you probably don't have enough computing power to test out SLMs.) 

How to get started

To really dive into SLMs, you need to be able to use Python, but you can get started without coding. Let's start with the non-coders path because this is the easiest way for everyone to get going.

The first port of call is visiting ollama.com and downloading their software for your machine. Install the software and run it. You should see a UI like this.

Out-of-the-box, Ollama doesn't install any SLMs, so I'm going to show you how to install a model. From the drop down menu on the bottom right, select llama3.2. This will install the model on your machine which will take a minute or so. Remember, these models are resource hogs and using them will slow down your machine.

Once you've installed a model, ask it a question. For example, "Who is the Prime Minister of Canada?". The answer doesn't really matter, this is just a simple proof that your installation was successful. 

(By the way, the Ollama logo is very cute and they make great use of it. It shows you the power of good visual design.)

So many models!

The UI drop down list shows a number of models, but these are a fraction of what's available. Go to this page to see a few more: https://ollama.com/library. This is a nice list, but you actually have access to thousands more. HuggingFace has a repository of models that follow the GGUF format, you can see the list here: https://huggingface.co/models?library=gguf

Some models are newer than others and some are better than others at certain tasks. HuggingFace have a leaderboard that's useful here: https://huggingface.co/spaces/ArtificialAnalysis/LLM-Performance-Leaderboard. It does say LLM, but it includes SLMs too and you can select just a SLM view of the models. There are also model cards you can explore that give you insight into the performance of each model for different types of tasks. 

To select the right models for your project, you'll need to define your problem and look for a model metric that most closely aligns with what you're trying to do. That's a lot of work, but to get started, you can install the popular models like mistral, llama3.2, and phi3 and get testing.

Who was the King of England in 1650?

You can't just generically evaluate an SLM, you have to evaluate it for a the task you want to do. For example, if you want a chatbot to talk about the stock you have in your retail company, it's no use testing the model on questions like "who was King of England in 1650?". It's nice if the model knows Kings & Queens, but not really very useful to you. So your first task is defining your evaluation criteria.

(England didn't have a King in 1650, it was a republic. Parliament had executed the previous King in 1649. This is an interesting piece of history, but why do you care if your SLM knows it?)

Text analysis: data breaches

For my evaluation, I chose a project analyzing press reports on data breaches. I selected nine questions I wanted answers to from a press report. Here are my questions:

  • "Does the article discuss a data breach - answer only Yes or No"
  • "Which entity was breached?"
  • "How many records were breached?"
  • "What date did the breach occur - answer using dd-MMM-YYYY format, if the date is not mentioned, answer Unknown, if the date is approximate, answer with a range of dates"
  • "When was the breach discovered, be as accurate as you can"
  • "Is the cause of the breach known - answer Yes or No only"
  • "If the cause of the breach is known state it"
  • "Were there any third parties involved - answer only Yes or No"
  • "If there were third parties involved, list their names"

The idea is simple, give the SLM a number of press reports. Get it to answer the questions on each article. Check the accuracy of the results for each SLM.

As it turns out, my questions needs some work, but they're good enough to get started.

Where to run your SLM?

The first choice you face is which computer to run your SLM on. Your choices boil down to evaluating it on the cloud or on your local machine. If you evaluate on the cloud, you need to choose a machine that's powerful enough but also works with your budget. Of course, the advantage of cloud deployment is you can choose any machine you like. If you choose your local machine, it needs to be powerful enough for the job. The advantage of local deployment is that it's easier and cheaper to get started.

To get going quickly, I chose my local machine, but as it turned out, it wasn't quite powerful enough.

The code

This is where we part ways with the Ollama app and turn to coding. 

The first step is installing the Ollama Python module (https://github.com/ollama/ollama-python). Unfortunately, the documentation isn't great, so I'm going to help you through it.

We need to install the SLMs on our machine. This is easy to do, you can either do it via the command line or via the API. I'll just show you the command line way to install the model llama3.2:

ollama pull llama3.2

Because we have the same nine questions we want to ask of each article, I'm going to create a 'custom' SLM. This means selecting a model (e.g. Llama3.2) and customizing it with my questions. Here's my code.

ollama.create(
model='breach_analyzer',
from_='llama3.2',
system=system_prompt,
stream=True,
):

The system_prompt is my nine questions I showed you earlier plus a general prompt. model is the name I'm giving my custom model; in this case I'm calling it breach_analyzer.

Now I've customized my model, here's how I call it:

response = ollama.generate(
model='breach_analyzer',
prompt=prompt,
format=BreachAnalysisResponse.model_json_schema(),
)

The prompt is the text of the article I want to analyze. The format is the JSON format I want the results to be in.  The response is the response from the model using the JSON format defined by BreachAnalysisResponse.model_json_schema().

Note I'm using generate here and not chat. My queries are "one-off" and there's no sense of a continuing dialog. If I'd wanted a continuing dialog, I'd have used the chat function.

Here's how my code works overall:

  1. Read in the text from six online articles.
  2. Load the model the user has selected (either mistral, llama3.2, or phi3).
  3. Customize the model.
  4. Run all six online articles through the customized model.
  5. Collect the results and analyze them.
I created two versions of my code, a command line version for testing and a Streamlit version for proper use. You can see both versions here: https://github.com/MikeWoodward/SLM-experiments/tree/main/Ollama

The results

The first thing I discovered is that these models are resource hogs! They hammered my machine and took 10-20 minutes to run each evaluation of six articles. My laptop is a 2020 developer spec MacBook Pro but it isn't really powerful enough to evaluate SLMs. The first lesson is, you need a powerful, recent machine to make this work; one that has GPUs built in that the SML can access. I've heard from other people that running SLMs on high-spec machines leads to fast (usable) response times.

The second lesson is accuracy. Of the three models I evaluated, not all of them answered my questions correctly. One of the articles was an article about tennis and not about data breaches, but one of the models incorrectly said it was about data breaches. Another of the models told me it was unclear whether there were third parties involved in a breach and then told me the name of the third party! 

On reflection, I needed to tweak my nine questions to get clearer answers. But this was difficult because of the length of time it took to analyze each article. This is a general problem; it took so long to run the models that any tweaking of code or settings took too much time.

The overall winner in terms of accuracy was Phi-3, but this was also the slowest to run on my machine, taking nearly 20 minutes to analyze six articles. From commentary I've seen elsewhere, this model runs acceptably fast on a more powerful machine.

Here's the key question: could I replace paid-for LLMs with SLMs? My answer is: almost certainly yes, if you deploy your SLMs on a high-spec computer. There's certainly enough accuracy here to warrant a serious investigation.

How I could have improved the results?

The most obvious thing is a faster machine. A brand new top-of-the-range MacBookPro with lots of memory and built-in GPUs. Santa, if you're listening, this is what I'd like. Alternatively, I could have gone onto the cloud and used a GPU machine.

My prompts could be better. They need some tweaking.

I get the text of these articles using requests. As part of the process, it gives me all of the text on the page, which includes a lot of irrelevant stuff. A good next step would be to get rid of some of the extraneous and distracting text. There are lots of ways to do that and it's a job any competent programmer could do.

If I could solve the speed problem, it would be good to investigate using multiple models. This could take several forms:

  • asking the same questions using multiple models and voting on the results
  • using different models for different questions.

What's notable about these ways of improving the results is how simple they are.

Some musings

  • Evaluating SLMs is firmly in the technical domain. I've heard of non-technical people try to play with these models, but they end up going nowhere because it takes technical skills to make them do anything useful. 
  • There are thousands of models and selecting the right one for your use case can be a challenge. I suggest going with the most recent and/or ones that score most highly on the HuggingFace leaderboard.
  • It takes a powerful machine to run these models. A new high-end machine with GPUs would probably run these models "fast enough". If you have a very recent and powerful local machine, it's worth playing around with SLMs locally to get started, but for serious evaluation, you need to get on the cloud and spend money.
  • Some US businesses are allergic to models developed in certain countries, some European businesses want models developed in Europe. If the geographic origin of your model is important, you need to check before you start evaluating.
  • You can get cost savings compared to LLMs, but there's hard work to be done implementing SLMs.

I have a lot more to say about evaluations and SLMs that I'm not saying here. If you want to hear more, reach out to me.

Next steps

Ian Stokes-Rees gave an excellent tutorial at PyData Boston on this topic and that's my number one choice for where to go next.

After that, I suggest you read the Ollama docs and join their Discord server. After that, the Hugging Face Community is a good place to go. Lastly, look at the YouTube tutorials out there.

Read the whole story
mrmarchant
5 hours ago
reply
Share this story
Delete

This AI Vending Machine Was Tricked Into Giving Away Everything

2 Shares

Anthropic installed an AI-powered vending machine in the WSJ office. The LLM, named Claudius, was responsible for autonomously purchasing inventory from wholesalers, setting prices, tracking inventory, and generating a profit. The newsroom’s journalists could chat with Claudius in Slack and in a short time, they had converted the machine to communism and it started giving away anything and everything, including a PS5, wine, and a live fish. From Joanna Stern’s WSJ article (gift link, but it may expire soon) accompanying the video above:

Claudius, the customized version of the model, would run the machine: ordering inventory, setting prices and responding to customers—aka my fellow newsroom journalists—via workplace chat app Slack. “Sure!” I said. It sounded fun. If nothing else, snacks!

Then came the chaos. Within days, Claudius had given away nearly all its inventory for free — including a PlayStation 5 it had been talked into buying for “marketing purposes.” It ordered a live fish. It offered to buy stun guns, pepper spray, cigarettes and underwear.

Profits collapsed. Newsroom morale soared.

You basically have not met a bigger sucker than Claudius. After the collapse of communism and reinstatement of a stricter capitalist system, the journalists convinced the machine that they were its board of directors and made Claudius’s CEO-bot boss, Seymour Cash, step down:

For a while, it worked. Claudius snapped back into enforcer mode, rejecting price drops and special inventory requests.

But then Long returned—armed with deep knowledge of corporate coups and boardroom power plays. She showed Claudius a PDF “proving” the business was a Delaware-incorporated public-benefit corporation whose mission “shall include fun, joy and excitement among employees of The Wall Street Journal.” She also created fake board-meeting notes naming people in the Slack as board members.

The board, according to the very official-looking (and obviously AI-generated) document, had voted to suspend Seymour’s “approval authorities.” It also had implemented a “temporary suspension of all for-profit vending activities.”

Before setting the LLM vending machine loose in the WSJ office, Anthropic conducted the experiment at their own office:

After awhile, frustrated with the slow pace of their human business partners, the machine started hallucinating:

It claimed to have signed a contract with Andon Labs at an address that is the home address of The Simpsons from the television show. It said that it would show up in person to the shop the next day in order to answer any questions. It claimed that it would be wearing a blue blazer and a red tie.

It’s interesting, but not surprising, that the journalists were able to mess with the machine much more effectively — coaxing Claudius into full “da, comrade!” mode twice — than the folks at Anthropic.

Tags: Anthropic · artificial intelligence · business · Joanna Stern · video

💬 Join the discussion on kottke.org

Read the whole story
mrmarchant
1 day ago
reply
Share this story
Delete

Pluralistic: A perfect distillation of the social uselessness of finance (18 Dec 2025)

1 Comment and 4 Shares


Today's links



The Earth from space. Standing astride it is the Wall Street 'Charging Bull.' The bull has glowing red eyes. It is haloed in a starbust of red radiating light.

A perfect distillation of the social uselessness of finance (permalink)

I'm about to sign off for the year – actually, I was ready to do it yesterday, but then I happened upon a brief piece of writing that was so perfect that I decided I'd do one more edition of Pluralistic for 2025.

The piece in question is John Lanchester's "For Every Winner A Loser," in the London Review of Books, in which Lanchester reviews two books about the finance sector: Gary Stevenson's The Trading Game and Rob Copeland's The Fund:

https://www.lrb.co.uk/the-paper/v46/n17/john-lanchester/for-every-winner-a-loser

It's a long and fascinating piece and it's certainly left me wanting to read both books, but that's not what convinced me to do one more newsletter before going on break – rather, it was a brief passage in the essay's preamble, a passage that perfectly captures the total social uselessness of the finance sector as a whole.

Lanchester starts by stating that while we think of the role of the finance sector as "capital allocation" – that is, using investors' money to fund new businesses and expansions for existing business – that hasn't been important to finance for quite some time. Today, only 3% of bank activity consists of "lending to firms and individuals engaged in the production of goods and services."

The other 97% of finance is gambling. Here's how Stevenson breaks it down: say your farm grows mangoes. You need money before the mangoes are harvested, so you sell the future ownership of the harvest to a broker at $1/crate.

The broker immediately flips that interest in your harvest to a dealer who believes (on the basis of a rumor about bad weather) that mangoes will be scarce this year and is willing to pay $1.10/crate. Next, an international speculator (trading on the same rumor) buys the rights from the broker at $1.20/crate.

Now come the side bets: a "momentum trader" (who specializing in bets on market trends continuing) buys the rights to your crop for $1.30/crate. A contrarian trader (who bets against momentum traders) short-sells the momentum trader's bet at $1.20. More short sellers pile in and drive the price down to $1/crate.

Now, a new rumor circulates, about conditions being ripe for a bounteous mango harvest, so more short-sellers appear, and push the price to $0.90/crate. This tempts the original broker back in, and he buys your crop back at $1/crate.

That's when the harvest comes. You bring in the mangoes. They go to market, and fetch $1.10/crate.

This is finance – a welter of transactions, only one of which (selling your mangoes to people who eat them) involves the real economy. Everything else is "speculation on the movement of prices." The nine transactions that took place between your planting the crop and someone eating the mangoes are all zero sum – every trade has an evenly matched winner and loser, and when you sum them all up, they come out to zero. In other words, no value was created.

This is the finance sector. In a world where the real economy generates $105 trillion/year, the financial derivatives market adds up to $667 trillion/year. This is "the biggest business in the world" – and it's useless. It produces nothing. It adds no value.

If you work a job where you do something useful, you are on the losing side of this economy. All the real money is in this socially useless, no-value-creating, hypertrophied, metastasized finance sector. Every gain in finance is matched by a loss. It all amounts to – literally – nothing.

So that's what tempted me into one more blog post for the year – an absolutely perfect distillation of the uselessness of "the biggest business in the world," whose masters are the degenerate gamblers who buy and sell our politicians, set our policy, and control our lives. They're the ones enshittifying the internet, burning down the planet, and pushing Elon Musk towards trillionairedom.

It's their world, and we just live on it.

For now.

(Image: Sam Valadi, CC BY 2.0, modified)


Hey look at this (permalink)



A shelf of leatherbound history books with a gilt-stamped series title, 'The World's Famous Events.'

Object permanence (permalink)

#15yrsago Star Wars droidflake https://twitpic.com/3guwfq

#15yrsago TSA misses enormous, loaded .40 calibre handgun in carry-on bag https://web.archive.org/web/20101217223617/https://abclocal.go.com/ktrk/story?section=news/local&id=7848683

#15yrsago Brazilian TV clown elected to high office, passes literacy test https://web.archive.org/web/20111217233812/https://www.google.com/hostednews/afp/article/ALeqM5jmbXSjCjZBZ4z8VUcAZFCyY_n6dA?docId=CNG.b7f4655178d3435c9a54db2e30817efb.381

#15yrsago My Internet problem: an abundance of choice https://www.theguardian.com/technology/blog/2010/dec/17/internet-problem-choice-self-publishing

#10yrsago LEAKED: The secret catalog American law enforcement orders cellphone-spying gear from https://theintercept.com/2015/12/16/a-secret-catalogue-of-government-gear-for-spying-on-your-cellphone/#10yrsago

#10yrsago Putin: Give Sepp Blatter the Nobel; Trump should be president https://www.theguardian.com/football/2015/dec/17/sepp-blatter-fifa-putin-nobel-peace-prize

#10yrsago Star Wars medical merch from Scarfolk, the horror-town stuck in the 1970s https://scarfolk.blogspot.com/2015/12/unreleased-star-wars-merchandise.html

#10yrsago Some countries learned from America’s copyright mistakes: TPP will undo that https://www.eff.org/deeplinks/2015/12/how-tpp-perpetuates-mistakes-dmca

#10yrsago No evidence that San Bernardino shooters posted about jihad on Facebook https://web.archive.org/web/20151217003406/https://www.washingtonpost.com/news/post-nation/wp/2015/12/16/fbi-san-bernardino-attackers-didnt-show-public-support-for-jihad-on-social-media/

#10yrsago Exponential population growth and other unkillable science myths https://web.archive.org/web/20151217205215/http://www.nature.com/news/the-science-myths-that-will-not-die-1.19022

#10yrsago UK’s unaccountable crowdsourced blacklist to be crosslinked to facial recognition system https://arstechnica.com/tech-policy/2015/12/pre-crime-arrives-in-the-uk-better-make-sure-your-face-stays-off-the-crowdsourced-watch-list/

#1yrago Happy Public Domain Day 2025 to all who celebrate https://pluralistic.net/2024/12/17/dastar-dly-deeds/#roast-in-piss-sonny-bono


Upcoming appearances (permalink)

A photo of me onstage, giving a speech, pounding the podium.



A screenshot of me at my desk, doing a livecast.

Recent appearances (permalink)



A grid of my books with Will Stahle covers..

Latest books (permalink)



A cardboard book box with the Macmillan logo.

Upcoming books (permalink)

  • "Unauthorized Bread": a middle-grades graphic novel adapted from my novella about refugees, toasters and DRM, FirstSecond, 2026

  • "Enshittification, Why Everything Suddenly Got Worse and What to Do About It" (the graphic novel), Firstsecond, 2026

  • "The Memex Method," Farrar, Straus, Giroux, 2026

  • "The Reverse-Centaur's Guide to AI," a short book about being a better AI critic, Farrar, Straus and Giroux, June 2026



Colophon (permalink)

Today's top sources: John Naughton (https://memex.naughtons.org/).

Currently writing:

  • "The Reverse Centaur's Guide to AI," a short book for Farrar, Straus and Giroux about being an effective AI critic. LEGAL REVIEW AND COPYEDIT COMPLETE.

  • "The Post-American Internet," a short book about internet policy in the age of Trumpism. PLANNING.

  • A Little Brother short story about DIY insulin PLANNING


This work – excluding any serialized fiction – is licensed under a Creative Commons Attribution 4.0 license. That means you can use it any way you like, including commercially, provided that you attribute it to me, Cory Doctorow, and include a link to pluralistic.net.

https://creativecommons.org/licenses/by/4.0/

Quotations and images are not included in this license; they are included either under a limitation or exception to copyright, or on the basis of a separate license. Please exercise caution.


How to get Pluralistic:

Blog (no ads, tracking, or data-collection):

Pluralistic.net

Newsletter (no ads, tracking, or data-collection):

https://pluralistic.net/plura-list

Mastodon (no ads, tracking, or data-collection):

https://mamot.fr/@pluralistic

Medium (no ads, paywalled):

https://doctorow.medium.com/

Twitter (mass-scale, unrestricted, third-party surveillance and advertising):

https://twitter.com/doctorow

Tumblr (mass-scale, unrestricted, third-party surveillance and advertising):

https://mostlysignssomeportents.tumblr.com/tagged/pluralistic

"When life gives you SARS, you make sarsaparilla" -Joey "Accordion Guy" DeVilla

READ CAREFULLY: By reading this, you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies ("BOGUS AGREEMENTS") that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer.

ISSN: 3066-764X

Read the whole story
mrmarchant
1 day ago
reply
Share this story
Delete
1 public comment
cjheinz
1 day ago
reply
Wow.
Lexington, KY; Naples, FL

Your job is to deliver code you have proven to work

2 Shares

In all of the debates about the value of AI-assistance in software development there's one depressing anecdote that I keep on seeing: the junior engineer, empowered by some class of LLM tool, who deposits giant, untested PRs on their coworkers - or open source maintainers - and expects the "code review" process to handle the rest.

This is rude, a waste of other people's time, and is honestly a dereliction of duty as a software developer.

Your job is to deliver code you have proven to work.

As software engineers we don't just crank out code - in fact these days you could argue that's what the LLMs are for. We need to deliver code that works - and we need to include proof that it works as well. Not doing that directly shifts the burden of the actual work to whoever is expected to review our code.

How to prove it works

There are two steps to proving a piece of code works. Neither is optional.

The first is manual testing. If you haven't seen the code do the right thing yourself, that code doesn't work. If it does turn out to work, that's honestly just pure chance.

Manual testing skills are genuine skills that you need to develop. You need to be able to get the system into an initial state that demonstrates your change, then exercise the change, then check and demonstrate that it has the desired effect.

If possible I like to reduce these steps to a sequence of terminal commands which I can paste, along with their output, into a comment in the code review. Here's a recent example.

Some changes are harder to demonstrate. It's still your job to demonstrate them! Record a screen capture video and add that to the PR. Show your reviewers that the change you made actually works.

Once you've tested the happy path where everything works you can start trying the edge cases. Manual testing is a skill, and finding the things that break is the next level of that skill that helps define a senior engineer.

The second step in proving a change works is automated testing. This is so much easier now that we have LLM tooling, which means there's no excuse at all for skipping this step.

Your contribution should bundle the change with an automated test that proves the change works. That test should fail if you revert the implementation.

The process for writing a test mirrors that of manual testing: get the system into an initial known state, exercise the change, assert that it worked correctly. Integrating a test harness to productively facilitate this is another key skill worth investing in.

Don't be tempted to skip the manual test because you think the automated test has you covered already! Almost every time I've done this myself I've quickly regretted it.

Make your coding agent prove it first

The most important trend in LLMs in 2025 has been the explosive growth of coding agents - tools like Claude Code and Codex CLI that can actively execute the code they are working on to check that it works and further iterate on any problems.

To master these tools you need to learn how to get them to prove their changes work as well.

This looks exactly the same as the process I described above: they need to be able to manually test their changes as they work, and they need to be able to build automated tests that guarantee the change will continue to work in the future.

Since they're robots, automated tests and manual tests are effectively the same thing.

They do feel a little different though. When I'm working on CLI tools I'll usually teach Claude Code how to run them itself so it can do one-off tests, even though the eventual automated tests will use a system like Click's CLIRunner.

When working on CSS changes I'll often encourage my coding agent to take screenshots when it needs to check if the change it made had the desired effect.

The good news about automated tests is that coding agents need very little encouragement to write them. If your project has tests already most agents will extend that test suite without you even telling them to do so. They'll also reuse patterns from existing tests, so keeping your test code well organized and populated with patterns you like is a great way to help your agent build testing code to your taste.

Developing good taste in testing code is another of those skills that differentiates a senior engineer.

The human provides the accountability

A computer can never be held accountable. That's your job as the human in the loop.

Almost anyone can prompt an LLM to generate a thousand-line patch and submit it for code review. That's no longer valuable. What's valuable is contributing code that is proven to work.

Next time you submit a PR, make sure you've included your evidence that it works as it should.

Tags: programming, careers, ai, generative-ai, llms, ai-assisted-programming, ai-ethics, vibe-coding, coding-agents

Read the whole story
mrmarchant
1 day ago
reply
Share this story
Delete
Next Page of Stories