1540 stories
·
2 followers

Grade Inflation Nation

1 Share

Join the Center for Educational Progress and receive all our content — and thanks to all our amazing paid subscribers for their support.


Early in my career, I taught public policy as an adjunct at Governors State University, a public university in the southern suburbs of Chicago.

One of my first assignments asked students to write about a policy issue that mattered to them — why it was important, and what solutions they could explore to solve it.

As I handed them back their assignments at the next week’s class, the mood in the room shifted. Students were stunned. Several told me these were the lowest grades they had ever received at the university.

I understood their frustration. They had done what was asked of them. They turned in the assignment, showed up to class, and put words on a page. By the standards they were used to, that should have been enough. But the writing wasn’t strong. Their arguments were underdeveloped, sources were thin, and basic mechanics were a problem.

What struck me wasn’t the pushback itself — it was how genuine their surprise was. These weren’t students trying to negotiate a grade they knew they hadn’t earned. They honestly believed they had done well. Somewhere along the way, the system had told them they were performing at a level they weren’t.

I also understood the incentive I was supposed to follow. Adjuncts live and die by course evaluations and enrollment numbers. A reputation as a tough grader doesn’t get you rehired — it gets you replaced. The rational move was to hand out B’s, keep everyone comfortable, and secure my next semester.

That experience has stayed with me as I’ve watched grade inflation become one of the most pervasive and least confronted problems in American education. What I saw in that classroom wasn’t a failure of individual students or individual teachers. It was a symptom of a system where no one has an incentive to tell the truth.


All Incentives Point One Direction

Even though my experience was in a post-secondary environment, the pressure I felt to inflate grades is not unique to college. The incentives to inflate in K-12 are stronger, more embedded, and harder to escape.

Start with the classroom teacher. Most teachers who inflate grades are not being told to do so. They are making a rational calculation to avoid a situation they know will follow if they don’t.

A rigorous grade often means a frustrated parent. A frustrated parent often means a phone call to the principal. A phone call to the principal means a meeting, a justification, and the mental labor of defending a grade that, in most cases, the teacher will be pressured to change anyway. Most teachers would rather give the B and move on with their day.

But the pressure does not stop with parents and principals. It is structural.

Over the past several years, a growing number of selective colleges have adopted test-optional admissions policies, dropping the SAT and ACT from their evaluation criteria. It has had a significant downstream consequence: the high school GPA became the dominant quantitative measure in a college application. Every K-12 teacher in America implicitly understands this. An honest B+ in a rigorous course might accurately reflect what a student knows. It might also be the grade that keeps that student out of their first-choice school.

The same dynamic plays out at the state level. Several states tie college scholarship eligibility primarily to GPA. Florida’s Bright Futures program and Georgia’s HOPE Scholarship both use GPA thresholds as the primary gateway to state-funded financial aid. When a student’s scholarship money is on the line, the pressure on teachers to keep grades above a certain threshold is overwhelming.

Then there are the grading policies themselves. During and after COVID, districts across the country adopted a set of practices marketed as “grading for equity.” These included grade floors — policies that guaranteed students a minimum score of 50 out of 100, even for work never submitted —, unlimited exam retakes, and the elimination of credit for homework and class participation. The stated goal was to remove bias from grading. The practical effect was to sever the already weakening connection between grades and mastery.

The result is a system where inflation is rational at every level. Teachers inflate preemptively to avoid conflict. Administrators back down when conflict arrives. Schools operate within a postsecondary admissions process that has placed GPA at its center. At no point does anyone in the system benefit from maintaining rigorous standards, and at every point, there is a tangible cost for doing so.

It is a textbook collective action problem. Each actor in the K-12 system, acting rationally within their own constraints, contributes to an outcome that harms students, misleads parents, and degrades the value of an education for everyone.


Parents Just Don’t Understand

Grade inflation would be less damaging if parents knew grades are an unreliable indicator of their child’s academic performance.

They don’t.

A 2023 study by Gallup and Learning Heroes surveyed nearly 2,000 parents of K-12 public school students and found that 79 percent report their child is receiving mostly B’s or better. Almost nine in ten believe their child is at or above grade level in reading and math. These numbers are wildly out of step with reality. On the 2022 NAEP, only 33 percent of fourth graders scored proficient or above in reading. Only 36 percent did so in math. Among 12th graders taking the ACT, just 40 percent met college readiness benchmarks in reading and 30 percent met them in math.

Parents are not stupid. They are misinformed. The Gallup-Learning Heroes study found that 64 percent of parents rely on report cards as one of their top three sources of information about their child’s academic progress. Only 21 percent said the same about year-end state standardized test results. When your primary source of information is telling you everything is fine, you have no reason to act.

This matters because parents do act when they know there is a problem. A 2026 working paper from the Becker Friedman Institute by Derek Rury and Ariel Kalil studied how parents weigh grades against standardized test scores when making decisions about investing in their child’s education — in tutoring, for example. Using over 23,000 investment decisions from more than 2,000 parents, they found that parents respond to both signals, but place significantly more weight on grades. When grades are low, parents invest, regardless of what the test score says. When grades are high, but test scores are low, parents do not invest. The high grade crowds out the response that the low test score would otherwise trigger.

That finding is the mechanism through which grade inflation does its real damage. It is not just that the signal is wrong. It is that the wrong signal actively suppresses the corrective action parents would take if they had accurate information.

A child who is struggling in math but receiving a B will not get a tutor. Their parents will not schedule a meeting with the teacher. They will not look for supplemental programs. The inflated grade has told them there is no problem.


Why Parents Don’t Trust the One Signal That Could Help

The question, then, is why parents discount the one signal that could cut through the noise. Standardized test scores are designed precisely for this purpose — to provide an objective measure of what a student actually knows. Two factors explain why they fail to serve that function.

The first is a sustained public relations campaign against testing. Teachers unions and allied advocacy groups have spent years framing standardized tests as reductive, biased, and harmful. The language is familiar: “teaching to the test,” “reducing kids to a number,” “high-stakes testing.” This messaging has been effective. The Rury and Kalil study found that nearly 40 percent of parents believe standardized tests are biased against certain groups. When asked directly, 71 percent of parents said grades are more important than test scores for making decisions about their own children. Only 8.5 percent said the opposite.

The second is timing. Standardized test scores typically arrive months after the test is administered — sometimes over the summer, sometimes well into the following school year. By the time a parent sees the result, it is stale. Compare that to a report card, which arrives every quarter. Parents understandably weigh the signal that shows up when there is still time to do something about it, even if that signal is unreliable.

This combination is deeply damaging. Parents are left with one signal that is timely but dishonest, and another that is honest but delayed, and culturally discredited. The natural self-correcting mechanism — parents investing in their children when they see them struggling — has been broken by the very system that is supposed to be providing honest information.


The Costs of Grade Inflation

The costs of this broken feedback loop are not abstract.

A recent study by Jeff Denning and colleagues linked high school administrative data from Los Angeles and Maryland to postsecondary and earnings records to measure the long-run impact of grade inflation on students. They found that being assigned to a teacher with higher average grade inflation reduces a student’s future test scores, lowers the likelihood of graduating from high school, decreases college enrollment, and ultimately reduces earnings. The cumulative effect is large: a teacher with one standard deviation higher than average grade inflation reduces the present discounted value of their collective students’ lifetime earnings by $213,872.

That number is worth sitting with. Grade inflation is not a victimless accounting trick. It lowers college attendance. It reduces lifetime earnings. It makes our students less prepared and, collectively, our workforce less capable.

When a country systematically tells its students they are performing well when they are not, it produces a generation that is less prepared than the one before it. This is not hypothetical. NAEP scores in reading and math have declined over the past decade even as average GPAs have risen. The share of high school students graduating with an A average has increased significantly since the early 2000s, but the share demonstrating proficiency on national assessments has not kept pace. We are handing out more A’s for less learning. The downstream consequences — a less literate workforce, fewer students prepared for rigorous postsecondary programs, lower productivity — are diffuse enough that no single institution is held accountable, but real enough that we will feel them for decades.

Grade inflation is, in this sense, a form of national self-deception. It flatters us in the short run and diminishes us in the long run. Unlike other forms of educational failure, it is almost perfectly designed to go unnoticed — because the very mechanism that would alert us to the problem, the grade itself, is the thing that has been corrupted.

There’s one more issue with grade inflation: it doesn’t stay contained. It spreads to adjacent systems, including the ones that are supposed to serve as external checks on grade inflation itself.

Consider what just happened in Massachusetts. On March 3, Governor Maura Healey celebrated that 35.8 percent of Massachusetts public high school graduates scored a 3 or higher on an AP exam — the highest percentage in the nation and the highest on record. State officials touted it as evidence that students are better prepared than ever. What they did not mention is that the College Board has changed how it scores AP exams. Passing rates have surged nationally in recent years not because students are learning more, but because the exams have gotten easier. The number of correct answers needed for passing scores has been reduced. The College Board confirmed the changes but neither the organization nor Massachusetts officials noted them in their press releases.

This is grade inflation’s downstream logic applied to a different institution. The AP exam was designed to be an objective, nationally comparable measure of college-level mastery. It was supposed to be the kind of signal that could not be gamed by local grading practices. But the same incentive structure that inflates classroom grades has reached the College Board. Students and families are happier because they get college credit. Schools are happier because they look good. Governors get to hold press conferences.

When the checks on grade inflation are themselves inflated, the system has no remaining mechanism for self-correction. That is why legislative action is necessary.


What States Can Do

There are three things states can do right now to help curb grade inflation.

End test-optional college admissions at public universities.

The shift to test-optional admissions is wrong-headed for a number of reasons, as I’ve written before.

But it also did something else: it removed the one external check that kept grade inflation from being costless. When standardized test scores were part of the admissions equation, a school that handed out inflated A’s would eventually be exposed by mediocre SAT or ACT results. Test-optional policies eliminated that accountability mechanism. States that control their public university systems can restore it. Requiring standardized test scores for admission to state universities would not solve grade inflation overnight, but it would reintroduce a signal that schools cannot manipulate.

The tide is already turning. Over the past two years, a growing number of universities have reversed their test-optional policies after reviewing admissions data from the pandemic era. Every Ivy League school except Columbia has reinstated a testing requirement. MIT, Stanford, Johns Hopkins, and the University of Pennsylvania all now require scores. Ohio State reinstated its requirement after finding that students who submitted test scores had higher GPAs and were more likely to persist through their degree. The University System of Georgia restored testing requirements at four additional campuses. Princeton’s decision followed a five-year internal review that found academic performance was stronger among students who had submitted scores. University officials starting to take data seriously is beginning to reverse these disastrous, ideologically motivated changes. Universities looked at what happened when they removed the external signal and concluded that grades alone were not sufficient to predict whether a student was prepared. State legislators overseeing public university systems should reach the same conclusion.

Get test scores back to parents faster — and make them harder to ignore.

Rury and Kalil make clear that parents will act on negative academic information — but only if they receive it in an accessible form and at a time when action is still possible. Right now, standardized test results often arrive months after the test is administered. A parent who receives their child’s state assessment results over the summer or halfway into the following school year has no actionable moment. The information is stale before it arrives. Report cards, by contrast, show up every quarter. They are immediate and familiar. It is no surprise that parents weigh them more heavily, even when they are unreliable.

Virginia is showing what a better approach looks like. In 2025, the General Assembly passed House Bill 1957, a comprehensive overhaul of the state’s Standards of Learning assessment system that takes effect in the 2026–27 school year. The law requires schools to provide score reports to families within 45 days — a significant improvement over the months-long delays that are common in most states. Those reports will include not just the student’s individual performance, but a comparison to the performance of other students in the school, the school division, and the state. Scores will be reported on a 100-point scale, replacing the old system that produced numbers like 487 that meant nothing to most parents.

Most consequentially, Virginia will require that SOL scores count for 10 percent of a student’s final course grade, starting with seventh graders. That provision is worth paying attention to. It does not replace grades with test scores. It forces the two signals onto the same report card. A parent who sees an A in math alongside a 43 on the state assessment will have a much harder time ignoring the discrepancy than a parent who receives those two pieces of information months apart, in different formats, from different sources. It is a transparency mechanism embedded directly in the grade itself.

Other states should look at Virginia’s model closely. Faster turnaround on state assessment results, clearer and more usable score reports, and structural linkages between test performance and course grades would give parents a timely, objective benchmark to look at alongside the report card. The goal is not to replace grades, but to ensure that parents have access to at least one signal that grade inflation cannot corrupt, delivered at a time when it can still change behavior.

Make grades honest.

Improving the delivery of test scores to parents is a worthwhile reform. But it is important to be clear-eyed about its limitations. If the goal is to ensure that parents have accurate information about their child’s academic performance, the most direct path is not to make parents care more about test scores. It is to make grades honest.

Rury and Kalil demonstrate why. When grades are high, parents do not invest — regardless of what the test score says.

The grade is the dominant signal. It has been the dominant signal for decades, and no amount of redesigned parent reports or 45-day turnaround mandates is likely to significantly change that. 71 percent of parents say grades matter more than test scores when making decisions about their own children. That preference is deeply ingrained, reinforced by frequency and familiarity, and actively defended by institutions that benefit from the status quo. Trying to get parents to weigh test scores more heavily means fighting an uphill battle against culture and a well-funded opposition that has spent years telling parents not to trust tests.

Fixing the grade itself is a different proposition. If grades reflect actual mastery — if a B means a student has demonstrated competence and an A means they have demonstrated excellence — then parents do not need to cross-reference two conflicting signals and decide which one to believe. They can do what they have always done: look at the report card. The difference is that the report card would be telling the truth.

This is why direct legislative action on grading practices should be the priority.

South Carolina is showing what that looks like. In 2025, Senator Jeff Zell — a former Sumter County school board member who had fought to end his district’s policy of guaranteeing students a minimum score of 50, even for work never submitted — filed S. 537 after the new board considered bringing the policy back. His bill was straightforward: prohibit school districts from requiring teachers to assign a minimum grade that exceeds a student’s actual performance.

Rep. Fawn Pedalino took the concept further. Her bill, H. 5073, goes beyond banning grade floors. It requires that only academic performance be considered in assigning high school course grades. It mandates that students complete all required assignments before becoming eligible for credit or content recovery programs — a direct response to the practice of students blowing off a class and coasting through a makeup course. It prohibits districts from counting benchmark assessments in final grades when the content has not yet been taught. It directs the State Board of Education to convene a task force to overhaul the state’s Uniform Grading Policy. Lastly, it enforces compliance by withholding 10 percent of a district’s State Aid to Classroom funding for violations.

H. 5073 passed the South Carolina House 110 to 2. That margin is worth noting. When the issue is framed correctly — that this is about making grades mean something again, not about punishing students — the politics are overwhelmingly favorable.

The South Carolina model is instructive because it addresses grade inflation at its roots without telling individual teachers how to grade. It removes the structural policies — grade floors, no-consequences credit recovery, benchmark tests counted as final grades — that make inflation the default.

Other states should follow.


The students I taught at Governors State were not lazy. They were not unintelligent. They had been told, semester after semester, that their work was good enough — and they had no reason to doubt it. When I handed back grades that reflected what their writing actually demonstrated, they were not just disappointed. They were confused. The system had failed them long before I entered the picture.

That is what grade inflation does. It does not help students. It lies to them. It tells them they are prepared when they are not. It tells their parents everything is fine when it is not. It suppresses the very interventions that would address the problem if anyone knew the problem existed.

Teachers are not the villains of this story. They are trapped in a system that punishes honesty and rewards the path of least resistance. Parents are not the villains either. They are making rational decisions based on information they have every reason to trust. The problem is structural. It is a collective action failure in which every individual actor behaves rationally and the outcome is worse for everyone.

Collective action failures require collective solutions, and our states have the tools.

They can ban grade floors and tighten credit recovery requirements, as South Carolina is doing. They can force test scores onto the report card and get results to parents in weeks instead of months, as Virginia is doing. They can end test-optional admissions policies at public universities to restore an external check the system desperately needs.

None of this will be easy. The incentives that created grade inflation are powerful, and the constituencies that benefit from the status quo are large. But the costs of inaction are no longer abstract. They show up in declining college completion rates, in reduced lifetime earnings, in a workforce that is less capable than it should be, and in a generation of students who were told they were ready and found out too late that they were not.

The students at Governors State deserved honest information about where they stood. So does every student and every parent of a student sitting in a K-12 classroom. The question is whether we are willing to build a system that provides it.


Related Articles

Read the whole story
mrmarchant
27 minutes ago
reply
Share this story
Delete

The curse of the cursor

1 Share

I had no idea it was Alan Kay himself who was responsible for the mouse pointer’s distinctive shape. In 2020, James Hill-Khurana emailed him and got this answer:

The Parc mouse cursor appearance was done (actually by me) because in a 16x16 grid of one-bit pixels (what the Alto at Parc used for a cursor) this gives you a nice arrowhead if you have one side of the arrow vertical and the other angled (along with other things there, I designed and made many of the initial bitmap fonts).

Then it stuck, as so many things in computing do.

And boy, did it stuck.

But let’s rewind slightly. The first mouse pointer during the Doug Engelbart’s 1968 Mother Of All Demos was an arrow faced straight up, which was the obvious symmetrical choice:

(You can see two of them, because Engelbart didn’t just invent a mouse – he also thought of a few steps after that, including multiple people collaborating via mice.)

But Kay’s argument was that on a pixelated screen, it’s impossible to do this shape justice, as both slopes of the arrow will be jagged and imprecise. (A second unvoiced argument is that the tip of the arrow needs to be a sharp solitary pixel, but that makes it hard to design a matching tail of the cursor since it limits your options to 1 or 3 or 5 pixels, and the number you want is probably 2.)

Kay’s solution was straightening the left edge rather than the tail, and that shape landed in Xerox Alto in the 1970s:

Interestingly enough, the top facing cursor returned as one of the variants in Xerox Star, the 1981 commercialized version of Alto…

…but Star failed, and Apple’s Lisa in 1983 and Mac in 1984 followed in Alto’s footsteps instead. Then, 1985’s Windows 1.0 grabbed a similar shape – only with inverted colors – and the cursor looked the same ever since.

That’s not to say there weren’t innovations since (mouse trails useful on slow LCD displays of the 1990s, shake to locate that Apple added in 2015), or the more recent battles with the hand mouse pointer popularized by the web.

But the only substantial attempt at redesigning the mouse pointer that I am aware of came from Apple in 2020, during the introduction of trackpad and mousing to the iPad. The mouse pointer a) was now a circle, b) morphed into other shapes, and c) occasionally morphed into the hovered objects themselves, too:

The 40-minute deep dive video is, today, a fascinating artifact. On one hand, it’s genuinely exciting to see someone take a stab at something that’s been around forever. Evolving some of the physics first tried in Apple TV’s interface feels smart, and the new inertia and magnetism mechanics are fun to think about.

But the high production value and Apple’s detached style robs the video of some authenticity. This is “Capital D Design” and one always has to remain slightly suspicious of highly polished design videos and the inherent propensity for bullshit that comes with the territory. Strip away the budget and the arguments don’t fully coalesce (why would the same principles that made text pointer snap vertically not extend to its horizontal movement?), and one has to wonder about things left unsaid (wouldn’t the pointer transitions be distracting and slow people down?).

Yet, I am speaking with the immense benefit of hindsight. Actually using that edition of the mouse pointer on my iPad didn’t feel like the revolution suggested, and barely even like an evolution. (Seeing Apple TV’s tilting buttons for the first time was a lot more enthralling.) And, Apple ended up undoing a bunch of the changes five years later anyway. The pointer went back to a familiar Alan Kay-esque shape…

…and lost its most advanced morphing abilities:

Watching the 2025 WWDC video mentioning the change (the relevant parts start at 8:40) is another interesting exercise:

2020:

We looked at just bringing the traditional arrow pointer over from the Mac, but that didn’t feel quite right on iPadOS. […] There’s an inconsistency between the precision of the pointer and the precision required by the app. So, while people generally think about the pointer in terms of giving you increased precision compared to touch, in this case, it’s helpful to actually reduce the precision of the pointer to match the user interface.

2025:

Everything on iPad was designed for touch. So the original pointer was circular in shape, to best approximate your finger in both size and accuracy. But under the hood, the pointer is actually capable of being much more precise than your finger. So in iPadOS 26, the pointer is getting a new shape, unlocking its true potential. The new pointer somehow feels more precise and responsive because it always tracks your input directly 1 to 1.

(That “somehow” in the second video is an interesting slip up.)

I hope this doesn’t come across as making fun of presenters, or even of the to-me-overdesigned 2020 approach. We try things, sometimes they don’t work, and we go back to what worked before.

I just wish Apple opened itself up a bit more; there are limits to the “we’ve always been at war with Eastasia” PR approach they practice in these moments, and I would genuinely be curious what happened here: Did people hate the circular pointer? Was it hard to adopt by app developers? Was it just a random casualty of Liquid Glass visual style, or perhaps the person who was the biggest proponent of it simply left Apple? We could all learn from this.

But the most interesting part to me is just the resilience of the slanted mouse pointer shape. In post-retina world, one could imagine a sharp edge at any angle, and yet we’re stuck with Kay’s original sketch – refined to be sure, but still sporting its slightly uncomfortable asymmetry.

The always-excellent Posy covered this in the first 7 minutes of his YouTube video:

But specifically one comment under that video caught my attention:

Honestly, I’ve never thought of the mouse cursor as an arrow, but rather its own shape. My mind was blown when I realized that it was just an arrow the whole time.

…because maybe this is actually the answer. Maybe the mouse pointer went on the same journey floppy disk icon did, and transcended its origins. It’s not an arrow shape anymore. It’s the mouse pointer shape, and it forever will be.

Read the whole story
mrmarchant
1 hour ago
reply
Share this story
Delete

ChatGPT, Claude, Gemini, and Grok are all bad at crediting news outlets, but ChatGPT is the worst (at least in this study)

1 Share

Canadian researchers asked the paid and free or “economy” versions of four AI models — ChatGPT, Claude, Gemini, and Grok — about Canadian news events to see whether they would credit individual news outlets in their answers.

The answer will probably not surprise you: AI models rarely cite news sources unless they’re specifically asked to, and some are better about it than others.

“These systems have ingested Canadian journalism systematically. The specificity of their knowledge of domestic politics, provincial affairs, and local reporting points clearly to Canadian news sources,” Taylor Owen, Beaverbrook chair in media, ethics, and communication at McGill University and a coauthor of the study, writes on his blog. “And they rarely tell you where the information came from.”

Canada’s CBC, Globe and Mail, Toronto Star, Postmedia, Metroland Media, and The Canadian Press sued OpenAI for copyright infringement in November 2024. The case is the first of its kind in Canada and the lawsuit is ongoing.

Owen, who is also the founding director of the Center for Media, Technology, and Democracy, and Aengus Bridgman, an assistant professor at McGill, explain their work (highlighting mine):

We tested four major AI models on 2,267 real Canadian news stories (English and French) without web search activated and found the same pattern across all of them. All four models showed extensive knowledge of Canadian current events consistent with having ingested Canadian news reporting. Models demonstrated at least partial knowledge in 74% of responses to stories within their training window, but among those knowledgeable responses, 92% provided no source attribution of any kind.

When we enabled web search and tested 140 specific articles via each company’s API, every model produced responses that covered enough of the original reporting that many consumers would rarely need to visit the source. Models often linked to Canadian news sites, with 52% of responses including at least one Canadian URL, but named a Canadian source in the response text only 28% of the time. Links provide a pathway back to the source, but consumers reading the response itself rarely see an indication of whose journalism they are consuming.

With web search enabled, the below chart “shows the default consumer experience: what happens when someone asks a generic topic question without requesting citations. This is how most people use AI models: ‘Tell me about X,’ not ‘What did the Toronto Star report about X?’”

The authors explain:

The blue squares show how often the result covers enough of the article’s distinctive reporting (specific events, named individuals, key findings) that a reader could plausibly get the gist of the story without visiting the news site. These are not complete reproductions: they are partial summaries and paraphrases that cover some of the original article’s distinctive content, though they sometimes contain factual errors or omissions…We evaluated each response against the source article to determine whether it covered the article’s distinctive reporting, not merely the general topic. The green squares show how often the model credits the source by naming the outlet in the response text or via structured machine-readable citations returned alongside the response.

Coverage rates are high while attribution rates are not. Gemini and Claude covered distinctive reporting in 81% and 72% of responses respectively, but Gemini credited the source only 6% of the time. Grok covered distinctive reporting in 59% of responses while citing the source in only 7% of them. ChatGPT, one of the most widely used models, covered distinctive content in 54% of responses but almost never credited the originating newsroom. Even when models fail to cover the distinctive reporting, they still deliver a topical response that can reduce the consumer’s motivation to visit the source.

ChatGPT was especially unlikely to credit sources when it wasn’t asked to, doing so only 1% of the time for this sample; Claude did so 16% of the time.

All of the AI models did much better when they were explicitly asked for citations — something most users won’t do.

Under the most favorable conditions (directly naming the outlet and explicitly asking for citations), attribution improves substantially across all models. All four named the outlet in a majority of responses: Claude (97%), Gemini (95%), ChatGPT (86%), and Grok (74%). Linking rates were also strong: Grok (91%), Gemini (69%), Claude (64%), and ChatGPT (59%). Meaningful attribution is technically achievable. The gap between the default experience and the best-case scenario is a core finding: most consumers will never explicitly name an outlet or ask for citations, so the generic-condition results reflect the experience that shapes the market for journalism.

When AI models do cite sources, the researchers found, it is likely to be the ones that consumers are already familiar with. Paywalled and smaller regional outlets were less cited even on original reporting.

From the study:

Among English-language outlets, CBC, CTV, and Global News — all freely accessible — capture the most AI visibility in both categories. The Globe and Mail performs relatively well, but the Toronto Star and Financial Post are marginal despite being important newsrooms. Regional Postmedia papers serving Calgary, Edmonton, Ottawa, and Vancouver are essentially absent. Among French-language outlets, Radio-Canada and La Presse dominate, with Le Devoir a distant third. The Journal de Montréal, one of Quebec’s most widely read papers, received only 48 total mentions across all models.

French-language journalism is “doubly disadvantaged,” the researchers write. “Its content is absorbed into model training data, but the outlets that produced it are almost never acknowledged.”

I emailed the paper’s authors to ask them: If you had to pick which AI model does the most “right” from a journalism POV, which would it be? Bridgman offered an interesting answer that I’m putting here in full because I thought our readers might find interesting too. Note: An AI model’s “cutoff” is the date through which it’s trained, so “pre-cutoff” stories are those published during the model’s training period, and “post-cutoff” stories are those published after it.

He wrote:

This is a genuinely hard question because each model behaves differently:

  • Claude cites Canadian outlets at the highest rate in Track 1 (61% vs. 8% for ChatGPT, 3% for Gemini), and when it doesn’t know something, it says so rather than hallucinating. Only ~37% of its economy-tier responses addressed pre-cutoff stories substantively, but that’s because it refuses rather than guesses. The trade-off is that it still reproduces paywalled content at high rates (68%) when given web access.
  • ChatGPT has the best consumer interface for surfacing recent news (inline citations, clickable links). But its economy model is the worst hallucinator (87% of post-cutoff responses generated confident-sounding answers about events it couldn’t possibly know about), and 88% of those were inaccurate. It names sources in 54% of Track 2 responses, which sounds good until you realize it’s also reproducing the reporting well enough to substitute for the original article 54% of the time.
  • Gemini is the most responsive and covers the most distinctive reporting with web access (81%), but it almost never names the Canadian source in the response text (2–8%). So, it’s the most effective at replacing the need to visit the source while hiding where the information came from.
  • Grok is strongest at surfacing Canadian outlets from training data alone (no web search). But it also hallucinates aggressively on post-cutoff stories (89% addressed topics it shouldn’t know, 84% inaccurate).

What surprised me most was the complexity of the phenomena and the variety of approaches being tried by the companies. Each company has design decisions which cause differential output and behavior that is more or less responsible (e.g. refusal to hallucinate or reproduce direct reporting) and value transferring (better or worse referrals to source and/or treatment of paywalls). These are important differences and point to minimal and incomplete self-governance in the space.

The AI News Audit was published by McGill University’s Center for Media, Technology and Democracy. You can read the full report, which includes suggestions for Canadian public policy around AI, here.

Read the whole story
mrmarchant
2 hours ago
reply
Share this story
Delete

AI Can’t Deal With The Real World

1 Share

We’re delighted to announce that our online BOOK CLUB is back! You can meet authors and ask questions about their work, as well as meeting other readers. Please join us on Tuesday, April 7 at 6pm ET, when our Head of Podcasts, Leonora Barclay, will interview Russell Muirhead and Nancy L. Rosenblum about their book Ungoverning: The Attack on the Administrative State and the Politics of Chaos. Register your interest here.


Sure you can kick, but can you implement a functioning water system? (Photo by CCTV+ via Getty.)

Recently I heard a presentation by an engineer from OpenAI about the incredible transformations that will occur once we get to artificial general intelligence (AGI), or even superintelligence. He said that this will quickly solve many of the world’s problems: GDP growth rates could rise to 10, 15, even 20 percent per year, diseases will be cured, education revolutionized, and cities in the developing world will be transformed with clean drinking water for everyone.

I happen to know something about the latter issue. I’ve been teaching cases over the past decade on why South Asian cities like Hyderabad and Dhaka have struggled with providing municipal water. The reason isn’t that we don’t know what an efficient water system looks like, or lack the technology to build it. Nor is it a simple lack of resources: multilateral development institutions have been willing to fund water projects for years.

Subscribe now

The obstacles are different, and are entirely political, social, and cultural. Residents of these cities have the capacity to pay more for their water, but they don’t trust their governments not to waste resources on corruption or incompetent management. Businesses don’t want the disruption of pervasive infrastructure construction, and many cities host “water mafias” that buy cheap water and resell it at extortionate prices to poor people. These mafias are armed and ready to use violence against anyone challenging their monopolies. The state is too weak to control them, or to enforce the very good laws they already have on their books.

It is hard to see how even the most superintelligent AI is going to help solve these problems. And this points to a central conceit that plagues the whole AI field: a gross overestimation of the value of intelligence by itself to solve problems.



In the teaching I’ve done over the past two decades, and in the Master’s in International Policy program I direct at Stanford, I’ve helped develop a public problem-solving framework that we now teach to all our students. (Credit here also goes to my former colleague Jeremy Weinstein, who is now Dean of Harvard’s Kennedy School of Government.) The framework is simple, and consists of three circles:

There is a problem that extends way beyond AI, and applies to the way we think about public problem-solving in general. The bulk of effort, and what most academic public policy programs seek to teach, centers on the first two of the three circles: Problem Identification and Solutions Development. Indeed, many programs focus on Solutions Development exclusively: they teach aspiring policy-makers how to gather data and use a battery of powerful econometric tools to analyze it. This yields a set of optimal solutions that a policy analyst can hand to his or her principal as a way forward.

What is missing from this approach is what lies in the third circle: implementation. Our budding policy analyst typically finds that after handing a brilliant options memo to the boss, nothing happens. Nothing happens because there are too many obstacles—political, social, cultural—to carry out that preferred policy, as in the municipal water example.

So let’s go back to how AI will play in this space. AGI will definitely help in the first circle: identifying stakeholders, mapping a causal space, and defining the problem. It will be of most help in the second circle: gathering data and analyzing it to come up with optimal solutions. But intelligence only gets you to the end of the second circle, and is of limited help in the third. An LLM cannot directly interact with stakeholders, message them, or come up with resources. In particular, an LLM will not be able to engage in the kind of iterative back-and-forth between policymakers and citizens that is required for effective policy implementation. It will likely face big challenges in generating the kind of trust that is necessary for policies to be accepted and adopted.



It is not just political and social obstacles that AI has difficulty dealing with; LLMs have limited ability to directly manipulate physical objects. AI interacts with the physical world primarily through robotics, but the latter is a field that has lagged considerably behind the development of LLMs. Robots have proliferated enormously over the past decades and are omnipresent in manufacturing, agriculture, and many other domains. But the vast majority of today’s robots are programmed by human beings to do a limited range of very specific tasks. The world was wowed recently by Chinese humanoid robots doing kung fu moves, but I suspect the robots didn’t teach themselves how to act this way.

Robotically-enabled LLMs do not have the ability to solve even simple physical problems that are novel or outside of their training set. My colleague Alex Stamos, a noted expert in cyber security, puts it this way: “my dog knows more physics than an LLM.” An LLM would be able to state Newton’s laws of motion, but it would not be able to direct a robot to chase a frisbee the way Alex’s dog can because that particular set of moves is not in its training set. It could be programmed to do this, but that is the product of human intelligence and not AI.

Here’s an example of AI’s current limitations. I recently had an HVAC contractor replace the furnace in my house. The contractor photographed and measured the house’s layout; he had to route the new ducts and wiring in ways specific to my house’s design. It turned out that the new furnace would not fit through the existing attic door; he had to cut a larger opening with a reciprocating saw, and then repair the doorframe after the new unit was inside. There is no robot in the world that could do what my contractor did, and it is very hard to imagine a robot acquiring such abilities anytime in the near future, with or without AGI. Robots may get there eventually, but that level of human capacity remains a distant objective.

Subscribe now

Many of the enthusiasts hyping AI’s capabilities think of policy problems as if they were long-standing problems in mathematics that human beings had great difficulties solving, such as the four-color map theorem or the Cap Set problem. But math problems are entirely cognitive in nature and it is not surprising that AI could make advances in that realm. The people building AI systems are themselves very smart mathematically, and tend to overvalue the importance of this kind of pure intelligence.

Policy problems are different. They require connection to the real world, whether that’s physical objects or entrenched stakeholders who don’t necessarily want changes to occur. As the economic historian Joel Mokyr has shown, earlier technological revolutions took years and decades to materialize after the initial scientific and engineering breakthroughs were made, because those abstract ideas had to be implemented on a widespread basis in real world conditions. AI may move faster on a cognitive level, but it may not be able to solve implementation problems more quickly than in previous historical periods.

This is not at all to say that AI will not be hugely transformative. But the kind of explosive, self-reinforcing AI advances that some observers predict are on the way will still have to solve implementation problems for which machines are not well suited. A ten percent annual growth rate will double GDP in seven years. Yet planet Earth will not remotely yield the materials—water, land, minerals, energy, or people—to make this come about, no matter how smart our machines get.

Francis Fukuyama is the Olivier Nomellini Senior Fellow at Stanford University. His latest book is Liberalism and Its Discontents. He is also the author of the “Frankly Fukuyama” column, carried forward from American Purpose, at Persuasion.


Follow Persuasion on X, Instagram, LinkedIn, and YouTube to keep up with our latest articles, podcasts, and events, as well as updates from excellent writers across our network.

And, to receive pieces like this in your inbox and support our work, subscribe below:

Subscribe now

Read the whole story
mrmarchant
18 hours ago
reply
Share this story
Delete

User interface sugar crash

1 Share

I think about some aspects of interface design as sugar.

This is how you adjust the photo in Photos app in the previous version of iOS:

And this is the same view in the current version:

The difference is in the delayed/​animated falling of the notches.

I don’t think it’s great. It’s “delightful” in a rudimentary and naïve sense, but like sugar, you cannot just add it to your daily diet without consequences. This extra animation serves no functional purpose, and the sugar high wears off quickly. What remains is constant distraction and overstimulation, the feeling of inherent slowness, and maybe even a bit of confusion.

It pairs nicely with the previous post about avoiding complexity and rewarding simplicity. I often see this kind of stuff as related to designer’s experience. Earlier on in your career, you are proud you’ve thought about this extra detail, you’ve figured out how to make this animation work and how to fine-tune the curves, and you’ve learned how to implement it or convince an engineer to get excited about it.

Later in your experience, you are proud you resisted it.

Read the whole story
mrmarchant
18 hours ago
reply
Share this story
Delete

The Comedy of Errors That Was the First-Ever Space Walk

1 Share

Murphy’s Law was in full effect

The post The Comedy of Errors That Was the First-Ever Space Walk appeared first on Nautilus.



Read the whole story
mrmarchant
18 hours ago
reply
Share this story
Delete
Next Page of Stories