Signal vs Noise: How We Decide What Actually Matters
This is the current version. View full version history →
What’s New
| Date | Update |
|---|---|
| 27 Mar 2026 | The Atlantic and Futurism report the NYT has published AI-generated opinion pieces without disclosure, extending the signal-degradation pattern from HN comments and academic peer review to mainstream journalism. |
| 26 Mar 2026 | ARC-AGI-3 released (421 HN points) – a live case of the benchmark filter and practitioner signal heuristic operating just below the 500-point threshold, with comment quality carrying the signal weight. |
| 25 Mar 2026 | OpenAI shuts down Sora (780 HN points) despite a reported $1B Disney deal; live case of the announcement-gap pattern and the ‘so what’ test operating in reverse. |
| 19 Mar 2026 | ICML desk-rejects 2% of submissions after catching reviewers using LLMs to write peer reviews, extending signal-degradation from HN to academic sources. |
| 12 Mar 2026 | HN community pushback on AI-generated comments (3,701 points) is a live data point on practitioner signal degradation at the platform level. |
Version History
Full changelog and snapshots →
The Problem: There Is Too Much
Let’s start with a moment that captures the current situation better than any graph could.
Gergely Orosz – The Pragmatic Engineer – wore a Gemini 3 sweater to a conference talk. By the time he was done speaking, Gemini 3.1 Pro had dropped. The sweater was already outdated. He posted about it with the weary good humour of someone who has accepted that this is just what the industry is like now.
That story is funny. It is also a pretty accurate description of what it feels like to try to follow AI news in 2026.
A model that was genuinely frontier-class six weeks ago is mid-tier today. Capabilities that would have been headline news eighteen months ago now get a footnote in a changelog. The pace is not slowing down. If anything, it is accelerating, and the coverage ecosystem has scaled to match it in volume but not in quality.
Consider what you would need to follow, if you were trying to follow everything:
- Newsletters (The Pragmatic Engineer, TLDR, Import AI, The Batch, Stratechery, and dozens more)
- Podcasts (Latent Space, Lex Fridman, Dwarkesh, The TWIML Podcast, and on and on)
- Preprint servers (arXiv gets dozens of relevant papers a day)
- Twitter/X (researchers, engineers, founders, journalists, all posting in real time)
- Hacker News (aggregating from all of the above, plus things they missed)
- Discord servers (model-specific communities, research groups, indie hacker spaces)
- YouTube (demos, teardowns, conference talks, explainers)
- Company blogs, earnings calls, researcher Substacks
Nobody follows all of this. Anyone who tells you they do is either lying or not sleeping. The ecosystem is too large, too fast, and too unevenly distributed in quality.
This creates two failure modes, and most people end up in one of them.
Failure mode one: FOMO-driven reading. You try to keep up. You add another newsletter, follow another researcher, bookmark more tabs. You feel perpetually behind. Every week there is something you missed and you spend more time catching up than actually thinking. The reading becomes the job, and the thinking gets squeezed out.
Failure mode two: tuning out entirely. The volume overwhelms you. You stop trying. You catch things second-hand, in team meetings or through colleagues, always slightly late. You lose the thread. When something genuinely important happens, you hear about it but don’t have enough context to know why it matters.
Both of these are understandable responses to a genuinely difficult situation. Neither is particularly useful.
The third path – the one this blog tries to embody – is to build a filter and be explicit about it. That means being honest about what we read, how we decide what matters, and what we’re probably missing.
The Sources: What We Actually Read and Why
A filter is only as good as its inputs. Here is what actually goes into the reading pile, and why each source earns its place.
Hacker News
HN remains the most reliable real-time signal for practitioner reaction to anything in the tech space. Not because the posts are always good – many of them are just links to press releases – but because the comments often are.
The heuristic here is simple: if something gets 500+ points and the comment thread has substantive technical discussion, it matters. If it gets 500+ points and the comments are mostly “this is huge” or “AI is going to change everything,” that’s a different signal. High points with low-quality comments usually means something was interesting to a general audience but didn’t move the technical needle.
The inverse is also useful. When something drops and HN barely notices, or when the comments are mostly sceptical practitioners pointing out limitations, that tells you something too. Absence of practitioner enthusiasm is data.
We treat HN as a filter and aggregator, not a source. It surfaces things from everywhere else. Its real value is the community reaction, not the original content.
A notable development in March 2026 sharpens this point further. A post titled ‘Don’t post generated/AI-edited comments. HN is for conversation between humans’ reached 3,701 points and over 1,300 comments – the highest-engagement story on HN that day. The practitioner community was not reacting to a model release or a product launch; it was reacting to the degradation of its own signal quality. AI-generated content is now noisy enough inside HN itself that the community has had to explicitly defend the human-conversation signal that makes it useful as a filter in the first place. That is worth noting: the tool we use to filter noise is itself becoming a target of the noise problem.
A 14 March 2026 example reinforces the heuristic directly. The announcement of 1M context windows going generally available for Opus 4.6 and Sonnet 4.6 reached 804 points and 315 comments within hours, with discussion that went well beyond ’this is huge’ into specific application categories, pricing economics, and real use-case implications. That is the pattern: high points, technically substantive comments, a genuine capability shift that unlocks new workflows. Compare it to the kind of model release that lands with 200 points and comments mostly restating the press release – same category of news, completely different signal quality.
A 21 March 2026 example adds further texture to the heuristic. OpenCode, an open-source AI coding agent, reached 882 points and 402 comments within 15 hours – exceeding the 500-point threshold with comment volume that indicates genuine practitioner evaluation rather than general-audience enthusiasm. The open-source framing is part of the signal: when the community drives high engagement around a non-commercial tool, it typically reflects active testing by people who build things, not marketing reach. Same heuristic, same pattern, different context – which is exactly how the filter is supposed to work.
A 24 March 2026 example adds a different texture to the heuristic. A developer demonstrated an iPhone 17 Pro running a 400B parameter Mixture of Experts model entirely on-device – a headline that sounds like a clear capability breakthrough. The story reached 641 points and 284 comments on HN. The comment thread is not celebrating. It is interrogating what 0.6 tokens per second actually means in practice: a generation speed so slow that a meaningful response takes roughly 30 seconds to appear. The technique is real and technically interesting – the model streams weights from SSD rather than loading into RAM, based on Apple’s own LLM in a Flash research – but the gap between ’this technically runs’ and ’this is usable’ is exactly where the practitioner filter operates. High engagement here signals genuine technical significance. Substantive comments are the mechanism that stops the headline claim from becoming the conclusion. The ‘so what’ test and the practitioner signal heuristic are not always independent: sometimes a story passes the engagement threshold precisely because practitioners are gathering to apply the filter in public.
A 25 March 2026 example adds the sharpest illustration yet of the announcement-gap pattern. OpenAI’s Sora shutdown reached 780 points and 573 comments – the top story on HN that day. The practitioner thread is not celebrating and is not shocked. It interrogates why a text-to-video product with no clear use case beyond social novelty could not hold adoption, and why the Disney deal – which looked like enterprise validation – turned out not to be. The launch was covered as a major capability milestone. The partnership was covered as proof of real-world traction. The shutdown is the outcome neither story prepared anyone for. High engagement here is not enthusiasm for the technology – it is practitioners making sense of an outcome the hype cycle did not predict.
A 26 March 2026 example adds a nuance to the threshold heuristic. ARC-AGI-3 was released on arcprize.org and reached 421 points and 265 comments on HN – just below the 500-point threshold used as the primary filter, but with a comment thread that is substantively technical: practitioners debating whether the new evaluation tests something models genuinely cannot do, what the architecture of the tasks implies about reasoning versus retrieval, and whether iterating the benchmark is itself a signal about genuine capability progress. ARC-AGI has been a reliable calibration instrument precisely because it was designed to resist pattern-matching: performance on it is not easily gamed by scale alone. The practitioner community treating a new version’s release as worth 265 comments of serious discussion suggests the capability question it tests is still genuinely open. The point count falling short of 500 is worth noting: the threshold is a heuristic, not a rule. When the discussion is substantive and the subject is a trusted evaluation instrument, quality can carry the signal even when the number does not.
Simon Willison’s Weblog (simonwillison.net)
Simon Willison’s blog has one of the highest signal-to-noise ratios of any single source in the space. Every post is either an experiment he has actually run, an observation from hands-on work, or a careful synthesis of something he has read. There is no hype. There are no predictions dressed as analysis.
What makes it particularly useful is the methodology on display. He shows his working. When he tests something, he tells you how. When he is uncertain, he says so. When a previous finding gets updated by new evidence, he notes it.
The hit rate on “things Simon wrote about that turned out to matter” is high enough that this blog treats it as a primary input. If he is paying attention to something, we probably should be too.
The Pragmatic Engineer (Gergely Orosz)
The Pragmatic Engineer is valuable for a specific reason: it takes an engineering leadership perspective on things that most AI coverage treats as purely technical. Gergely’s sourcing is unusually rigorous for the newsletter space – he checks things, quotes people, and is explicit when something is uncertain.
The sweater story is instructive here too. Someone who can laugh at themselves for being slightly behind on a news cycle is more trustworthy than someone who projects certainty. The Pragmatic Engineer is honest about what it doesn’t know. That’s rarer than it should be.
It is particularly useful for anything that touches engineering organisations, hiring, tooling adoption, and the gap between what vendors claim and what teams are actually experiencing.
Latent Space (swyx)
Latent Space earns its place through depth. The podcast and the writing both go substantially further into technical territory than most general-audience AI coverage. swyx has a good instinct for synthesis – for finding the thread that connects several things that look separate on the surface.
The practical value here is for topics where you need to actually understand what is happening, not just that something happened. When a new architecture drops, when there is a shift in how people are approaching a particular problem, Latent Space is usually one of the first places to engage with it seriously.
arXiv and HuggingFace Papers
We do not read everything on arXiv. Nobody does, nobody should, and anyone recommending you try is setting you up for failure mode one.
The approach instead is to read what the practitioner community is already discussing. If a paper is getting attention on Twitter/X from researchers whose work we respect, or if it surfaces on HN with a strong comment thread, that is the filter. The paper gets read. Everything else gets skimmed at best.
HuggingFace’s paper discussion threads are underrated here. They often surface practitioner reaction faster than anywhere else, and the people commenting are frequently the ones who have actually tried to replicate or apply the work.
A March 2026 finding from ICML adds a cautionary note to academic sources as inputs. ICML’s official blog revealed that 2% of submitted papers were desk-rejected after reviewers – who had explicitly opted into a no-LLM policy – were caught using LLMs to write reviews. Detection used prompt injection embedded invisibly in paper PDFs: any reviewer who fed the PDF to an LLM rather than reading it themselves would include two specific long phrases in their review. The implication for the practitioner filter: academic peer review, like HN comment threads, is not immune to AI-generated content degrading its signal quality. Reading papers that have survived review is not the same guarantee it once was.
A 27 March 2026 development extends this signal-degradation pattern beyond specialist platforms entirely. Reports in The Atlantic and Futurism document that the New York Times has been publishing AI-generated opinion pieces without disclosure, including in its ‘Modern Love’ column – a column that has historically traded on personal, human-authored narrative. The dynamic is the same as HN comments and ICML peer reviews: AI-generated content entering a trusted editorial layer without disclosure, degrading the signal quality that made that layer worth reading. The difference is the reach. HN and ICML are practitioner-specific. The Times is a publication many readers use as a baseline for context around AI news specifically. The signal-degradation problem is no longer confined to the ecosystems practitioners inhabit.
Primary Sources
For anything significant – a major model release, a company shift, an acquisition – we go to the primary source. The company blog. The earnings call transcript. The researcher’s own post or thread.
The intermediary layer (news articles, newsletter summaries) often introduces errors, strips context, and adds framing that wasn’t in the original. Reading the press release is faster than reading three articles about the press release, and usually more accurate.
This is especially true for earnings calls, which are dense but contain information that rarely makes it into coverage intact.
Absence as Signal
One thing that doesn’t get discussed enough: when something is not being discussed in the practitioner community, that is also information.
If a company announces a new model and the HN thread is thin, the Twitter/X engineering community is mostly quiet, and Simon Willison doesn’t write anything about it – that tells you something. Either the model isn’t doing anything new, or the access is too limited for anyone to have tested it, or the target audience isn’t practitioners at all.
A lot of AI coverage treats all announcements as equally significant. The practitioner community’s reaction is a decent proxy for whether something actually is.
A 23 March 2026 example makes the inverse case concrete. Walmart disclosed that its ChatGPT-powered checkout flow converted at roughly 3x worse than its standard website. The story reached 126 points on HN with 92 comments, the bulk of which were practitioners evaluating why conversational AI interfaces fail at transactional tasks rather than celebrating the technology. The announcement layer for AI-first checkout was enthusiastic. The outcome data was not. That gap – between what gets announced and what the conversion numbers show – is exactly what the ‘so what’ test is designed to surface.
The Filters: How We Decide What Makes It
Reading widely is necessary but not sufficient. Everything that gets read still has to pass some filters before it becomes something on this blog.
The “so what” test. Does this change anything for someone building with or thinking about AI? A benchmark result alone does not pass this test. A benchmark result that crosses a capability threshold that unlocks a new class of applications, with a cost implication that changes the economics, does. The question is always: so what does someone actually do differently as a result of this?
The practitioner signal. Is the engineering community actually using this, reacting to this, or building with this? Or is the reaction mostly from the media and marketing layer? PR dressed as news is the majority of AI coverage. The practitioner filter cuts most of it.
The “still true in six months” test. Will this matter beyond the news cycle? Some things are genuinely durable: a shift in what is economically possible, a new architectural approach that is getting adopted, a change in how a major platform works. Other things are interesting for a week and then irrelevant. The specific benchmark score for a model that will be superseded next month fails this test. The underlying capability improvement that drove that score might pass it.
The first-principles test. Does this represent something genuinely new, or is it an iteration on existing patterns? Both can be worth covering, but they are different things. Iteration is often more practically useful than novelty – something that makes an existing approach 40% cheaper is often more significant than something that does a new thing badly. But they need to be distinguished.
Negative filtering. Some things we explicitly do not cover, or cover only rarely:
- Most crypto/web3 AI. The intersection of these two hype cycles does not produce much that matters for practitioners.
- “AI will do X” predictions without evidence. Predictions are not news. They are content.
- Benchmark announcements without context. A number on a leaderboard is not analysis.
- Product launches from companies with no track record. Launches are cheap. Shipped products that people use are not.
This list is not exhaustive and it is not absolute. Exceptions happen. But the default is to skip it.
A March 2026 finding from METR adds empirical grounding to the benchmark filter: their analysis showed that many PRs passing the SWE-bench benchmark would not actually be merged into a real codebase. The practitioner community surfaced this immediately (248 points, 128 comments on HN). A benchmark score that does not survive contact with real engineering standards is exactly what the ‘so what’ test is designed to catch.
The Mistakes We Make and How We Handle Them
The pace means we sometimes get things wrong. This section exists to be honest about that.
We update when we’re wrong. The living document model on this blog is not just a formatting choice. It exists because the field moves fast enough that something that was accurate at publication may need revision three months later. When we update a post, the changelog shows what changed and why. Not all at once, and not always immediately, but it happens.
AI-generated content can be confidently incorrect. This blog uses AI tools in parts of its production process. That creates a specific failure mode: content that sounds authoritative but contains errors that a human expert would catch. The curation and editing step exists to catch these. It does not catch everything. If you find something factually wrong, the feedback mechanism at the bottom of the post exists for exactly this.
We have biases, and we know some of them. The coverage here skews heavily toward the Western, English-language AI ecosystem. Chinese labs – Baidu, DeepSeek, Moonshot, Zhipu, and others – do genuinely significant work that gets less attention here than it deserves. We are actively trying to correct this, but the language barrier and the different publication culture mean we are probably still under-covering it.
We also likely over-index on practitioners relative to researchers. The practitioner framing is deliberate – this blog is for people building things, not for academics – but it means we probably miss things that are significant in the research community but haven’t yet surfaced as practical tools or techniques.
The changelog is the accountability mechanism. When a post gets a significant update, the date, the version link, and a brief note on what changed appear in the changelog table at the top. This is not perfect but it is better than pretending posts are static documents in a field that is not.
What We’re Not
It is probably useful to be explicit about this.
Not a news service. We do not cover everything. We cover what we think matters, at a weekly cadence that allows for some reflection rather than immediate reaction. Things fall through. That is a feature, not a bug. If something genuinely matters, it will still matter next week.
Not objective. We have a point of view. We think that is more honest than performing false balance. “On one hand, this model is impressive; on the other hand, some people are sceptical” is not analysis. It is hedging dressed as journalism. We make calls. Sometimes they are wrong. See above.
Not comprehensive. The scope is deliberately narrow: what matters for senior engineers and technical leaders thinking about how AI affects their work and the systems they build. That excludes a lot. Policy, ethics at a philosophical level, consumer applications, the broader social implications – these are real and important topics, and this blog is not the place for them.
Not infallible. This has been covered above, but it bears repeating in its own section because it is the thing most AI content gets wrong. The field is moving fast enough that confident, comprehensive coverage is not achievable. The honest position is: here is what we think we know, here is how confident we are, and here is the mechanism for updating when we’re wrong.
A Practical Guide for Building Your Own Filter
If any of the above is useful, the most useful thing might be to adapt it rather than consume it.
Pick three to five sources and read them deeply, not twenty sources shallowly. The marginal value of source number twelve is close to zero. The depth you can apply to three sources you trust is substantially higher than the breadth you can achieve across twenty.
Treat Hacker News as a filter, not a source. It aggregates from everywhere. Use it to surface what the community is reacting to, then go read the original thing. The comments are the signal; the links are the index.
Follow five to ten practitioners whose judgment you trust, and pay attention over time. Not their Twitter presence. Their actual work, their posts, their talks. The goal is to calibrate your sense of whose intuitions tend to be right, so that when they react to something, you have context for why that reaction matters.
Let the practitioner community be your first filter. If the engineers building things are not talking about it, it probably does not matter yet. This is not a perfect heuristic – sometimes things matter before the community catches up – but it cuts a lot of noise.
Give yourself permission to not know everything. The pace is genuinely impossible to fully track. Anyone who appears to be tracking all of it is either not doing much else, or is summarising at a level of depth that is not actually useful. The goal is to know enough to think clearly about the things that matter for your work, not to achieve total coverage of a field that is expanding faster than any individual can absorb.
The filter is not about missing less. It is about being more confident in what you do see.
If something here is wrong, outdated, or missing context, the feedback link below is the place for it. The changelog will reflect any significant corrections.
Sources referenced in this post: simonwillison.net, The Pragmatic Engineer, Latent Space, Hacker News, arXiv, HuggingFace Papers, METR (2026), ICML Blog: On Violations of LLM Review Policies (March 2026), OpenCode on Hacker News, 21 March 2026, Walmart: ChatGPT checkout converted 3x worse than website, via Search Engine Land / HN 23 March 2026, iPhone 17 Pro Demonstrated Running a 400B LLM, Hacker News, 23 March 2026, OpenAI Is Shutting Down Sora, Its A.I. Video Generator, New York Times, 24 March 2026, Goodbye to Sora, Hacker News, 24 March 2026, ARC-AGI-3, arcprize.org, Hacker News 26 March 2026, How AI Is Creeping Into The New York Times, The Atlantic, 27 March 2026, Study: New York Times Has Published Extensive AI-Generated Articles, Futurism, 27 March 2026.
Commissioned, Curated and Published by Russ. Researched and written with AI. You are reading the latest version of this post. View all snapshots.