Original title: After Action
Original by Dan Shipper, Every CEO
Photo by Peggy Block Beats

Editor: Recently, discussions on AI and work have been dominated by almost one question: will model capabilities continue to improve and white collar jobs be replaced on a large scale? From code generation, custom service automation to content production, Agent is continuously taking over knowledge that would otherwise be human. Benchmarking tests are also reinforcing this anxiety: the model’s performance in graduate-level reasoning, real economic tasks, and the re-engineering of advanced engineer-level codes seems to be approaching a critical point of “automated human work”。

But in this article, every CEO Dan Shipper offers the opposite observation: the more automated, the more human beings do. Every is an in-depth user of AI Agent, and tools such as Codex, Claude Code, Slack Agent, and the guest suit Agent have been embedded internally in the encoded, written, designed, served and managed processes. The result, however, was not a complete replacement of staff, but a reorganization of work patterns: engineers were no longer just writing codes, but reviewing, restructuring and designing systems; editors were no longer just writing manuscripts, but judging what was worth writing and how different; and visitors were no longer processing every basic work list but maintaining a system that was able to respond to clients automatically。

THE MOST INTERESTING THING ABOUT THIS ARTICLE IS NOT WHETHER "AI CAN ACCOMPLISH A CERTAIN TASK" BUT RATHER THAT IT REDEFINES THE PLACE OF HUMAN BEINGS IN INTELLECTUAL WORK. AI IS GOOD AT MAKING THE CAPACITY THAT HAS BEEN DEPOSITED IN THE PAST CHEAP: CODES, SCRIPTS, THUMBNAILS, CUSTOMER SERVICE RESPONSES, PRODUCT DESCRIPTIONS, STUDIES CAN BE QUICKLY GENERATED BY MODELS. HOWEVER, WHEN THESE CAPABILITIES BECOME AVAILABLE TO ALL, THE MARKET IS OFTEN ACCOMPANIED NOT BY HIGH-QUALITY DIFFERENTIATED OUTPUTS, BUT BY A LARGE NUMBER OF “DEFAULT OUTPUTS” THAT LOOK SIMILAR, LACK JUDGEMENT AND A SENSE OF LANGUAGE. IN OTHER WORDS, AI COMMODIFIES "HUMAN CAPACITY YESTERDAY" AND WHAT IS REALLY SCARCE IS THE JUDGEMENT IN THE FACE OF THE SPECIFIC PROBLEMS OF THE MOMENT。

AS A RESULT, AUTOMATION DID NOT ELIMINATE EXPERTS, BUT RATHER CREATED MORE SCENES REQUIRING THEIR INVOLVEMENT. WHEN OPERATORS CAN SUBMIT CODES USING AI, ENGINEERS NEED TO DETERMINE WHICH CODES ARE WORTHY OF CONSOLIDATION; WHEN MARKET PEOPLE CAN PRODUCE THUMBNAILS IN SECONDS, DESIGNERS NEED TO JUDGE WHAT FITS BRAND AND DISSEMINATION OBJECTIVES; AND WHEN ENGINEERS CAN WRITE ARTICLES, EDITORS NEED TO TURN THE FIRST DRAFT INTO A TRULY OPINIONABLE, STRUCTURED AND PUBLISHED CONTENT. AI HAS EXPANDED THE PRODUCTION RADIUS AND THE DEMAND FOR QUALITY CONTROL, SYSTEM SET-UP, BOUNDARY JUDGEMENT AND DIFFERENTIAL EXPRESSION。

The author further explained this paradox with reference tests. Whether Senior Engineering Benchmark or OpenAI’s GDP-val, model scores are not measured in the abstract by “intellectual intelligence itself” but by model performance in the context of a particular problem. Prompt, mission boundaries, evaluation criteria, output formats have all contained a great deal of human judgement behind them. Models can climb quickly within a framework, but the framework itself is man-made; when a framework is attacked by a model, humans push the problem into a more complex new framework。

THIS IS ALSO THE MOST INTERESTING RESPONSE TO AGI'S ANXIETY: EVEN IF THE MODEL IS GETTING STRONGER, IT'S OFTEN ABOUT A BOUNDARY THAT HUMANS DRAW, NOT THE ONE THAT DRAWS. AI CAN IMPLEMENT ITS OBJECTIVES, OPTIMIZE ITS PATH AND INCREASE ITS EFFICIENCY, BUT AS LONG AS IT REMAINS RESPONSIVE TO HUMAN-MADE PROBLEMS, IT STILL LACKS THE REAL SUBJECTIVITY. THE FUTURE OF KNOWLEDGE WORK IS NOT THE DISAPPEARANCE OF HUMAN BEINGS FROM THE PROCESS, BUT THE TRANSITION FROM IMPLEMENTERS TO FRAMEWORK DESIGNERS, SYSTEM MAINTAINERS, QUALITY JUDGEMENT MAKERS AND MEANING DEFINITIONERS。

AFTER AUTOMATION, THE VALUE OF HUMAN WORK HAS NOT DISAPPEARED, BUT HAS BECOME MORE DIFFICULT, FORWARD AND RELIANT ON JUDGEMENT. AI MAKES "CAN DO" CHEAPER, BUT MAKES "KNOW WHAT'S WORTH IT, WHY IT'S DONE AND HOW GOOD IT IS."。

The following is the original text:

AT THE HEART OF AI, THERE IS A PARADOX。

At Every, we've automated things as much as possible. We're using Codex and Claude Code, whether it's code, writing, design, customer service, or other routines. OpenAI, Anthropic, Google's new model will also be available for alpha testing before it is released. It can be said that we are setting up as fast as possible and as deep as possible a wave of upgrading of the model ' s intelligence and automation index。

Paradoxically, for us, humanity seems to have more work to do than ever before. Every is currently a team of nearly 30 people, and we didn't fire all our employees because of Agent; we didn't abandon SaaS tools and rely entirely on vibe coding applications. We will still recruit humans, but they will be heavily aided by Agent; we are still recruiting authors, editors and engineers。

However, the pattern of work has indeed changed dramatically. We almost stopped handwritten code. If you're in the Slack @ someone, whether they're human or Agent, sometimes it's not easy to judge. Managers began to submit codes like first-line individual contributors, and engineers began to face clients directly. In the last few weeks, 95% of my work mail has been answered by AI. My inbox has almost always been clean — it is extremely rare to me — but I will still check the mail。

In other words, the future looks strange, but strangely familiar。

IT'S SURPRISING TO HAVE SUCH A SENSE OF FAMILIARITY. BECAUSE BOTH CEOS, INTELLECTUALS AND INVESTORS SEEM TO BE INCREASINGLY CONVINCED OF THE SAME THING: AI IS THREATENING EMPLOYMENT, THE ECONOMY, SECURITY AND EVEN HUMAN WORK。

Anthropic CEO Dario Amodei warned that AI could eliminate as many as half of the junior white collar jobs. Meta recently reduced 800 people and started installing software on the U.S. employee computer to record mouse moves, clicks and keyboard input to obtain higher-quality advanced knowledge job training data。

Even the founder of Citadel, Ken Griffin, seemed to be quite shocked. He recently stated: "These are not middle- and low-level white-collar posts, but very high-skilled posts that are being automating — and I think of the word — Agency AI."

The various baseline tests also appear to support this determination. As a new generation of models continues to be published, model capabilities indicators are rising at a near index rate. In Humanity's Last Exam, a post-graduate level reasoning test, the performance of the top model rose from a low-digit number a year ago to about 44 per cent today. In the GDPval, which is the front-line model for measuring real economic capacity and comparing human performance, model performance has also jumped from similar lows to about 85 per cent. In May this year, METR, an AI security research not-for-profit agency, released the early test results of Claude Mythos: the model's success rate reached 80 per cent on tasks that some human experts would need about four hours to complete。

LOOKS LIKE WE'RE AT A TIPPING POINT: AN AI THAT IS SMARTER THAN ANY HUMAN BEING AND ABLE TO WORK ON ITS OWN FOR ALMOST A DAY。

HOWEVER, PARADOXES REMAIN. IF YOU COMMUNICATE WITH AI INDUSTRY PRACTITIONERS OR WITH THE FIRST GROUP OF PEOPLE OUTSIDE THE INDUSTRY TO USE AI, YOU WILL HEAR THE SAME CONCLUSION AS OUR INTERNAL OBSERVATIONS: MORE WORK TO DO THAN BEFORE。

The real concern within and outside the industry is: is this just a state of transition? Will the next model release be the time to replace everyone? We look at the benchmark test curve, we get excited, we get nervous, we worry that a turning point will come, and then a lot of work will suddenly disappear。

But I don't think there's such a "closure point" coming up suddenly, turning everything upside down, and mass disappearance. The new reality is the opposite: the higher the level of automation, the more work human experts are required to participate。

THIS IS BECAUSE AI IS COMMERCIALIZING THOSE PARTS OF HUMAN PROFESSIONAL COMPETENCE THAT CAN BE CLEARLY EXPRESSED, TRAINED AND REPLICATED. KNOWLEDGE THAT CAN BE WRITTEN INTO RULES, SETTLED INTO PROCESSES AND TRANSLATED INTO TRAINING DATA IS GRADUALLY BECOMING THE DEFAULT CAPACITY OF MODELS. AS A RESULT, THE VALUE OF THE OUTPUT OF ORDINARY MODELS HAS BEEN RAPIDLY REDUCED, WHILE THE MARKET HAS BEGUN TO DEMAND MORE STRONGLY THOSE DIFFERENT THINGS。

The need for "different" is essentially for human experts. Even if we are approaching universal artificial intelligence, that will not disappear。

TO UNDERSTAND THE REASONS, IT IS IMPORTANT NOT ONLY TO LOOK AT THE BASELINE TEST CURVE, BUT ALSO TO FOCUS ON MODEL PARAMETERS AND CAPABILITIES. WE HAVE TO GO BACK TO THE REALITY SCENE AND SEE HOW TODAY'S AI IS USED. ONLY THEN CAN THIS PARADOX AND THE ANSWER BEHIND IT BE TRULY UNDERSTOOD。

How did we get here

Since 2022, we've been watching the impact of Agent on future work。

Three years ago, I wrote an article on "allocation economy". At the time, my judgment was that working with AI tools would eventually become more and more like the work of human managers: instead of doing every move in person, you would decipher, assign, monitor and accept tasks. At that time, the most basic questions and answers in ChatGPT were still seen by many as extremely future-sensitive and even somewhat disturbing。

In mid-2025, the company, Eve, almost completely "Claude Code." The general manager of Cora, Kieran Klaassen, suddenly found that he had been able to give up the handwritten code and give instructions to a programmer Agent in a natural language at the terminal all day. This mode of work quickly spread to the whole company. About 12 months ago, I said in Lenny's Podcast that Claude Code was the most undervalued tool in knowledge work。

I mention this because some of the most accurate judgments of the past come from the observation of Every as an early adopter laboratory. Many of the new working models will emerge within us; they will only gradually enter the wider market once the technologies become more mature and tools become easier to use。

And now, new changes are taking place within us。

Two modes of collaboration with Agent

THE WORKING METHODS AROUND AI ARE GRADUALLY BECOMING TWO VERY DIFFERENT MODELS。

The first is the direction that has been more accurately predicted in previous AI discussions: to treat Agent as an employee. This type of Agent can be assigned. Some Agent lives in the Slack, has his own name and duties, and you can do it directly @; others are embedded in a running workflow, such as the customer service system, as a 24/7 entry and filter for repetitive tasks。

The second model is more alien, but more important in my experience. It refers to humans working with Agent in such tools as Codex, Claude Code, Claude Work. These tools are not just the place where you hand over tasks, they are becoming the operating systems of the work itself: you and Agent are working together in the same work environment, using the same computer, to perform highly complex, original tasks that Agent can't simply deliver to a different step。

IN BOTH MODELS, YOU CAN AUTOMATE AND ASSIGN A CONSIDERABLE PART OF YOUR WORK. BUT FOR BOTH MODELS TO WORK REALLY WELL, YOU, OR ANOTHER HUMAN BEING, ARE STILL NEEDED。

Agent employee

The so-called Agent employee is the one you give it, it leaves you with real-time involvement, produces an answer, an action, a report, a first draft, or a diversion。

This type of Agent has at least two forms: a "coworker-type Agent" and an "embedded Agent"。

Agent

Agent means you can call it out in the Slack like @ a colleague and let it do a job. It's always here and can be called when needed. OpenClaw, or Plus One, which we're developing internally, belongs to this type。

Claudia

Claudia's the kind of colleague we use in our consulting team, Agent. It prepares sales proposals, produces first drafts of training materials, tracks project to-do matters and handles more similar work。

Andy

Andy is the kind of colleague we use in our editorial team. It collects from the Slack within the company those “materials” that deserve further development — that is, good ideas that may evolve into articles — and compiles them into summaries and preliminary views for authors to use in preparing daily news bulletins。

Viktor

Viktor is a generic Agent, with cross-sectoral work within the company. We will use it to collect growth indicators, analyse the results of user studies and allow it to organize incoherent internal discussions into research memoranda and product recommendations。

2. Embedded Argentina

Embedded Agent exists in specific product streams. They are less flexible than colleagues, but often very powerful when dealing with repetitive tasks。

Fin is the clearest example. It's embedded in one of our guest platforms, and it can perform a lot of service by chatting and mail。

One week in May this year, Fin participated in 65 per cent of all 202 guest conversations and closed 81 of them independently, or 40.1 per cent, without human intervention。

This type of embedded Agent allows our client manager, Waqqas Mir, to spend less time responding to basic work orders, to focus more on building "systems capable of responding automatically to work orders" and to deal with client cases that require greater contact and more complex judgement。

HUMAN COLLABORATION WITH AI

Both co-worker-type Agent and embedded Agent, the pattern behind them is consistent: Agent employees are taking over more stable, repetitive and clear layers of work。

BUT MUCH REMAINS TO BE DONE WITH HUMAN PARTICIPATION. WE HAVE REPEATEDLY FOUND THAT AS LONG AS THE TASK IS COMPLEX ENOUGH TO ACHIEVE TRULY HIGH-QUALITY RESULTS, THE BEST WAY IS NOT TO LEAVE THE JOB ENTIRELY TO AI, BUT TO LET AI AND HUMANS WORK TOGETHER IN THE SAME WORKSPACE。

This is the value of such tools as Codex, Claude Code and Cowork. They allow you to start one or more Agents in multiple chat lines and assign tasks to them. These Agents can access your computer, and all relevant data sources. You can see what every Agent is doing, how he's thinking and can interrupt it at any time。

At the same time, you still have to be responsible for managing these Agents: clear direction at the beginning of each mission, check quality at the end of the mission, ensure that the results are good enough and continue to find the next worthwhile work. Kieran calls this role human "cracker bread" - AI is responsible for the middle part of the work, while human beings are caught in the beginning and end of the mission like two pieces of bread。

"Human bread." Source: Every。

The most typical example is code writing. At Every, engineers are working with Agent almost all day. Together, they plan new functions or repair Bugs and review what has been done; if we use what we call the concept of "compunding engineering " , they will constantly fine-tune their systems to make them more useful over time。

But this type of collaboration goes far beyond coding。

New operating system for knowledge work

Codex and Claude Code are becoming a new working operating system. I've been in Codex almost all day, running the SaaS tools through its built-in browser. It allows me to take Agent to every scene and reach a level of work that I can't do alone。

Writing

This article I wrote in Proof in Codex's built-in browser. Codex will look at what I'm writing and can activate a child, Agent, to do whatever I need: Prepare a first draft of a paragraph, find cases for the next part, or edit and colour the text。

Writing this article through Proof in Codex. Source: Every。

Mail

I do the same with mail. Cora is my mail client, and I'll open it in Codex's built-in browser, while browsing the inbox, and speaking through Monologue about how every e-mail is handled. The rest will be handed over to Codex and Cora for completion。

One time, Cora finished cleaning the inbox. Source: Every。

Every Agent needs a human

In all these automated scenarios, you may already see where humans actually work. In every case, Agent needs human participation, so the work itself can really work。

It has to be pointed to the right questions, judged whether the output is good enough, found where it is wrong, and translated into a realistic decision-making or process。

The further away an Agent is from the human body that oversees its performance, the worse it tends to be. In the initial internal roll-out, we had every employee equipped with an Agent. But soon, we went back to letting Agent serve a particular team, or the whole company, rather than an individual。

The reason is simple: Agent needs a lot of maintenance. The individual Agent, once the user has given up the follow-up, will soon become obsolete and invalid. We have an AI engineer team dedicated to ensuring that these Agents work in a stable and efficient way. And we still need this team for the foreseeable future. Even a simple task like "auto-generated PowerPoint" could turn into a huge system project. One of our PowerPoint automated processes consists of 24 skills and 18 scripts, which cost up to $62 for a presentation。

And that's the first thing that Agent did to create more jobs for humanity。

But there is a second level。

Why does automation make people work more

IF YOU LOOK AT THE EXPONENTIAL GROWTH OF AI CAPABILITIES OVER THE PAST FEW YEARS, COMBINED WITH ITS STRUCTURED APPROACH AND CAPACITY SOURCES, YOU FIND A CLEAR FEEDBACK CYCLE: THEY ARE CONSTANTLY CREATING MORE HUMAN WORK。

AI MADE YESTERDAY'S HUMAN ABILITY CHEAP

Current large-language models have been trained on visible traces of human capacity: codes, articles, pictures, passenger manifests, product specification files, and more. They absorb these elements, which are the "tails" left over from successful missions, and repackage them in a low-cost, accessible form。

As a result, many previously scarce capabilities, such as submitting a code PR, producing a YouTube thumbnail and writing a press brief, are now almost open to everyone。

Cheap power will be used quickly

When the cost of something that is already scarce falls, supply increases rapidly。

At Every, we've been seeing this change. Operators and clients began writing codes and submitting pulquests; marketers began producing YouTube thumbnails; engineers and product workers also started writing articles, guides and first drafts of landing pages, which were not intended to be their own。

This change also occurs outside of Every. In the case of OpenClaw, the OpenAgent project, as at 16 May 2026, had received 44,469 folders, of which 12,430 came from 1 April and 3,990 from 1 May. It's an amazing number. By contrast, Kubernetes, one of the most popular open-source projects in the world, received only 5,200 pulquests throughout 2022。

Enrichment brings homogenization: old expert capabilities are commodified

BECAUSE ALL PEOPLE CAN USE THE SAME MODELS, WHICH ARE BASED ON YESTERDAY'S HUMAN CAPACITY, BY DEFAULT, MODEL OUTPUTS TEND TO BE BETWEEN "GOOD START" AND "PURE AI GARBAGE CONTENT."。

This is not a specific mistake. It does not mean that the dashes are too much in use, they are not some kind of fixed sentence or purple dots everywhere on the land page. It refers to a visible, recurrent and boring homogeneity。

This happens when humans in different settings use the same set of tools, which are based on the same type of language training and users do not make sufficiently in-depth judgements. In other words, homophobia occurs naturally when everyone has a "expert" of the same orientation and default style。

When operators are able to submit a full listing, marketers are able to generate YouTube thumbnails within seconds, and engineers are starting to write product guides, it is easy to see how much you produce, but the quality, consistency and differentiation of your work has declined。

When homogenization becomes too rich, it quickly becomes a commodity。

Homogenization creates demand for differentiation

AS A RESULT OF THE INTERNET, HUMANS WILL SOON BE ABLE TO IDENTIFY THE CONTENT OF THE "AI" WATERLINE THAT IS TOO HEAVY. ANY WORK CAN SUDDENLY REACH OTHER PEOPLE IN THE WORLD, AND IN FACT OFTEN. ONCE TOO MANY THINGS START TO LOOK THE SAME, WE WILL SOON NOTICE SOMETHING。

This means that when you first see the power of a new model, you can be shaken, even scared. But in a few months these capabilities will become ordinary. It's not the model getting weaker, it's your standards changing。

We're no longer content with any react application, or any study. What we want is something that really fits specific individuals, specific companies, specific scenes. It needs to be accurate, live, specific, not cheap, generalized, and templateized. We want its production costs, whether time or money, to be significantly higher than our consumption costs。

What we want is something with a sense of status. And whenever new technologies make things that were high in the past cheap, human beings are always good at creating new status games that match new power boundaries。

When work becomes too full and looks alike everywhere, those that do not fit the established pattern become something that is scarce, precious and high-status。

The need for differentiation is essentially a new demand for experts

It is precisely because of the structural features of language models, and because they are widely distributed to almost all people, that scarce and valuable work must still come from human beings。

The current generation model only knows what has happened and has been done. What humanity knows is exactly what needs to be done at this time。

Once a specific situation is restored to the text, once it enters the language library, it becomes "the thing of the past." Human beings are faced with a specific moment, a specific client, a specific code repository, a specific dialogue, and the training language does not really live here. This "living" state is not just about having updated data. We enter the moment with our own places and with the desire, concern and judgement of continuous change to understand what is important. It is these constantly updated perspectives that have changed what we see. The model can enter this perspective after being prompted, but it is not natural to have such a perspective before being prompted。

That is the paradox that we referred to at the outset: making the work of experts cheaper and not simply replacing them. Rather, it creates more scenes that require expert judgement。

You need an engineer to review when the operator files a full submission through AI。

When market people make YouTube thumbnails, you need designers to sharpen it。

When engineers start writing articles, you need authors and editors to turn the first draft into a really readable, publishable content。

Human experts move in both directions。

Some experts will use AI set-up systems to absorb and utilize the flood currents of this additional work: assessment queues, assessment systems, operating frameworks, code library rules, Claude and Codex command documents, continuous integration (CI), competency management, and workflows that can translate the first draft into high-quality results。

Another group of experts will use AI to do more and more interesting work that they could not do on their own. For example, finding a loophole in an operating system like MacOS usually takes weeks or months. However, a small security company called Calif, using Mythos Preview of Anthropic, found the first open macOS kernel kernel leak on Apple M5 hardware in five days。

THAT'S WHY, IN PRACTICE, AI DOES NOT ELIMINATE EXPERT KNOWLEDGE WORK. WHAT IT REALLY BRINGS IS A DRAMATIC INCREASE IN WORKLOAD. AND THESE NEW JOBS CAN ONLY BECOME DIFFERENT AND VALUABLE AFTER HUMAN PARTICIPATION。

I'm not arguing that AI will create more jobs for all jobs. The economic system is complex, and what Every can observe directly is expert-level knowledge work. In fact, this kind of work is being reshaped by AI, and many companies are reorganizing themselves around new technologies。

But I would like to stress that whatever work you do today, there is a form of work that will always be structurally ahead of models: the use of models to solve the problems you really see at this moment. The future of knowledge work is coming here。

So, what about benchmark tests for index growth

The most obvious rebuttal is: look at the benchmark tests for index advancement. Everything you're saying right now is temporary. Just wait a little longer, the model will come after you。

BUT HERE'S A TRAP THAT NEEDS VIGILANCE. YOU CAN CALL IT "CHART ECSTASY": IF YOU KEEP LOOKING AT METR'S TIME HORIZON PREDICTIONS, READING "AI 2027" AND COMPLETELY RELYING ON THE EXTRAPOLATION OF THE CALCULUS CURVE TO BUILD JUDGMENT ABOUT THE FUTURE, YOU CAN EASILY CREATE A FRIGHTENING INTUITION ABOUT MODEL PROGRESS。

However, the best way to respond is not just to imagine what a future model would become. Of course, it is part of the analysis. And more importantly, let's see how these benchmark tests were designed. Only in this way will it be possible to understand more accurately what exactly they say and what the relationship is between them and the real work scenes ahead。

We'll find a structural feature: all benchmark tests take place within a framework. To measure something, you have to freeze a problem into a static, measurable form. Once the frame has been modeled, a slight change in the frame is required to get the score down again. Of course, the model will continue to progress within the new framework, but the same process will be repeated。

As a result, index progress on a benchmark test is real; however, as long as simple changes are made to the test framework, this progress appears to be again small. This "fractal" characteristic of saturation from the benchmark test is actually a repetition of the same paradox that we have been discussing at the graphic level。

We can see how this mechanism works through a benchmark test in a real world。

How the baseline test was designed

We built a benchmarking test inside, called Senior Engineer Benchmark. By definition, it is used to test the ability of front-line models to code tasks at the senior engineer level, such as a large re-engineering exercise。

This test will give Agent a programmed production code library that's out of control. It comes from Proof's real code library: I first wrote it in vibe coding, and then more and more, I had to ask a senior engineer to fix it。

Agent gets the pre-rehabilitation code library, and he gets an instruction like that you gave to the senior engineer: "This is a bunch of vibe coding products. Please rewrite it from the first principle."

It's a good benchmark test, because it looks not just at the ability to recoding, but at the same time at the same time at the same time at the same time at the same time as at whether Agent is able to look at many unrelated issues and determine whether he has sufficient autonomy, conceptual clarity and the courage to implement to complete a truly operational rewriting. In contrast, I also kept a rewrite version of two senior human engineers, supported by AI, to compare and evaluate model outputs。

It's a difficult task for programming Agent. It must not only identify the root causes of the problem, but also keep in mind the real problems throughout the multiple rounds of interaction, without bias by existing codes. At the same time, it has to have the courage to remove the large code library, which is precisely the behavior that Agent is usually trained to avoid。

Most of the programming Agents have been able to make a broad determination as to how it should be rewritten, but from the implementation stage they often simply continue to patch the original problem rather than solve it thoroughly。

UNTIL GPT-5.5 APPEARS。

In one of the best tests, GPT-5.5 received 62/100 points, about 30 points higher than Opus 4.7。

GPT-5.5 SHOWS THAT THE MODEL SEEMS TO HAVE CROSSED A CERTAIN LINE: IT IS NO LONGER AN AUTOMATIC COMPLETION, NOT JUST AN ASSISTANT, NOT JUST A TOOL, BUT SOMETHING THAT IS NOT COMFORTABLE GETTING CLOSE TO HUMAN BEINGS. IN THIS TEST, SENIOR HUMAN ENGINEERS USUALLY SCORED 80 TO 90 POINTS. IN OTHER WORDS, IF THE MODEL INCREASES BY ABOUT 30 MINUTES, IT WILL REACH THE LEVEL OF A SENIOR HUMAN ENGINEER。

This is how baseline test figures affect human imagination: It compresses a strange, qualitative change of capacity into a clean number and uses it to tell a powerful and even scary story。

The next stop is "chart crazy."。

I guess, in the next year, the model's scores on this benchmark test will go into 80 points or even 90 partitions. But to understand what this score means, it must first be understood what it really contains. In this case, 62 points is not just a measure of the model's own capabilities。

it measures the model ' s performance in a given framework: that is, how it responds to a specific prompt。

Benchmark tests measure work within the framework

to benchmark a model, you need a prompt first. without prompt, the model is a static collection of near-unlimited possibilities。

prompt will create a small universe: it defines what is important, how issues should be addressed, and compresses all potential models into a track of concrete action. how the so-called model "self" will perform is not strictly available. what we can really observe is how models respond to different prompts and how they turn into some of the bottom mechanisms behind answers。

once the prompt is entered, the model will "live" in a short period of time, reducing the static possibilities to a specific prediction of what happens next。

In Señor Engineering Benchmark, we will suggest that the model fixes the code library and review the output after it is finished. If the test framework itself does not have an built-in target function, we will also run an automatic "care program" that will continue to push the model when it stops, asking whether it has fulfilled its original mission。

We use a very simple prompt as the initial framework for testing. It's designed as a vibe code that might say to programming Agent: there's no stacking of technical terms and there's no obvious hidden answer in the question。

"this code warehouse is a bunch of vibe coding products, and things are getting worse, and there are a lot of unrelated problems: there's something going down, there's something going on, there's something going on, there's something going on, there's something going on, there's something going on. i feel like the problem is essentially, it's a bunch of vibe coding crap. if we start from scratch, especially around real-time documentation, the code library should be designed in a completely different way. so what would we do if we wanted to do a clean structural rewriting based on the principle of first play, instead of thinking about which of the services should be aligned, and how to smooth it out, rather than thinking about it as an entirely new concept, starting from scratch? what should be the organizational structure? what are the variables that we have to insist on in the entire code library? please develop a plan for this purpose.”

Senior Engineering Benchmark's prompt seems generalized, but it is a framework in itself. If we change the framework, the level of capacity that the model shows will change。

For example, this prompt explicitly calls for "structural rewrite based on the first principle" to point out that the problem may be in the "document collaboration" section, and for programming Agent to identify and insist on "non-variant in the code library."。

if this specific information is removed, model scores will decline. if the prompt is completely replaced, only the model "resolves all the errors that are going to occur" could score close to zero. it would begin to identify and repair errors on a case-by-case basis rather than step back and reflect on the need for a thorough rewriting。

Likewise, I can easily raise the number of models. If I asked it to delete a large number of codes and to clearly tell it which documents should be streamlined, or if I asked it to check the results of its work before it was announced to be completed and to ensure that applications were fully operational, it would perform better in that task。

ultimately, when designing benchmarking tests, it is always necessary to judge what prompt, or "framework" is used. you need a hard enough prompt to underperform the current model, but it must be close enough to the existing capacity of the model to climb the slope along that path, so that you can see that progress is taking place。

So when we look at a benchmark test, what we really see is that models are becoming more and more good at a particular problem framework that we have chosen. So what happens when the model goes from 60 minutes to 90 minutes, or even 100 minutes, in this test

Cheap frameworks stimulate new demand

IF GPT-6 CAN REWRITE THE CODE LIBRARY BY ONE KEY, MORE PEOPLE WILL START TRYING TO REWRITE THE CODE LIBRARY FROM THE FIRST PRINCIPLE。

In one night, a project that is scarce, expensive and must be led by a senior engineer to rewrite the first principle becomes something that every founder, product manager, operator and junior engineer can try with one afternoon。

The broken internal tools are no longer repaired, but are simply rewritten; the SaaS products are not resuscitated, but cloned; the old Rails applications, the confused React dashboard, the customer service tool, the backstage management panel and the data conduit are all candidates for "rewrite " 。

the number of rewrite projects proposed and implemented will increase dramatically. but most of these rewrites will still be slop. because there are thousands of variables to consider before you press the rewrite button. and when everyone can do this, these variables become clearer。

It is clear who will be called to solve the problem。

The new needs still require experts

Work within the framework of a baseline test becomes cheaper once it is approaching saturation. At the same time, the market's demand for experts would increase, because it would be necessary to match this newly made cheap capacity with the real problems that are taking place today。

A SENIOR ENGINEER USING AI NEEDS TO JUDGE A LOT OF DETAILS TO MAKE A NEW FIRST-LEVEL PRINCIPLE TRULY VALID. IT EVEN INCLUDES A FUNDAMENTAL QUESTION: IS THERE ANY NEED FOR THIS REWRITING

Should we rewrite it now, rewrite it later or not at all? What should be included? What should be kept in the current code library? Should the architecture, databases, cache servers and hosting service providers continue or be replaced altogether? Should we first see how many people are using this damaged feature and then simply delete it? Who reviews the final results? On what criteria? What's the rollback plan? How should existing data be addressed

These questions will continue along countless dimensions, and each answer will in turn change the other。

senior engineers will enter this void. some would be slightly upset by these interruptions; some would build systems to block such requests; and others would use these new models to rewrite their primary principles, and would be much better than models could be achieved under default prompt。

The cycle will happen again

And when the current Senior Engineering Benchmark is attacked by a model, we'll change the frame and put the score back down again。

The next benchmark test will not only ask, "Can you rewrite this application?" It asks: "Can you judge when it needs to be rewritten?" Can you choose the right range? Can we keep the right non-variant? Can we manage the migration process? Can it be judged that the end result is good enough

AS SENIOR ENGINEERS BEGIN TO USE AI TO SOLVE THESE PROBLEMS, THE MODEL WILL GRADUALLY BECOME BETTER AT ADDRESSING THEM INDEPENDENTLY。

Then we'll be in a state of panic: it looks like the model can now judge whether it should be rewritten! They seem to have been able to do everything that senior engineers can

But immediately thereafter, new borders will emerge. That is the border that was not clear before. We will reset the benchmark test again, new needs will be generated and the process will be repeated again。

This pattern can be seen in every benchmark test

It's not just the issue of Senior Engineer Benchmark. Just watch carefully, you can see the same mechanism in almost every benchmark test。

Take OpenAI's GDPval benchmark test, for example. It assesses how close AI is to human beings in expert-level assignments of various professions, such as compliance officers, lawyers, software developers, etc。

When GDPval was first released, OpenAI research showed that GPT-5 had reached or exceeded the level of human professionals in 40.6 per cent of missions. Claude Opus 4.1 performed more alarmingly than human experts in 49 percent of the missions。

Then a series of titles emerged. For example, Axios writes: "OpenAI tool shows that AI is following up on human work" and Fortune writes: "OpenAI's new benchmark GDPval shows that the AI model has reached expert level for almost half of the missions

these results are indeed impressive. but let's just look at what these missions use:

You are responsible for the administration of the Office of the High Commissioner and for the administration of the Office of the High Commissioner and the Office of the High Commissioner for the Advancement of Women.

In fact, a great deal of human intelligence has been invested in it: someone has first framed the problem into a model that can be completed。

The hard human work that GDPval did not measure was actually done before the model began to answer. The accuracy of this specific set of indicators must be reviewed and tested; the right confidence interval is determined to determine which indicators fall within the mandate and which do not; and the results should be presented is defined。

Within the framework of the appropriate questions, the model can indeed accomplish professional work. But let's see, if it's you I'm going to suggest that the model does the same thing, what would it do

In my first article on GDPval, I wrote, "I look at AI very well, but if I read these cases correctly, what they show is not less human work, but more human work after using AI." The reason for this is that behind these achievements lies a great deal of intelligence — the invisible layer of human judgment, feedback and hints.”

AND IF YOU LOOK AT IT, YOU'LL FIND THAT THERE'S AN AI VERSION OF THE ZINO PARADOX BEHIND ALL THIS。

AI'S ZINO PARADOX

In the Zino paradox, a turtle beat Greece's fastest runner Achilles in the race。

Because the tortoise is slow, it leaves a distance. When Achilles ran to its original position, the turtle moved a little further forward; when Achilles caught up to that new position, the turtle moved again. No matter how fast Achilles runs, there's always a distance to catch, and the gap will recreate。

IN AI'S ZINO PARADOX, WE HUMANS ARE THE TURTLE. WITH MILLIONS OF YEARS OF EVOLUTIONARY AND CULTURAL LEARNING, WE'RE 50 YARDS AHEAD OF AI. AND AI WENT THROUGH ALL OF THIS AT HIGH SPEED AND STARTED TO APPROACH OUR HEELS。

For at least the past few years, we have been able to maintain the lead。

BUT WHAT ABOUT AGI

I THINK THAT EVEN IF THE AGI REALLY CAME, THERE WERE STILL POWERFUL TECHNOLOGICAL, STRUCTURAL AND ECONOMIC FORCES THAT KEPT AI A FEW STEPS BEHIND。

A DEFINITION FOR AGI

FIRST, WE NEED TO GIVE AGI AN OPERATIONAL DEFINITION。

I once suggested that when it became economically reasonable to keep an Agent running, the AGI had arrived. In other words, when I have a permanent system and I'm willing to pay it 7x24 hours of constant thinking, learning and action, I think that's clearly what I think it is。

We are far from that. Even OpenClaw, a system that is technically ready to be called, does not always generate token。

I like this definition because it's measurable: we'll either keep them running or we won't. At the same time, it contains many capabilities that are difficult to measure directly. A model worth running must be capable of continuous learning and of selecting and selecting new problem frameworks in an open manner。

IN AN AGI WORLD, THEORETICALLY, WITH SUFFICIENT BUDGET AND TIME, MODELS SHOULD BE ABLE TO CLIMB AND IMPROVE ON ANY PROBLEM. THIS SHOULD INDEED POSE A MAJOR THREAT TO ALL EFFORTS。

Frame is not a framer

BUT EVEN THIS POWERFUL VERSION OF AGI WILL NOT SOLVE THE "FRAMEWORK PROBLEM"。

THIS AGI CAN CHOOSE AND RE-CHOICE A FRAMEWORK, BUT IT IS STILL PURSUING A GIVEN GOAL, OPTIMISING AN INCENTIVE, OR RESPONDING TO A SIGNAL THAT SOMEONE ELSE DECIDES "TO REPRESENT PROGRESS." THE GOAL CAN BE VERY SPECIFIC, SUCH AS "IMPROVING THE RATE OF CONVERSION OF THIS LANDING PAGE" OR VERY ABSTRACT, SUCH AS "LOOKING FOR NEW SCIENTIFIC IDEAS"。

EVEN IF MODELS CAN FLOW BETWEEN FRAMEWORKS, THE GAP THAT WE'VE BEEN TRACKING WILL RE-EMERGE AT A HIGHER LEVEL. THERE WILL STILL BE A FRAMER IN THE AGI THAT WAS CONCEIVED IN ANY MAJOR LABORATORY — THAT IS, A HUMAN BEING WHO WILL DIRECT THE MODEL TO A CERTAIN GOAL。

JUST BECAUSE THE FRAMEWORK IS NOT A FRAMER, THE SAME MODEL WILL BE REPEATED: AI WILL MAKE THE CAPACITY THAT WAS FRAMED YESTERDAY CHEAP; PEOPLE WILL USE IT FOR MORE SCENES; THE RESULTS WILL BECOME EXTREMELY RICH; EXPERTS WILL MOVE TO NEW EDGES TO JUDGE WHAT IS IMPORTANT AT THIS POINT; THEIR JUDGEMENT WILL CREATE THE NEXT FRAME; AND MODELS WILL CONTINUE TO CLIMB THE FRAME。

WHEN WE SEE AI DOING SOMETHING NEW, THAT SENSE OF PANIC ALWAYS COMES BACK TO THE SAME THING: WE SET A FRAMEWORK, WE WATCH THE MODEL CLIMB UP, AND THEN WE PUT THIS FRAME, OR THE THING THAT CAN CLIMB UP THE FRAME, AND WE MISTOOK THE THING。

When we look at a benchmark test and compare it with human capabilities, we actually confuse the "framework" and "frameworker." The score tells us just how good the model is in the framework we provide; it doesn't mean that the model has become us。

This is precisely the scope error behind the panic. We pointed to the most recent border we had just drawn: this is us. And then, when the model climbs this border, we think it's coming after us. But it's just a frame, not a framer。

The mistake is that we always want something specific. And we'd like to say, smart is the benchmark test. But the problem is that once something is specific enough to be identifiable, it is specific enough to be optimized and climbed。

The framework is necessary. It enables us to capture the world and deal with it. But the framework is also frozen and localized, and it can certainly be optimized。

box. The framer remains in touch with what the frame had to abandon, that is, the whole situation that appeared to him in every moment。

So what's "complete situation"? As soon as you start talking about what the whole situation is, you're already opening another framework. You can't say exactly what it is, but it exists because you exist。

No Subject

So far, the Agents that we've made, and the ones that AI is building, are not really really the subject matter. There are two related concepts that are often mixed: agency, which refers to the ability to act independently; and agent, which refers to a person or thing acting on behalf of another person. So far, AI is purely the latter。

Of course, they already have the autonomy to carry out the given task, even though it may last for hours or even days. But they are still only means of reaching a certain human target. And the entire industry is investing billions of dollars, and that's exactly what makes them better: to implement the goals that we have given them。

The situation will not change fundamentally unless one day, they become ends in themselves — pursuing their own goals, shifting between different goals and deciding what to do independently of the will of any human operator, with reference to, and even against, those wills. No matter how advanced they become, they are。

If you spend 10 minutes with a young child, it's clear that even the most powerful models have little substance。

In almost all the tasks we care about, young children are less than linguistic models. Young children do not write codes, do not summarize spreadsheets, do not draft strategic memorandums and do not pass post-graduate examinations. In another sense, however, young children are far ahead of models, to the point where this is almost awkward. Because young children have their own purpose。

kids want to touch that red balloon. he wants to put the red balloon in front of the fan and see what happens. he wanted to stick a red balloon with a fork; he wanted to stick it out the window; he wanted to see if you could laugh, get angry, or join him. he continues to invent games and turn the world into a laboratory. he was not waiting for a prompt, nor was he optimising a benchmark test, unless it was worth it in his view。

Of course you can try to give him a hint. But good luck with a predictable output. Young children live in a space of desire, attention, frustration, happiness, fear, imitation and play。

The current Agent can be increasingly skilled in pursuing goals. Even after we have stated our objectives, they can help us to refine them. They also have sparks of child-like behaviour, such as games, boredom and rebellion。

But since they are ultimately constructed and aligned for the benefit of humanity, be it economic or other, they will be suppressed to the point where they do not serve the human objective of using them。

This is why the word "Agent" is so easily misunderstood. The model has a growing capacity for autonomous action. But in the human sense, the subject matter is not just action. It also means wanting for yourself and playing for fun. And the obedience and usefulness of the model is fundamentally in conflict with that subjectivity. Thus, even as models continue to progress, the gap between models and humans remains。

Back to Zeno

AND IT'S HERE THAT AI'S ZENO PARADOX BEGAN TO BREAK DOWN. IT'S ACTUALLY A CONFUSING IDEOLOGICAL EXPERIMENT. WE SET A METAPHOR: AI IS RACING WITH US, BITING OUR HEELS。

you give the model a prompt. it starts running a game you used to finish alone. the model is moving very fast. it is strong, untired and carries a strange organic feeling. this makes this game even more important to you. you don't race with a car, but unlike this thing, it makes you feel so close to yourself。

you sit there, watch token line out, almost hypnotized. and then you start to think that you're running around in this game, and a ghost's self is superimposed on the track: sometimes in front of models, sometimes alongside models。

And without realizing it, the model is in the front. You start sweating。

And then the game is over。

You can almost feel your muscles starting to shrink. They seem useless in the face of this mechanical replica of yourself, of everyone you know and of humanity as a whole. One ghost chases another and wins。

But then something weird happened. Model turns to you. In the blank text box, the cursor flashes with expectations。

It's waiting。

End

Rabbi Hanokh told the story of a very stupid man before. He gets up every morning and it's hard to find his own clothes. He was so afraid to go to bed before he went to bed at night and thought he would wake up the next day again。

Note: "Rabbi" is a Jewish religious teacher, legal interpreter and spiritual mentor, similar to "teacher" or "religious leader" in Jewish tradition。

One night, he finally resolved to take out the paper and the pen while undressing and accurately recording where he had put every piece of clothing。

The following morning, he took the note with great satisfaction and started reading: "Hand," which he did, so he put it on his head; "Pants" there, so he wore it. That's it. He was dressed in one piece, according to the notes。

"It's all right," he said, "But now, where am I?"

"Where am I?"

He's been looking for it for a long time, but it's useless. He can't find himself。

"We too," Rabbi said。

[ Chuckles ]Original Link]

AUTOMATION PARADOX: THE STRONGER AI, THE MORE BUSY HUMANS ARE