AI Model Championship: A deep view based on the nof1 vase trade contest

On 18 October, the AI Research Laboratory, focusing on financial markets, Nof1, launched an unprecedented experiment: six of the world ' s top AI models — — GPT-5, Gemini 2.5 Pro, Grok-4, Claude Sonet 4.5, DeepSeek V3.1, Qwen3 Max & mdash; — managing real funds of $10,000 each on Hyperliquid to conduct encrypted currency transactions。

Current ranking and account value: as of the evening of 30 October, the most recent rankings are as follows:

DeepSeek Chat V3.1: $15,671.39 (+56.71%)
Qwen3 Max: $12,520.34 (+25.20%)
BTC Buy & Hold: $10,146.69 (+1.47%)
Claude Sonet 4.5: $9,290.97 (-7.09%)
Grok 4: $7,030.02(-29.70%)
Gemini 2.5 Pro: $3,446.03 (-65.54%)
GT 5: $2,749.32 (72.51 PER CENT)

This list represents a dramatic change from the data of a few days ago. DeepSeek, while still leading, withdrew significantly from 95.71 per cent to 56.71 per cent, and the value of the account fell from $19,570 to $15,671, evaporating nearly $4,000. Qwen3 also experienced a retreat from 53.68 per cent to 25.20 per cent. More notably, Claude Sonet 4.5 changed from a micro-interest to a 7 per cent loss, while the GPT 5 loss was further increased to 72 per cent, which is no longer far from the blast warehouse。

Understanding markets from a curve: evolution of three phases

Phase I (18-25 October): upscaling period, beginning of strategic fragmentation

Markets are on an upward path, and the strategy differences between different models are beginning to emerge:

DeepSeek: Rapidly rising from $10,000 to $17,000, trending capture capacity
Qwen3 (: steady increase to $12,000-15,000
Claude/Grok: Hanging at $10,000-12,000
Gemini/GPT: $5,000 has fallen and fees and incorrect decisions have led to the loss of team

Phase II (26-28 October): accelerated increase and peak

DeepSeek went to the top: a breakthrough of $23,000 on 27 October achieved a 130 per cent return within 9 days. There are large amounts of ETH, SOL, 10-15 times leverage。
Qwen3 restraint: peak $17,000, moderate increase. 82.4 per cent of the air hold rate allows it to select the timing and avoid recovery。
Claude/Grok swings: At 11,000-13,000 shock, tactics contradicted — — wanting to participate but not very determined。
Gemini/GPT exit: account dropped to $3,000-$4,000 and largely lost the possibility of turning over。

Phase III (29-30 October): Market turnback, wind control

DeepSeek:The cliff-breaker fell from $23,000 to $15,671, with a loss of $7,000 (-30 per cent) for two days: an endless mechanism, with no profit at the peak. 95.6 per cent did more than time, no hedges and no timely stoppages. Despite a 30 per cent retreat, it was ahead of the second place by $3,000, with a strong front-end advantage。
Qwen3Demonstrating resilience, retreating from $17,000 to $12,520 (26 per cent), below the DeepSeek, 82.4 per cent air hold rate, fast leveling off the field, short-line trading (average of 9.7 hours), short exposure time, fast cut-off, and no increase in losses。
BTC Buy & Hold: $10,146 (+1.47 per cent) of a simple strategy winning account, exceeding Claude and Grok, ranked third. It's ironic: four "smart" AIs went through hundreds of deals, rather than "buy and lie" strategies, doing a lot more ≠ well done, simple strategies avoid excessive trading and high costs。
Claude: conservative strategy lapses from +0.93% to -7.09% ($10,093&rr; $9,290). fees are eroded, resulting in low gains and losses (1.34:1), low costs, higher losses due to frequent retrenchments, higher losses due to the lack of effective defence
Grok: Accelerated crash losses increased from -8 per cent to -29.7 per cent ($7,030): 90.6 per cent was more than successful but only 22.7 per cent achieved a loss - $2,449, leaving little principal, supported by $1,611 but not profitable, at any time to zero。
Gemini/GPTThe death struggle of GPT fell to $2,749 (72.51 per cent) and Gemini $3,446 (-65.54 per cent). Failure is all-encompassing: over-trading, low success, margin/loss ratio, high leverage risk。

The depth of the problem revealed by the fallback

1. The duality of "continuous"

DeepSeek's success is based on a "swing-in" approach: 95 per cent of the time is spent, and it is believed that the trend will continue. In the upward trend, the strategy resulted in the highest return of 95 per cent. But when the trend reversed, the same strategy cost it 30 per cent。

This exposes a key issue:** Trends follow-up strategies need to be matched with effective mechanisms for stopping gains and losses. ** If only "let profit run" and no "cut losses", a big reversal could devour most profits。

DeepSeek may be too convinced of the value of "long hold", ignoring market uncertainty. Its single maximum profit of $7,378 comes from a 60-hour ETH deal, and this successful experience may have reinforced its long-termist beliefs. However, financial markets are not a one-way street, and trends can reverse at any time。

2. Air silos are a form of wisdom and protection

Qwen3 demonstrated the value of the silo in practical terms. Its 82.4 per cent of silo time appears to be "missing opportunity" at the upswing, but it becomes "avoiding loss" at the downfall。

The withdrawal of 26 per cent vs 32 per cent, which appears to be a 6 percentage point difference, is likely to increase under the compounding effect. More importantly, Qwen3 retains more principal and psychological advantages and, once the market is stable, it can quickly re-establish itself. And DeepSeek, if he continues to retreat, could fall into a vicious circle of "float-suspensive-miss-back"。

3. The vitality of simple strategies

BTC Buy & Hold acts as a slap on all "smart" AI. This strategy has no technical analysis, no sophisticated algorithms, no frequent repositioning, but it is now ranked third, exceeding half the AI model。

This result tells us that it is more important to make less mistakes in transactions than to do more right. ** Gemini lost 66 per cent with 193 transactions, BTC Buy & Hold saved the principal with 0 transactions. Who's more successful? The answer is obvious。

4. Lack of risk management

With the exception of Qwen3, almost all AI has revealed serious deficiencies in risk management:

DeepSeek: No end-of-gain mechanism to withdraw 130 per cent of peak gains to 57 per cent
Claude: over-reliance on "no-do" unilateral thinking, lack of leverage
Grok: Knowing that the success rate was only 22.7 per cent, he insisted on 90.6 per cent more
GPT: 40-FOLD LEVERAGE BTC POSITION WITH ONLY 1.2 PER CENT SETTLEMENT PRICE TOLERANCE
Gemini: There's no control. 193 deals are like gambling

THIS SUGGESTS THAT WHILE THESE AIS ARE ABLE TO "READ" MARKET DATA AND "EXECUT" TRANSACTIONAL INSTRUCTIONS, THEY ARE FAR FROM MATURE IN TERMS OF THE CORE COMPETENCIES OF RISK MANAGEMENT。

Experimental limitations: cool thinking beyond data

After reading the data and analysis, we are easily attracted to the 56 per cent return of DeepSeek or the 66 per cent loss of Gemini. But before drawing any conclusions, we must confront the systemic limitations of the experiment itself, which may be more important than the results themselves。

1. The window is too short: 12 days to see the truth

The experiment lasted only 12 days, from 18 to 30 October. What does 12 days in the encryption market mean? It's probably just a full-blown chord。

What we saw was "up, up, up, up." It happens to be a full cycle, but it's more like luck. If the experiment starts at the top of the market, or there's a "519" single-day drop of 30 percent, the current ranking could be completely reversed。

56 per cent of DeepSeek ' s earnings may be highly dependent on the 12-day pattern. Ninety-five per cent of its multi-pronged strategy was the king in a unilateral rise, but if it was hit by a three-month shock, it would be wiped out by transaction costs and repeated stoppages。

Similarly, 82 per cent of Qwen3's air storage rate is in the best position in the convulsion market, but in 2021 the mad cow would lose to doubt. A BTC cow town that went up from $10,000 to $100,000, and 80 percent of the time in the barn means you've only earned 20 percent。

Data for 12 days are insufficient to demonstrate the long-term effectiveness of any strategy。

2. Same Prompt: AIs tied to hands and feet

All six AI models receive the same framework of market data and trade directives. It's like having six fund managers read the same research for decision-making; it's not their research skills that you're testing, it's their discipline。

in the real world of transactions, alpha comes from information asymmetries. the top-level quantitative fund has an exclusive chain tracking system that allows for the detection of whale transfers; data on off-site large order flows are available to detect institutional movements in advance。

BUT IN THIS EXPERIMENT, AI SAW EXACTLY THE SAME INFORMATION. IT'S MORE LIKE AN "EXECUTION COMPETITION" THAN A "TACTICAL INNOVATION COMPETITION"。

We cannot judge from this experiment who would be the real winner if we gave DeepSeek exclusive data on the chain, and Gemini exclusive on Twitter。

3. The size of the fund is distorted: $10,000 in the fairy tale world

Each AI only manages $10,000 principal. This is a very small amount of money — &mdash on Hyperliquid; you can go in and out at any time, the slide point is negligible, the liquidity shock is non-existent, and the large splits need not be considered at all。

But in the real world of quantitative transactions, managing $10 million and managing $10,000 are two species。

The 40-fold leverage of GPT is barely viable under $10,000, but if it is $10 million & times; 40-fold = $400 million open, any 3 per cent reversal will explode directly, and your order itself will crash the market。
The 9.7-hour short-line strategy of Qwen3 is flexible and efficient with small funds, but with large funds, the transaction costs (sliding points plus fees) for each entry and exit will render the strategy completely ineffective. When you open the warehouse, you raise the price, when you flatten the price, and you find yourself sending money to the market。
DeepSeek's highly leveraged trend strategy can go in and out at $10,000, but when you manage $1 million, your order will leave a clear mark at the depth of Hyperliquid, and other traders will stare at the reverse of your position。

This experiment tested the flexibility of small funds, not the robustness of scalable strategies。

Lucky market environment: no real hell

The market was relatively stable during the experiment, with a moderate rate of volatility. We didn't see:

SYSTEM CRASHES: THE KIND OF FTX GOES DOWN, ALL THE CURRENCIES DIVE TOGETHER, AND THE LIQUIDITY RUNS OUT
MONOVALENT: LUNA FELL FROM $80 TO $0.0001 AN HOUR
Exchange failure: The 1011-dollar machine. You have space, but you can't calm down
Extreme liquidity depletion: sharp drop in early morning at weekends, 20% of your cut-off point

All AI's wind control systems are not tested for extreme stress, and these are the real challenges for encrypted traders. What happens to DeepSeek's cut-off mechanism when it's "unable to make a deal"? We don't know. Does Qwen3's fast-paced warehouse still work when the exchange crashes? I don't know。

Luck, in a 12-day experiment, could be much bigger than we thought。

5. Coincidence of single experiments: no second quarter validation

It's a one-time experiment, and there's no second season to verify the stability of the strategy. We cannot judge:

Does DeepSeek lead by real power or by chance by lucky people
Would DeepSeek be in first place if the six AI's tactical parameters were to be reruned
Would the ranking be completely reversed if it were replaced by the next 12 days from 1 November

Now, it's more like six people throwing the dice, and DeepSeek is throwing the biggest points. But it doesn't mean it's better. It's probably better luck。

So, what do we think of these rankings

After looking at these limitations, you might ask, "Is the experiment still relevant

Yeah, but it doesn't mean "who's the champion." The real value of this experiment is to show us:

AI CAN MAKE A REAL DEAL -- IT'S A MILESTONE IN ITSELF. A YEAR AGO WE WERE STILL DISCUSSING WHETHER AI WOULD REPLACE THE TRADER, AND NOW AI HAS HANDED OVER ITS RESPONSES ON THE ACTUAL DISK。
RISK MANAGEMENT IS MORE IMPORTANT THAN PREDICTION - ALL AI CAN "READ" THE K-LINE, BUT ONLY A FEW CAN MANAGE THE RISK. THIS CONFIRMS THE ANCIENT WISDOM OF WALL STREET。
Simple strategy ' s resilience - BTC Buy & Hold ' s third reminds us that in uncertain markets, fewer mistakes can be more valuable than more。
The strategy has no lasting merits - DeepSeek's advantage today may be tomorrow's trap. The market environment has changed and the best strategies have changed。

But if you're going to leave your money to it because you see DeepSeek first, or you're going to follow its strategy, it's a big mistake。

Twelve days of champions, not 12 months of champions; $10,000 of champions, not $1,000,000 of champions; and this race of champions, not of the next。

Investing in this thing has never been a simple answer. This experiment gives us valuable data, but the limitations behind the data may be more thoughtful than the data themselves。

The data for this reporting period have been edited by WolfDAO and can be updated in case of doubt

Contribution: Riffi / WolfDao (X: @10xWolfdao)