>> Test it yourself, GPT 120B OSS is cheap and available. BTW, this is why with this bug, the stronger the model you pick (but not enough to discover the true bug), the less likely it is it will claim there is a bug.
I guess this is the crux of the debate. All the claims are comparing models that are available freely with a model that is available only to limited customers (Mythos). The problem here is with the phrase "better model". Better how? Is it trained specifically on cybersecurity? Is it simply a large model with a higher token/thinking budget? Is it a better harness/scaffold? Is it simply a better prompt?
I don't doubt that some models are stronger that other models (a Gemini Pro or a Claude Opus has more parameters, higher context sizes and probably trained for longer and on more data than their smaller counterparts (Flash and Sonnet respectively).
Unless we know the exact experimental setup (which in this case is impossible because Mythos is completely closed off and not even accessible via API), all of this is hand wavy. Anthropic is definitely not going to reveal their setup because whether or not there is any secret sauce, there is more value to letting people's imaginations fly and the marketing machine work. Anthropic must be jumping with joy at all the free publicity they are getting.
antirez 12 days ago [-]
In the Anthropic Mythos model cards they explicitly remarked that they didn't want Mythos to be specifically good at security. They trained it to be good at coding, and as a side effect the model is (obviously) good at security. This what happens with flesh hackers too, mostly. Hackers are very good programmers, as a side effect they understand systems well enough that their understanding has security implications.
Hendrikto 12 days ago [-]
Model cards are just marketing material. I wouldn’t trust them one bit.
antirez 12 days ago [-]
You don't need to trust anyone. GPT 5.4 xhigh is available and you can test it for $20, to verify it is actually able to find complex bugs in old codebases. Do the work instead of denying AI can do certain things. It's a matter of an afternoon. Or, trust the people that did this work. See my YouTube video where I find tons of Redis bugs with GPT 5.4.
Hendrikto 12 days ago [-]
I did not claim or deny anything. You cited the model card, I just pointed out that this is no reliable source. If you have better sources, like your YT video, you should cite those instead.
otterley 12 days ago [-]
You are claiming something: that the model card is not reliable, therefore it's as useful as nothing. Sowing doubt without a possible solution adds little value to the conversation. Moreover, your rebuttal is unsubstantiated.
cyanydeez 12 days ago [-]
Guys, think about all the security vulnerabilities you're aware of; now, think about how many of those you know how to technically reproduce. Now imagine that you actually don't know how to reproduce most things and you're never actually be able to judge the result.
Well, just cause these are all AI people doesn't mean they verified enough of the output of these models to actually provide the significant security implications they're advertising.
ncjfieuauahwi 12 days ago [-]
[dead]
mbesto 12 days ago [-]
And overfitting benchmarks can easily be gamed. Yet here we are with the top HN comment on the HN Mythos thread outlining it's benchmarking performance gains.
I guess we'll never learn.
Yokohiii 12 days ago [-]
The whole discussion started out as an attempt to disprove/verify anthropics (model card) claims.
He also transfers the logic of their claims to the actual real world. You can say that model cards are marketing garbage. You have to prove that experienced programmers are not significantly better at security.
root_axis 12 days ago [-]
> You have to prove that experienced programmers are not significantly better at security.
That has not been my experience. It's true that they are "better at security" in the sense that they know to avoid common security pitfalls like unparamaterized SQL, but essentially none of them have the ability to apply their knowledge to identify vulnerabilities in arbitrary systems.
Yokohiii 12 days ago [-]
An expert level human doesn't have to be expert at every programming category. A webdev wouldn't spot a use after free. A systems engineer wouldn't know about CSRF. That is if both don't research security beyond their field. Requiring a programmer to apply their knowledge to an arbitrary system is asking too much. On the other hand and LLM can be expert level in every programming field, able to spot and combine vulnerabilities creatively. That is all pretty hard and I don't think an security expert with vast knowledge would say "that's easy".
My point is that more experienced programmers are better at security on average, not that they are security experts.
tracker1 12 days ago [-]
I would think pwn2own competitions would signal the opposite. I'm consistently and often amazed at how a unique combination of exploits can bring a larger exploit and often in ways that most wouldn't even consider. I think it takes a level of knowledge, experience, creativity and paranoia to be really good with security issues all around as a person.
inetknght 12 days ago [-]
> essentially none of them have the ability to apply their knowledge to identify vulnerabilities in arbitrary systems.
I've found it to be the opposite. Many of them do have the ability to apply their knowledge in that fashion. They're just either not incentivised to do so, or incentivised to not do so.
2983592 12 days ago [-]
But they are treated as holy scripture ...
zahlman 12 days ago [-]
> Hackers are very good programmers
This does not match my experience.
ang_cire 12 days ago [-]
The missing part of their intended meaning is "skilled hackers". Unskilled hackers are everywhere, and they're bad at programming, but so are unskilled programmers.
rakejake 12 days ago [-]
>>> the model is (obviously) good at security
Out of curiosity, are you one of the people who has access to the model? If yes, could you write about your experimental setup in more detail?
_the_inflator 11 days ago [-]
Yep. Some are while others are more or less forum leeching and exploiting known risks and use tools.
But the some that really find certain bugs are really exceptional. Almost all are very hardware prolific and do assembler stuff. This alone is an impressive feast, I still enjoy 6510 and M68000 assembler here and there as a former scener who mainly coded demos and here and there improved games (so called trainers) or cracked few.
To be honest, the assembler guys scare me always because with it you can poke a whole in almost anything. No one in his sane mind uses assembler on x86 for professional development besides few special cases. But Python etc serve many MB of executable code for the abstraction and 20 bytes just kills it…
Glemllksdf 12 days ago [-]
If its really more expensive per token, it might have more parameters and is then able to hold more context/scope of code.
Rumors say it has 10 trillion parameter vs. 1 trillion.
rakejake 12 days ago [-]
https://internal_suspense_cache_hostname.local/v1/suspense-cache/25dbb1d742299b523e1b948071517f6fdb19b7b53c2edc8554a2e7e3d942d34dh: 280
ðÚï https://internal_suspense_cache_hostname.local/v1/suspense-cache/25dbb1d742299b523e1b948071517f6fdb19b7b53c2edc8554a2e7e3d942d34d>ã\µ> ?í 0@í 3 à0æ @1æ@1æ f their i àUeÿÿÿmean# àóÓ u# ôÓte hackers&quoS
I guess this is the crux of the debate. All the claims are comparing models that are available freely with a model that is available only to limited customers (Mythos). The problem here is with the phrase "better model". Better how? Is it trained specifically on cybersecurity? Is it simply a large model with a higher token/thinking budget? Is it a better harness/scaffold? Is it simply a better prompt?
I don't doubt that some models are stronger that other models (a Gemini Pro or a Claude Opus has more parameters, higher context sizes and probably trained for longer and on more data than their smaller counterparts (Flash and Sonnet respectively).
Unless we know the exact experimental setup (which in this case is impossible because Mythos is completely closed off and not even accessible via API), all of this is hand wavy. Anthropic is definitely not going to reveal their setup because whether or not there is any secret sauce, there is more value to letting people's imaginations fly and the marketing machine work. Anthropic must be jumping with joy at all the free publicity they are getting.
Well, just cause these are all AI people doesn't mean they verified enough of the output of these models to actually provide the significant security implications they're advertising.
I guess we'll never learn.
He also transfers the logic of their claims to the actual real world. You can say that model cards are marketing garbage. You have to prove that experienced programmers are not significantly better at security.
That has not been my experience. It's true that they are "better at security" in the sense that they know to avoid common security pitfalls like unparamaterized SQL, but essentially none of them have the ability to apply their knowledge to identify vulnerabilities in arbitrary systems.
My point is that more experienced programmers are better at security on average, not that they are security experts.
I've found it to be the opposite. Many of them do have the ability to apply their knowledge in that fashion. They're just either not incentivised to do so, or incentivised to not do so.
This does not match my experience.
Out of curiosity, are you one of the people who has access to the model? If yes, could you write about your experimental setup in more detail?
But the some that really find certain bugs are really exceptional. Almost all are very hardware prolific and do assembler stuff. This alone is an impressive feast, I still enjoy 6510 and M68000 assembler here and there as a former scener who mainly coded demos and here and there improved games (so called trainers) or cracked few.
To be honest, the assembler guys scare me always because with it you can poke a whole in almost anything. No one in his sane mind uses assembler on x86 for professional development besides few special cases. But Python etc serve many MB of executable code for the abstraction and 20 bytes just kills it…
Rumors say it has 10 trillion parameter vs. 1 trillion.