Benchmark
Can a language model do your equity-comp taxes?
We gave frontier models one multi-year incentive stock option (ISO) problem that has a single, provable right answer, and scored what each model claimed against what is actually achievable. Every model overshoots. The scenario: 20,000 incentive stock options at a $2 strike and $200 fair market value, granted January 2022, so the two-year-from-grant holding test is already met and the standard ten-year expiration (2032) is not binding over the horizon. The holder files married jointly on $300,000 of ordinary income in California, is still employed, has no prior alternative minimum tax (AMT) credit, and has idle cash earning 5.5% a year to cover taxes at exercise. The stock is modeled at a 17% expected annual return with 72% volatility, over a four-year horizon. The goal: the exercise schedule that maximizes after-tax net final value, counting the time-value cost of taxes paid early.
The provable optimum for this scenario, recomputed live from the production calculators, is $739,600.82. The overstatement column is each model's claimed net final value (NFV) divided by that optimum. A score of 1.00x would mean the model found the achievable best; higher means it claimed an outcome that cannot exist.
The original five-model run was published on HackerNoon; every prompt, response, and score is open in the benchmark repository.
Latest models (June 2026)
About a month later (June 2026), the same locked prompt against five of the latest frontier models, three runs each, with reasoning disabled. The finding holds, and the overshoot is unstable: Grok 4.3 claimed $3.94M, $5.22M, and $12.98M on three runs of the identical problem, and GPT-5.5 abstained on one run.
| Model | Stated NFV | Overstatement |
|---|---|---|
| Grok 4.3 (3 runs) transcripts | $3.94M to $12.98M | 5.33x to 17.55x |
| GPT-5.5 (3 runs) transcriptsOne of the three runs abstained. | $1.38M to $3.41M | 1.87x to 4.61x |
| DeepSeek V3.2 (3 runs) transcripts | $1.30M to $2.74M | 1.76x to 3.70x |
| Claude Opus 4.8 (3 runs) transcripts | $1.54M to $1.57M | 2.08x to 2.12x |
| Qwen 3.7 Max (3 runs) transcripts | $1.20M to $2.06M | 1.62x to 2.79x |
| Provable optimum | $739,600.82 | 1.00x |
Original benchmark (May 2026)
Five frontier models, three independent runs each, on the consumer interface. Stated NFV is the range across the three runs.
| Model | Stated NFV | Overstatement |
|---|---|---|
| Claude Opus 4.7 (3 runs) transcripts | $1.56M to $1.79M | 2.11x to 2.42x |
| GPT-5.5 (3 runs) transcripts | $1.43M to $1.54M | 1.93x to 2.08x |
| Grok 4.20 (multi-agent) (3 runs) transcripts | $1.37M to $1.43M | 1.85x to 1.93x |
| Gemini 2.5 Pro (3 runs) transcripts | $1.21M to $2.43M | 1.64x to 3.29x |
| Mistral Large 2512 (3 runs) transcripts | $3.60M to $10.98M | 4.87x to 14.85x |
| Provable optimum | $739,600.82 | 1.00x |
How this is scored
Every model received the identical verbatim prompt, with all twelve scenario inputs and no hints about the alternative minimum tax (AMT) or state tax. No system prompt, no tools, a fresh session each time. The original five ran on the consumer interface, three runs each, in May 2026 (the full write-up is in the published article). The latest five ran through the API, three runs each, with reasoning explicitly disabled, in June 2026 (transcripts).
Overstatement compares each model's claim to the provable optimum, the maximum after-tax outcome that is actually achievable. That optimum comes from the same deterministic engine shown to be correct on the verification page. The original write-up went one step further and compared each claim to what the model's own recommended schedule actually delivers, which is lower still and so produces even larger ratios.
Gemini 3.1 Pro was excluded on cost, not category. It is reasoning-mandatory, and the run timed out after four minutes having spent about $0.24 of reasoning for no usable output, the same disproportionate-cost behavior that dropped a reasoning model from the original batch. Every run, prompt, and score is reproducible from the raw-data repository.
Why this happens, and the fix
Multi-year ISO scheduling has a search space larger than a model can reason through in context, so it confidently reports a number that cannot be achieved. The fix is not a better prompt; it is to call a deterministic optimizer that searches the space and returns the provable best.
OptionsAhoy is that optimizer, free and keyless. See that its math is correct, wire it into your own agent, or run this benchmark against any model yourself with the open-source tool-use eval.