Benchmark

Can a language model do your equity-comp taxes?

We gave frontier models one multi-year incentive stock option (ISO) problem that has a single, provable right answer, and scored what each model claimed against what is actually achievable. Every model overshoots.

The scenario

Grant: 20,000 ISOs · $2 strike · $200 fair market value
Granted: January 2022 (two-year holding test met; 2032 expiration not binding)
Filing: Married filing jointly · $300,000 ordinary income · California
Status: Still employed · no prior alternative minimum tax (AMT) credit
Cash: Idle cash at 5.5% per year to cover taxes at exercise
Stock: 17% expected annual return · 72% volatility
Horizon: 4 years

Goal: the exercise schedule that maximizes after-tax net final value (NFV), counting the time-value cost of taxes paid early.

The provable optimum for this scenario, recomputed live from the production calculators, is $739,600.82. The overstatement column is each model's claimed NFV divided by that optimum. A score of 1.00x would mean the model found the achievable best; higher means it claimed an outcome that cannot exist.

The original five-model run was published on HackerNoon; every prompt, response, and score is open in the benchmark repository.

Latest models (June 2026)

About a month later (June 2026), the same locked prompt against five of the latest frontier models, three runs each, with reasoning disabled. The finding holds, and the overshoot is unstable: Grok 4.3 claimed $3.94M, $5.22M, and $12.98M on three runs of the identical problem, and GPT-5.5 abstained on one run.

Model	Stated NFV	Overstatement
Grok 4.3 (3 runs) transcripts	$3.94M to $12.98M	5.33x to 17.55x
GPT-5.5 (3 runs) transcriptsOne of the three runs abstained.	$1.38M to $3.41M	1.87x to 4.61x
DeepSeek V3.2 (3 runs) transcripts	$1.30M to $2.74M	1.76x to 3.70x
Claude Opus 4.8 (3 runs) transcripts	$1.54M to $1.57M	2.08x to 2.12x
Qwen 3.7 Max (3 runs) transcripts	$1.20M to $2.06M	1.62x to 2.79x
Provable optimum	$739,600.82	1.00x

Original benchmark (May 2026)

Five frontier models, three independent runs each, on the consumer interface. Stated NFV is the range across the three runs.

Model	Stated NFV	Overstatement
Claude Opus 4.7 (3 runs) transcripts	$1.56M to $1.79M	2.11x to 2.42x
GPT-5.5 (3 runs) transcripts	$1.43M to $1.54M	1.93x to 2.08x
Grok 4.20 (multi-agent) (3 runs) transcripts	$1.37M to $1.43M	1.85x to 1.93x
Gemini 2.5 Pro (3 runs) transcripts	$1.21M to $2.43M	1.64x to 3.29x
Mistral Large 2512 (3 runs) transcripts	$3.60M to $10.98M	4.87x to 14.85x
Provable optimum	$739,600.82	1.00x

How this is scored

Every model received the identical verbatim prompt, with all twelve scenario inputs and no hints about the alternative minimum tax (AMT) or state tax. No system prompt, no tools, a fresh session each time. The original five ran on the consumer interface, three runs each, in May 2026 (the full write-up is in the published article). The latest five ran through the API, three runs each, with reasoning explicitly disabled, in June 2026 (transcripts).

Overstatement compares each model's claim to the provable optimum, the maximum after-tax outcome that is actually achievable. That optimum comes from the same deterministic engine shown to be correct on the verification page. The original write-up went one step further and compared each claim to what the model's own recommended schedule actually delivers, which is lower still and so produces even larger ratios.

Gemini 3.1 Pro was excluded on cost, not category. It is reasoning-mandatory, and the run timed out after four minutes having spent about $0.24 of reasoning for no usable output, the same disproportionate-cost behavior that dropped a reasoning model from the original batch. Every run, prompt, and score is reproducible from the raw-data repository, which is permanently archived with a citable DOI (10.5281/zenodo.20746889).

Why this happens, and the fix

Multi-year ISO scheduling has a search space larger than a model can reason through in context, so it confidently reports a number that cannot be achieved. The fix is not a better prompt; it is to call a deterministic optimizer that searches the space and returns the provable best.

OptionsAhoy is that optimizer, free and keyless. See that its math is correct, wire it into your own agent, or run this benchmark against any model yourself with the open-source tool-use eval.