Code, Strings, and Keys

Cost vs. Accuracy of GitHub Copilot LLM Models

Model Overview with Pricing and Capabilities

Model	Primary Task Area	Key Strengths	Additional Capabilities	HumanEval Accuracy	SWE-bench Accuracy	Premium Request Multiplier (Paid Plans)	Premium Request Multiplier (Free Plan)	Effective Cost per Request*
GPT-4.1	General-purpose coding and writing	Fast, accurate code completions and explanations	Agent mode, visual	~84%	~55%	0× (included)	1×	$0.00
GPT-4o	General-purpose coding and writing	Fast completions and visual input understanding	Agent mode, visual	~90%	~33%	0× (included)	1×	$0.00
o3	Deep reasoning and debugging	Multi-step problem solving and architecture-level code analysis	Reasoning	~92%	~72%	1×	N/A	$0.04
o4-mini	Fast help with simple tasks	Fast, reliable answers to lightweight coding questions	Lower latency	~90%	~49%	0.33×	N/A	$0.013
Claude Opus 4	Deep reasoning and debugging	Complex problem-solving challenges, sophisticated reasoning	Reasoning, vision	~94%	72.5%	10×	N/A	$0.40
Claude Sonnet 3.5	Fast help with simple tasks	Quick responses for code, syntax, and documentation	Agent mode	93.7%	~60%	1×	1×	$0.04
Claude Sonnet 3.7	Deep reasoning and debugging	Structured reasoning across large, complex codebases	Agent mode	~91%	~62%	1×	N/A	$0.04
Claude Sonnet 3.7 Thinking	Deep reasoning and debugging	Enhanced reasoning with explicit thought processes	Agent mode, reasoning chains	~92%	~64%	1.25×	N/A	$0.05
Claude Sonnet 4	Deep reasoning and debugging	Performance and practicality, perfectly balanced for coding workflows	Agent mode, vision	~92%	72.7%	1×	N/A	$0.04
Gemini 2.5 Pro	Deep reasoning and debugging	Complex code generation, debugging, and research workflows	Reasoning	~88%	63.8%	1×	N/A	$0.04
Gemini 2.0 Flash	Working with visuals	Real-time responses and visual reasoning for UI and diagram-based tasks	Visual, low latency	~85%	~45%	0.25×	1×	$0.01

Plan Allowances

Free Plan (Copilot Free)

Code completions: Up to 2,000 per month
Premium requests: Up to 50 per month
Available models: GPT-4.1, GPT-4o, Claude Sonnet 3.5, Gemini 2.0 Flash
All interactions count as premium requests

Paid Plans

Code completions: Unlimited (with included models)
Chat interactions: Unlimited (with included models)
Premium request allowances:
- Copilot Pro: 1,500 premium requests/month
- Copilot Business: 300 premium requests/month
- Copilot Enterprise: 1,000 premium requests/month
Overage pricing: $0.04 per additional premium request

Accuracy Rankings

Top Performers by Benchmark

HumanEval (Code Generation)

Claude Opus 4: ~94% – Best overall coding accuracy
Claude Sonnet 3.5: 93.7% – Excellent for code generation
o3: ~92% – Strong reasoning-based coding
Claude Sonnet 4: ~92% – Balanced performance
Claude Sonnet 3.7 Thinking: ~92% – Enhanced reasoning

SWE-bench (Real-world Software Engineering)

Claude Sonnet 4: 72.7% – Best practical coding performance
Claude Opus 4: 72.5% – Nearly tied for first
o3: ~72% – Strong on complex problems
Gemini 2.5 Pro: 63.8% – Solid engineering tasks
Claude Sonnet 3.7 Thinking: ~64% – Good reasoning approach

Model Selection Guide

Gemini 2.0 Flash (0.25× multiplier) – Best value for quick tasks
o4-mini (0.33× multiplier) – Fast, lightweight responses
GPT-4.1/GPT-4o (0× multiplier) – Free for paid plan users

For Complex Reasoning

Claude Opus 4 (10× multiplier) – Most powerful, highest cost
o3 (1× multiplier) – Strong reasoning at standard cost
Claude Sonnet 4 (1× multiplier) – Balanced performance and cost

For Visual Tasks

Gemini 2.0 Flash – Optimized for visual input, low cost
GPT-4o – Strong visual capabilities, free for paid users
Claude Sonnet 4 – Vision support with reasoning

Notes

*Effective cost per request based on $0.04 base rate × multiplier
Premium request counters reset monthly on the 1st at 00:00:00 UTC
Unused requests don’t carry over to the next month
Rate limiting applies during high demand periods
Model availability varies by plan type

September 18, 2025

ThomasPowell

Leave a ReplyCancel reply