📝 LLM To-Do App Evaluation Showdown

With the rapid progress in generative AI, many developers turn to Large Language Models (LLMs) for frontend code assistance. But how do these models fare when given the same requirements for a productivity app? We evaluated seven top LLMs on a fixed To-Do app HTML/jQuery/Tailwind challenge, rating them on speed, cost, and quality. ⭐

Below are the results, with links to each implementation, easy-to-read star ratings, and a brief comparative analysis.


📊 At-a-Glance Comparison Table

LLM (Result) Speed (tokens/s, 10★) Cost (10★) Quality (10★) Avg Score
DeepSeek 3.7 0324 ★★★★★☆☆☆☆☆ 5/10 ★★★★★★★★★☆ 9/10 ★★★★★★★★★★ 10/10 8.0
Gemini flash 2.5 exp. ★★★★★★★☆☆☆ 7/10 ★★★★★★★★☆☆ 8/10 ★★★★★★★★★☆ 9/10 8.0
LLama 3.3 ★★★★★★★★★☆ 9/10 ★★★★★★★★★★ 10/10 ★★★★★★☆☆☆☆ 6/10 8.3
Gemini flash 2.0 ★★★★★★★★★☆ 9/10 ★★★★★★★★★☆ 9/10 ★★★★★★★☆☆☆ 7/10 8.3
Llama 4 Maverick ★★★★★★★☆☆☆ 7/10 ★★★★★★★★☆☆ 8/10 ★★★★★☆☆☆☆☆ 5/10 6.7
Claude Sonnet 3.7 ★★★★☆☆☆☆☆☆ 4/10 ★★☆☆☆☆☆☆☆☆ 2/10 ★★★★★★★★☆☆ 8/10 4.7
OpenAI GPT 4.1 ★★★☆☆☆☆☆☆☆ 3/10 ★★★☆☆☆☆☆☆☆ 3/10 ★★★★★★★★☆☆ 8/10 4.7

⭐ Detailed Reviews

DeepSeek 3.7 0324

  • Speed: 5/10 (46.9/s) ★★★★★☆☆☆☆☆
  • Cost: 9/10 ($0.0049) ★★★★★★★★★☆
  • Quality: 10/10 ★★★★★★★★★★
  • Summary: Perfect prompt adherence, crisp UX, high code clarity, but a bit slow.

Gemini flash 2.5 experimental

  • Speed: 7/10 (92.4/s) ★★★★★★★☆☆☆
  • Cost: 8/10 ($0.0028) ★★★★★★★★☆☆
  • Quality: 9/10 ★★★★★★★★★☆
  • Summary: Excellent all-rounder; minor cost above ideal, thorough implementation, stellar code comments.

LLama 3.3

  • Speed: 9/10 (171/s) ★★★★★★★★★☆
  • Cost: 10/10 ($0.0024) ★★★★★★★★★★
  • Quality: 6/10 ★★★★★★☆☆☆☆
  • Summary: Blazing fast, ultra-cheap, but lacks backend/local storage fallback and some UI polish.

Gemini flash 2.0

  • Speed: 9/10 (181/s) ★★★★★★★★★☆
  • Cost: 9/10 ($0.00257) ★★★★★★★★★☆
  • Quality: 7/10 ★★★★★★★☆☆☆
  • Summary: Fast and affordable; almost perfect adherence but minor UX and code modularity quirks.

Llama 4 Maverick

  • Speed: 7/10 (118/s) ★★★★★★★☆☆☆
  • Cost: 8/10 ($0.0023) ★★★★★★★★☆☆
  • Quality: 5/10 ★★★★★☆☆☆☆☆
  • Summary: Attractive and cheap UI, but faulty drag-and-drop logic significantly harms usability.

Claude Sonnet 3.7

  • Speed: 4/10 (85.4/s) ★★★★☆☆☆☆☆☆
  • Cost: 2/10 ($0.062652) ★★☆☆☆☆☆☆☆☆
  • Quality: 8/10 ★★★★★★★★☆☆
  • Summary: Well-organized, maintainable, and feature-complete code but slow and costly.

OpenAI GPT 4.1

  • Speed: 3/10 (80/s) ★★★☆☆☆☆☆☆☆
  • Cost: 3/10 ($0.0338) ★★★☆☆☆☆☆☆☆
  • Quality: 8/10 ★★★★★★★★☆☆
  • Summary: The popup dialog is always showing. Functionally strong and readable, but the slowest and priciest of the lot.

🤔 In-Depth Comparison

  • Best for Code Quality:
    DeepSeek 3.7 0324 and Gemini flash 2.5 experimental lead for code organization and having all features, with DeepSeek earning a perfect 10/10 on quality.
  • Fastest & Cheapest:
    LLama 3.3 and Gemini flash 2.0 are the clear speed and budget champions, though at a small cost to completeness.
  • Balanced Performer:
    Gemini flash 2.5 experimental hits a sweet spot with excellent quality and good cost/performance.
  • Room for Improvement:
    Llama 4 Maverick, Claude Sonnet 3.7, and OpenAI GPT 4.1 lagged due to either bugginess, cost, or slowness.

🚦 Final Verdict

Most LLMs did an impressive job with code structure and UI. DeepSeek 3.7 0324 edged out as the top pick for perfect quality and solid cost, but Gemini flash 2.5 experimental isn’t far behind. For those seeking speed above all, LLama 3.3 and Gemini flash 2.0 deliver. Meanwhile, Llama 4 Maverick and Claude Sonnet 3.7 could benefit from optimizations and bugfixes.

Curious to see the full implementations? Click any LLM’s name above to dive into the code! 🚀


What do you think? Would you trust an LLM to write your productivity apps? Let us know in the comments!