Hacker Newsnew | past | comments | ask | show | jobs | submit | saberience's commentslogin

I hope we keep beating this dead horse some more, I'm still not tired of it.

Benchmarks aren't everything.

Gemini consistently has the best benchmarks but the worst actual real-world results.

Every time they announce the best benchmarks I try again at using their tools and products and each time I immediately go back to Claude and Codex models because Google is just so terrible at building actual products.

They are good at research and benchmaxxing, but the day to day usage of the products and tools is horrible.

Try using Google Antigravity and you will not make it an hour before switching back to Codex or Claude Code, it's so incredibly shitty.


That's been my experience too; can't disagree. Still, when it comes to tasks that require deep intelligence (esp. mathematical reasoning [1]), Gemini has consistently been the best.

[1] https://arxiv.org/abs/2602.10177


What’s so shitty about it?

Please no, let's not.

I always try Gemini models when they get updated with their flashy new benchmark scores, but always end up using Claude and Codex again...

I get the impression that Google is focusing on benchmarks but without assessing whether the models are actually improving in practical use-cases.

I.e. they are benchmaxing

Gemini is "in theory" smart, but in practice is much, much worse than Claude and Codex.


> but without assessing whether the models are actually improving in practical use-cases

Which cases? Not trying to sound bad but you didn't even provide of cases you are using Claude\Codex\Gemini for.


I find Gemini is outstanding at reasoning (all topics) and architecture (software/system design). On the other hand, Gemini CLI sucks and so I end up using Claude Code and Codex CLI for agentic work.

However, I heavily use Gemini in my daily work and I think it has its own place. Ultimately, I don't see the point of choosing the one "best" model for everything, but I'd rather use what's best for any given task.


I'm glad someone else is finally saying this, I've been mentioning this left and right and sometimes I feel like I'm going crazy that not more people are noticing it.

Gemini can go off the rails SUPER easily. It just devolves into a gigantic mess at the smallest sign of trouble.

For the past few weeks, I've also been using XML-like tags in my prompts more often. Sometimes preferring to share previous conversations with `<user>` and `<assistant>` tags. Opus/Sonnet handles this just fine, but Gemini has a mental breakdown. It'll just start talking to itself.

Even in totally out-of-the-ordinary sessions, it goes crazy. After a while, it'll start saying it's going to do something, and then it pretends like it's done that thing, all in the same turn. A turn that never ends. Eventually it just starts spouting repetitive nonsense.

And you would think this is just because the bigger the context grows, the worse models tend to get. But no! This can happen well below even the 200.000 token mark.


Flash is (was?) was better than Pro on these fronts.

I exclusively use Gemini for Chat nowadays, and it's been great mostly. It's fast, it's good, and the app works reliably now. On top of that I got it for free with my Pixel phone.

For development I tend to use Antigravity with Sonnet 4.5, or Gemini Flash if it's about a GUI change in React. The layout and design of Gemini has been superior to Claude models in my opinion, at least at the time. Flash also works significantly faster.

And all of it is essentially free for now. I can even select Opus 4.6 in Antigravity, but I did not yet give it a try.


Honestly doesn't feel like Google is targeting the agentic coding crowd so much as they are the knowledge worker / researcher / search-engine-replacement market?

Agree Gemini as a model is fairly incompetent inside their own CLI tool as well as in opencode. But I find it useful as a research and document analysis tool.


For my custom agentic coding setup, I use Claude Code derived prompts with Gemini models, primarily flash. It's night and day compared to Google's own agentic products, which are all really bad.

The models are all close enough on the benchmarks and I think people are attributing too much difference in the agentic space to the model itself. I strongly believe the difference is in all the other stuff, which is why Antropic is far ahead of the competition. They have done great work with Claude Code, Cowork, and their knowledge share through docs & blog, bar none on this last point imo.


Heh, I find Codex to be a far, far smarter model than Claude Code.

And there's a good reason the most "famous" vibe coders, including the OpenClaw creator all moved to Codex, it's just better.

Claude writes a lot more code to do anything, tons of redundent code, repeated code etc. Codex is only model I've seen which occasionally removes more code than it writes.


One major problem, I don't want a journal with unbreakable encryption where I lose all my data if I ever lose the key.

I already pay for a journaling website where I know I can always recover my journals as long as I have access to my Gmail.

So, while I appreciate this security first mindset, for me it actually becomes less interesting. I want my journal to sync to the cloud, I want to be able to unlock it, I don't want to risk losing years of journals if I forget a single key.


Thanks for the feedback! That point is super valid; that's why I created it with multiple authentication slots in mind (currently, it supports both password and public key authentication) so you can use multiple simultaneously and do not need to rely on one single point of failure.

For example, if you set up a password and a key, you can use your key, and if it gets lost or compromised, you can still log in with the password, remove the old key, and generate a new one.

You can do the same in reverse: just use the password and keep the key in a safe place (like a password manager or a physical USB), and if you lose your password, you can still get access with the key.

Thanks again!


>as long as I have access to my Gmail

I think you should be more cautious about relying on the services of a company like Google that can arbitrarily decide to remove your account data or access. Similar, though the person was fortunate enough to regain access: https://hey.paris/posts/appleid/

You can mitigate hardware failure and data loss, especially for a simple key, but you may not be able to prevent Google from deciding your account is gone one day.


This looks rather over-engineered. I built something myself using SQLite in probably 10% of the code (or less) and queries were always running in the single digit milliseconds?

No offense intended but was this vibe coded? Have you tested and verified the code by hand?


What does verifying code by hand even mean? Do we now have artisanal testing, lovingly hand crafted? I get the current debate about LLM assisted coding but this is mudslinging and not constructive discussion.

Verifying the code means, does it actually do what you said it does?

When you vibe code things, the AI often thinks it built something that does X, but when you actually test it/verify it, in fact it doesn't do that at all.

I've seen it this week with both Codex and Claude Code. I built something for scraping financial data from various websites. Claude Code said, yes I built this and it works. I verified the data and it wasn't accurate, it was data for the previous quarter.

I.e. if you're trusting that the AI built your RAG perfectly without testing it, you're probably mistaken.


"I took a look" is the human version of vibe coding.

OpenClaw was a two weeks ago thing. No one cares anymore about this security hole ridden vibe coded OpenAI project.

Perhaps not but the idea is sound. The implementation leaves some to be desired yes.

I have seldomly seen so many bad takes in two sentences.

I would be blown away if Anthropic, OpenAI, and Google don't all offer this functionality in the near future.

Also, I'm not sure many engineers really WANT to be able to do more "vibe coding" away from their laptop.

Is this really what our job is going to be? Typing in prompts from a mobile phone?

Good luck guys but I think a pivot might be in order.


You couldn't vibe-code Dropbox in an hour though...

You wouldn’t vibe-code a car...

hmm, i wonder how long it would take nowadays actually

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: