More

Davidzheng · 2026-02-10T05:10:58 1770700258

Honestly for research level math, the reasoning level of Gemini 3 is much below GPT 5.2 in my experience--but most of the failure I think is accounted for by Gemini pretending to solve problems it in fact failed to solve, vs GPT 5.2 gracefully saying it failed to prove it in general.

mapontosevenths · 2026-02-10T05:23:19 1770700999

Have you tried Deep Think? You only get access with the Ultra tier or better... but wow. It's MUCH smarter than GPT 5.2 even on xhigh. It's math skills are a bit scary actually. Although it does tend to think for 20-40 minutes.

Davidzheng · 2026-02-10T15:22:01 1770736921

I tried Gemini 2.5 Deep Think, was not very impressed ... too much hallucinations. In comparison GPT 5.2 extended time hallucinates at like <25% of the time and if you ask another copy to proofread it goes even lower.

Davidzheng · 2026-02-08T15:18:26 1770563906

makes you wonder how automate-able this babysitter roles is...

pfdietz · 2026-02-08T15:41:18 1770565278

That was my reaction.

Davidzheng · 2026-02-08T15:17:27 1770563847

really interested in what the brain does when it "loads" the context for something it's familiar with but is currently unloaded from the working memory. Does it mostly try to align some internal state? or more just load memories into fast access

Davidzheng · 2026-02-08T14:46:03 1770561963

how do you take over the world if you have access to 1000 normal people? if AGI is by the original definition (long forgotten by now) of surpassing MEDIAN human at almost all tasks. How the rebranding of ASI into AGI happened without anyone noticing is kind of insane

Davidzheng · 2026-02-08T14:43:39 1770561819

"People living in less advanced economies will do OK, but the rest of us not so much" how is this possible? are the less advanced economies protected from outside influences? are they also protected from immigration?

iugtmkbdfil834 · 2026-02-08T16:30:32 1770568232

Not OP, but assuming I am following the argument correctly, I think parent is referring to something else. Advanced economies have participants, who function well in that environment and are shaped by it to a large degree. As a result, if one was to ask them to get food in an environment, where it is not as easily accessible as it is today, they might stumble. On the other hand, in the old country, a lot of people I knew had a tendency to have a little garden, hunt every so often, forage for mushrooms and so on. In other words, more individuals may be able to survive in less developed economies precisely, because they are less developed and less reliant on convenience today brings.

Davidzheng · 2026-02-08T17:08:53 1770570533

ah makes a lot of sense! thanks!

Davidzheng · 2026-02-08T09:53:52 1770544432

Sure, but in pure mathematics there are a lot of well specific problems which no one can solve.

pegasus · 2026-02-08T14:09:17 1770559757

Mathematics is indeed one of those rare fields where intimate knowledge of human nature is not paramount. But even there, I don't expect LLMs to replace top-level researchers. The same evolutionary "baggage" which makes simulating and automating humans away impossible is also what enables (some of) us to have the deep insight into the most abstract regions of maths. In the end it all relies on the same skills developed through millions of years of tuning into the subtleties of 3D geometry, physics, psychology and so on.

Davidzheng · 2026-02-08T09:53:04 1770544384

I find it unbelievable that this question can't be settled themselves without posting this simply by asking the AI enough novel questions. I myself have little doubt that at least they can solve some novel questions (of course similarity of proofs is a spectrum so it's hard to draw the line at how original they are)

okintheory · 2026-02-08T09:58:50 1770544730

I settle this question for myself every month: I try asking ChatGPT and Gemini for help, but in my domains it fails miserably at anything that looks new. But, YMMV, that's just the experience of one professional mathematician.

seg_lol · 2026-02-08T10:45:13 1770547513

New doesn't have to mean "the hardest thing yet", but as humans mastering our subdomain, they are often the same.

nxobject · 2026-02-08T16:46:28 1770569188

Trust, but verify, no? No one benefits from refusing to experiment and test.

Davidzheng · 2026-02-08T09:25:33 1770542733

Even if your argument is correct here, it would only mean this particular method of replacement doesn't immediately work for this job.

sublinear · 2026-02-09T01:50:18 1770601818

Thanks for the reply. I'm not sure I understand. Why does replacement matter so much? Where do these anxieties come from?

I'm saying that this type of employment is a necessary diffusion layer for making decisions, and isn't about "productivity". Payroll for these kinds of jobs is even considered capex. Efficiency misses the point entirely and is tripping over dollars to pick up pennies.

Davidzheng · 2026-02-06T05:25:43 1770355543

"coding agents have been out for more than couple years"?????

gavinflud · 2026-02-06T10:34:54 1770374094

Depends on what we categorize as a coding agent. Devin was released two years ago. Cursor was about the same, and it released agent mode around 1.5 years ago. Aider has been around even longer than that I think.

Davidzheng · 2026-02-06T03:59:50 1770350390

It's important to remember though (this is besides the point for what you're saying) that job displacement of things like secretaries from AI do not require it to be a near perfect replacement. There are many other factors (for example if it's much cheaper and can do part of the work it can dramatically shrink demand as people can shift to an imperfect replacement in AI)