To add anecdotally based on logging on my portfolio site, all major US players (OpenAI, Google, Anthropic, Meta, CommonCrawl) appeared to respect robots.txt as they claim to do (can't say the same of Alibaba).
Sometimes I do still get requests with their useragents, but generally from implausible IPs (residential IPs, or "Google-Extended" from an AWS range, or same IP claiming to be multiple different bots, ...) - never from the bots' actual published IP addresses (which I did see before adding robots.txt) - which makes me believe it's some third party either intentionally trolling or using the larger players as cover for their own bots.
Using residential IPs is standard operating procedure for companies that rely on collecting information via web scraping. You can rent residential egress IPs. Sometimes this is done in a (kind of) legit way by companies that actually subscribe to residential ISPs. Mostly it's done by malware hijacking consumer devices.
Sometimes I do still get requests with their useragents, but generally from implausible IPs (residential IPs, or "Google-Extended" from an AWS range, or same IP claiming to be multiple different bots, ...) - never from the bots' actual published IP addresses (which I did see before adding robots.txt) - which makes me believe it's some third party either intentionally trolling or using the larger players as cover for their own bots.