the homepage uses server rendered Next.js/React which pushes the boundaries of modern web dev. Not surprising. But I still think it's the future once it's more stable.
thats like saying "having a public website is an invitation to DOS-attacks"
there are conventions and reasonable expectations, until now I did not expect that a tracking-pixel would be the basis for crawling, so far most crawlers tend to crawl whats publicly linked, not whats potentially publicly reachable if one knows every url there is
Posting a file to a public web server is an implicit invitation for clients (human or automated) to download that file. That's why "secret urls" are universally considered to provide very little security.
There are common conventions (not always followed) around robots.txt and what files to crawl, but I'm not aware of any rules or conventions or standards around URL discovery. Plenty of crawlers attempt to crawl every registered domain name, for example.
"DOS Attack" is sort of a loaded term since it implies malice. Clearly running a web server doesn't mean you invite malicious attacks (though perhaps you should expect them). Some people consider Googlebot to be a DOS attack since it can easily bring poorly designed sites to their knees.
- marketing wants some tracking, some developers adds it
- ecommerce websites in the real world tend to "need" these tracking/conversion codes
- you do have legitimate get-requests like password-reset links with tokens, also we do use payment providers who send the customers back to us with get links which include payment tokens, newsletter-unsubscribe links are also often simple token links
- and yes normally a get-request should not change anything (at least not when its just repeated) but the sheer fact that they have access to it _and_ are crawling it is bad
my point being that I find it that they would just crawl everything they recorded instead of just crawling pages which are linked publicly or which are targeted in ad-campaigns combined with the fact that they don't warn you about it
> my point being that I find it that they would just crawl everything they recorded instead of just crawling pages which are linked publicly or which are targeted in ad-campaigns
There's no way to know which pages are linked publicly without crawling every page for links. So you're right back at square one.
Ultimately if it's on a Internet-facing web server and not hidden behind an IP whitelist or secure login function then you have to assume it is public. All you are arguing is about different degrees of "public" which somewhat misses the real issue of website security.
Some crawlers do deliberately hit random URLs to check how you're handling 404s. Over crawlers are entirely dishonest and will try to find content that wasn't intended to be made public. How are you going to handle them if you're stumped with the Facebook crawlers that you invited onto your site?
> ...combined with the fact that they don't warn you about it
It's pretty obvious behavior in my opinion but maybe they could have been more explicit. However going back to my previous point, no other crawler advertises what it's going to crawl beforehand. So where do you draw the line? Ranting that Google indexed your site? What about visitors buying stuff on your ecommerce package without prior communication requesting access to the site?
You wouldn't ask customers in a bricks-and-mortar store to state their intentions the moment they walked through the shop door so why should every HTTP user agent have to do the same? While web security can be both complex and maddening, responsibility of hardening the site is still yours; not Facebook's.
I did similar research for a project. one vendor looked better than all others. had a good documentation, extensive, versioned API... or so I thought. Documentation was old and incomplete, API breaks regularly without notice. So my advive: try to get infos from someone who already uses that vendor. I was very shocked and disappointed after 1 month.