More

ziflex · on Oct 7, 2018

Nope, just thought that ferrets are cool :)

ziflex · on Oct 2, 2018

This package is more like a runtime. There are plans to create a dedicated server, where you would be able to store your queries, schedule them and set up output streams like Spark or Flink. For now, it does not respect robots.txt. But it can be easily added.

Out of the box, there are not scaling mechanism yet, since the project is WIP. But, it's written in Go, which makes it pretty fast.

One idea of how you could scale it is to run cluster of instances of headless Chrome, put proxy/load balancer in front of it, and get Ferret a url to the cluster. It will treat it as a single instance of Chrome. The only problem, you would need to differentiate request from CDP (Chrome DevTools Protocol) client, and once a page is open, redirect all related requests to the same Chrome instance.

ziflex · on Oct 2, 2018

And you can open as many pages as you want in a single query (or as your memory allows you :) )

ziflex · on Oct 2, 2018

That's what you can do right now :)

https://github.com/MontFerret/ferret/blob/master/docs/exampl...

Document, returned form DOCUMENT() function, represents an open browser tab which allows you to do all interactions with the page.

ianbicking · on Oct 3, 2018

Well, that's what I'm saying... right now, making it represent an open browser tab with a specific state and where everything DOES something isn't declarative. But it could be declarative if you changed how those commands are implemented.

Or, to phrase it another way: if the program represents a PLAN then it's declarative. If it represents a series of things to DO then it's imperative. It seems like it's doing things, but it could plan things with the same syntax.

ziflex · on Oct 3, 2018

Oh yes. The reason if this is that for now the language itself is DOM agnostic, it's just a port of an existing one. (https://docs.arangodb.com/3.4/AQL/) . So, the entire DOM thing is implemented by standard library which is pluggable. In the future, I might extend the language to make it less DOM agnostic by introducing new keywords for dealing with that. But for now you have to move document object around. Which is not that bad, because you may open as many page as you want in a single query.

ziflex · on Oct 2, 2018

PRs are welcome :) There is gonna be a separate project within the organization that would do all these things and even more. It's just beginning :)

ziflex · on Oct 2, 2018

Yes! This is one of the reasons why I wanted to be able making these changes without redeploying the whole thing!

ziflex · on Oct 2, 2018

It can! :)

Even more - it can interact with these pages! Here is an example of use of Google Search page: https://github.com/MontFerret/ferret/blob/master/docs/exampl...

the_other_guy · on Oct 3, 2018

Wow, thank you. Now I am absolutely in love with your tool!

ziflex · on Oct 3, 2018

Great! ^_^

ziflex · on Oct 2, 2018

Thank you very much for your valuable feedback and I'm glad that someone has finally got the idea :)

ziflex · on Oct 2, 2018

This is how it works under the hood. But everything is wired for you ;)

ziflex · on Oct 2, 2018

You definitely need to share. Web scraping is tedious. As more ideas we have, as more options we have to come up with a better solution for that.