The premise of virtual tables (Postgres FDW) is to not store the data but instead query it from the original source. So, the primary performance challenge is API throttling. Steampipe optimizes these API calls with smart caching, massive parallelization through goroutines and calling the minimum set of APIs to hydrate the exact columns requested. For example, "select instance_id from aws_ec2_instance" will do 2 API calls get 100 instances in 2 pages, while "select instance_id, tags from aws_ec2_instance" would do 2 calls (instance paging) + 100 tag API calls (one per instance). We've also implemented support for qualifiers (i.e. where clauses) so API calls can be reduced even further - e.g. get 1 EC2 instance without pagination etc.
The Postgres planner is not really optimized for foreign tables, but we can give it hints to indicate optimal paths. We've gradually ironed out many cases here in our FDW implementation particularly for joins etc.
Awesome, thanks so much for that! I guess then it sounds like, even if bandwidth were not an issue for a datasource, there would still be the physical limitations of the machine on which steampipe would do any later processing (such as joins between datasources). In other words, there's no sense in which steampipe distributes this work across multiple processes or machine. Is that correct?
Sorry if a dumb question. I could be thinking entirely in terms of the wrong paradigm here since my work in this space was primarily concerned with distributed computing and big data.
Steampipe computes the SQL part in a single Postgres instance (not distributed) - we've not found this to be a significant limit (so far). A difference in this case is that the query is really combining and organizing results that may been processed at the source. In 2022 most services are API first and offer great searching and filtering, so we can use that to do much of the query processing in many cases.
The Postgres planner is not really optimized for foreign tables, but we can give it hints to indicate optimal paths. We've gradually ironed out many cases here in our FDW implementation particularly for joins etc.
If you can tolerate my Aussie accent, I explain many of the details in this Yow Data talk - https://www.youtube.com/watch?v=2BNzIU5SFaw