At what point do we drop the term "regular" expressions altogether for stuff like this? This is going to sound pedantic since I know that most popularly-used regex implementations are themselves non-regular, but I feel like we're just piling more and more stuff on top of good-old-regexes and trying to turn the concept into a catch-all for anything that does pattern matching on text.
I guess it just feels icky that "regular expressions" has inherent meaning (i.e. can be represented entirely by a finite automaton) which has become completely diluted at this point.
That rant aside, cool paper. The idea of bridging formal language theory with modern computational tooling feels timely. I think I would've liked to see more exploration of oracle-based costs, for instance:
* What happens when oracle outputs are inconsistent/uncertain?
* What happens as oracle interactions become more computationally expensive?
Your rant is a bit unfounded for this paper as they do actually take completely standard regular expressions (no backtracking or anything like that) and extend it with one more construct. Calling it «semantic regular expressions» seems perfectly reasonable to me. What else to call it?
As for outside of the computer science sphere (which I find is quite consistent in their terminology): I do agree that it seems like it’s a lost cause and «regex» is now synonymous for «pattern matching using this one specific syntax» :(
Regular has a meaning, and it isn't this. It's had this meaning since the 1950s. OTOH, these expressions do not generate a regular language. That's a good reason to me.
They call it "semantic regular expression" because it apparently already is a lost cause. "Regular expressions with TMs embedded" doesn't quite have the same ring to it. Nobody would see it as a regexp.
> OTOH, these expressions do not generate a regular language.
Okay sure you're technically correct here, but only because these expressions generate a subset of a regular language. The LLM can only be invoked on a substring that can expressed as a regular expression, and then it's only used to remove strings from the language. Their results are based heavily on how regular expressions work. A "semantic context-free grammar" would have different type of characteristics and behavior.
Maybe throwing in the word "extended" or "augmented" would be a bit more clear, but as I reader I definitely would expect "regular expression" to be part of the name.
Removing strings from the language is what makes it non-regular. E.g., a regular language cannot contain a^n b^n (that is: the string is only accepted when it has an identical amount of a's and b's), but it sure as hell can contain a^m b^n. Removing the strings where m != n is what makes a language context-free.
To be fair, I could see this basically allowing a form of "back reference" lookup that lets you offload the back reference to other parts. For example, `/(.); \1 = (.)/.exec("foo; foo = anything")`, but instead of doing a back reference, you could have an oracle lookup.
I haven't looked at the examples in this paper, yet. But I'm having fun imagining ways this could be used.
Nested querying is not something that is standard for regular grammars, amongst other aspects introduced in this paper that implicitly require things like memory (again, not standard for regular grammars)
Building off our last research post, we wanted to figure out ways to quantify "ambiguity" and "uncertainty" in prompts/responses to LLMs. We ended up discovering two useful forms of uncertainty: "Structural" and "Conceptual" uncertainty.
In a nutshell: Conceptual uncertainty is when the model isn't sure what to say, and Structural uncertainty is when the model isn't sure how to say it.
You can play around with this yourself in the demo!
I think the author of this post probably meant to caveat that what we call "prompt engineering" TODAY might tend towards snake oil, but prompt engineering _doesn't have to_ be snake oil, and it _doesn't have to_ promote black box mentalities[1]. What's more is that fine-tuning is certainly not a panacea - it's not particularly great an injecting net-new context into these foundation models. It's great when you want to "close the aperture" a bit in model outputs. Even suggesting that fine-tuning is somehow a replacement for crafting prompts is just incorrect.
Interestingly we initially thought that prompt length would play a big factor in the performance of this approach. In practice, though, we discovered that it's actually not as big a factor as we predicted. For instance, Prompt #3 was 410 tokens long, while Prompt #5 was only 88 tokens. The estimation for Prompt #3 aligned fairly well with the IG approach (0.746 cosine similarity, 0.643 pearson correlation), while the estimation for Prompt #5 seemed to underperform (0.55 cosine similarity, 0.295 pearson correlation). Meanwhile, Prompt #2 was 57 tokens long and performed quite well (0.852 cosine similarity, 0.789 pearson correlation).
Re: our definitions of average/long/short prompts -- we weren't really rigorous with those definitions. In general, we considered anything under 100 tokens "short", 100-300 average, and 300+ large.
Our intuition here is that the relationship between performance of the estimation and the prompt structure is less about length, and more about "ambiguity". Again, we don't really have a rigorous definition of that yet, but it's something we are working on. If you take a look at the prompts in the analysis notebook you might get a sense of what I mean: prompts 1-3 are pretty straight forward and mechanical. Prompts 4 & 5 are a bit more open to interpretation. We see performance of the estimation degrade as prompts become more and more open to interpretation.
Oh, it’s definitely ambiguity. Any given token’s attention is going to have its weight vary based on its context, and less-ambiguous terms are more likely to be used “near” the other terms that matter. For example, if you tell GPT not to ‘omit’ code from a code sample, it has to disambiguate the meaning of omit. Tell it not to ‘elide’ any code, and it performs a lot better.
“Prompt engineering” is far more linguistic than people seem to realize. It’s not just “say what you mean” when the model has an easier time when you “say what you mean in the most linguistically precise way possible”. Simplified, but workable: it’s a matter of finding less ambiguous/more context-specific tokens/words with a better tf/idf in the pre-training corpus without getting too esoteric.
Another example: storytelling prompts that include “I dislike open-ended conclusions and other rhetorical hooks” often results in fewer (or no) closing statements like, “as night fell, they wondered about their future.”
Great question - there are currently (likely) tons of limitations to this approach as-is. We're planning on testing this on more capable models (e.g: integrated gradients on Llama2) to see how the relationship might change, but here are some initial thoughts:
1. The perturbation method could be improved to more directly capture long-range dependency information across tokens
2. The scoring method could _definitely_ be improved to capture more nuance across perturbations.
I think what we've found is that there does seem to be a relationship between the embedding space and attributions of LLMs, so the next step would be to figure out how to capture more nuance out of that relationship. This sort of side-steps the question you asked, because honestly we'd need to test a lot more to figure out the specific cases where an approach like this falls short.
Anecdotally - we've seen the greatest deviation between the estimation & integrated gradients as prompt "ambiguity" increases. We're thinking about ways to quantify & measure that ambiguity but that's its own can of worms.
Off the back of a napkin - the key should never be stored anywhere first of all. In the absence of keyring/keychain/etc., it'd be trivial to introduce a masterpassword implementation in the browser client which is XOR'd with secret credentials and stored as such.
Obviously not a 'secure' system by any stretch of the imagination but it's an order of magnitude better than storing in plaintext.
As addressed in the post - there are no mitigating factors in the scenario of accidental exposure. The lowest hanging fruit would be a dumb hashing function which uses some master password.
If you've been hit with an OS compromise you're pretty much SOL, but it shouldn't be so easy to grab highly sensitive data from accidentally exposed profiles.
Yes! There are tons of accidentally-uploaded profiles on github, for instance. Search for the readme string and you'll see a number of very dangerous commits.
I guess it just feels icky that "regular expressions" has inherent meaning (i.e. can be represented entirely by a finite automaton) which has become completely diluted at this point.
That rant aside, cool paper. The idea of bridging formal language theory with modern computational tooling feels timely. I think I would've liked to see more exploration of oracle-based costs, for instance:
* What happens when oracle outputs are inconsistent/uncertain?
* What happens as oracle interactions become more computationally expensive?