Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We hide and abstract things in programming all the time.

For example, the proper (RFC-compliant) regex for an email is very complicated and often implemented the wrong way.

As somebody who has written a lot of regexes, I'd rather uses this library which has the correct abstraction than google the regex for http-urls for the 512th time.



This library doesn't help you here though. It doesn't bring you any abstractions on top of regexes (e.g. parsing URLs/emails). This is merely a different syntax for matching parts of a string.


I hope no one is seriously suggesting that regular expressions should be used in programs, such as email address verification and http parsing. As well as being incredibly slow, they are hard to read, and inferior to application-specific parsers (for example, sometimes non-RFC-compliant emails are actually valid, and dealing with whitespace in html is a nightmare).

For the interactive case, the ability for them to be written quickly is what makes them so helpful, and an abstraction library could take away this advantage.


I'm not exactly disagreeing but just curious:

1. How can a non-RFC-compliant email be valid?

2. What about compiled regexes, performance-wise?

3. Sometimes a regex is faster than the overhead of a parser, so wouldn't the choice be dependent on context? In other words, regexes are not always slower, true?

4. Wouldn't some abstraction libraries utilize regexes under the hood? Would that be wrong in your view?

P.S. Some languages allow the option for very readable regexes, e.g. separate each component on its own line, with a comment.


> How can a non-RFC-compliant email be valid?

甲斐@黒川.日本 is a non-RFC 5322-compliant, but still valid, email address.

Unless you're implying that "valid" === RFC 5322-compliant, in which case the example isn't valid ;)

The best way to validate an email address: send an email to that email address containing a confirmation link. Simple, easy.


Ah, understood. I was thinking "valid == well-formed" without knowing whether it really works (i.e., could be deleted), whereas I see you rightfully point out that it more reasonably means "it works." Thank you, makes sense.


Simply, some sites don't enforce the full set of RFC rules, as such people actually have non-RFC-compliant email addresses that are valid.

How can you 'compile' a regular expression?

For very simple regular expressions, they might be decently fast, but as soon as you start pulling out the more complicated regular expressions needed for parsing, you get slower. Even simple repeats can have a lot of overhead if not used correctly, have a look at "Looking Inside The Regex Engine" at this link http://www.regular-expressions.info/repeat.html. An equivalent parser doesn't need to do any form of backtracking, and doesn't care about the structure. For example, I've seen an application use regular expressions for html parsing. After spending a while figuring out what they actually did, I found the source html had changed its whitespace, but not the DOM structure, which broke the regular expressions.

As for my reasoning above, I think a lot of 'abstraction' libraries would be faster by operating directly on the data, instead of just converting it to regular expressions. The beauty of regular expressions is the speed at which they can be written.


I hope no one is seriously suggesting that regular expressions should be used in programs, such as email address verification and http parsing.

I think the comma changes the meaning of your sentence from what you intended. It was refreshing, though!

To be clear, with the comma, this reads essentially as "Don't use regexes in programs. Some examples of programs are ...".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: