Rectangle 27 1

javascript Why this regex is not working for german words?


However, that table is in some cases at least five years out of date, so I can't completely vouch for it. It's a good start, though.

I don't believe that this workaround is available in Javascript, though. You can also use Unicode properties like those in Perl and PCRE, and in Ruby 1.9, but not in Python.

In Java, however, there is a good workaround available. There, you can use \pL to mean any character that has the Unicode General_Category=Letter property. That means you can always emulate a proper \w using [\pL\p{Nd}_].

Indeed, there's even an advantage to writing it that way, because it keeps you aware that you're adding decimal numbers and the underscore character to the character class. With a simple \w, please sometimes forget this is going on.

It would be really nice if some public-spirited soul would please add Javascript to this Wikipedia page that compares the support regex features in various languages.

Lame as Java is, it's still better than Javascript, because Javascript doesn't support any Unicode properties whatso*CENSORED*ever. I'm afraid that Javascript's paltry 7-bit mindset makes it pretty close to unusable for Unicode. This is a tremendously huge gaping hole in the language that's extremely difficult to account for given its target domain.

Like Java itself, Javascript doesn't support Unicode in its \w, \d, and \b regex shortcuts. This is (arguably) a bug in Java and Javascript. Even if one manages through casuistry or obstinacy to argue that it is not a bug, it's sure a big gotcha. Kinda bites, really.

Ruby, Python, Perl, and PCRE all offer ways to extend \w to mean what it is supposed to mean, but the two Jthingies do not.

SIGH! To understand just how limited Java's property support is, merely compare it with Perl. Perl supports 1633 Unicode properties as of 2007's 5.10 release, and 2478 of them as of this year's 5.12 release. I haven't counted them for ancient releases, but Perl started supporting Unicode properties back during the last millennium.

The future JDK7 will finally get around to adding scripts. Even then Java still won't support most of the Unicode properties, though, not even critical ones like \p{WhiteSpace} or handy ones like \p{Dash} and \p{Quotation_Mark}.

The only Unicode properties current Java supports are the one- and two-character general properties like \pN and \p{Lu} and the block properties like \p{InAncientSymbols}, but not scripts like \p{IsGreek}, etc.

The problem is that those popular regex shortcuts only apply to 7-bit ASCII whether in Java or in Javascript. This restriction is painfully 1970sish; it makes absolutely no sense in the 21 century. This blog posting from this past March makes a good argument for fixing this problem in Javascript.

This page says that Javascript doesn't support any Unicode properties at all. That same site has a table that's a lot more detailed than the Wikipedia page I mention above. For Javascript features, look under its ECMA column.

Note
Rectangle 27 1

javascript Why this regex is not working for german words?


However, that table is in some cases at least five years out of date, so I can't completely vouch for it. It's a good start, though.

I don't believe that this workaround is available in Javascript, though. You can also use Unicode properties like those in Perl and PCRE, and in Ruby 1.9, but not in Python.

In Java, however, there is a good workaround available. There, you can use \pL to mean any character that has the Unicode General_Category=Letter property. That means you can always emulate a proper \w using [\pL\p{Nd}_].

Indeed, there's even an advantage to writing it that way, because it keeps you aware that you're adding decimal numbers and the underscore character to the character class. With a simple \w, please sometimes forget this is going on.

It would be really nice if some public-spirited soul would please add Javascript to this Wikipedia page that compares the support regex features in various languages.

Lame as Java is, it's still better than Javascript, because Javascript doesn't support any Unicode properties whatsoCENSOREDever. I'm afraid that Javascript's paltry 7-bit mindset makes it pretty close to unusable for Unicode. This is a tremendously huge gaping hole in the language that's extremely difficult to account for given its target domain.

Like Java itself, Javascript doesn't support Unicode in its \w, \d, and \b regex shortcuts. This is (arguably) a bug in Java and Javascript. Even if one manages through casuistry or obstinacy to argue that it is not a bug, it's sure a big gotcha. Kinda bites, really.

Ruby, Python, Perl, and PCRE all offer ways to extend \w to mean what it is supposed to mean, but the two Jthingies do not.

SIGH! To understand just how limited Java's property support is, merely compare it with Perl. Perl supports 1633 Unicode properties as of 2007's 5.10 release, and 2478 of them as of this year's 5.12 release. I haven't counted them for ancient releases, but Perl started supporting Unicode properties back during the last millennium.

The future JDK7 will finally get around to adding scripts. Even then Java still won't support most of the Unicode properties, though, not even critical ones like \p{WhiteSpace} or handy ones like \p{Dash} and \p{Quotation_Mark}.

The only Unicode properties current Java supports are the one- and two-character general properties like \pN and \p{Lu} and the block properties like \p{InAncientSymbols}, but not scripts like \p{IsGreek}, etc.

The problem is that those popular regex shortcuts only apply to 7-bit ASCII whether in Java or in Javascript. This restriction is painfully 1970sish; it makes absolutely no sense in the 21 century. This blog posting from this past March makes a good argument for fixing this problem in Javascript.

This page says that Javascript doesn't support any Unicode properties at all. That same site has a table that's a lot more detailed than the Wikipedia page I mention above. For Javascript features, look under its ECMA column.

Note
Rectangle 27 0

javascript Why this regex is not working for german words?


$this.text().replace(/\b(\S+)\b/g, "<span>$1</span>")

Note: Unlike \w+, \S+ will also match periods, commas, etc. at the end of words. So if you parsed this comment with this regex, the first match will be "Note:" not "Note". You'll need to tweak your regex or perform additional checks if this is not what you want.

You could use something like \S+ to match all non-space characters, including non-ASCII characters like . This might or might not work depending on how the rest of your string is formatted.

\w only matches A-Z, a-z, 0-9, and _ (underscore).

Note