Rectangle 27 1

regex Remove all non "word characters" from a String in Java, leaving accented characters?


String resultString = subjectString.replaceAll("[^\\p{L}\\p{Nd}]+", "");
\W
\\[

I changed \p{N} to \p{Nd} because the former also matches some number symbols like ; the latter doesn't. See it on regex101.com.

Use [^\p{L}\p{Nd}]+ - this matches all (Unicode) characters that are neither letters nor (decimal) digits.

works like a charm! but does not replace '', '', ''. since i only have this one '' I used .replaceAll("[^\\p{L}\\p{N}]|", ""); any suggestion on how can I remove does others?

works like a charm! thanks!

Note
Rectangle 27 1

regex Remove all non "word characters" from a String in Java, leaving accented characters?


import java.text.Normalizer;
import java.text.Normalizer.Form;

import org.apache.commons.lang.StringUtils;

/**
 * Utility class for String manipulation.
 * 
 * @author Stefan Haberl
 */
public abstract class TextUtils {
    private static String[] searchList = { "", "", "", "", "", "", "" };
    private static String[] replaceList = { "Ae", "ae", "Oe", "oe", "Ue", "ue",
            "sz" };

    /**
     * Normalizes a String by removing all accents to original 127 US-ASCII
     * characters. This method handles German umlauts and "sharp-s" correctly
     * 
     * @param s
     *            The String to normalize
     * @return The normalized String
     */
    public static String normalize(String s) {
        if (s == null)
            return null;

        String n = null;

        n = StringUtils.replaceEachRepeatedly(s, searchList, replaceList);
        n = Normalizer.normalize(n, Form.NFD).replaceAll("[^\\p{ASCII}]", "");

        return n;
    }

    /**
     * Returns a clean representation of a String which might be used safely
     * within an URL. Slugs are a more human friendly form of URL encoding a
     * String.
     * <p>
     * The method first normalizes a String, then converts it to lowercase and
     * removes ASCII characters, which might be problematic in URLs:
     * <ul>
     * <li>all whitespaces
     * <li>dots ('.')
     * <li>(semi-)colons (';' and ':')
     * <li>equals ('=')
     * <li>ampersands ('&')
     * <li>slashes ('/')
     * <li>angle brackets ('<' and '>')
     * </ul>
     * 
     * @param s
     *            The String to slugify
     * @return The slugified String
     * @see #normalize(String)
     */
    public static String slugify(String s) {

        if (s == null)
            return null;

        String n = normalize(s);
        n = StringUtils.lowerCase(n);
        n = n.replaceAll("[\\s.:;&=<>/]", "");

        return n;
    }
}

At times you do not want to simply remove the characters, but just remove the accents. I came up with the following utility class which I use in my Java REST web projects whenever I need to include a String in an URL:

Being a German speaker I've included proper handling of German umlauts as well - the list should be easy to extend for other languages.

EDIT: Note that it may be unsafe to include the returned String in an URL. You should at least HTML encode it to prevent XSS attacks.

important info on this, u can get the StringUtils class / package etc. @ commons.apache.org/lang/download_lang.cgi

Note
Rectangle 27 1

regex Remove all non "word characters" from a String in Java, leaving accented characters?


> String s = "blah"; 
> Pattern p = Pattern.compile("[\\p{InLatin-1Supplement}]+"); // this regex uses a block
> Matcher m = p.matcher(s);
> System.out.println(m.find());
> System.out.println(s.replaceAll(p.pattern(), "#"));

I was trying to achieve the exact opposite when I bumped on this thread. I know it's quite old, but here's my solution nonetheless. You can use blocks, see here. In this case, compile the following code (with the right imports):

You should see the following output:

Note
Rectangle 27 0

regex Remove all non "word characters" from a String in Java, leaving accented characters?


Class java.text.Normalizer is not supported before android API level 9, so if your app must be compatible with API level 8 (13% of total devices, according to Google's Android dashboard), this method is not viable

You might want to remove the accents and diacritic signs first, then on each character position check if the "simplified" string is an ascii letter - if it is, the original position shall contain word characters, if not, it can be removed.

Note
Rectangle 27 0

regex Remove all non "word characters" from a String in Java, leaving accented characters?


Class java.text.Normalizer is not supported before android API level 9, so if your app must be compatible with API level 8 (13% of total devices, according to Google's Android dashboard), this method is not viable

You might want to remove the accents and diacritic signs first, then on each character position check if the "simplified" string is an ascii letter - if it is, the original position shall contain word characters, if not, it can be removed.

Note