Rectangle 27 380

  • "Unicode" isn't an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to any particular encoding.
  • UTF-16: 2 bytes per "code unit". This is the native format of strings in .NET, and generally in Windows and Java. Values outside the Basic Multilingual Plane (BMP) are encoded as surrogate pairs. (These are relatively rarely used - which is a good job, as very few developers get them right, I suspect. I very much doubt that I do.)
  • UTF-8: Variable length encoding, 1-4 bytes per code point. ASCII values are encoded as ASCII using 1 byte.
  • UTF-7: Usually used for mail encoding. Chances are if you think you need it and you're not doing mail, you're wrong. (That's just my experience of people posting in newsgroups etc - outside mail, it's really not widely used at all.)
  • UTF-32: Fixed width encoding using 4 bytes per code point. This isn't very efficient, but makes life easier outside the BMP. I have a .NET Utf32String class as part of my MiscUtil library, should you ever want it. (It's not been very thoroughly tested, mind you.)
  • ASCII: Single byte encoding only using the bottom 7 bits. (Unicode code points 0-127.) No accents etc.
  • ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default locale/codepage for my system" which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.

The other big resource of code is unicode.org which contains more information than you'll ever be able to work your way through - possibly the most useful bit is the code charts.

The term "ANSI" when applied to Microsoft's 8-bit code pages is a misnomer. They were based on drafts submitted for ANSI standardization, but ANSI itself never standardized them. Windows-1252 (the code page most commonly referred to as "ANSI") is similar to ISO 8859-1 (Latin-1), except that Windows-1252 has printable characters in the range 0x80..0x9F, where ISO 8859-1 has control characters in that range. Unicode also has control characters in that range. en.wikipedia.org/wiki/Windows_code_page

@jp2code: I wouldn't - but you need to distinguish between "content that is sent back via HTTP from the web server" and "content that is sent via email". It's not the web page content that sends the email - it's the app behind it, presumably. The web content would be best in UTF-8; the mail content could be in UTF-7, although I suspect that it's fine to keep that in UTF-8 these days.

For UTF-16, IMHO, I would say "2 bytes per code unit" since a code point outside the BMP will be encoded in surrogate pairs as 2 code units (4 bytes).

Misses the differences between UTF-16LE (within .NET) and BE as well as the notion of the BOM.

character encoding - Unicode, UTF, ASCII, ANSI format differences - St...

unicode character-encoding ascii ansi utf
Rectangle 27 368

  • "Unicode" isn't an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to any particular encoding.
  • UTF-16: 2 bytes per "code unit". This is the native format of strings in .NET, and generally in Windows and Java. Values outside the Basic Multilingual Plane (BMP) are encoded as surrogate pairs. (These are relatively rarely used - which is a good job, as very few developers get them right, I suspect. I very much doubt that I do.)
  • UTF-8: Variable length encoding, 1-4 bytes per code point. ASCII values are encoded as ASCII using 1 byte.
  • UTF-7: Usually used for mail encoding. Chances are if you think you need it and you're not doing mail, you're wrong. (That's just my experience of people posting in newsgroups etc - outside mail, it's really not widely used at all.)
  • UTF-32: Fixed width encoding using 4 bytes per code point. This isn't very efficient, but makes life easier outside the BMP. I have a .NET Utf32String class as part of my MiscUtil library, should you ever want it. (It's not been very thoroughly tested, mind you.)
  • ASCII: Single byte encoding only using the bottom 7 bits. (Unicode code points 0-127.) No accents etc.
  • ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default locale/codepage for my system" which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.

The other big resource of code is unicode.org which contains more information than you'll ever be able to work your way through - possibly the most useful bit is the code charts.

The term "ANSI" when applied to Microsoft's 8-bit code pages is a misnomer. They were based on drafts submitted for ANSI standardization, but ANSI itself never standardized them. Windows-1252 (the code page most commonly referred to as "ANSI") is similar to ISO 8859-1 (Latin-1), except that Windows-1252 has printable characters in the range 0x80..0x9F, where ISO 8859-1 has control characters in that range. Unicode also has control characters in that range. en.wikipedia.org/wiki/Windows_code_page

@jp2code: I wouldn't - but you need to distinguish between "content that is sent back via HTTP from the web server" and "content that is sent via email". It's not the web page content that sends the email - it's the app behind it, presumably. The web content would be best in UTF-8; the mail content could be in UTF-7, although I suspect that it's fine to keep that in UTF-8 these days.

For UTF-16, IMHO, I would say "2 bytes per code unit" since a code point outside the BMP will be encoded in surrogate pairs as 2 code units (4 bytes).

Misses the differences between UTF-16LE (within .NET) and BE as well as the notion of the BOM.

Sign up for our newsletter and get our top new questions delivered to your inbox (see an example).

character encoding - Unicode, UTF, ASCII, ANSI format differences - St...

unicode character-encoding ascii ansi utf
Rectangle 27 366

  • "Unicode" isn't an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to any particular encoding.
  • UTF-16: 2 bytes per "code unit". This is the native format of strings in .NET, and generally in Windows and Java. Values outside the Basic Multilingual Plane (BMP) are encoded as surrogate pairs. (These are relatively rarely used - which is a good job, as very few developers get them right, I suspect. I very much doubt that I do.)
  • UTF-8: Variable length encoding, 1-4 bytes per code point. ASCII values are encoded as ASCII using 1 byte.
  • UTF-7: Usually used for mail encoding. Chances are if you think you need it and you're not doing mail, you're wrong. (That's just my experience of people posting in newsgroups etc - outside mail, it's really not widely used at all.)
  • UTF-32: Fixed width encoding using 4 bytes per code point. This isn't very efficient, but makes life easier outside the BMP. I have a .NET Utf32String class as part of my MiscUtil library, should you ever want it. (It's not been very thoroughly tested, mind you.)
  • ASCII: Single byte encoding only using the bottom 7 bits. (Unicode code points 0-127.) No accents etc.
  • ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default locale/codepage for my system" which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.

The other big resource of code is unicode.org which contains more information than you'll ever be able to work your way through - possibly the most useful bit is the code charts.

The term "ANSI" when applied to Microsoft's 8-bit code pages is a misnomer. They were based on drafts submitted for ANSI standardization, but ANSI itself never standardized them. Windows-1252 (the code page most commonly referred to as "ANSI") is similar to ISO 8859-1 (Latin-1), except that Windows-1252 has printable characters in the range 0x80..0x9F, where ISO 8859-1 has control characters in that range. Unicode also has control characters in that range. en.wikipedia.org/wiki/Windows_code_page

@jp2code: I wouldn't - but you need to distinguish between "content that is sent back via HTTP from the web server" and "content that is sent via email". It's not the web page content that sends the email - it's the app behind it, presumably. The web content would be best in UTF-8; the mail content could be in UTF-7, although I suspect that it's fine to keep that in UTF-8 these days.

For UTF-16, IMHO, I would say "2 bytes per code unit" since a code point outside the BMP will be encoded in surrogate pairs as 2 code units (4 bytes).

Misses the differences between UTF-16LE (within .NET) and BE as well as the notion of the BOM.

character encoding - Unicode, UTF, ASCII, ANSI format differences - St...

unicode character-encoding ascii ansi utf
Rectangle 27 18

Unicode defines a list of characters (letters, numbers, analphabetic symbols, control codes and others) but their representation (in bytes) is defined as encoding. Most common Unicode encodings nowadays are UTF-8, UTF-16 and UTF-32. UTF-16 is what usually is associated with Unicode because it's what has been chosen for Unicode support in Windows, Java, NET environment, C and C++ language (on Windows). Be aware it's not the only one and during your life you'll for sure also meet UTF-8 text (especially from web and on Linux file system) and UTF-32 (outside Windows world). A very introductory must read article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and UTF-8 Everywhere - Manifesto. IMO especially second link (regardless your opinion UTF-8 vs UTF-16) is pretty enlightening.

Because the most commonly used characters are all in the Basic Multilingual Plane, handling of surrogate pairs is often not thoroughly tested. This leads to persistent bugs and potential security holes, even in popular and well-reviewed application software (e.g. CVE-2008-2938, CVE-2012-2135)

To see where issue is just start with some simple math: Unicode defines around 110K code points (note that not all of them are grapheme). "Unicode character type" in C, C++, C#, VB.NET, Java and many other languages on Windows environment (with notable exception of VBScript on old ASP classic pages) is UTF-16 encoded then it's two bytes (type name here is intuitive but completely misleading because it's a code unit, not a character nor a code point).

Please check this distinction because it's fundamental: a code unit is logically different from a Character and, even if sometimes they coincide, they're not same thing. How this affect your programming life? Imagine you have this C# code and your specifications (written by someone who thinks about true definition of Character) says "password length must be 4 characters":

bool IsValidPassword(string text ) {
    return text.Length >= 4;
}

That code is ugly, wrong and broken. Length property returns number of code units in text string variable and now your know they're different. Your code will validate no as valid password (but it's made by two characters, four code points - which almost always coincide with code units). Now try to imagine this applied to all layers of your application: an UTF-8 encoded database field navly validated with previous code (where input is UTF-16), errors will sum and your Polish friend witosaw Komicki won't be happy of this. Now think you have to validate user's first name with same technique and your users are Chinese (but don't worry, if you don't care then they will be your users for very short time). Another example: this nave C# algorithm to count distinct Characters in a string will fail for same reason:

myString.Distinct().Count()

If user enters this Han character then your code will wrongly return...2 because its UTF-16 representation is 0xD840 0xDC11 (BTW each of them, alone, is not a valid Unicode character because they're high and low surrogate, respectively). Reasons are explained in greater detail in this post, a working solution is also provided so I just repeat here essential code:

StringInfo.GetTextElementEnumerator(text)
    .AsEnumerable<string>()
    .Distinct()
    .Count();
codePointCount()
AsEnumerable<T>()
GetTextElementEnumerator()
IEnumerator
IEnumerable

Is this something only related to string length? Of course not, if you handle keyboard input Char by Char you may need to fix your code. See for example this question about Korean characters handled in KeyUp event.

Unrelated but IMO helpful to understand, this C code (taken from this post) works on char (ASCII/ANSI or UTF-8) but it'll fail if straight converted to use wchar_t:

wchar_t* pValue = wcsrchr(wcschr(pExpression, L'|'), L':') + 1;

Note that in C++ 11 there is a new great set of classes to handle encoding and clearer type aliases: char8_t, char16_t and char32_t for, respectively, UTF-8, UTF-16 and UTF-32 encoded characters. Be aware that you also have std::u8string, std::u16string and std::u32string. Note that even if length() (and its size() alias) will still return count of code units you can easily perform encoding conversions with codecvt() template function and using these types IMO you'll make your code more clear and explicit (isn't astonishing size() of u16string will return number of char16_t elements). For more details about Characters counting in C++ check this nice post. In C things are pretty easier with char and UTF-8 encoding: this post IMO is a must-read.

Not all Languages are similar, they don't even share some basic concepts. For example our current definition of grapheme can be pretty far from our concept of Character. Let me explain with an example: in Korean Hangul alphabet letters are combined into a single syllable (and both letters and syllables are Characters, just represented in a different way when alone and in a word with other letters). Word (Guk) is one syllable composed by three letters , and (first and last letter are the same but they're pronounced with different sounds when they're at beginning or end of a word, that's why they're transliterated g and k).

Syllables let us introduce another concept: precomposed and decomposed sequences. Hangul syllable han can be represented as a single character (U+0D55C) or a decomposed sequence of letters , and . If you're, for example, reading a text file you may have both (and users may enter both sequences in your input boxes) but they must compared equal. Note that if you type that letters sequentially they'll be displayed always as single syllable (copy & paste single characters - without spaces - and try) but final form (precomposed or decomposed) depends on your IME.

In Czech "ch" is a digraph and it's treated as a single letter. It has it's own rule for collation (it's between H and I), with Czech sorting fyzika comes before chemie! If you count Characters and you tell your users that word Chechtal is composed by 8 Characters they'll think your software is bugged and your support for their language is merely limited to a bunch of translated resources. Let's add exceptions: in puchoblk (and few other words) C and H are not a digraph and they're separated. Note that there are also other cases like "d" in Slovak and others where it's counted as single character even if it uses two/three UTF-16 code points! Same happens also in many other languages too (for example ll in Catalan). True languages have more exceptions and special cases than PHP!

Note that appearance alone is not always enough for equivalence, for example: A (U+0041 LATIN CAPITAL LETTER A) is not equivalent to (U+0410 CYRILLIC CAPITAL LETTER A). Conversely character (U+0662 ARABIC-INDIC DIGIT TWO) and (U+06F2 EXTENDED ARABIC-INDIC DIGIT TWO) are visually and conceptually equivalent but they are different Unicode code points (see also next paragraph about numbers and synonyms).

Symbols like ? and ! are sometimes used as characters, for example earliest Haida language). In some languages (like earliest written form of Native Americans languages) also numbers and other symbols have been borrowed from Latin alphabet and used as letters (mind this if you have to handle that languages and you have to strip alphanumeric from symbols, Unicode can't distinguish this), one example !Kung in Khoisan African language. In Catalan when ll is not a digraph they use a diacritic (or a middot (+U00B7)...) to separate characters, like in celles (in this case character count is 6 and code units/code points are 7 where an hypothetical non-existing word celles would result in 5 characters).

Same word may be written using in more than one form. This may be something you have to care about if, for example, you provide a full-text search. For example Chinese word (house) can be transliterated as Ji in pinyin and in Japanese same word may be also written with same Kanji or as in Hiragana (and others too) or transliterated in romaji as ie. Is this limited to words? No, also characters, for numbers is pretty common: 2 (Arabic number in Roman alphabet), (in Arabic and Persian) and (Chinese and Japanese) are exactly same cardinal number. Let's add some complexity: in Chinese it's also very common to write the same number as (simplified: ). I don't even mention prefixes (micro, nano, kilo and so on). See this post for a real world example of this issue. It's not limited to far-east languages only: apostrophe (U+0027 APOSTROPHE or better (U+2019 RIGHT SINGLE QUOTATION MARK) is used often in Czech and Slovak instead of its superimposed counterpart (U+02BC MODIFIER LETTER APOSTROPHE): d and d' are then equivalent (similar to what I said about middot in Catalan).

Maybeyou should properly handle lower case "ss" in German to be compared to (and problems will arise for case insensitive comparison). Similar issue is in Turkish if you have to provide a non-exact string matching for i and its forms (see section about Case).

If you're working with professional text you may also meet ligatures; even in English, for example sthetics is 9 code points but 10 characters! Same applies, for example, for ethel character (U+0153 LATIN SMALL LIGATURE OE, absolutely necessary if you're working with French text); horse d'ouvre is equivalent to horse d'vre (but also ethel and thel). Both are (together with German ) lexical ligatures but you may also meet typographical ligatures (such as U+FB00 LATIN SMALL LIGATURE FF) and they have they're own part on Unicode character set (presentation forms). Nowadays diacritics are much more common even in English (see tchrist's post about people freed of the tyranny of the typewriter, please read carefully Bringhurst's citation). Do you think you (and your users) won't ever type faade, nave and prt--porter or "classy" none or coperation?

Here I don't even mention word counting because it'll open even more problems: in Korean each word is composed by syllables but in, for example, Chinese and Japanese, Characters are counted as words (unless you want to implement word counting using a dictionary). Now let's take this Chinese sentence: rougly equivalent to Japanese sentence . How do you count them? Moreover if they're transliterated to Sh yg shl wnbn and Kore wa, sanpuru no tekisutodesu then they should be matched in a text search?

Speaking about Japanese: full width Latin Characters are different from half width Characters and if your input is Japanese romaji text you have to handle this otherwise your users will be astonished when won't compare equal to T (in this case what should be just glyphs became code points).

Unicode (primary for ASCII compatibility and other historical reasons) has duplicated characters, before you do a comparison you have to perform normalization otherwise (single code point) won't be equal to a (a plus U+0300 COMBINING GRAVE ACCENT). Is this a corner uncommon case? Not really, also take a look to this real world example from Jon Skeet. Also (see section Culture Difference) precomposed and decomposed sequences introduce duplicates.

Note that diacritics are not only source of confusion. When user is typing with his keyboard he'll probably enter ' (U+0027 APOSTROPHE) but it's supposed to match also (U+2019 RIGHT SINGLE QUOTATION MARK) normally used in typography (same is true for many many Unicode symbols almost equivalent from user point of view but distinct in typography, imagine to write a text search inside digital books).

In short two strings must be considered equal (this is a very important concept!) if they are canonically equivalent and they are canonically equivalent if they have the same linguistic meaning and appearance, even if they are composed from different Unicode code points.

If you have to perform case insensitive comparison then you'll have even more problems. I assume you do not perform hobbyist case insensitive comparison using toupper() or equivalent unless, one for all, you want to explain to your users why 'i'.ToUpper() != 'I' for Turkish language (I is not upper case of i which is . BTW lower case letter for I is ).

Another problem is eszett in German (a ligature for long s + short s used - in ancient times - also in English elevated to dignity of a character). It has an upper case version but (at this moment) .NET Framework wrongly returns "" != "".ToUpper() (but its use is mandatory in some scenarios, see also this post). Unfortunately not always ss becomes (upper case), not always ss is equal to (lower case) and also sz sometimes is in upper case. Confusing, right?

Globalization is not only about text: what about dates and calendars, number formatting and parsing, colors and layout. A book won't be enough to describe all things you should care about but what I would highlight here is that few localized strings won't make your application ready for an international market.

Even just about text more questions arise: how this applies to regex? How spaces should be handled? Is an em space equal to an en space? In a professional application how "U.S.A." should be compared with "USA" (in a free-text search)? On the same line of thinking: how to manage diacritics in comparison?

How to handle text storage? Forget you can safely detect encoding, to open a file you need to know its encoding. Of course unless you're planning to do like HTML parsers with <meta charset="UTF-8"> or XML/XHTML encoding="UTF-8" in <?xml>).

What we see as text on our monitors is just a chunk of bytes in computer memory. By convention each value (or group of values, like an int32_t represents a number) represents a character. How that character is then drawn on screen is delegated to something else (to simplify little bit think about a font).

If we arbitrary decide that each character is represented with one byte then we have available 256 symbols (as when we use int8_t, System.SByte or java.lang.Byte for a number we have a numeric range of 256 values). What we need now to so decide each value which character it represents, an example of this is ASCII (limited to 7 bits, 128 values) with custom extensions to also use upper 128 values.

That's done, habemus character encoding for 256 symbols (including letters, numbers, analphabetic characters and control codes). Yes each ASCII extension is proprietary but things are clear and easy to manage. Text processing is so common that we just need to add a proper data type in our favorite languages (char in C, note that formally it's not an alias for unsigned char or signed char but a distinct type; char in Pascal; character in FORTRAN and so on) and few library functions to manage that.

Unfortunately it's not so easy. ASCII is limited to a very basic character set and it includes only latin characters used in USA (that's why its preferred name should be usASCII). It's so limited that even English words with diacritical marks aren't supported (if this made the change in modern language or vice-versa is another story). You'll see it also has other problems (for example its wrong sorting order with the problems of ordinal and alphabetic comparison).

How to deal with that? Introduce a new concept: code pages. Keep a fixed set of basic characters (ASCII) and add another 128 characters specific for each language. Value 0x81 will represent Cyrillic character (in DOS code page 866) and Greek character (in DOS code page 869).

Now serious problems arise: 1) you cannot mix in the same text file different alphabets. 2) To properly understand a text you have to also know with which code page it's expressed. Where? There is not a standard method for that and you'll have to handle this asking user or with a reasonable guess (?!). Even nowadays ZIP file "format" is limited to ASCII for file names (you may use UTF-8 - see later - but it's not standard - because there is not a standard ZIP format). In this post a Java working solution. 3) Even code pages are not standard and each environment has different sets (even DOS code pages and Windows code pages are different) and also names vary. 4) 255 characters are still too few for, for example, Chinese or Japanese language then more complicated encodings have been introduced (Shift JIS, for example).

Situation was terrible at that time (~ 1985) and a standard was absolutely needed. ISO/IEC 8859 arrived and it, at least, solved point 3 in previous problem list. Point 1, 2 and 4 were still unsolved and a solution was needed (especially if your target is not just raw text but also special typography characters). This standard (after many revisions) is still with us nowadays (and it somehow coincides with Windows-1252 code page) but probably you won't ever use it unless you're working with some legacy system.

Standard which emerged to save us from this chaos is world wide known: Unicode. From Wikipedia:

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems. [...] the latest version of Unicode contains a repertoire of more than 110,000 characters covering 100 scripts and multiple symbol sets.

Languages, libraries, Operating Systems have been updated to support Unicode. Now we have all characters we need, a shared well-known code for each, and the past is just a nightmare. Replace char with wchar_t (and accept to live with wcout, wstring and friends), just use System.Char or java.lang.Character and live happy. Right?

NO. It's never so easy. Unicode mission is about "...encoding, representation and handling of text...", it doesn't translate and adapt different cultures into an abstract code (and it's impossible to do it unless you kill the beauty in the variety of all our languages). Moreover encoding itself introduces some (not so obvious?!) things we have to care about.

"UTF-16 is what usually is associated with Unicode" Only if you are on Windows (and you can never be sure they are really using unicode there, due to earlier UCS-2 and that all is BMP problem). And it is certainly not the C and C++ wide encoding (outside windows, that's UTF-32). Also, since when does C# string deal in codepoints instead of UTF-16 codeunits?

@Deduplicator yes this post is pretty limited to Windows world (and cross platform environment .NET and Java). It's complex enough even without C and C++ cross-platform compatibility. On Windows, unless you're using Windows NT 4, it's always UTF-16 (starting with Win2K so it's a pretty safe assumption). You're right about codepoint vs codeunit, fixed where I saw it and where appropriate, tnx!

To @DOWNVOTERS: as you should know votes to answer and to question are unrelated (even if they're both written by same author). Feel free to disagree about question (and also leave your opinion on the meta post about this) but if you also downvote answer please post a short comment to explain your reasons. It'll be greatly appreciated, it'll improve my knowledge and it'll help future readers to better understand this topic.

@AdrianoRepetti My guess is that this answer might not be found useful because it's too long, even when appreciating your commitment. There are other places to find a blog post, article or a book on the subject :) (FWIW I'm on the fence here so I didn't vote).

c# - How can I perform a Unicode aware character by character comparis...

c# .net unicode
Rectangle 27 84

In general, you can't do this. UTF-8 is capable of encoding any Unicode code point. ISO-8859-1 can handle only a tiny fraction of them. So, transcoding from ISO-8859-1 to UTF-8 is no problem. Going backwards from UTF-8 to ISO-8859-1 will cause "replacement characters" () to appear in your text when unsupported characters are found.

byte[] latin1 = ...
byte[] utf8 = new String(latin1, "ISO-8859-1").getBytes("UTF-8");
byte[] utf8 = ...
byte[] latin1 = new String(utf8, "UTF-8").getBytes("ISO-8859-1");

You can exercise more control by using the lower-level Charset APIs. For example, you can raise an exception when an un-encodable character is found, or use a different character for replacement text.

For more information on character encoding and why it rightfully doesn't make much sense to go from UTF-8 to ISO-8859 (or ASCII or ANSI for that matter), see this explanation: joelonsoftware.com/articles/Unicode.html

There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language). But try to store Russian or Hebrew letters [or special chars] in these encodings and you get a bunch of question marks. UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.

It might be worth mentioning that Windows-1252 (Windows Latin 1) extends ISO-8859-1 (official Latin 1) by filling in some of the "Unicode control" characters 0x80 - 0xbf. Even browsers on Mac and Linux respect that. So at some spots use Windows-1252 instead.

How do I convert between ISO-8859-1 and UTF-8 in Java? - Stack Overflo...

java java-me utf-8 character-encoding iso-8859-1
Rectangle 27 7

First, the character encoding used is not directly related to the locale. So changing the locale won't help much.

Second, the is typical for the Unicode replacement character U+FFFD being printed in ISO-8859-1 instead of UTF-8. Here's an evidence:

System.out.println(new String("".getBytes("UTF-8"), "ISO-8859-1")); //

So there are two problems:

  • Your JVM is reading those special characters as .

For a Sun JVM, the VM argument -Dfile.encoding=UTF-8 should fix the first problem. The second problem is to be fixed in the console settings. If you're using for example Eclipse, you can change it in Window > Preferences > General > Workspace > Text File Encoding. Set it to UTF-8 as well.

byte[] textArray = f.getName().getBytes();

That should have been the following to exclude influence of platform default encoding:

byte[] textArray = f.getName().getBytes("UTF-8");

If that still displays the same, then the problem lies deeper. What JVM exactly are you using? Do a java -version. As said before, the -Dfile.encoding argument is Sun JVM specific. Some Linux machines ships with GNU JVM or OpenJDK's JVM and this argument may then not work.

I tried that and it didn't work. java -Dfile.encoding=UTF-8 load_i18n es_ES special____characters.doc I'm probably wrong, but I'm not convinced there's a console issue yet. I redirect the output to a file so there's no console involved and I still get the same results. I do an "od -a" on the file and here's the relevant output: 0000200 e f i l e nl s p e c i a l _ o ? 0000220 = _ o ? = _ o ? = _ c h a r a c 0000240 t e r s . d o c nl r e a d _ i 1

As to the first problem: that may be platform/JVM specific. Hard to tell from here on. As to the second problem: is the file written with an OutputStreamWriter using UTF-8 and viewed with a viewer supporting UTF-8?

@Mark, not sure why you're passing the 'mangled' filename on the command line. The flow seems to be (1) Java gets correct filename from OS (2) Java writes filename to stdout, where it gets mangled (3) you take the mangled filename and pass it back in to a different tool (4) Java hands the mangled filename to the OS, which can't find the file. Fix (2), and the problem goes away; passing the MANGLED filename in (3) is just making things worse.

Also - "I redirect the output to a file so there's no console involved and I still get the same results." -- do you mean redirect in code, using e.g. a Writer, or using your shell's command-line redirection? If the problem is Java's choice of encoding when writing to System.out, it's just those (incorrect) bytes which your shell will redirect into the file, making exactly the same problem.

my file name is " 03. (feat. 74).mp3 " and i got error filenot found in fileinputstream plz help i use your one but still get same error

unicode - How can I open files containing accents in Java? - Stack Ove...

java unicode character-encoding
Rectangle 27 2

One thing great about Java is that it is unicode based. That means, you can use characters from writing systems that are not english alphabets (e.g. Chinese or math symbols), not just in data strings, but in function and variable names too.

Here's a example code using unicode characters in class names and variable names.

class  {
    String  = "north";
    double  = 3.14159;
}

class UnicodeTest {
    public static void main(String[] arg) {
         x1 = new ();
        System.out.println( x1. );
        System.out.println( x1. );
    }
}

Java was created around the time when the Unicode standard had values defined for a much smaller set of characters. Back then it was felt that 16-bits would be more than enough to encode all the characters that would ever be needed. With that in mind Java was designed to use UTF-16. In fact, the char data type was originally used to be able to represent a 16-bit Unicode code point.

The UTF-8 charset is specified by RFC 2279;

The UTF-16 charsets use sixteen-bit quantities and are therefore sensitive to byte order. In these encodings the byte order of a stream may be indicated by an initial byte-order mark represented by the Unicode character '\uFEFF'. Byte-order marks are handled as follows:

When decoding, the UTF-16BE and UTF-16LE charsets ignore byte-order marks; when encoding, they do not write byte-order marks.

When decoding, the UTF-16 charset interprets a byte-order mark to indicate the byte order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.

UTF-8 and UTF-16 are not character sets; they're two different variable-width encodings of the very same charset: Unicode.

Java Unicode Confusion - Stack Overflow

java unicode
Rectangle 27 2

One thing great about Java is that it is unicode based. That means, you can use characters from writing systems that are not english alphabets (e.g. Chinese or math symbols), not just in data strings, but in function and variable names too.

Here's a example code using unicode characters in class names and variable names.

class  {
    String  = "north";
    double  = 3.14159;
}

class UnicodeTest {
    public static void main(String[] arg) {
         x1 = new ();
        System.out.println( x1. );
        System.out.println( x1. );
    }
}

Java was created around the time when the Unicode standard had values defined for a much smaller set of characters. Back then it was felt that 16-bits would be more than enough to encode all the characters that would ever be needed. With that in mind Java was designed to use UTF-16. In fact, the char data type was originally used to be able to represent a 16-bit Unicode code point.

The UTF-8 charset is specified by RFC 2279;

The UTF-16 charsets use sixteen-bit quantities and are therefore sensitive to byte order. In these encodings the byte order of a stream may be indicated by an initial byte-order mark represented by the Unicode character '\uFEFF'. Byte-order marks are handled as follows:

When decoding, the UTF-16BE and UTF-16LE charsets ignore byte-order marks; when encoding, they do not write byte-order marks.

When decoding, the UTF-16 charset interprets a byte-order mark to indicate the byte order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.

UTF-8 and UTF-16 are not character sets; they're two different variable-width encodings of the very same charset: Unicode.

Java Unicode Confusion - Stack Overflow

java unicode
Rectangle 27 24

SHA-1 (and all other hashing algorithms) return binary data. That means that (in Java) they produce a byte[]. That byte array does not represent any specific characters, which means you can't simply turn it into a String like you did.

If you need a String, then you have to format that byte[] in a way that can be represented as a String (otherwise, just keep the byte[] around).

Two common ways of representing arbitrary byte[] as printable characters are BASE64 or simple hex-Strings (i.e. representing each byte by two hexadecimal digits). It looks like you're trying to produce a hex-String.

There's also another pitfall: if you want to get the SHA-1 of a Java String, then you need to convert that String to a byte[] first (as the input of SHA-1 is a byte[] as well). If you simply use myString.getBytes() as you showed, then it will use the platform default encoding and as such will be dependent on the environment you run it in (for example it could return different data based on the language/locale setting of your OS).

A better solution is to specify the encoding to use for the String-to-byte[] conversion like this: myString.getBytes("UTF-8"). Choosing UTF-8 (or another encoding that can represent every Unicode character) is the safest choice here.

Java String to SHA1 - Stack Overflow

java string sha1
Rectangle 27 4

As you've probably gathered, Go and Java treat strings differently. In Java, a string is a series of codepoints (characters); in Go, a string is a series of bytes. Text manipulation functions in Go understand UTF-8 codepoints when necessary, but since the string is represented as bytes, the indices they return and work with are byte indexes, not character indexes.

As you observe in the comments, you can use a RuneReader and FindReaderIndex to get indexes in characters rather than bytes. strings.Reader provides an implementation of RuneReader, so you can use strings.NewReader to wrap a string in a RuneReader.

Another option is to take the substring you want the length of in characters and pass it to utf8.RuneLen, which returns the number of characters in the UTF-8 string. Using a RuneReader is probably more efficient, however.

The return value of FindReaderIndex is unclear (char-index or byte-index) to me based on the doc. Specifically, I'm not sure what 's' is referring to. "The match itself is at s[loc[0]:loc[1]]." golang.org/pkg/regexp/#Regexp.FindReaderIndex

@NickSiderakis You're right, it would appear that, ambiguous as it is, it still returns byte indices. In which case, your best option is to use utf8.RuneLen to count how many characters occur before the match.

google app engine - Shared GAE datastore, Go <-> Java, regexp.FindStri...

java google-app-engine utf-8 character-encoding go
Rectangle 27 1

It looks like you have a character encoding problem. Your Objective-C code is based on ASCII (8 bit) characters but you need to switch (16 bit) UNICODE character decoding while parsing Java Strings into bytes. On the other hand, It may be a good idea to consider byte ordering in your arrays depending on the CPU architecture you are working on (Little or Big Endianness).

3DES encryption in iPhone app always produces different result from 3D...

java iphone objective-c encryption 3des
Rectangle 27 16

is a sequence of three characters - 0xEF 0xBF 0xBD, and is UTF-8 representation of the Unicode codepoint 0xFFFD. The codepoint in itself is the replacement character for illegal UTF-8 sequences.

Apparently, for some reason, the set of routines involved in your source code (on Linux) is handling the PNG header inaccurately. The PNG header starts with the byte 0x89 (and is followed by 0x50, 0x4E, 0x47), which is correctly handled in Windows (which might be treating the file as a sequence of CP1252 bytes). In CP1252, the 0x89 character is displayed as .

On Linux, however, this byte is being decoded by a UTF-8 routine (or a library that thought it was good to process the file as a UTF-8 sequence). Since, 0x89 on it's own is not a valid codepoint in the ASCII-7 range (ref: the UTF-8 encoding scheme), it cannot be mapped to a valid UTF-8 codepoint in the 0x00-0x7F range. Also, it cannot be mapped to a valid codepoint represented as a multi-byte UTF-8 sequence, for all of multi-byte sequences start with a minimum of 2 bits set to 1 (11....), and since this is the start of the file, it cannot be a continuation byte as well. The resulting behavior is that the UTF-8 decoder, now replaces 0x89 with the UTF-8 replacement characters 0xEF 0xBF 0xBD (how silly, considering that the file is not UTF-8 to begin with), which will be displayed in ISO-8859-1 as .

If you need to resolve this problem, you'll need to ensure the following in Linux:

  • Read the bytes in the PNG file, using the suitable encoding for the file (i.e. not UTF-8); this is apparently necessary if you are reading the file as a sequence of characters*, and not necessary if you are reading bytes alone. You might be doing this correctly, so it would be worthwhile to verify the subsequent step(s) also.
  • When you are viewing the contents of the file, use a suitable editor/view that does not perform any internal decoding of the file to a sequence of UTF-8 bytes. Using a suitable font will also help, for you might want to prevent the unprecedented scenario where the glyph (for 0xFFFD it is actually the diamond character ) cannot be represented, and might result in further changes (unlikely, but you never know how the editor/viewer has been written).
  • It is also a good idea to write the files out (if you are doing so) in the suitable encoding - ISO-8859-1 perhaps, instead of UTF-8. If you are processing and storing the file contents in memory as bytes instead of characters, then writing these to an output stream (without the involvement of any String or character references) is sufficient.

* Apparently, the Java Runtime will perform decoding of the byte sequence to UTF-16 codepoints, if you convert a sequence of bytes to a character or a String object.

Hi Vineet, Great writeup! The thing I want to do is to split the parts into String[] to manipulate the data because I need to encode the png binary into base64. Gonna test this out. Thanks!

I don't think you need to split the file. You can feed in the byte stream to a Base64 encoder like Apache Commons Codec, which will do the job for you. What would be necessary is to read the file in the appropriate encoding.

There's mo appropriate encoding. The file should be read as a sequence of bytes.

@ninjalj, you're right. I was mistaken. I was referring to FileOutputStream and not FileInputStream. Edit: Mistaken again. I meant InputStreamReader and OutputStreamWriter. There is also the String object which can be in a different charset if read from a stream with a different encoding.

Hi guys, please don't use the commenting system as a chat room. It is for leaving a few comments and prods for more information to a question or answer, not for long debates. The reason behind this is that most of the time (and this is one of them), a lot if not all the comments belong as edits to the question/answer to make that more complete. If I have to read a half-page answer + 3 pages of comments, the focus on the comments is too big. Please edit in pertinent details into the answer instead. If you really need to chat, find/create a chat-room on the Chat site, link at the top of the page

java - Reading File from Windows and Linux yields different results (c...

java windows linux character-encoding png
Rectangle 27 9

Parsing the file as fixed-size blocks of bytes is not good --- what if some character has a byte representation that straddles across two blocks? Use an InputStreamReader with the appropriate character encoding instead:

BufferedReader br = new BufferedReader(
         new InputStreamReader(
         new FileInputStream("myfile.csv"), "ISO-8859-1");

 char[] buffer = new char[4096]; // character (not byte) buffer 

 while (true)
 {
      int charCount = br.read(buffer, 0, buffer.length);

      if (charCount == -1) break; // reached end-of-stream 

      String s = String.valueOf(buffer, 0, charCount);
      // alternatively, we can append to a StringBuilder

      System.out.println(s);
 }

Btw, remember to check that the unicode character can indeed be displayed correctly. You could also redirect the program output to a file and then compare it with the original file.

As Jon Skeet suggests, the problem may also be console-related. Try System.console().printf(s) to see if there is a difference.

encoding - Java App : Unable to read iso-8859-1 encoded file correctly...

java encoding character-encoding iso-8859-1
Rectangle 27 9

Parsing the file as fixed-size blocks of bytes is not good --- what if some character has a byte representation that straddles across two blocks? Use an InputStreamReader with the appropriate character encoding instead:

BufferedReader br = new BufferedReader(
         new InputStreamReader(
         new FileInputStream("myfile.csv"), "ISO-8859-1");

 char[] buffer = new char[4096]; // character (not byte) buffer 

 while (true)
 {
      int charCount = br.read(buffer, 0, buffer.length);

      if (charCount == -1) break; // reached end-of-stream 

      String s = String.valueOf(buffer, 0, charCount);
      // alternatively, we can append to a StringBuilder

      System.out.println(s);
 }

Btw, remember to check that the unicode character can indeed be displayed correctly. You could also redirect the program output to a file and then compare it with the original file.

As Jon Skeet suggests, the problem may also be console-related. Try System.console().printf(s) to see if there is a difference.

encoding - Java App : Unable to read iso-8859-1 encoded file correctly...

java encoding character-encoding iso-8859-1
Rectangle 27 1

You are in the gray zone of C++ unicode. Unicode initially started by an extension of the 7 bits ASCII characters, or multi-byte characters to plain 16 bits characters, what later became the BMP. Those 16 bits characters were adopted natively by languages like Java and systems like Windows. C and C++ being more conservative on a standard point of view decided that wchar_t would be an implementation dependant wide character set that could be 16 or 32 bits wide (or even more...) depending on requirement. The good side was that it was extensible, the dark side was that it was never made clear how non BMP unicode characters should be represented when wchar_t is only 16 bits.

UTF-16 was then created to allow a standard representation of those non BMP characters, with the downside that they need 2 16 bits characters, and that the std::char_traits<wchar_t>::length would again be wrong if some of them are present in a wstring.

That's the reason why most C++ implementation choosed that wchar_t basic IO would only process correctly BMP unicode characters for length to return a true number of characters.

The C++-ish way is to use char32_t based strings when full unicode support is required. In fact wstring_t and wchar_t (prefix L for litteral) are implementation dependant types, and since C++11, you also have char16_t and u16string (prefix u) that explicitely use UTF-16, or char32_t and u32string (prefix U) for full unicode support through UTF-32. The problem of storing characters outside the BMP in a u16string, is that you lose the property size of string == number of characters, which was a key reason for using wide characters instead of multi-byte characters.

One problem for u32string is that the io library still has no direct specialization for 32 bit characters, but as the converters have, you can probably use them easily when you process files with a std::basic_fstream<char32_t> (untested but according to standard should work). But you will have no standard stream for cin, cout and cerr, and will probably have to process the native from in string or u16string, and then convert everything in u32string with the help of the standard converters introduced in C++14, or the hard way if using only C++11.

The really dark side, is that as that native part currently depend on the OS, you will not be able to setup a fully portable way to process full unicode - or at least I know none.

std::u16string is not limited to the BMP (any more than std::string is when using UTF-8, or std::wstring is when using UTF-16/32, depending on the size of wchar_t). char16_t and std::u16string are specifically designed for UTF-16 (the u"" string prefix returns a full UTF-16 encoded std::u16string, not a UCS-2 encoded one like you are implying). UTF-16 handles the entire Unicode repertoire. char32_t and std::u32string are designed for UTF-32, which also handles the entire Unicode repertoire.

@RemyLebeau: Thanks for commenting! What I meant is that when you use UTF-16 encoded strings in u16string, you lose the propertly length == number of chars, which was a key reason for using wide characters instead of multi-byte ones. Hope it is more clear now

the length == number of chars property was lost decades ago with the invention of MBCS charsets. Only code that deals exclusively in English and other Latin-based languages could ever rely on that property. True international apps don't use that property for a long time. That being said, the majority of modern languages fit in the BMP, so the property usually holds up in most texts. Asian languages, and more recently emojis, and less-common uses of Unicode (ancient languages, math/music symbols etc), require the use of surrogates.

Basic issue regarding full unicode in C++ - Stack Overflow

c++ unicode wstring
Rectangle 27 2

The Java Strings are Unicode: each character is encoded on 16 bits. Your String is - I suppose - a "C" string. You have to know the name of the character encoder and use CharsetDecoder.

import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;

public class Char8859_1Decoder {

   public static void main( String[] args ) throws CharacterCodingException {
      String hex = "6174656ec3a7c3a36f";
      int len = hex.length();
      byte[] cStr = new byte[len/2];
      for( int i = 0; i < len; i+=2 ) {
         cStr[i/2] = (byte)Integer.parseInt( hex.substring( i, i+2 ), 16 );
      }
      CharsetDecoder decoder = Charset.forName( "UTF-8" ).newDecoder();
      CharBuffer cb = decoder.decode( ByteBuffer.wrap( cStr ));
      System.out.println( cb.toString());
   }
}

utf 8 - Java String HEX to String ASCII with accentuation - Stack Over...

java utf-8 hex ascii
Rectangle 27 3

"Extended ASCII" is nebulous. There are many extensions to ASCII that define glyphs for the byte values between 127 and 255. These are referred to as code pages. Some of the more common ones include:

  • CP437, the standard on original IBM PCs
  • ISO 8859-1 also known as Code page 1252, the encoding used for most Western European-language versions of Windows for everything but the console

You really need to know what character encoding your terminal is expecting, otherwise you'll end up printing garbage. In Java, you should be able to check the value of Charset.defaultCharset() (Charset documentation).

There are many more ways to encode characters than just single-byte "extended ASCII" code pages. Unicode requires far more code points than 255, so there are various fixed-width and variable-width encodings that are used frequently. This page seems to be a good guide to character encoding in Java.

character encoding - How to print the extended ASCII code in java from...

java character-encoding extended-ascii
Rectangle 27 1

Essentially, Java regular expressions work on Strings, not arrays of bytes - characters are represented as abstract "character" entities, not as bytes in some specific encoding. This is not completely true since the char type only contains characters from the Basic Multilingual Plane and Unicode chars from outside this range are represented as two char values each, but nonetheless "multibyte" is relative and depends on the encoding.

If what you need is "multibyte in UTF-8", then note that only characters with values 0-127 are single-byte in this encoding. So, the easiest way to check would be to use a loop and check each character - if it's greater than 127, it's more than one byte in UTF-8.

If you insist on using a regex, you could probably use the character range operator in the regex like this: [\u0080-\uFFFF] (haven't checked and \uFFFF is not really a character but I think the regex engine should accept it).

java - Regular expression for Multi Bytes string - Stack Overflow

java regex
Rectangle 27 1

Essentially, Java regular expressions work on Strings, not arrays of bytes - characters are represented as abstract "character" entities, not as bytes in some specific encoding. This is not completely true since the char type only contains characters from the Basic Multilingual Plane and Unicode chars from outside this range are represented as two char values each, but nonetheless "multibyte" is relative and depends on the encoding.

If what you need is "multibyte in UTF-8", then note that only characters with values 0-127 are single-byte in this encoding. So, the easiest way to check would be to use a loop and check each character - if it's greater than 127, it's more than one byte in UTF-8.

If you insist on using a regex, you could probably use the character range operator in the regex like this: [\u0080-\uFFFF] (haven't checked and \uFFFF is not really a character but I think the regex engine should accept it).

java - Regular expression for Multi Bytes string - Stack Overflow

java regex
Rectangle 27 7

Characters are a graphical entity which is part of human culture. When a computer needs to handle text, it uses a representation of those characters in bytes. The exact representation used is called an encoding.

There are many encodings that can represent the same character - either through the Unicode character set, or through other character sets like the various ISO-8859 encodings, or the JIS X 0208.

Internally, Java uses UTF-16. This means that each character can be represented by one or two sequences of two bytes. The character you were using, , has the code point U+6700 which is represented in UTF-16 as the byte 0x67 and the byte 0x00.

That's the internal encoding. You can't see it unless you dump your memory and look at the bytes in the dumped image.

But the method getBytes() does not return this internal representation. Its documentation says:

public byte[] getBytes()

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

The "platform's default charset" is what your locale variables say it is. That is, UTF-8. So it takes the UTF-16 internal representation, and converts that into a different representation - UTF-8.

Note that

new String(bytes, StandardCharsets.UTF_16);

does not "convert it to UTF-16 explicitly" as you assumed it does. This string constructor takes a sequence of bytes, which is supposed to be in the encoding that you have given in the second argument, and converts it to the UTF-16 representation of whatever characters those bytes represent in that encoding.

But you have given it a sequence of bytes encoded in UTF-8, and told it to interpret that as UTF-16. This is wrong, and you do not get the character - or the bytes - that you expect.

You can't tell Java how to internally store strings. It always stores them as UTF-16. The constructor String(byte[],Charset) tells Java to create a UTF-16 string from an array of bytes that is supposed to be in the given character set. The method getBytes(Charset) tells Java to give you a sequence of bytes that represent the string in the given encoding (charset). And the method getBytes() without an argument does the same - but uses your platform's default character set for the conversion.

So you misunderstood what getBytes() gives you. It's not the internal representation. You can't get that directly. only getBytes(StandardCharsets.UTF_16) will give you that, and only because you know that UTF-16 is the internal representation in Java. If a future version of Java decided to represent the characters in a different encoding, then getBytes(StandardCharsets.UTF_16) would not show you the internal representation.

Edit: in fact, Java 9 introduced just such a change in internal representation of strings, where, by default, strings whose characters all fall in the ISO-8859-1 range are internally represented in ISO-8859-1, whereas strings with at least one character outside that range are internally represented in UTF-16 as before. So indeed, getBytes(StandardCharsets.UTF_16) no longer returns the internal representation.

Which encoding does Java uses UTF-8 or UTF-16? - Stack Overflow

java encoding utf-8 default utf-16