Rectangle 27 1

python What is the difference between a string and a byte string?


In Python 2, str consists of sequences of 8-bit values, while unicode consists of sequences of Unicode characters. One thing to keep in mind is that str and unicode can be used together with operators if str only consists of 7-bit ASCI characters.

In Python 3, bytes consists of sequences of 8-bit values, while str consists of sequences of Unicode characters. bytes and strcannot be used together with operators like > or +.

It might be useful to use helper functions to convert between str and unicode in Python 2, and between bytes and str in Python 3.

You can check this question and its answers to see how to convert between bytes and str in Python 3.

Note
Rectangle 27 1

python What is the difference between a string and a byte string?


'I am a string'.encode('ASCII')
b'I am a string'.decode('ASCII')
  • If you want to store a picture, you must first encode it using PNG, JPEG, etc.
  • If you want to store music, you must first encode it using MP3, WAV, etc.
  • If you want to store text, you must first encode it using ASCII, UTF-8, etc.

A byte string can be decoded back into a character string, if you know the encoding that was used to encode it.

Absolutely brilliant. Lucid and easy to understand. However, I would like to mention that this line - "If you print it, Python will represent it as b'I am a string'" is true for Python3 as for Python2 bytes and str are the same thing.

Encoding and decoding are inverse operations. Everything must be encoded before it can be written to disk, and it must be decoded before it can be read by a human.

I am awarding you this bounty for offering a very human-readable explanation to put some clarity in this subject!

In Python, a byte string is just that: a sequence of bytes. It isn't human-readable. Under the hood, everything must be converted to a byte string before it can be stored in a computer.

Link to Joel's post mentioned by @neil.millikin above : joelonsoftware.com/2003/10/08/

MP3, WAV, PNG, JPEG, ASCII and UTF-8 are examples of encodings. An encoding is a format to represent audio, images, text, etc in bytes.

On the other hand, a character string, often just called a "string", is a sequence of characters. It is human-readable. A character string can't be directly stored in a computer, it has to be encoded first (converted into a byte string). There are multiple encodings through which a character string can be converted into a byte string, such as ASCII and UTF-8.

The above Python code will encode the string 'I am a string' using the encoding ASCII. The result of the above code will be a byte string. If you print it, Python will represent it as b'I am a string'. Remember, however, that byte strings aren't human-readable, it's just that Python decodes them from ASCII when you print them. In Python, a byte string is represented by a b, followed by the byte string's ASCII representation.

The above code will return the original string 'I am a string'.

The only thing that a computer can store is bytes.

To store anything in a computer, you must first encode it, i.e. convert it to bytes. For example:

Zenadix deserves some kudos here. After some years functioning in this environment, his is the first explanation that clicked with me. I may tattoo it on my other arm (one arm already has "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky"

Note
Rectangle 27 1

python What is the difference between a string and a byte string?


>>> 'oo'.encode('utf-8')
b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'
>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-16')
''
>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-8')
'oo'

@KshitijSaraogi that isn't quite true either; that whole sentence was edited in and is a bit unfortunate. The in-memory representation of Python 3 str objects is not accessible or relevant from the Python side; the data structure is just a sequence of codepoints. Under PEP 393, the exact internal encoding is one of Latin-1, UCS2 or UCS4, and a utf-8 representation may be cached after it is first requested, but even C code is discouraged from relying on these internal details.

Assuming Python 3 (in Python 2, this difference is a little less well-defined) - a string is a sequence of characters, ie unicode codepoints; these are an abstract concept, and can't be directly stored on disk. A byte string is a sequence of, unsurprisingly, bytes - things that can be stored on disk. The mapping between them is an encoding - there are quite a lot of these (and infinitely many are possible) - and you need to know which applies in the particular case in order to do the conversion, since a different encoding may map the same bytes to a different string:

If they can't be directly stored on disk, so how are they stored in memory?

Once you know which one to use, you can use the .decode() method of the byte string to get the right character string from it as above. For completeness, the .encode() method of a character string goes the opposite way:

To be technically correct, unicode is not the default encoding, rather the utf-8 encoding is the default character encoding to store unicode strings in memory.

To clarify for Python 2 users: the str type is the same as the bytes type; this answer is equivalently comparing the unicode type (does not exist in Python 3) to the str type.

Note
Rectangle 27 1

python What is the difference between a string and a byte string?


'I am a string'.encode('ASCII')
b'I am a string'.decode('ASCII')
  • If you want to store a picture, you must first encode it using PNG, JPEG, etc.
  • If you want to store music, you must first encode it using MP3, WAV, etc.
  • If you want to store text, you must first encode it using ASCII, UTF-8, etc.

A byte string can be decoded back into a character string, if you know the encoding that was used to encode it.

Absolutely brilliant. Lucid and easy to understand. However, I would like to mention that this line - "If you print it, Python will represent it as b'I am a string'" is true for Python3 as for Python2 bytes and str are the same thing.

Encoding and decoding are inverse operations. Everything must be encoded before it can be written to disk, and it must be decoded before it can be read by a human.

I am awarding you this bounty for offering a very human-readable explanation to put some clarity in this subject!

In Python, a byte string is just that: a sequence of bytes. It isn't human-readable. Under the hood, everything must be converted to a byte string before it can be stored in a computer.

Link to Joel's post mentioned by @neil.millikin above : joelonsoftware.com/2003/10/08/

MP3, WAV, PNG, JPEG, ASCII and UTF-8 are examples of encodings. An encoding is a format to represent audio, images, text, etc in bytes.

On the other hand, a character string, often just called a "string", is a sequence of characters. It is human-readable. A character string can't be directly stored in a computer, it has to be encoded first (converted into a byte string). There are multiple encodings through which a character string can be converted into a byte string, such as ASCII and UTF-8.

The above Python code will encode the string 'I am a string' using the encoding ASCII. The result of the above code will be a byte string. If you print it, Python will represent it as b'I am a string'. Remember, however, that byte strings aren't human-readable, it's just that Python decodes them from ASCII when you print them. In Python, a byte string is represented by a b, followed by the byte string's ASCII representation.

The above code will return the original string 'I am a string'.

The only thing that a computer can store is bytes.

To store anything in a computer, you must first encode it, i.e. convert it to bytes. For example:

Zenadix deserves some kudos here. After some years functioning in this environment, his is the first explanation that clicked with me. I may tattoo it on my other arm (one arm already has "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky"

Note
Rectangle 27 1

python What is the difference between a string and a byte string?


>>> print(''.encode('utf-8'))
b'\xe4\xb8\xad\xe6\x96\x87'
>>> print(b'\xe4\xb8\xad\xe6\x96\x87'.decode('utf-8'))

In a word, string is for displaying to humans to read on a computer and byte string is for storing to disk and data transmission.

Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one.

In python3, you can transform string and byte string to each other:

So when a computer represents a string, it finds characters stored in the computer of the string through their unique Unicode number and these figures are stored in memory. But you can't directly write the string to disk or transmit the string on network through their unique Unicode number because these figures are just simple decimal number. You should encode the string to byte string, such as UTF-8. UTF-8 is a character encoding capable of encoding all possible characters and it stores characters as bytes (it looks like this). So the encoded string can be used everywhere because UTF-8 is nearly supported everywhere. When you open a text file encoded in UTF-8 from other systems, your computer will decode it and display characters in it through their unique Unicode number. When a browser receive string data encoded UTF-8 from network, it will decode the data to string (assume the browser in UTF-8 encoding) and display the string.

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

Note
Rectangle 27 1

python What is the difference between a string and a byte string?


>>> print(''.encode('utf-8'))
b'\xe4\xb8\xad\xe6\x96\x87'
>>> print(b'\xe4\xb8\xad\xe6\x96\x87'.decode('utf-8'))

In a word, string is for displaying to humans to read on a computer and byte string is for storing to disk and data transmission.

Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one.

In python3, you can transform string and byte string to each other:

So when a computer represents a string, it finds characters stored in the computer of the string through their unique Unicode number and these figures are stored in memory. But you can't directly write the string to disk or transmit the string on network through their unique Unicode number because these figures are just simple decimal number. You should encode the string to byte string, such as UTF-8. UTF-8 is a character encoding capable of encoding all possible characters and it stores characters as bytes (it looks like this). So the encoded string can be used everywhere because UTF-8 is nearly supported everywhere. When you open a text file encoded in UTF-8 from other systems, your computer will decode it and display characters in it through their unique Unicode number. When a browser receive string data encoded UTF-8 from network, it will decode the data to string (assume the browser in UTF-8 encoding) and display the string.

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

Note
Rectangle 27 1

python What is the difference between a string and a byte string?


In Python 2, str consists of sequences of 8-bit values, while unicode consists of sequences of Unicode characters. One thing to keep in mind is that str and unicode can be used together with operators if str only consists of 7-bit ASCI characters.

In Python 3, bytes consists of sequences of 8-bit values, while str consists of sequences of Unicode characters. bytes and strcannot be used together with operators like > or +.

It might be useful to use helper functions to convert between str and unicode in Python 2, and between bytes and str in Python 3.

You can check this question and its answers to see how to convert between bytes and str in Python 3.

Note
Rectangle 27 0

python What is the difference between a string and a byte string?


>>> 'oo'.encode('utf-8')
b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'
>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-16')
''
>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-8')
'oo'

@KshitijSaraogi that isn't quite true either; that whole sentence was edited in and is a bit unfortunate. The in-memory representation of Python 3 str objects is not accessible or relevant from the Python side; the data structure is just a sequence of codepoints. Under PEP 393, the exact internal encoding is one of Latin-1, UCS2 or UCS4, and a utf-8 representation may be cached after it is first requested, but even C code is discouraged from relying on these internal details.

Assuming Python 3 (in Python 2, this difference is a little less well-defined) - a string is a sequence of characters, ie unicode codepoints; these are an abstract concept, and can't be directly stored on disk. A byte string is a sequence of, unsurprisingly, bytes - things that can be stored on disk. The mapping between them is an encoding - there are quite a lot of these (and infinitely many are possible) - and you need to know which applies in the particular case in order to do the conversion, since a different encoding may map the same bytes to a different string:

Once you know which one to use, you can use the .decode() method of the byte string to get the right character string from it as above. For completeness, the .encode() method of a character string goes the opposite way:

To be technically correct, unicode is not the default encoding, rather the utf-8 encoding is the default character encoding to store unicode strings in memory.

To clarify for Python 2 users: the str type is the same as the bytes type; this answer is equivalently comparing the unicode type (does not exist in Python 3) to the str type.

Note