Strings and bytes

The topic of strings may provide some confusion for programmers that used to program only in Python 2. In Python 3, there is only one datatype capable of storing textual information. It is str, or simply string. It is an immutable sequence that stores Unicode code points. This is the major difference from Python 2, where str represented byte strings: something that is now handled by the bytes objects (but not exactly in the same way).

Strings in Python are sequences. This single fact should be enough to include them in a section covering other container types. But they differ from other container types in one important detail. Strings have very specific limitations on what type of data they can store, and that is Unicode text.

bytes, and its mutable alternative, bytearray, differs from str by allowing only bytes as a sequence value, and bytes in Python are integers in the 0 <= x < 256 range. This may be a bit confusing at the beginning, because, when printed, they may look very similar to strings:

>>> print(bytes([102, 111, 111]))
b'foo'

The bytes and bytearray types allow you to work with raw binary data that may not always have to be textual (for example, audio/video files, images, and network packets). The true nature of these types is revealed when they are converted into other sequence types, such as list or tuple:

>>> list(b'foo bar')
[102, 111, 111, 32, 98, 97, 114]
>>> tuple(b'foo bar')
(102, 111, 111, 32, 98, 97, 114)

A lot of Python 3 controversy was about breaking the backwards compatibility for string literals and how Python deals with Unicode. Starting from Python 3.0, every string literal without any prefix is Unicode. So, literals enclosed by single quotes ('), double quotes ("), or groups of three quotes (single or double) without any prefix represent the str data type:

>>> type("some string")
<class 'str'>

In Python 2, the Unicode literals required a u prefix (like u"some string"). This prefix is still allowed for backwards compatibility (starting from Python 3.3), but does not hold any syntactic meaning in Python 3.

Byte literals were already presented in some of the previous examples, but let’s explicitly present their syntax for the sake of consistency. Bytes literals are enclosed by single quotes, double quotes, or triple quotes, but must be preceded with a b or B prefix:

>>> type(b"some bytes")
<class 'bytes'>

Note that Python does not provide a syntax for bytearray literals. If you want to create a bytearray value, you need to use a byte``s literal and a ``bytearray() type constructor:

>>> bytearray(b'some bytes')
bytearray(b'some bytes')

It is important to remember that Unicode strings contain abstract text that is independent from the byte representation. This makes them unable to be saved on the disk or sent over the network without encoding them to binary data. There are two ways to encode string objects into byte sequences:

  • Using the str.encode(encoding, errors) method, which encodes the string using a registered codec for encoding. Codec is specified using the encoding argument, and, by default, it is ‘utf-8’. The second argument, errors, specifies the error handling scheme. It can be ‘strict’ (default), ‘ignore’ , ‘replace’ , ‘xmlcharrefreplace’, or any other registered handler (refer to the built-in codecs module documentation).

  • Using the bytes(source, encoding, errors) constructor, which creates a new bytes sequence. When the source is of the str type, then the encoding argument is obligatory and it does not have a default value. The usage of the encoding and errors arguments is the same as for the str.encode() method.

Binary data represented by bytes can be converted into a string in an analogous way:

  • Using the bytes.decode(encoding, errors) method, which decodes the bytes using the codec registered for encoding. The arguments of this method have the same meaning and defaults as the arguments of str.encode().

  • Using the str(source, encoding, error) constructor, which creates a new string instance. Similar to the bytes() constructor, the encoding argument in the str() call has no default value and must be provided if the bytes sequence is used as a source.

Tip

Due to changes made in Python 3, some people tend to refer to the bytes instances as byte strings. This is mostly due to historic reasons: bytes in Python 3 is the sequence type that is the closest one to the str type from Python 2 (but not the same). Still, the bytes instance is a sequence of bytes and also does not need to represent textual data. So, in order to avoid any confusion, it is advised to always refer to them as either bytes or byte sequence, despite their similarities to strings. The concept of strings is reserved for textual data in Python 3, and this is now always str.

1. Implementation details

Python strings are immutable. This is also true for byte sequences. This is an important fact, because it has both advantages and disadvantages. It also affects the way strings should be handled in Python efficiently. Thanks to immutability, strings can be used as dictionary keys or set collection elements because, once initialized, they will never change their value. On the other hand, whenever a modified string is required (even with only tiny modification), a completely new instance needs to be created. Fortunately, bytearray, as a mutable version of bytes, does not have such an issue. Byte arrays can be modified inplace (without creating new objects) through item assignments and can be dynamically resized, exactly like lists: using appends, pops, inserts, and so on.

2. String concatenation

The fact that Python strings are immutable imposes some problems when multiple string instances need to be joined together. As we stated previously, concatenating immutable sequences results in the creation of a new sequence object. Consider that a new string is built by repeated concatenation of multiple strings, as follows:

substrings = ["These ", "are ", "strings ", "to ", "concatenate."]
s = ""

for substring in substrings:
    s += substring

This will result in quadratic runtime costs in the total string length. In other words, it is highly inefficient. For handling such situations, the str.join() method is available. It accepts iterables of strings as the argument and returns joined strings. The call to join() of the str type can be done in two forms:

# using empty literal
s = "".join(substrings)

# using "unbound" method call
str.join("", substrings)

The first form of the join() call is the most common idiom. The string that provides this method will be used as a separator between concatenated substrings. Consider the following example:

>>> ','.join(['some', 'comma', 'separated', 'values'])
'some,comma,separated,values'

It is worth remembering that just because it is faster (especially for large lists), it does not mean that the join() method should be used in every situation where two strings need to be concatenated. Despite being a widely recognized idiom, it does not improve code readability. And readability counts! There are also some situations where join() may not perform as well as ordinary concatenation with a + operator. Here are some examples:

  • If the number of substrings is very small and they are not contained already by some iterable variable (existing list or tuple of strings): in some cases the overhead of creating a new sequence just to perform concatenation can overshadow the gain of using join().

  • When concatenating short literals: thanks to some interpreter-level optimizations, such as constant folding in CPython, some complex literals (not only strings), such as 'a' + 'b' +'c', can be translated into a shorter form at compile time (here 'abc'). Of course, this is enabled only for constants (literals) that are relatively short.

Ultimately, if the number of strings to concatenate is known beforehand, the best readability is ensured by proper string formatting either using the str.format() method, the % operator, or f-string formatting. In code sections where the performance is not critical or the gain from optimizing string concatenation is very little, string formatting is recommended as the best alternative to concatenation.

2.1. Constant folding, the peephole optimizer, and the AST optimizer

CPython uses various techniques to optimize your code. The first optimization takes place as soon as source code is transformed into the form of the abstract syntax tree, just before it is compiled into byte code. CPython can recognize specific patterns in the abstract syntax tree and make direct modifications to it. The other kind of optimizations are handled by the peephole optimizer. It implements a number of common optimizations directly on Python’s byte code. As we mentioned earlier, constant folding is one such feature. It allows the interpreter to convert complex literal expressions (such as "one" + " " + "thing", " " * 79, or 60 * 1000) into a single literal that does not require any additional operations (concatenation or multiplication) at runtime.

Until Python 3.5, all constant folding was done in CPython only by the peephole optimizer. For strings, the resulting constants were limited in length by a hardcoded value. In Python 3.5, this value was equal to 20. In Python 3.7, most of the constant folding optimizations are handled earlier on the abstract syntax tree level. These particular details are a curiosity rather than a thing that can be relied on in your day-to-day programming. Information about other interesting optimizations performed by AST and peephole optimizers can be found in the Python/ast_opt.c and Python/peephole.c files of Python’s source code.

3. String formatting with f-strings

F-strings are one of the most beloved new Python features that came with Python 3.6. It’s also one of the most controversial features of that release. The f-strings or formatted string literals that were introduced by the PEP 498 document add a new way to format strings in Python. Before Python 3.6, there were two basic ways to format strings:

  • Using % formatting for example "Some string with included % value" % "other"

  • Using the str.format() method for example "Some string with included {other} value".format(other="other")

Formatted string literals are denoted with the f prefix, and their syntax is closest to the str.format() method, as they use a similar markup for denoting replacement fields in the text that has to be formatted. In the str.format() method, the text substitutions refer to arguments and keyword arguments that are passed to the formatting method. You can use either anonymous substitutions that will translate to consecutive argument indexes, explicit argument indexes, or keyword names.

This means that the same string can be formatted in different ways:

>>> from sys import version_info
>>> "This is Python {}.{}".format(*version_info)
'This is Python 3.7'
>>> "This is Python {0}.{1}".format(*version_info)
'This is Python 3.7'
>>> "This is Python {major}.{minor}".format(major=version_info.major, minor=version_info.minor)
'This is Python 3.7'

What makes f-strings special is that replacement fields can be any Python expression, and it will be evaluated at runtime. Inside of strings, you have access to any variable that is available in the same namespace as the formatted literal. With f-strings, the preceding examples could be written in the following way:

>>> from sys import version_info
>>> f"This is Python {version_info.major}.{version_info.minor}"
'This is Python 3.7'

The ability to use expressions as replacement fields make formatting code simpler and shorter. You can also use the same formatting specifiers of replacement fields (for padding, aligning, signs, and so on) as the str.format() method, and the syntax is as follows:

f"{replacement_field_expression:format_specifier}"

The following is a simple example of code that prints the first ten powers of the number 10 using f-strings and aligns results using string formatting with padding:

>>> for x in range(10):
...     print(f"10^{x} == {10**x:10d}")
...
10^0 == 1
10^1 == 10
10^2 == 100
10^3 == 1000
10^4 == 10000
10^5 == 100000
10^6 == 1000000
10^7 == 10000000
10^8 == 100000000
10^9 == 1000000000

The full formatting specification of the Python string is almost like a separate minilanguage inside Python. The best reference for it is the official documentation which you can find under https://docs.python.org/3/library/string.html. Another useful internet resource for that topic is https://pyformat.info/, which presents the most important elements of this specification using practical examples.