Strings and bytes ================= The topic of strings may provide some confusion for programmers that used to program only in Python 2. In Python 3, there is only one datatype capable of storing textual information. It is ``str``, or simply string. It is an immutable sequence that stores Unicode code points. This is the major difference from Python 2, where ``str`` represented byte strings: something that is now handled by the ``bytes`` objects (but not exactly in the same way). Strings in Python are sequences. This single fact should be enough to include them in a section covering other container types. But they differ from other container types in one important detail. Strings have very specific limitations on what type of data they can store, and that is Unicode text. ``bytes``, and its mutable alternative, ``bytearray``, differs from ``str`` by allowing only bytes as a sequence value, and bytes in Python are integers in the ``0 <= x < 256`` range. This may be a bit confusing at the beginning, because, when printed, they may look very similar to strings: .. code-block:: python >>> print(bytes([102, 111, 111])) b'foo' The ``bytes`` and ``bytearray`` types allow you to work with raw binary data that may not always have to be textual (for example, audio/video files, images, and network packets). The true nature of these types is revealed when they are converted into other sequence types, such as list or tuple: .. code-block:: python >>> list(b'foo bar') [102, 111, 111, 32, 98, 97, 114] >>> tuple(b'foo bar') (102, 111, 111, 32, 98, 97, 114) A lot of Python 3 controversy was about breaking the backwards compatibility for string literals and how Python deals with Unicode. Starting from Python 3.0, every string literal without any prefix is Unicode. So, literals enclosed by single quotes (``'``), double quotes (``"``), or groups of three quotes (single or double) without any prefix represent the ``str`` data type: .. code-block:: python >>> type("some string") In Python 2, the Unicode literals required a ``u`` prefix (like ``u"some string"``). This prefix is still allowed for backwards compatibility (starting from Python 3.3), but does not hold any syntactic meaning in Python 3. Byte literals were already presented in some of the previous examples, but let's explicitly present their syntax for the sake of consistency. Bytes literals are enclosed by single quotes, double quotes, or triple quotes, but must be preceded with a ``b`` or ``B`` prefix: .. code-block:: python >>> type(b"some bytes") Note that Python does not provide a syntax for ``bytearray`` literals. If you want to create a ``bytearray`` value, you need to use a ``byte``s literal and a ``bytearray()`` type constructor: .. code-block:: python >>> bytearray(b'some bytes') bytearray(b'some bytes') It is important to remember that Unicode strings contain abstract text that is independent from the byte representation. This makes them unable to be saved on the disk or sent over the network without encoding them to binary data. There are two ways to encode string objects into byte sequences: - Using the ``str.encode(encoding, errors)`` method, which encodes the string using a registered codec for encoding. Codec is specified using the encoding argument, and, by default, it is 'utf-8'. The second argument, errors, specifies the error handling scheme. It can be 'strict' (default), 'ignore' , 'replace' , 'xmlcharrefreplace', or any other registered handler (refer to the built-in codecs module documentation). - Using the ``bytes(source, encoding, errors``) constructor, which creates a new bytes sequence. When the source is of the ``str`` type, then the encoding argument is obligatory and it does not have a default value. The usage of the encoding and errors arguments is the same as for the ``str.encode()`` method. Binary data represented by ``bytes`` can be converted into a string in an analogous way: - Using the ``bytes.decode(encoding, errors)`` method, which decodes the bytes using the codec registered for encoding. The arguments of this method have the same meaning and defaults as the arguments of ``str.encode()``. - Using the ``str(source, encoding, error)`` constructor, which creates a new string instance. Similar to the ``bytes()`` constructor, the encoding argument in the ``str()`` call has no default value and must be provided if the bytes sequence is used as a source. .. tip:: Due to changes made in Python 3, some people tend to refer to the ``bytes`` instances as byte strings. This is mostly due to historic reasons: ``bytes`` in Python 3 is the sequence type that is the closest one to the ``str`` type from Python 2 (but not the same). Still, the ``bytes`` instance is a sequence of bytes and also does not need to represent textual data. So, in order to avoid any confusion, it is advised to always refer to them as either bytes or byte sequence, despite their similarities to strings. The concept of strings is reserved for textual data in Python 3, and this is now always ``str``. 1. Implementation details ************************* Python strings are immutable. This is also true for byte sequences. This is an important fact, because it has both advantages and disadvantages. It also affects the way strings should be handled in Python efficiently. Thanks to immutability, strings can be used as dictionary keys or set collection elements because, once initialized, they will never change their value. On the other hand, whenever a modified string is required (even with only tiny modification), a completely new instance needs to be created. Fortunately, ``bytearray``, as a mutable version of ``bytes``, does not have such an issue. Byte arrays can be modified inplace (without creating new objects) through item assignments and can be dynamically resized, exactly like lists: using appends, pops, inserts, and so on. 2. String concatenation *********************** The fact that Python strings are immutable imposes some problems when multiple string instances need to be joined together. As we stated previously, concatenating immutable sequences results in the creation of a new sequence object. Consider that a new string is built by repeated concatenation of multiple strings, as follows: .. code-block:: python substrings = ["These ", "are ", "strings ", "to ", "concatenate."] s = "" for substring in substrings: s += substring This will result in quadratic runtime costs in the total string length. In other words, it is highly inefficient. For handling such situations, the ``str.join()`` method is available. It accepts iterables of strings as the argument and returns joined strings. The call to ``join()`` of the str type can be done in two forms: .. code-block:: python # using empty literal s = "".join(substrings) # using "unbound" method call str.join("", substrings) The first form of the ``join()`` call is the most common idiom. The string that provides this method will be used as a separator between concatenated substrings. Consider the following example: .. code-block:: python >>> ','.join(['some', 'comma', 'separated', 'values']) 'some,comma,separated,values' It is worth remembering that just because it is faster (especially for large lists), it does not mean that the ``join()`` method should be used in every situation where two strings need to be concatenated. Despite being a widely recognized idiom, it does not improve code readability. And readability counts! There are also some situations where ``join()`` may not perform as well as ordinary concatenation with a + operator. Here are some examples: - If the number of substrings is very small and they are not contained already by some iterable variable (existing list or tuple of strings): in some cases the overhead of creating a new sequence just to perform concatenation can overshadow the gain of using ``join()``. - When concatenating short literals: thanks to some interpreter-level optimizations, such as constant folding in CPython, some complex literals (not only strings), such as ``'a' + 'b' +'c'``, can be translated into a shorter form at compile time (here ``'abc'``). Of course, this is enabled only for constants (literals) that are relatively short. Ultimately, if the number of strings to concatenate is known beforehand, the best readability is ensured by proper string formatting either using the ``str.format()`` method, the ``%`` operator, or f-string formatting. In code sections where the performance is not critical or the gain from optimizing string concatenation is very little, string formatting is recommended as the best alternative to concatenation. 2.1. Constant folding, the peephole optimizer, and the AST optimizer ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ CPython uses various techniques to optimize your code. The first optimization takes place as soon as source code is transformed into the form of the abstract syntax tree, just before it is compiled into byte code. CPython can recognize specific patterns in the abstract syntax tree and make direct modifications to it. The other kind of optimizations are handled by the peephole optimizer. It implements a number of common optimizations directly on Python's byte code. As we mentioned earlier, constant folding is one such feature. It allows the interpreter to convert complex literal expressions (such as ``"one" + " " + "thing"``, ``" " * 79``, or ``60 * 1000``) into a single literal that does not require any additional operations (concatenation or multiplication) at runtime. Until Python 3.5, all constant folding was done in CPython only by the peephole optimizer. For strings, the resulting constants were limited in length by a hardcoded value. In Python 3.5, this value was equal to 20. In Python 3.7, most of the constant folding optimizations are handled earlier on the abstract syntax tree level. These particular details are a curiosity rather than a thing that can be relied on in your day-to-day programming. Information about other interesting optimizations performed by AST and peephole optimizers can be found in the *Python/ast_opt.c* and *Python/peephole.c* files of Python's source code. 3. String formatting with f-strings *********************************** F-strings are one of the most beloved new Python features that came with Python 3.6. It's also one of the most controversial features of that release. The f-strings or formatted string literals that were introduced by the PEP 498 document add a new way to format strings in Python. Before Python 3.6, there were two basic ways to format strings: - Using ``%`` formatting for example ``"Some string with included % value" % "other"`` - Using the ``str.format()`` method for example ``"Some string with included {other} value".format(other="other")`` Formatted string literals are denoted with the ``f`` prefix, and their syntax is closest to the ``str.format()`` method, as they use a similar markup for denoting replacement fields in the text that has to be formatted. In the ``str.format()`` method, the text substitutions refer to arguments and keyword arguments that are passed to the formatting method. You can use either anonymous substitutions that will translate to consecutive argument indexes, explicit argument indexes, or keyword names. This means that the same string can be formatted in different ways: .. code-block:: python >>> from sys import version_info >>> "This is Python {}.{}".format(*version_info) 'This is Python 3.7' >>> "This is Python {0}.{1}".format(*version_info) 'This is Python 3.7' >>> "This is Python {major}.{minor}".format(major=version_info.major, minor=version_info.minor) 'This is Python 3.7' What makes f-strings special is that replacement fields can be any Python expression, and it will be evaluated at runtime. Inside of strings, you have access to any variable that is available in the same namespace as the formatted literal. With f-strings, the preceding examples could be written in the following way: .. code-block:: python >>> from sys import version_info >>> f"This is Python {version_info.major}.{version_info.minor}" 'This is Python 3.7' The ability to use expressions as replacement fields make formatting code simpler and shorter. You can also use the same formatting specifiers of replacement fields (for padding, aligning, signs, and so on) as the ``str.format()`` method, and the syntax is as follows: .. code-block:: python f"{replacement_field_expression:format_specifier}" The following is a simple example of code that prints the first ten powers of the number 10 using f-strings and aligns results using string formatting with padding: .. code-block:: python >>> for x in range(10): ... print(f"10^{x} == {10**x:10d}") ... 10^0 == 1 10^1 == 10 10^2 == 100 10^3 == 1000 10^4 == 10000 10^5 == 100000 10^6 == 1000000 10^7 == 10000000 10^8 == 100000000 10^9 == 1000000000 The full formatting specification of the Python string is almost like a separate minilanguage inside Python. The best reference for it is the official documentation which you can find under `https://docs.python.org/3/library/string.html `_. Another useful internet resource for that topic is `https://pyformat.info/ `_, which presents the most important elements of this specification using practical examples.