The difference between bytearray and bytes in Python
Python 3 has two classes representing raw data:
bytearray. At a cursory glance they seem very similar. However, there is a difference that becomes crucial in certain applications.
str, is an immutable sequence of bytes.
bytearray is mutable.
One case where this matters is when we’re dealing with I/O operations, and thus buffering. For example, we may be receiving data over the network and waiting for message headers and terminators to appear in the stream before we can parse the message. So we keep adding incoming bytes to a buffer.
bytes we can achieve this with the following (pseudo) code:
buffer = b'' while message_not_complete(buffer): buffer += read_from_socket()
However, there is a significant cost we’re paying for each addition to the buffer. Since
bytes is an immutable type, every time we append more bytes to
buffer Python has to allocate the variable as the concatenation of
buffer and the return value of
read_from_socket. Concatenation is slow in Python and it shows when you’re processing high volume of data.
bytearray implementation of buffering looks very similar:
buffer = bytearray() while message_not_complete(buffer): buffer.extend(read_from_socket())
Yet this slight modification is orders of magnitude faster. Because
bytearray is mutable, it can be treated similarly to
list. It even has similar methods,
extend, and they both perform much better than concatenation of
bytes. Here’s a quick test:
In : %%timeit x = b'' x += b'x' ...: 100000 loops, best of 3: 3.02 µs per loop In : %%timeit x = bytearray() x.extend(b'x') ...: 10000000 loops, best of 3: 152 ns per loop
bytearray was 20× faster in appending bytes in this test. I recently had this experience first hand when writing custom network I/O code and mistakenly using
bytes for buffering.