Python: Encode UTF-8 to Base64
The Python class above converts a text string from UTF-8 to Base64, but perhaps you wanted to get down to the nitty-gritty.
A single Base64 digit represents 6 binary bits. For example, lowercase a is represented by index number 26, and binary number 011010
, since 0 + 16 + 8 + 0 + 2 + 0 = 26.
For this example, Base64’s 6-bit binary numbers will be separated out of three concatenated 8-bit binary numbers which represent the UTF-8 characters in the string abc
. For example, when the binary numbers 01100001
01100010
01100011
concatenated to 011000010110001001100011
then separate out to 011000
010110
001001
100011
.
In the Python interactive shell, the string abc
can be manually converted to 6-bit numbers:
>>> ord('a')
97
>>> ord('b')
98
>>> ord('c')
99
Since Python’s bin()
builtin removes leading zeros, the binary numbers need to be padded with leading zeros in order to end up with eight digits each. Below, three binary numbers are concatenated:
>>> bin(97)[2:].rjust(8, '0') + bin(98)[2:].rjust(8, '0') + bin(99)[2:].rjust(8, '0')
'011000010110001001100011'
Then that long binary string needs to be split up into 6-bit pieces, and then the separated 6-bit binary numbers can be converted into characters from a Base64 table:
- Y = 011000
- W = 010110
- J = 001001
- j = 100011
That is how UTF-8 abc
converts to Base64 YWJj
, the hard way.