So, like, what is a byte?

A friend of mine in the filmmaking business, who is exceedingly bright but has never worked with SQL Server before, was reading through the first five posts of this Database Fundamentals series, and asked a great question:

“I guess I’m not understanding what a byte is. I think I’m circling the drain in understanding it, but not floating down.”

She has a way with words.

I answered her immediately, but it reminded me that I did get a little carried away with data types, assuming that everyone reading that post would understand what a byte is.

In the innards of the computer is the CPU, or Central Processing Unit (there might be more than one in a server). The CPU is best described as a hot mess of on-off switches. Just as it is in your house, a switch only has two states.

David Hasselhoff, SQL Server DBA

This is what “binary” means. When the CPU clock ticks over, billions of times per second, if a switch is closed, it’s a 1 (electricity can flow to complete the circuit). If the switch is open, it’s a 0 (electricity cannot pass through it).

The CPU (and memory, and storage system, and network) understand binary, and the software that sits on top of it uses binary as well.

We end up with a series of 1s and 0s that, when arranged in different combinations, represent information in some form or another. Each of these is a binary digit, or bit.

Through a series of decisions in the old days of computing, when we stick eight of these bits of data together, they form a byte.

Now comes the mathematical part of today’s post.

If we have 8 digits that can store two values each, we get a total of 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 combinations. This is more easily typed as 2^8, or 256. In other words, a byte can store a maximum of 256 values.

Here’s a short list of bytes to give you an example (I have not listed every one of the 256 possibilities). We write the bits in groups of four to make them easier to read.

Binary Decimal ASCII Binary Decimal ASCII
0010 0000 32 <space> 1000 0001 129 Å
0010 0001 33 ! 1000 0010 130 Ç
0010 0010 34 1000 0011 131 É
0010 0011 35 # 1000 0100 132 Ñ
0010 0100 36 $ 1000 0101 133 Ö
0010 0101 37 % 1000 0110 134 Ü
0010 0110 38 & 1000 0111 135 á
0010 0111 39 1000 1000 136 à
0010 1000 40 ( 1000 1001 137 â
0010 1001 41 ) 1000 1010 138 ä
0010 1010 42 * 1000 1011 139 ã
0010 1011 43 + 1000 1100 140 å
0010 1100 44 , 1000 1101 141 ç
0010 1101 45 1000 1110 142 é
0010 1110 46 . 1000 1111 143 è
0010 1111 47 / 1001 0000 144 ê
0011 0000 48 0 1001 0001 145 ë
0011 0001 49 1 1001 0010 146 í
0011 0010 50 2 1001 0011 147 ì
0011 0011 51 3 1001 0100 148 î
0011 0100 52 4 1001 0101 149 ï
0011 0101 53 5 1001 0110 150 ñ
0011 0110 54 6 1001 0111 151 ó
0011 0111 55 7 1001 1000 152 ò
0011 1000 56 8 1001 1001 153 ô
0011 1001 57 9 1001 1010 154 ö
0011 1010 58 : 1001 1011 155 õ
0011 1011 59 ; 1001 1100 156 ú
0011 1100 60 < 1001 1101 157 ù
0011 1101 61 = 1001 1110 158 û
0011 1110 62 > 1001 1111 159 ü
0011 1111 63 ? 1010 0000 160
0100 0000 64 @ 1010 0001 161 °
0100 0001 65 A 1010 0010 162 ¢
0100 0010 66 B 1010 0011 163 £
0100 0011 67 C 1010 0100 164 §
0100 0100 68 D 1010 0101 165
0100 0101 69 E 1010 0110 166
0100 0110 70 F 1010 0111 167 ß
0100 0111 71 G 1010 1000 168 ®
0100 1000 72 H 1010 1001 169 ©
0100 1001 73 I 1010 1010 170
0100 1010 74 J 1010 1011 171 ´
0100 1011 75 K 1010 1100 172 ¨
0100 1100 76 L 1010 1101 173
0100 1101 77 M 1010 1110 174 Æ
0100 1110 78 N 1010 1111 175 Ø
0100 1111 79 O 1011 0000 176
0101 0000 80 P 1011 0001 177 ±
0101 0001 81 Q 1011 0010 178
0101 0010 82 R 1011 0011 179
0101 0011 83 S 1011 0100 180 ¥
0101 0100 84 T 1011 0101 181 µ
0101 0101 85 U 1011 0110 182
0101 0110 86 V 1011 0111 183
0101 0111 87 W 1011 1000 184
0101 1000 88 X 1011 1001 185 π
0101 1001 89 Y 1011 1010 186
0101 1010 90 Z 1011 1011 187 ª
0101 1011 91 [ 1011 1100 188 º
0101 1100 92 \ 1011 1101 189 Ω
0101 1101 93 ] 1011 1110 190 æ
0101 1110 94 ^ 1011 1111 191 ø
0101 1111 95 _ 1100 0000 192 ¿
0110 0000 96 ` 1100 0001 193 ¡
0110 0001 97 a 1100 0010 194 ¬
0110 0010 98 b 1100 0011 195
0110 0011 99 c 1100 0100 196 ƒ
0110 0100 100 d 1100 0101 197
0110 0101 101 e 1100 0110 198
0110 0110 102 f 1100 0111 199 «
0110 0111 103 g 1100 1000 200 »
0110 1000 104 h 1100 1001 201
0110 1001 105 i 1100 1010 202
0110 1010 106 j 1100 1011 203 À
0110 1011 107 k 1100 1100 204 Ã
0110 1100 108 l 1100 1101 205 Õ
0110 1101 109 m 1100 1110 206 Œ
0110 1110 110 n 1100 1111 207 œ
0110 1111 111 o 1101 0000 208
0111 0000 112 p 1101 0001 209
0111 0001 113 q 1101 0010 210
0111 0010 114 r 1101 0011 211
0111 0011 115 s 1101 0100 212
0111 0100 116 t 1101 0101 213
0111 0101 117 u 1101 0110 214 ÷
0111 0110 118 v 1101 0111 215
0111 0111 119 w 1101 1000 216 ÿ
0111 1000 120 x 1101 1001 217 Ÿ
0111 1001 121 y 1101 1010 218
0111 1010 122 z 1101 1011 219
0111 1011 123 { 1101 1100 220
0111 1100 124 | 1101 1101 221
0111 1101 125 } 1101 1110 222
0111 1110 126 ~ 1101 1111 223
0111 1111 127 1110 0000 224
1000 0000 128 Ä 1110 0001 225 ·

There are values missing from the above table, for characters that cannot be displayed correctly in a web browser. For a complete table showing all 256 characters, visit PC Guide.com.

How does this affect Unicode values? If you remember in our post about CHAR, NCHAR, VARCHAR and NVARCHAR data types, we discovered that the Unicode versions (those types starting with N) will use two bytes in memory and on disk to store a single character, compared to the non-Unicode (sometimes called ASCII or plain text) data types, which use only one byte per character.

The high-level reason for this is that some alphabets have more than 256 characters, so the code page (the full set of characters in upper- and lower-case where applicable, plus all the numbers, punctuation marks, and so forth) won’t fit in the 256 possibilities available in a single byte.

When we stick two bytes together however, we suddenly have as many as 2^16 values that we can store, for a total of 65,536 possibilities. This is mostly good enough if you’re not storing Japanese in SQL Server.

There are exceptions to this, where some kanji takes up four bytes per character. This is known as UTF-32 (Unicode Transformation Format, 32 bits per character). The good news is, SQL Server does support multi-byte characters wider than standard (UTF-16) Unicode, as long as we pick the correct collation.

I hope this answers any burning questions you may have had about bits and bytes.

Feel free to reach out to me on Twitter at @bornsql.

BONUS! Check out this great video on what a CPU looks like at a massive scale (link from Leon Adato):

%d bloggers like this: