## Sunday, December 4, 2011

### Evaluating RC4 as pseudo-random number generator

Update (September 3, 2012): I made some mistakes in my own implementation of RC4. An analysis of the mistake is documented in Mistake in my RC4 implementation. After fixing the mistake, I do not see the skewed distribution shown here. Please be advised that none of the blog posts in this blog are peer reviewed, and mistakes happen.
On the quest to look for a pseudo-random number generator that is easy to implement, I gave up on Mersenne Twister but took Wikipedia's advice that a stream cipher can be used for generating pseudo-random sequences. But it also mentions that RC4 is not a particularly uniform stream cipher which makes statistical attacks possible. Because of the simplicity of RC4, I decided to implement it and see really how bad it is. Since RC4 outputs a stream of bytes, I took the histogram of the byte value. The initialization key is the integer value of a stack-allocated object pointer, repeated for 64 bytes, and it's run on a 64-bit machine. I'm not sure if the fact that the key is highly regular has anything to do with the following non-uniform result that I observed.
 byte count byte count byte count byte count byte count byte count byte count byte count 0 5499938 32 50376 64 81740 96 253697 128 50004 160 144433 192 147856 224 147031 1 134854 33 50103 65 50386 97 115718 129 50024 161 49930 193 149593 225 50260 2 50352 34 50179 66 954060 98 50362 130 50185 162 50553 194 134910 226 78249 3 115332 35 50006 67 50091 99 50170 131 50543 163 50289 195 50516 227 49853 4 50214 36 50065 68 50256 100 50052 132 50288 164 49950 196 50532 228 50303 5 50089 37 50589 69 50173 101 50118 133 143783 165 49753 197 50182 229 50135 6 207241 38 49946 70 50416 102 50395 134 50447 166 50270 198 50145 230 134634 7 68820 39 50147 71 49985 103 49774 135 50252 167 50502 199 49368 231 49987 8 50212 40 103906 72 49950 104 50501 136 49803 168 50038 200 50032 232 50486 9 695154 41 49905 73 50137 105 50389 137 50085 169 50619 201 66006 233 49849 10 50439 42 49865 74 50472 106 50284 138 49508 170 50045 202 50336 234 50083 11 49682 43 948555 75 50141 107 50230 139 50294 171 50719 203 50300 235 50267 12 50378 44 49965 76 140698 108 50119 140 49772 172 50399 204 49963 236 50117 13 50466 45 50171 77 49814 109 50012 141 50385 173 50031 205 50072 237 49945 14 49591 46 50290 78 50219 110 90366 142 50177 174 50126 206 50585 238 50157 15 49776 47 50340 79 49916 111 49974 143 50224 175 50513 207 50350 239 50234 16 287052 48 205931 80 151232 112 131863 144 50079 176 50026 208 50226 240 131317 17 50022 49 50076 81 50134 113 49801 145 50487 177 50688 209 81920 241 50161 18 118991 50 50125 82 50108 114 50370 146 66545 178 78182 210 49998 242 50092 19 50209 51 50706 83 50292 115 50297 147 50236 179 50411 211 50458 243 50180 20 50170 52 50025 84 221044 116 116139 148 247793 180 215784 212 50185 244 50124 21 50371 53 49717 85 50259 117 50274 149 50327 181 758974 213 49808 245 90855 22 50365 54 49979 86 50302 118 50523 150 50228 182 50272 214 50685 246 50228 23 50467 55 50145 87 49797 119 105677 151 50108 183 50390 215 113514 247 50316 24 100861 56 78827 88 113581 120 138358 152 57005 184 53395 216 147492 248 50546 25 50158 57 49803 89 50264 121 50209 153 50095 185 50334 217 50396 249 50078 26 50483 58 50279 90 50176 122 49674 154 50292 186 50529 218 49915 250 50111 27 50211 59 49997 91 50337 123 50396 155 50453 187 49836 219 50170 251 50524 28 49859 60 50507 92 50313 124 50280 156 50059 188 50229 220 248053 252 49981 29 50580 61 50869 93 50417 125 49877 157 50355 189 50313 221 49937 253 50533 30 50142 62 50358 94 50292 126 50240 158 49821 190 50204 222 50139 254 49975 31 50265 63 50165 95 50352 127 244313 159 50410 191 50529 223 202938 255 246009
The result is pretty staggering non-uniform and heavily biased to 0. I quickly eyeballed the figures and labeled the counter with colors. Red means a strong bias. Yellow is a slight bias. Green is the ballpark figure for unbiased result, but they are still above a typical count in this table. The coloring is quite subjective, but I'm only using it to highlight the non-uniform bias.

Now, this observation also indicates that RC4 is a very poor stream cipher. The way stream cipher works is essentially that a stream of pseudo-random bytes are generated and XOR-ed byte-wise with the plaintext in order to obtain ciphertext. The prevalence of 0 values means that most of the plaintext will be identical to the ciphertext because 0 is an XOR identity. The fact that this bad encryption is possible with some key choices should worry a lot of people who are still using RC4.

By a stroke of luck, I tried a slightly modified RC4 that actually gives me a pretty uniform distribution. The difference (from the pseudo-code listing of the RC4 wikipedia article) is highlighted in grey below.
i := 0
j := 0
while GeneratingOutput:
i := (i + 1) mod 256
j := (j + S[i]) mod 256
swap values of S[i] and S[j]
K := S[(S[i] + S[j]) mod 256]
output K ^ j  // was just K before
endwhile

This modification gives me the following histogram.
 byte count byte count byte count byte count byte count byte count byte count byte count 0 99723 32 100388 64 99739 96 99866 128 100026 160 100026 192 99831 224 99581 1 99941 33 99908 65 100363 97 100328 129 99741 161 100338 193 99792 225 99776 2 99968 34 100010 66 99589 98 100231 130 100649 162 100084 194 100218 226 99839 3 99896 35 99924 67 100465 99 100002 131 100093 163 99815 195 100478 227 100137 4 100439 36 99677 68 100004 100 99960 132 100165 164 99756 196 100064 228 100114 5 100081 37 99362 69 99664 101 99872 133 99887 165 100059 197 99523 229 100590 6 100570 38 99924 70 99648 102 99945 134 99459 166 99854 198 100170 230 100069 7 99812 39 99706 71 100066 103 99598 135 99532 167 99499 199 99956 231 99892 8 99969 40 100484 72 100074 104 99506 136 99831 168 100291 200 100164 232 99562 9 99840 41 99756 73 99128 105 100178 137 99646 169 100348 201 100121 233 99412 10 99793 42 100133 74 100019 106 100488 138 99882 170 99661 202 99864 234 99878 11 100208 43 100059 75 99717 107 100141 139 100374 171 100116 203 99643 235 100394 12 100157 44 99772 76 99846 108 100085 140 100180 172 99632 204 99928 236 100123 13 100098 45 100232 77 100017 109 100654 141 99955 173 99993 205 99858 237 100188 14 100017 46 100135 78 99566 110 99104 142 99640 174 99256 206 100243 238 99918 15 100259 47 100403 79 99459 111 99887 143 100255 175 99818 207 99612 239 99840 16 100139 48 99888 80 100828 112 99999 144 99820 176 100378 208 99663 240 99622 17 100093 49 100173 81 99633 113 100480 145 99804 177 99826 209 100564 241 99916 18 100257 50 99987 82 100502 114 100253 146 100554 178 99998 210 100183 242 100319 19 99835 51 100206 83 99776 115 99674 147 100726 179 99909 211 100548 243 99803 20 100010 52 99879 84 100188 116 99657 148 100255 180 99687 212 99937 244 100122 21 100424 53 100257 85 100197 117 100375 149 99866 181 99741 213 100669 245 99972 22 100062 54 99376 86 100123 118 100396 150 99623 182 100523 214 100195 246 100034 23 99729 55 100404 87 100620 119 100275 151 99594 183 99879 215 99921 247 100092 24 100076 56 99792 88 100383 120 99712 152 100641 184 99514 216 100226 248 100600 25 100107 57 100214 89 100079 121 99682 153 99723 185 99764 217 100120 249 99574 26 99785 58 100284 90 99835 122 100090 154 99805 186 99752 218 99850 250 100211 27 99839 59 99459 91 100332 123 100185 155 99557 187 100176 219 100167 251 100213 28 99700 60 99795 92 99966 124 100599 156 99709 188 100422 220 100028 252 100368 29 99776 61 99663 93 99596 125 99479 157 100266 189 100427 221 100172 253 99842 30 100319 62 99734 94 99962 126 99874 158 100068 190 100238 222 100058 254 99921 31 99537 63 100025 95 100043 127 100382 159 99910 191 99824 223 100697 255 99923
This is quite amazingly uniform, at least appears good enough for my purpose. It remains to be seen whether two or more consecutive bytes of output will still retain the same uniformity.

I think it is quite possible that this slight modification could be immune to the existing statistical attacks on RC4.

## Friday, December 2, 2011

### Interface checking

One of my regrets for C++ is that it doesn't have interface checking when you instantiate templates. Coupled with the delayed instantiation (the code of a method is not fully type checked until you use it), static type checking of template code has been a hit or miss. Also, when you do get a type error, it is incomprehensible, riddled with the instantiation details that shouldn't be there in the first place. Interface checking means that the compiler can do better type checking before template instantiation. Interface checking has been in development for some time (e.g. Boost Concept Check, ConceptCpp, N2773), but it didn't make into the C++11 standard.

Interface checking in most programming languages follow a fulfillment model where an interface specifies a collection of function method signatures, and if a class implements all the functions (fulfills all signatures), then it is said to implement the interface. This is the case for Java and Go language as well as C++ concept checking in development so far.

However, if I were to develop interface for a new language, I would choose a declarative model where a class has to explicitly declare the interface it intends to implement. The interface is uniquely identifiable in the program's namespace. The motivation is to prevent a class from unintentionally implementing an interface by coincidence.

This is especially important for C++ since many operators are overloaded. For example, a random access iterator looks a lot like an integral type at the interface level (in fact it is a super concept because you can dereference an iterator), but manipulating the iterator like an integral without observing begin() and end() will cause the iterator to go out of bounds.

This semantic gap can be bridged by a more precise type specification using dependent type (not to be confused with C++ dependent type, a class that is complete if and only if a template argument refers to a complete type; complete type here simply means the compiler knows its memory layout so you can do a sizeof(type), for example, to get its size), but we don't yet have an object oriented dependent type system with interface checking. Even if we can bridge the semantic gap, I argue that a declarative interface checking is not less powerful than fulfillment interface checking.

Suppose you have a class Foo that implements all the methods required for interface Xyzzy, but Foo doesn't declare it so. In the fulfillment model, Foo can be used to instantiate any template (generic) that requires interface Xyzzy, but not in the declarative model, which requires Foo to explicitly state that it implements interface Xyzzy. Even if you cannot modify the definition of Foo, all you need to do is to subclass Foo and declare that it implements Xyzzy, and pass it to the template.

Another idea I have (in the context of C++ concept checking at least) is to get rid of the distinction between a class and an interface. Any class can be used as an interface. That's because C++ allows declaration and implementation to be separate. Furthermore, when a class is said to implement another class, it does not mean the class is a subclass of the other class. The compiler will only check that the method signatures are congruent.