Sunday, December 4, 2011

Evaluating RC4 as pseudo-random number generator

Update (September 3, 2012): I made some mistakes in my own implementation of RC4. An analysis of the mistake is documented in Mistake in my RC4 implementation. After fixing the mistake, I do not see the skewed distribution shown here. Please be advised that none of the blog posts in this blog are peer reviewed, and mistakes happen.
On the quest to look for a pseudo-random number generator that is easy to implement, I gave up on Mersenne Twister but took Wikipedia's advice that a stream cipher can be used for generating pseudo-random sequences. But it also mentions that RC4 is not a particularly uniform stream cipher which makes statistical attacks possible. Because of the simplicity of RC4, I decided to implement it and see really how bad it is. Since RC4 outputs a stream of bytes, I took the histogram of the byte value. The initialization key is the integer value of a stack-allocated object pointer, repeated for 64 bytes, and it's run on a 64-bit machine. I'm not sure if the fact that the key is highly regular has anything to do with the following non-uniform result that I observed.
bytecountbytecountbytecountbytecount bytecountbytecountbytecountbytecount
05499938 3250376 6481740 96253697 12850004 160144433 192147856 224147031
1134854 3350103 6550386 97115718 12950024 16149930 193149593 22550260
250352 3450179 66954060 9850362 13050185 16250553 194134910 22678249
3115332 3550006 6750091 9950170 13150543 16350289 19550516 22749853
450214 3650065 6850256 10050052 13250288 16449950 19650532 22850303
550089 3750589 6950173 10150118 133143783 16549753 19750182 22950135
6207241 3849946 7050416 10250395 13450447 16650270 19850145 230134634
768820 3950147 7149985 10349774 13550252 16750502 19949368 23149987
850212 40103906 7249950 10450501 13649803 16850038 20050032 23250486
9695154 4149905 7350137 10550389 13750085 16950619 20166006 23349849
1050439 4249865 7450472 10650284 13849508 17050045 20250336 23450083
1149682 43948555 7550141 10750230 13950294 17150719 20350300 23550267
1250378 4449965 76140698 10850119 14049772 17250399 20449963 23650117
1350466 4550171 7749814 10950012 14150385 17350031 20550072 23749945
1449591 4650290 7850219 11090366 14250177 17450126 20650585 23850157
1549776 4750340 7949916 11149974 14350224 17550513 20750350 23950234
16287052 48205931 80151232 112131863 14450079 17650026 20850226 240131317
1750022 4950076 8150134 11349801 14550487 17750688 20981920 24150161
18118991 5050125 8250108 11450370 14666545 17878182 21049998 24250092
1950209 5150706 8350292 11550297 14750236 17950411 21150458 24350180
2050170 5250025 84221044 116116139 148247793 180215784 21250185 24450124
2150371 5349717 8550259 11750274 14950327 181758974 21349808 24590855
2250365 5449979 8650302 11850523 15050228 18250272 21450685 24650228
2350467 5550145 8749797 119105677 15150108 18350390 215113514 24750316
24100861 5678827 88113581 120138358 15257005 18453395 216147492 24850546
2550158 5749803 8950264 12150209 15350095 18550334 21750396 24950078
2650483 5850279 9050176 12249674 15450292 18650529 21849915 25050111
2750211 5949997 9150337 12350396 15550453 18749836 21950170 25150524
2849859 6050507 9250313 12450280 15650059 18850229 220248053 25249981
2950580 6150869 9350417 12549877 15750355 18950313 22149937 25350533
3050142 6250358 9450292 12650240 15849821 19050204 22250139 25449975
3150265 6350165 9550352 127244313 15950410 19150529 223202938 255246009
The result is pretty staggering non-uniform and heavily biased to 0. I quickly eyeballed the figures and labeled the counter with colors. Red means a strong bias. Yellow is a slight bias. Green is the ballpark figure for unbiased result, but they are still above a typical count in this table. The coloring is quite subjective, but I'm only using it to highlight the non-uniform bias.

Now, this observation also indicates that RC4 is a very poor stream cipher. The way stream cipher works is essentially that a stream of pseudo-random bytes are generated and XOR-ed byte-wise with the plaintext in order to obtain ciphertext. The prevalence of 0 values means that most of the plaintext will be identical to the ciphertext because 0 is an XOR identity. The fact that this bad encryption is possible with some key choices should worry a lot of people who are still using RC4.

By a stroke of luck, I tried a slightly modified RC4 that actually gives me a pretty uniform distribution. The difference (from the pseudo-code listing of the RC4 wikipedia article) is highlighted in grey below.
i := 0
j := 0
while GeneratingOutput:
    i := (i + 1) mod 256
    j := (j + S[i]) mod 256
    swap values of S[i] and S[j]
    K := S[(S[i] + S[j]) mod 256]
    output K ^ j  // was just K before
endwhile
This modification gives me the following histogram.
bytecountbytecountbytecountbytecount bytecountbytecountbytecountbytecount
099723 32100388 6499739 9699866 128100026 160100026 19299831 22499581
199941 3399908 65100363 97100328 12999741 161100338 19399792 22599776
299968 34100010 6699589 98100231 130100649 162100084 194100218 22699839
399896 3599924 67100465 99100002 131100093 16399815 195100478 227100137
4100439 3699677 68100004 10099960 132100165 16499756 196100064 228100114
5100081 3799362 6999664 10199872 13399887 165100059 19799523 229100590
6100570 3899924 7099648 10299945 13499459 16699854 198100170 230100069
799812 3999706 71100066 10399598 13599532 16799499 19999956 23199892
899969 40100484 72100074 10499506 13699831 168100291 200100164 23299562
999840 4199756 7399128 105100178 13799646 169100348 201100121 23399412
1099793 42100133 74100019 106100488 13899882 17099661 20299864 23499878
11100208 43100059 7599717 107100141 139100374 171100116 20399643 235100394
12100157 4499772 7699846 108100085 140100180 17299632 20499928 236100123
13100098 45100232 77100017 109100654 14199955 17399993 20599858 237100188
14100017 46100135 7899566 11099104 14299640 17499256 206100243 23899918
15100259 47100403 7999459 11199887 143100255 17599818 20799612 23999840
16100139 4899888 80100828 11299999 14499820 176100378 20899663 24099622
17100093 49100173 8199633 113100480 14599804 17799826 209100564 24199916
18100257 5099987 82100502 114100253 146100554 17899998 210100183 242100319
1999835 51100206 8399776 11599674 147100726 17999909 211100548 24399803
20100010 5299879 84100188 11699657 148100255 18099687 21299937 244100122
21100424 53100257 85100197 117100375 14999866 18199741 213100669 24599972
22100062 5499376 86100123 118100396 15099623 182100523 214100195 246100034
2399729 55100404 87100620 119100275 15199594 18399879 21599921 247100092
24100076 5699792 88100383 12099712 152100641 18499514 216100226 248100600
25100107 57100214 89100079 12199682 15399723 18599764 217100120 24999574
2699785 58100284 9099835 122100090 15499805 18699752 21899850 250100211
2799839 5999459 91100332 123100185 15599557 187100176 219100167 251100213
2899700 6099795 9299966 124100599 15699709 188100422 220100028 252100368
2999776 6199663 9399596 12599479 157100266 189100427 221100172 25399842
30100319 6299734 9499962 12699874 158100068 190100238 222100058 25499921
3199537 63100025 95100043 127100382 15999910 19199824 223100697 25599923
This is quite amazingly uniform, at least appears good enough for my purpose. It remains to be seen whether two or more consecutive bytes of output will still retain the same uniformity.

I think it is quite possible that this slight modification could be immune to the existing statistical attacks on RC4.

Friday, December 2, 2011

Interface checking

One of my regrets for C++ is that it doesn't have interface checking when you instantiate templates. Coupled with the delayed instantiation (the code of a method is not fully type checked until you use it), static type checking of template code has been a hit or miss. Also, when you do get a type error, it is incomprehensible, riddled with the instantiation details that shouldn't be there in the first place. Interface checking means that the compiler can do better type checking before template instantiation. Interface checking has been in development for some time (e.g. Boost Concept Check, ConceptCpp, N2773), but it didn't make into the C++11 standard.

Interface checking in most programming languages follow a fulfillment model where an interface specifies a collection of function method signatures, and if a class implements all the functions (fulfills all signatures), then it is said to implement the interface. This is the case for Java and Go language as well as C++ concept checking in development so far.

However, if I were to develop interface for a new language, I would choose a declarative model where a class has to explicitly declare the interface it intends to implement. The interface is uniquely identifiable in the program's namespace. The motivation is to prevent a class from unintentionally implementing an interface by coincidence.

This is especially important for C++ since many operators are overloaded. For example, a random access iterator looks a lot like an integral type at the interface level (in fact it is a super concept because you can dereference an iterator), but manipulating the iterator like an integral without observing begin() and end() will cause the iterator to go out of bounds.

This semantic gap can be bridged by a more precise type specification using dependent type (not to be confused with C++ dependent type, a class that is complete if and only if a template argument refers to a complete type; complete type here simply means the compiler knows its memory layout so you can do a sizeof(type), for example, to get its size), but we don't yet have an object oriented dependent type system with interface checking. Even if we can bridge the semantic gap, I argue that a declarative interface checking is not less powerful than fulfillment interface checking.

Suppose you have a class Foo that implements all the methods required for interface Xyzzy, but Foo doesn't declare it so. In the fulfillment model, Foo can be used to instantiate any template (generic) that requires interface Xyzzy, but not in the declarative model, which requires Foo to explicitly state that it implements interface Xyzzy. Even if you cannot modify the definition of Foo, all you need to do is to subclass Foo and declare that it implements Xyzzy, and pass it to the template.

Another idea I have (in the context of C++ concept checking at least) is to get rid of the distinction between a class and an interface. Any class can be used as an interface. That's because C++ allows declaration and implementation to be separate. Furthermore, when a class is said to implement another class, it does not mean the class is a subclass of the other class. The compiler will only check that the method signatures are congruent.