Life of a Computer Scientist: December 2011

Update (September 3, 2012): I made some mistakes in my own implementation of RC4. An analysis of the mistake is documented in Mistake in my RC4 implementation. After fixing the mistake, I do not see the skewed distribution shown here. Please be advised that none of the blog posts in this blog are peer reviewed, and mistakes happen.

On the quest to look for a pseudo-random number generator that is easy to implement, I gave up on Mersenne Twister but took Wikipedia's advice that a stream cipher can be used for generating pseudo-random sequences. But it also mentions that RC4 is not a particularly uniform stream cipher which makes statistical attacks possible. Because of the simplicity of RC4, I decided to implement it and see really how bad it is. Since RC4 outputs a stream of bytes, I took the histogram of the byte value. The initialization key is the integer value of a stack-allocated object pointer, repeated for 64 bytes, and it's run on a 64-bit machine. I'm not sure if the fact that the key is highly regular has anything to do with the following non-uniform result that I observed.

byte	count	byte	count	byte	count	byte	count	byte	count	byte	count	byte	count	byte	count
0	5499938	32	50376	64	81740	96	253697	128	50004	160	144433	192	147856	224	147031
1	134854	33	50103	65	50386	97	115718	129	50024	161	49930	193	149593	225	50260
2	50352	34	50179	66	954060	98	50362	130	50185	162	50553	194	134910	226	78249
3	115332	35	50006	67	50091	99	50170	131	50543	163	50289	195	50516	227	49853
4	50214	36	50065	68	50256	100	50052	132	50288	164	49950	196	50532	228	50303
5	50089	37	50589	69	50173	101	50118	133	143783	165	49753	197	50182	229	50135
6	207241	38	49946	70	50416	102	50395	134	50447	166	50270	198	50145	230	134634
7	68820	39	50147	71	49985	103	49774	135	50252	167	50502	199	49368	231	49987
8	50212	40	103906	72	49950	104	50501	136	49803	168	50038	200	50032	232	50486
9	695154	41	49905	73	50137	105	50389	137	50085	169	50619	201	66006	233	49849
10	50439	42	49865	74	50472	106	50284	138	49508	170	50045	202	50336	234	50083
11	49682	43	948555	75	50141	107	50230	139	50294	171	50719	203	50300	235	50267
12	50378	44	49965	76	140698	108	50119	140	49772	172	50399	204	49963	236	50117
13	50466	45	50171	77	49814	109	50012	141	50385	173	50031	205	50072	237	49945
14	49591	46	50290	78	50219	110	90366	142	50177	174	50126	206	50585	238	50157
15	49776	47	50340	79	49916	111	49974	143	50224	175	50513	207	50350	239	50234
16	287052	48	205931	80	151232	112	131863	144	50079	176	50026	208	50226	240	131317
17	50022	49	50076	81	50134	113	49801	145	50487	177	50688	209	81920	241	50161
18	118991	50	50125	82	50108	114	50370	146	66545	178	78182	210	49998	242	50092
19	50209	51	50706	83	50292	115	50297	147	50236	179	50411	211	50458	243	50180
20	50170	52	50025	84	221044	116	116139	148	247793	180	215784	212	50185	244	50124
21	50371	53	49717	85	50259	117	50274	149	50327	181	758974	213	49808	245	90855
22	50365	54	49979	86	50302	118	50523	150	50228	182	50272	214	50685	246	50228
23	50467	55	50145	87	49797	119	105677	151	50108	183	50390	215	113514	247	50316
24	100861	56	78827	88	113581	120	138358	152	57005	184	53395	216	147492	248	50546
25	50158	57	49803	89	50264	121	50209	153	50095	185	50334	217	50396	249	50078
26	50483	58	50279	90	50176	122	49674	154	50292	186	50529	218	49915	250	50111
27	50211	59	49997	91	50337	123	50396	155	50453	187	49836	219	50170	251	50524
28	49859	60	50507	92	50313	124	50280	156	50059	188	50229	220	248053	252	49981
29	50580	61	50869	93	50417	125	49877	157	50355	189	50313	221	49937	253	50533
30	50142	62	50358	94	50292	126	50240	158	49821	190	50204	222	50139	254	49975
31	50265	63	50165	95	50352	127	244313	159	50410	191	50529	223	202938	255	246009

The result is pretty staggering non-uniform and heavily biased to 0. I quickly eyeballed the figures and labeled the counter with colors. Red means a strong bias. Yellow is a slight bias. Green is the ballpark figure for unbiased result, but they are still above a typical count in this table. The coloring is quite subjective, but I'm only using it to highlight the non-uniform bias.

Now, this observation also indicates that RC4 is a very poor stream cipher. The way stream cipher works is essentially that a stream of pseudo-random bytes are generated and XOR-ed byte-wise with the plaintext in order to obtain ciphertext. The prevalence of 0 values means that most of the plaintext will be identical to the ciphertext because 0 is an XOR identity. The fact that this bad encryption is possible with some key choices should worry a lot of people who are still using RC4.

By a stroke of luck, I tried a slightly modified RC4 that actually gives me a pretty uniform distribution. The difference (from the pseudo-code listing of the RC4 wikipedia article) is highlighted in grey below.

i := 0
j := 0
while GeneratingOutput:
    i := (i + 1) mod 256
    j := (j + S[i]) mod 256
    swap values of S[i] and S[j]
    K := S[(S[i] + S[j]) mod 256]
    output K ^ j  // was just K before
endwhile

This modification gives me the following histogram.

byte	count	byte	count	byte	count	byte	count	byte	count	byte	count	byte	count	byte	count
0	99723	32	100388	64	99739	96	99866	128	100026	160	100026	192	99831	224	99581
1	99941	33	99908	65	100363	97	100328	129	99741	161	100338	193	99792	225	99776
2	99968	34	100010	66	99589	98	100231	130	100649	162	100084	194	100218	226	99839
3	99896	35	99924	67	100465	99	100002	131	100093	163	99815	195	100478	227	100137
4	100439	36	99677	68	100004	100	99960	132	100165	164	99756	196	100064	228	100114
5	100081	37	99362	69	99664	101	99872	133	99887	165	100059	197	99523	229	100590
6	100570	38	99924	70	99648	102	99945	134	99459	166	99854	198	100170	230	100069
7	99812	39	99706	71	100066	103	99598	135	99532	167	99499	199	99956	231	99892
8	99969	40	100484	72	100074	104	99506	136	99831	168	100291	200	100164	232	99562
9	99840	41	99756	73	99128	105	100178	137	99646	169	100348	201	100121	233	99412
10	99793	42	100133	74	100019	106	100488	138	99882	170	99661	202	99864	234	99878
11	100208	43	100059	75	99717	107	100141	139	100374	171	100116	203	99643	235	100394
12	100157	44	99772	76	99846	108	100085	140	100180	172	99632	204	99928	236	100123
13	100098	45	100232	77	100017	109	100654	141	99955	173	99993	205	99858	237	100188
14	100017	46	100135	78	99566	110	99104	142	99640	174	99256	206	100243	238	99918
15	100259	47	100403	79	99459	111	99887	143	100255	175	99818	207	99612	239	99840
16	100139	48	99888	80	100828	112	99999	144	99820	176	100378	208	99663	240	99622
17	100093	49	100173	81	99633	113	100480	145	99804	177	99826	209	100564	241	99916
18	100257	50	99987	82	100502	114	100253	146	100554	178	99998	210	100183	242	100319
19	99835	51	100206	83	99776	115	99674	147	100726	179	99909	211	100548	243	99803
20	100010	52	99879	84	100188	116	99657	148	100255	180	99687	212	99937	244	100122
21	100424	53	100257	85	100197	117	100375	149	99866	181	99741	213	100669	245	99972
22	100062	54	99376	86	100123	118	100396	150	99623	182	100523	214	100195	246	100034
23	99729	55	100404	87	100620	119	100275	151	99594	183	99879	215	99921	247	100092
24	100076	56	99792	88	100383	120	99712	152	100641	184	99514	216	100226	248	100600
25	100107	57	100214	89	100079	121	99682	153	99723	185	99764	217	100120	249	99574
26	99785	58	100284	90	99835	122	100090	154	99805	186	99752	218	99850	250	100211
27	99839	59	99459	91	100332	123	100185	155	99557	187	100176	219	100167	251	100213
28	99700	60	99795	92	99966	124	100599	156	99709	188	100422	220	100028	252	100368
29	99776	61	99663	93	99596	125	99479	157	100266	189	100427	221	100172	253	99842
30	100319	62	99734	94	99962	126	99874	158	100068	190	100238	222	100058	254	99921
31	99537	63	100025	95	100043	127	100382	159	99910	191	99824	223	100697	255	99923

This is quite amazingly uniform, at least appears good enough for my purpose. It remains to be seen whether two or more consecutive bytes of output will still retain the same uniformity.

I think it is quite possible that this slight modification could be immune to the existing statistical attacks on RC4.

One of my regrets for C++ is that it doesn't have interface checking when you instantiate templates. Coupled with the delayed instantiation (the code of a method is not fully type checked until you use it), static type checking of template code has been a hit or miss. Also, when you do get a type error, it is incomprehensible, riddled with the instantiation details that shouldn't be there in the first place. Interface checking means that the compiler can do better type checking before template instantiation. Interface checking has been in development for some time (e.g. Boost Concept Check, ConceptCpp, N2773), but it didn't make into the C++11 standard.

Interface checking in most programming languages follow a fulfillment model where an interface specifies a collection of function method signatures, and if a class implements all the functions (fulfills all signatures), then it is said to implement the interface. This is the case for Java and Go language as well as C++ concept checking in development so far.

However, if I were to develop interface for a new language, I would choose a declarative model where a class has to explicitly declare the interface it intends to implement. The interface is uniquely identifiable in the program's namespace. The motivation is to prevent a class from unintentionally implementing an interface by coincidence.

This is especially important for C++ since many operators are overloaded. For example, a random access iterator looks a lot like an integral type at the interface level (in fact it is a super concept because you can dereference an iterator), but manipulating the iterator like an integral without observing begin() and end() will cause the iterator to go out of bounds.

This semantic gap can be bridged by a more precise type specification using dependent type (not to be confused with C++ dependent type, a class that is complete if and only if a template argument refers to a complete type; complete type here simply means the compiler knows its memory layout so you can do a sizeof(type), for example, to get its size), but we don't yet have an object oriented dependent type system with interface checking. Even if we can bridge the semantic gap, I argue that a declarative interface checking is not less powerful than fulfillment interface checking.

Suppose you have a class Foo that implements all the methods required for interface Xyzzy, but Foo doesn't declare it so. In the fulfillment model, Foo can be used to instantiate any template (generic) that requires interface Xyzzy, but not in the declarative model, which requires Foo to explicitly state that it implements interface Xyzzy. Even if you cannot modify the definition of Foo, all you need to do is to subclass Foo and declare that it implements Xyzzy, and pass it to the template.

Another idea I have (in the context of C++ concept checking at least) is to get rid of the distinction between a class and an interface. Any class can be used as an interface. That's because C++ allows declaration and implementation to be separate. Furthermore, when a class is said to implement another class, it does not mean the class is a subclass of the other class. The compiler will only check that the method signatures are congruent.

Life of a Computer Scientist

Sunday, December 4, 2011

Evaluating RC4 as pseudo-random number generator

Friday, December 2, 2011

Interface checking