Life of a Computer Scientist: 2014

Friday, October 24, 2014

Cross-Site Request Forgery Prevention

Cross-site request forgery (CSRF or XSRF) happens when an attacker embeds a tag like <img src="http://snafu.com/cgi-bin/delete_account?confirm=yes"> in a malicious web page on attacker.com. When user visits the page, authentication cookies are sent to snafu.com despite the page origin. Forms that POST to snafu.com can be embedded in a malicious page as well. "Referer" can be forged if snafu.com redirects to canonicalize URL.

One solution to prevent the attack is to both: (1) use an unguessable secret cookie rotated regularly, and (2) require the same secret to appear in the POST request as a hidden form field.

If the secret appears in the URL, and if the attacker may embed content in snafu.com, then when snafu.com loads a page on attacker.com, the "Referer" header would reveal the secret.

Source: https://www.owasp.org/index.php/Cross-Site_Request_Forgery_(CSRF)_Prevention_Cheat_Sheet

Sunday, October 19, 2014

Build Notes of v8 Command Line Shell

The instructions on StackOverflow is a bit outdated. You don't need scons anymore. You don't even need to download GYP in advance.

# Takes about 123MB disk space.
git clone https://chromium.googlesource.com/external/v8
cd v8
# master branch of this repo is the stable edge so it's safe to use.
nice make builddeps && nice make -j4 native
cp out/native/shell ~/where/you/want/it/to/go/shell

The build artifact based on samples/shell.cc is in out/native/shell. There is no need to compile it separately. The resulting shell doesn't have any of the usual browser intrinsics such as window or document (obviously this is not a browser JavaScript). But you can print() to the standard output, read() the contents of a file, load() a JavaScript file and execute it, quit() the interpreter, and obtain its version(). This is quite enough for many "pure" programs that do mostly computations and not a lot of I/O.

Or just use node.js. The command line interpreter \( \texttt{node} \) comes with a suite of system I/O functions you could use out of the box. It doesn't have to run a web server.

Saturday, June 14, 2014

Bitrot Resistance on a Single Drive

This morning I read an interesting summary on Slashdot about bitrot under HFS+. The article is a bit misleading because bitrot is not really an HFS+ problem per se, but the author's qualm is that HFS+ does not support data integrity feature like ZFS. This merely echoes a sentiment that ZFS is slowly making its way to Mac OS X which was not happening due to the NetApp v. Sun Microsystems (now Oracle) patent lawsuit claiming that the copy-on-write feature in ZFS infringes on NetApp's WAFL technology. Although the lawsuit was dismissed in 2010, it appears that Apple had lost interest.

But is ZFS really the silver bullet against bitrot? My argument is that bitrot should not be handled at the filesystem layer, but done at the block storage layer, if you only put data on a single drive.

ZFS data integrity uses block checksumming. Rather than storing the checksum in the block, the block pointers are really a pair (address, checksum). This pointer structure is propagated all the way to the root, forming a Merkle tree. You can configure the filesystem to store several copies of a block, which divides the usable data capacity of a drive by the number of copies (or \( 1 / k \) of the capacity, where \( k \) is the number of copies). ZFS can also manage a disk pool for you in a RAID-like configuration. RAID-Z uses parity so the drive capacity is \( (n - 1)/n \) where \( n \) is the number of drives (which must be identically sized). Most people don't have disk arrays, so they're stuck with multiple copies at \( 1 / k \) capacity.

Given that the original author of the bitrot article suffered only 23 files over a 6 year period (on the filesystem there are usually millions of files), reducing your drive capacity to a half or a third seems a dire price to pay.

The ideal case of bitrot resistance is if the block storage performs Reed-Solomon error correction. This have been the way CD-ROM stores data. CD-ROM physical sectors are 2352 bytes, but data sectors are 2048 bytes; the hidden bytes are used for error correction.

Note that Reed-Solomon is generally more robust the more data there are. For hard drives (magnetic spinning discs, not SSDs), it would make sense to do Reed-Solomon on a whole cylinder, since hard drive performance overhead is bottlenecked on seeking: movement of the reading heads to a particular cylinder position is a delicate operation. Reading the extra sectors in a cylinder is free. The drive doesn't have to implement this. The extra error correction can be done by the logical volume manager, which is a block storage layer underneath the filesystem but above the actual device.

But one complication is logical block addressing (LBA) that numbers sectors linearly in order to overcome disk size limited by BIOS int 13h, which is an outdated API for accessing disk drives in the MS-DOS days. LBA hides the disk geometry due to the linear numbering. The error-correcting logical block layer could be more efficient if it knows the underlying disk geometry. Otherwise the error correction might span cross two physical cylinders, which is suboptimal. But it might not be too big of a deal since moving the disk head to the next cylinder is much easier than swinging it across the disk surface. The notion of a cylinder is also largely decimated in modern disks because outer cylinders have bigger circumference, and can have more sectors than inner cylinders. It would suffice to pretend that a pseudo-cylinder be merely a span of continuous sectors.

To get the best performance, the filesystem might wish to use a logical block size equaling the cylinder size. Depending on the disk geometry, the block size would typically be ~1MB. This would waste space unless the filesystem can pack small files, like BtrFS. Therefore, it is possible to achieve bitrot resistance on a single drive without performance impact.

Now, if you want to put data across an array of drives, then doing Reed-Solomon error correction across all drives that might fail independently would be the ideal. Note that you don't want to do error correction across drives with correlated failure, e.g. (1) drives from the same manufacturing batch subject to the same wear, or (2) drives that can be destroyed by the same power failure or surge at a site.

Saturday, April 12, 2014

VirtualBox shared folder Linux notes

I sank my Friday evening fighting with VirtualBox 4.3.10 r93012 with a Mac OS X 10.9.2 host and Ubuntu Linux guest kernel 3.13. Whenever I try to mount my shared folder xyzzy, I'd get cryptic error messages like these:

root@ubuntu:~# mount -t vboxsf xyzzy /media/xyzzy
mount: wrong fs type, bad option, bad superblock on xyzzy,
       missing codepage or helper program, or other error
       (for several filesystems (e.g. nfs, cifs) you might
       need a /sbin/mount. helper program)
       In some cases useful info is found in syslog - try
       dmesg | tail  or so
root@ubuntu:~# dmesg | tail
...
[ 1292.150065] sf_read_super_aux err=-22
root@ubuntu:~# mount -t vboxsf -o gid=1000,uid=1000 xyzzy /media/xyzzy
mount: Protocol error
root@ubuntu:~# dmesg | tail
...
[ 1348.460432] sf_read_super_aux err=-71

Making the shared folder automount and reboot works, but I want to mount it to a specific path. Reinstalling GuestAddition doesn't help. It turns out that it's an old problem resurfacing. The root cause is that /sbin/mount.vboxsf is an invalid symlink. The fix is this:

root@ubuntu:~# updatedb  # first make sure locate is up to date.
root@ubuntu:~# locate mount.vboxsf  # find the real location.
/opt/VBoxGuestAdditions-4.3.10/lib/VBoxGuestAdditions/mount.vboxsf
/sbin/mount.vboxsf  # this is the bad symlink.
root@ubuntu:~# ln -sf /opt/VBoxGuestAdditions-4.3.10/lib/VBoxGuestAdditions/mount.vboxsf /sbin/mount.vboxsf

Now I can mount.

If I want to add the mount point to /etc/fstab so that it gets mounted at boot time, the problem is that vboxsf is not loaded automatically early in the boot, so again it fails to mount. The solution is to add vboxsf to /etc/modules.

root@ubuntu:~# echo vboxsf >> /etc/modules

Now when I reboot, the shared folder is mounted correctly at boot time.

Wednesday, March 26, 2014

Really Understanding Python @staticmethod and @classmethod

Nowadays people don't write blog posts explaining subtle programming concepts. They write surprisingly well-versed posts on StackOverflow which I personally use a lot. But in this case, I noticed the answers to the question "What is the difference between @staticmethod and @classmethod in Python?" are unsatisfactory. That gave me the excuse to write another blog post.

Python has three types of methods. Here is an example illustrating their use in Python.

class MyClass(object):

    def some_instance_method(self, *args, **kwds):
        pass

    @classmethod
    def some_class_method(cls, *args, **kwds):
        pass

    @staticmethod
    def some_static_method(*args, **kwds):
        pass

Python is unique in that its instance methods take an explicit instance argument. You can call it any name, but the convention is to call it self. Many languages like C++ and Java have an implicit this instead. Syntax wise, the class method would take a class (which is also an object in Python) as its first argument, and a static method takes no explicit class nor object references.

Many people understand that static methods do not use per-instance object states, but they implement functionality grouped as part of the class. That's why you can forego the explicit references. But nobody seems to explain well what class methods are good for. Class states? That's a synonym to the dirty word "global variables" which everyone knows you should avoid. But class methods have legitimate uses.

The first legitimate use is to use a class method like a static method, but better.

Imagine you have a static method that wants to use another static method in the same class, or a static method that wants to use constants defined in the class. You could refer to the class by its full name like this:

class Smoothie(object):

    YOGURT = 1
    STRAWBERRY = 2
    BANANA = 4
    MANGO = 8

    @staticmethod
    def blend(*mixes):
        return sum(mixes) / len(mixes)

    @staticmethod
    def eternal_sunshine():
        return Smoothie.blend(
            Smoothie.YOGURT, Smoothie.STRAWBERRY,
            Smoothie.BANANA)

    @staticmethod
    def mango_lassi():
        return Smoothie.blend(
            Smoothie.YOGURT, Smoothie.MANGO)

But with class method, you could write instead:

class Smoothie(object):

    YOGURT = 1
    STRAWBERRY = 2
    BANANA = 4
    MANGO = 8

    @staticmethod
    def blend(*mixes):
        return sum(mixes) / len(mixes)

    @classmethod
    def eternal_sunshine(cls):
        return cls.blend(cls.YOGURT, cls.STRAWBERRY, cls.BANANA)

    @classmethod
    def mango_lassi(cls):
        return cls.blend(cls.YOGURT, cls.MANGO)

This results in a much more succinct code. Imagine the headache you would have saved renaming the class if you later decide that SimpleSmoothie is a better name for just Smoothie. Indeed, making code refactoring easier is one of the first use case I discovered with Python class methods.

This is what you get if you want to see what the class methods return:

>>> Smoothie.eternal_sunshine()
2
>>> Smoothie.mango_lassi()
4

There is a subtle performance benefit. If you refer to the full class name in a static method, the Python interpreter has to first look up the local variables, not finding it there, and then look up the global variables. With a class method, cls would be found in the local variables, so there is no need to look up the globals.

With this in mind, the second legitimate use is to allow a subclass to override static and class methods.

class BetterSmoothie(Smoothie):

    YOGURT = 'yogurt'
    STRAWBERRY = 'strawberry'
    BANANA = 'banana'
    MANGO = 'mango'

    @staticmethod
    def blend(*mixes):
        return ', '.join(mixes)

Now, without touching the class methods eternal_sunshine() and mango_lassi(), they will use the new blend() as well as the new constants, and produce a different kind of smoothie mix.

>>> BetterSmoothie.eternal_sunshine()
'yogurt, strawberry, banana'
>>> BetterSmoothie.mango_lassi()
'yogurt, mango'

If you use a class method for your class factory, then it will be able to construct derived classes too. It may or may not make sense depending on your class design.

class Xyzzy(object):

    def __init__(self, which_factory):
        self.which_factory = which_factory

    @classmethod
    def factory_foo(cls):
        return cls('factory_foo')

    @classmethod
    def factory_bar(cls):
        return cls('factory_bar')

class Yappy(Xyzzy):

    def __init__(self, which_factory):
        super(Yappy, self).__init__(which_factory)
        print which_factory

Notice that only the Yappy class prints out the factory type, but not Xyzzy. And this is what you get:

>>> Xyzzy.factory_foo()
<__main__.Xyzzy object at 0x1027d1050>
>>> Xyzzy.factory_bar()
<__main__.Xyzzy object at 0x1027c7e10>
>>> Yappy.factory_foo()
factory_foo
<__main__.Yappy object at 0x1027d10d0>
>>> Yappy.factory_bar()
factory_bar
<__main__.Yappy object at 0x1027c7e10>

With class method factories, even though the factories are defined in the base class, they are able to construct the derived classes as well.

Class method in Python is really a unique feature. Some use cases shown above demonstrate how to use class methods to refer to other static or class methods in the same class and how to allow a subclass to change behavior of the base class. The closest analog in C++ would be the curiously recurring template pattern.

Friday, March 21, 2014

Programming Language as a Data Mining Problem

I've been playing with busybox for an embedded system lately. Busybox is a collection of Unix tools like ls, grep, cp, etc. all compiled into a single monolithic statically linked binary. The size of the single monolithic binary is smaller than the sum of individually compiled binaries.

Obviously, static linking would duplicate shared library code among the binaries, so the individually compiled binaries would contain a lot of duplicate code. But even if the binaries were to be dynamically linked against one shared library so we avoid code duplication, the shared library typically contains unused code as well. Whoever builds the shared library cannot predict in advance what code will eventually be used by busybox. Static linking the monolithic busybox binary allows the linker to throw away any unused code, thus achieving the size reduction. Of course, they also cut some corners in the functionality, but it's harder to tell exactly how much the savings would be.

The objective is to put together a complete program with only code that the program actually uses, even though the programmer might write code that is unused for the problem at hand. The unused code might be written for solving other problems, but only the relevant ones must go into the program.

The C language has other peculiarities that keeps us further away from this objective. A single *.c source file is a single compilation unit, and it is compiled to a single *.o object file. The linker only considers whether a *.o is used or unused. It is either entirely included or entirely omitted. If the *.c contains multiple functions, some are unused, then the unused functions are included as well. Note that dead-code elimination only removes intra-function dead code. A more sophisticated compiler and linker would put each function into its own section in the *.o so the linker can cherry pick them. This is explained more in Link time dead code and data elimination using GNU toolchain.

In a high-level view, the compiler takes in source code and produces key-value pairs of blobs, and each blob may reference other blobs by their keys. The linker traverses the blobs as a dependency tree. This is a data mining problem.

If you think C is bad, many modern languages are worse. Java classes are all stored in a *.jar which is a zip file, so the distributed binary is completist. Java partially mitigates the problem by only loading classes on-demand, but loading a class requires scanning the zip file's central directory which is \( O(n) \).

If we treat programming language design as a data mining problem, it is immediately obvious that we want code retrieval to be as efficient as possible, preferably \( O(1) \). But there is more. The compiler reads in a source file (as if it's a punched card) and outputs a structural representation that allows the code to be intelligently surveyed. Human can use this database to learn about the architecture of the codebase. The compiler can use it to compile other code. The linker can use it to eliminate dead and duplicate code. The debugger can use it to show diagnostic.

When people implement a programming language, they don't usually realize that they're designing a database as well. If we do, we could simplify the model and allow the code to be queried more efficiently and more expressively.