Life of a Computer Scientist: July 2021

Pictured below is the Gropius House in Lincoln, Massachusetts, which was the former residence designed and built by the renowned architect of the Bauhaus movement and emeritus chair of the Department of Architecture at Harvard Graduate School of Design, Walter Gropius.

But the Harvard architecture revival I'm writing about is not that kind of architecture. It was a computer machine architecture in the punched tape era that separated program and data in two different pathways. The Harvard Mark I machine was designed by Howard Aiken at Harvard University. He took inspirations from the Analytical Engine mechanical computer designed by Charles Babbage, who was the mathematics mentor of Ada Lovelace, the first computer programmer and the daughter of poet Lord Byron.

There is poetic justice about the fact that computer programming came out of the conjunction of mathematics and poetry.

What makes Harvard architecture computers notably different from today's computer is that programs were read-only, and data could not be executed as programs. In today's computer, programs and data are stored on the same storage medium. Anyone who can put data on a computer can also run it as a program.

Security breaches happen in two steps: (1) someone has write access to a data path, and (2) subsequently leverages the data path to upload and execute unauthorized programs on a computer they do not own. Gaining write access for step (1) is not that hard. Many computer systems that handle data actively solicit data from non-privileged users. Step (2) relies on having some knowledge of vulnerabilities in a running program that can trick it into passing control flow onto arbitrary data.

Case in point is WordPress, which is a blogging platform notorious for having vulnerabilities that result in security breaches. That's partly because it is built on PHP, a language that makes it easy to run data as program:
Once file upload is enabled on a site, PHP will allow it for any script on that site even if that particular script never expects file upload.
A moderately complex site breaks down their code into several files, and the first file include() or require() other files to bootstrap the program's functions. The file to be included is a computed path that could be any file on the filesystem.
PHP can be configured to limit the path of what could be included using open_basedir. This limits exfiltration of system files that are not part of the public site such as /etc/passwd, but uploaded files are often moved under open_basedir which can then be included.
Any file type could be included, even if it is an image file with PHP code embedded in it.
The user only needs to upload a malicious image to WordPress and trick it to run the image. It's less of an issue for a private WordPress installation, but on a shared blogging site like wordpress.com where anyone could register an account, anyone can cause a security breach and steal other people's passwords. That's why people shouldn't reuse passwords on multiple sites.

Once they hijack the computer, they also hijack whatever data was entrusted to be handled on that computer. They may also modify existing legitimate programs on the computer and hide illicit code there, which makes it very costly to discover and clean up after a security incident. This is how ransomware works in a nutshell. Some crooks turned security breaches into a profiteering criminal enterprise.

If today's computers have separate program and data pathways, security breaches would only be limited to the original unauthorized data path access but would not escalate to become a complete hijack of the computer.

Modern computers provide some facilities to separate code and data, but they are optional and must be deployed judiciously in a production environment in order to be effective.

Executable space protection designates parts of the memory only for handling data and prohibits executing it as code. This protects programs that are already running from remote code injection vulnerabilities.
Enforce writable data to be non-executable (i.e. W^X). This is typically part of the executable space protection for running programs, but the same principle can be applied to programs and data stored on disk.
Copy on write is a storage policy that maintains snapshots of data modifications. This can make it easier to restore the filesystem to a former state before the security breach.
Some operating systems support making a filesystem read-only, e.g. read-only bind mounts.
Some filesystems have the option to make files non-executable on a volume, e.g. zfs exec=off, noexec mount option.
Code signing verifies that a program comes from a trusted party and has not been modified in an unauthorized manner. Verification and execution must happen atomically. Otherwise an attacker has an opportunity to modify the program between verification and execution to evade detection.

They all work towards the same principle as Harvard architecture, where program is read-only, and writable data cannot be executed as program.

Ultimately, the reason why programs must be read-only has to do with accountability requirements imposed by Sarbanes-Oxley Act in response to corporate scandals such as Enron. Programs running in production must be able to establish its provenance, and that someone has reviewed and certified that the program serves its intended purpose. If the program is writable after the certification, then that would put its provenance in doubt.

In a sense, if a company falls victim to ransomware attack, there must be a loophole in the accountability of its information systems. Revival of Harvard architecture would close the loophole and eliminate ransomware attacks.

Life of a Computer Scientist

Thursday, July 15, 2021

Revival of Harvard Architecture for the Mitigation of Ransomware Attacks