Friday, December 19, 2025

My advice if you are a 13 year old vibe coding to become the next Bill Gates...

(Insert obligatory AI generated Ghibli styled graphics here.)

If you are a 13 year old vibe coding to become the next Bill Gates, here are the computer science concepts you will need to know to successfully direct AI to do the right things. I assume you want to build the next big thing and not just fixated about the 70's BASIC interpreter that Bill Gates wrote when he was 20 without AI. Also, if you want to go into fundamental research on AI or quantum computing, vibe coding is probably not what you're after, but it can be helpful to learn about randomized algorithms.

If you ask AI today the same question, you would have gotten some hand-wavy advices, and here are my takes on why they are not that useful.

  • Master prompt engineering.
    • Why this is not useful: the only way you master prompting is by having the right vocabulary for the fundamental concepts in computing, and you need to understand the concepts behind these vocabularies. It is not effective to prompt with an alphabet soup of jargons unless you use them in a meaningful way.
  • Learn the tech stack (e.g. Gemini, GitHub Copilot).
    • Why this is not useful: tech companies are going to make AI as easy to use as tap water. Obviously, there is fascinating practical engineering about water resourcing and plumbing infrastructure to get the water to your tap, but it's not exactly rocket science to learn how to turn on the faucet. Anyone who thinks they have a particular edge in prompt engineering will find out that it is quickly disappearing.
  • Problem solving and strategic vision.
    • Why this is not useful: don't just dream about things in your head. It is even more important to learn to try things and observe the outcome. If something didn't happen as expected, try to understand why. This is not something AI can do for you, since AI can only learn from its training data. Always validate your ideas in the real world and pivot as necessary.
  • Financial literacy.
    • If you pay attention in high school math class, you should have the tools you need to predict the outcome of your decisions. Watch this video about why Math Just Got Important.

Instead of recommending specific framework or product (since you can ask AI for more timely recommendations), here is a bucket list of timeless computer science concepts that you will want to learn:

  • Computer architecture, especially about the memory hierarchy and principle of locality in the context of cloud services. Separation of code and data, which is important for security.
    • Why is it important: memory hierarchy and principle of locality lets you understand how all computer systems operate under the same space-time constraints, similar to relativity in Physics, and the cost-benefit trade-offs needed to optimize it (e.g. caching).
    • Without the discipline to separate code and data, mixing the two is a constant source of exploits compromising cybersecurity at national levels.
  • Algorithms and data structures: when you learn about sorting, set your sight on the time and space complexity analysis and try to not get bogged down with the mechanism itself. Hash table is going to be relevant in load balancing, and Graph traversal for network architecture.
    • Learn programming in the context of algorithms and data structures so you have a vocabulary to describe them.
    • Why is it important: complexity analysis tells you why your system is provably not scalable when the dataset becomes larger, not even if you add more computing resources to it.
  • Network architecture: DNS, IP addressing, client/server and load balancing.
    • Why is it important: although many "cloud" providers have this abstracted away, understanding the network architecture lets you design your cloud in a more cost effective way and be able to debug when issues arise.
  • Server side: HTTP and REST, cryptography (TLS, JOSE), SQL (particularly about SQL injection which is what happens when you fail to separate code and data).
    • Why is it important: although many frameworks also abstract away most of these details, your system architecture is limited by the assumptions made by the underlying protocols. Ultimately this boils down to trade-offs, whether to accept these limitations as "good enough" for your application or seek alternatives, e.g. HTTP long polling vs. WebSocket, X.509 vs. JWK, SQL vs. NoSQL.
  • Client side: fundamentally, at least HTML/DOM, CSS, and Javascript.
    • They are very capable nowadays, so you don't really need a separate framework, but feel free to let AI try different frameworks.
    • Why is it important: you still need to know HTML, CSS and Javascript in order to debug framework code.

There are some more advanced topics that could be relevant for domain specific work like games, especially the triple A titles with good graphics and physics, unless you are satisfied writing yet another unimpressive Minesweeper.

In a world where the cost of answers is dropping to zero, the value of the question becomes everything. It is still important to learn the concepts so you have a vocabulary to express ideas in your head, and to observe if your ideas work in the real-world and pivot if not.

Friday, December 12, 2025

Secretive Key Attestation Possible?

On an old MacBook Pro 2019, I've been trying out Secretive (secretive.dev) to let me use the Touch ID (aka SecureEnclave) as a hardware SSH key, like how I would use Yubikey on other computers.

One of the things I would like to do is to create a certificate authority for key signing, to avoid having to manually add new keys to each host I have to SSH into. The certificate authority needs to tell that the hardware key is indeed generated by some qualified hardware, so the private key cannot be easily exfiltrated. Also, I should not be able to cheat by generating a key pair on a computer, making a copy somewhere, then load the private key onto the hardware token and pretend it was generated there.

Normally, using U2F with SSH gives me the option to generate a key attestation that can then be verified by the CA before signing the public key certificate. Essentially, I tell ssh-keygen to write the attestation to another file as the new key is generated, and the attestation will have to be parsed using a custom script in order to check against the vendor certificates published by FIDO2 Alliance.

There are several ways to generate key from the Secure Enclave:

Note that the Secretive 3.0.4 release notes mentions "GitHub attestation" but that is entirely different. It just says the binary built by GitHub actions will now have provenance.

Even though DeviceCheck.DCAppAttestService seems like a likely candidate to generate attested keys, the fact it requires a validation roundtrip to Apple's server can be problematic. It likely only works for apps uploaded to the App Store (which Secretive is not), and people will not be able to compile their own binaries and expect it to work. I'm also not sure how people may feel that Apple can now keep a record of all their SSH keys.

Probably, it's easier to just buy another Yubikey.

Thursday, December 11, 2025

Certificate Signing Request with Authentication Credentials

PKCS #10 (RFC 2986certificate signing request (X509_REQ) typically contains at minimum the subject and the public key to be signed into a X.509 certificate issued by a certificate authority (CA). Here is an example of what I send to Let's Encrypt:

$ openssl asn1parse -item X509_REQ -in worship.likai.org.csr 
X509_REQ: 
  req_info: 
    version: 0
    subject: CN=worship.likai.org
    pubkey:     X509_PUBKEY: 
      algor: 
        algorithm: id-ecPublicKey (1.2.840.10045.2.1)
        parameter: OBJECT:secp384r1 (1.3.132.0.34)
      public_key:  (0 unused bits)
        0000 - 04 e4 5c 30 85 6f b5 37-35 f4 e9 21 3b 2c 74   ..\0.o.75..!;,t
        ... omitted for brevity ...
        005a - 33 2f c8 fb c1 21 86                           3/...!.
    attributes:
      <EMPTY>
  sig_alg: 
    algorithm: ecdsa-with-SHA256 (1.2.840.10045.4.3.2)
    parameter: <ABSENT>
  signature:  (0 unused bits)
    0000 - 30 66 02 31 00 ce 23 b9-c3 b0 53 24 2f d3 8c b0   0f.1..#...S$/...
    ... omitted for brevity ...
    0060 - cc fe a2 64 fa f2 7a e6-                          ...d..z.

X509_REQ can also contain extensions, most notably subjectAltName. Some of the extensions are documented in x509v3_config, which includes the ability to send arbitrary extensions. PKCS #9 (RFC 2985) defines the attributes that can be used in a PKCS #10 CSR. Arbitrary extensions are defined by an object identifier (OID) key and some ASN.1 value. We can lookup the OID assignments.

As currently defined, X509_REQ does not have the ability to include authentication credentials, i.e. the CA only knows the subject and its public key, and the X509_REQ itself is signed by the same subject's public key for integrity, but the CA does not know whether the subject is whom they claim to be.

That kind of trust has to be verified out of band. In the case of Let's Encrypt (how it works), the ACME protocol performs domain validation by fetching the result of a challenge response over plaintext HTTP at the domain name matching the CN subject in the CSR. This requires the host to be reachable from the public Internet, which is not feasible for internal servers. There are other challenge types: TLS-ALPN-01 also requires the host to be reachable from the public Internet; DNS-01 does not require the host to be reachable, but the requestor must be able to modify the DNS record, which can present other difficulties.

Enrollment over Secure Transport (EST) is another protocol to automate certificate signing (like ACME). It authenticates the CSR at the API layer but not as part of the CSR. However, binding the API authentication to the CSR remains a challenge.

There are a few potential OIDs we can investigate for their suitability to include authentication credential as part of the CSR:

  • PKCS #9 defines a challengePassword OID 1.2.840.113549.1.9.7, but it is supposedly used to request certificate revocation. It seems to be unimplemented. Also, it is generally a bad idea to repurpose a field for a different purpose than originally intended.
    • The content of challengePassword is plaintext equivalent, which means the CSR should be protected by other means so it is not public accessible.
  • RFC 7894 proposes additional fields to disambiguate the PKCS #9 challengePassword, notably: otpChallenge, revocationChallenge, estIdentityLinking.
    • It assumes that the primary authentication is done by the underlying transport such as EST. The additional fields can provide a second-factor authentication.
  • The SMI mechanism OID 1.3.6.1.5.5 references additional mechanisms tracked by IANA. In particular:
    • The aforementioned RFC 7894.
    • There is a working draft draft-ietf-lamps-csr-attestation-08 to attest the CSR with evidence, which is intended to come from a hardware TPM or PSA. Presumably, the evidence can be further extended to cover FIDO U2F or Passkeys. These evidences must not contain plaintext equivalent secrets.
  • Kerberos OID 1.2.840.113554.1.2.2 (RFC 1964) does not have an OID for the challenge response, as far as I can tell.

These proposals tend to avoid putting plaintext secrets into CSR, but the ability to secure secret credentials in a CSR remains elusive.

Envelope Proposal

To be able to include secrets in CSR, we can use the CA's public key to encrypt the plaintext, and we store the ciphertext in an envelopedData. There are three potential OIDs defining envelopedData:

The envelopedData defined by PKCS #7 (RFC 2315) is obsolete by the latest version RFC 5652, which seems to be the correct candidate to store encrypted secrets destined for the CA. If the CA has multiple certificates, they can all be included as a recipientInfo.

Inside the envelopedData should be a SEQUENCE that contains at least the SET OF Attribute{{ IOSet }} as defined in PKCS #10 (RFC 2986). This also opens up the possibility to include anything from an LDAP directory OID 2.5.4 (RFC 4519 or ISO/IEC 9594-6:2020 or X.520) that might need to be protected from public access, such as userPassword 2.5.4.35 or maybe encryptedUserPassword 2.5.4.35.2.

Afterthoughts

On one hand, having a global OID means applications and protocols can share relevant definitions when appropriate, but navigating the maze of existing OID assignments can be an exercise in frustration. It is similar to code reuse: sometimes the code you borrow from almost does what you need except for some corner cases, but you cannot change the corner case behavior because it may break existing users relying specifically on the corner case. So you have to choose the OID carefully based on how you expect the OID to evolve in the future, and try to find an OID that best aligns with the use case.

It is tempting to start from scratch and define exactly what an application needs in an application-specific way without any historical baggage, like JOSE (JWS/JWT). Their RFCs are still long, but still shorter than the maze of OIDs, RFCs, and ITU/ISO standards. It would be nearly trivial to abbreviate the CSR to another JSON object.

It is also interesting how OID leaves the breadcrumbs showing how the Internet standards evolved. At first, RSA developed the PKCS under its OID 1.2.840.113549 {iso(1) member-body(2) us(840) rsadsi(113549)}, then this namespace is donated to IANA. Similarly, the OID 1.3.6.1 {iso(1) identified-organization(3) dod(6) internet(1)} used to be owned by DoD (Department of Defense), now hijacked by IANA as well.

Tuesday, December 9, 2025

Runtime TLS Certificate Hot-Reload with OpenSSL

For long-running microservices that can act as both client and server, it is possible the lifetime of the microservice might outlive its TLS certificate validity, especially if an organization opts for a security policy to rotate TLS certificates quickly (e.g. every day). We need to be able to reload both the client and server certificates without exiting the service.

OpenSSL provides the functionality as follows:

  • For a client, the caller should use SSL_CTX_set_client_cert_cb() to provide a callback which is "called when a client certificate is requested by a server and no certificate was yet set for the SSL object." The caller provides a client_cert_cb() function that modifies the X509 ** and EVP_PKEY ** (pointer to a pointer), which OpenSSL will use to call SSL_use_certificate() and SSL_use_private_key() internally.
  • For a server, the caller should use SSL_CTX_set_tlsext_servername_callback() to provide a callback which is called during SNI. The caller provides a cb() function that modifies the SSL * object directly to override the SSL_CTX using SSL_set_SSL_CTX().

The callback can use an out-of-band method to reload the private key and certificate as needed. Once a connection has been authenticated during handshake, it will probably remain active even after the certificate expires, so the callbacks only apply to new connections.

To actively disconnect TLS transport with an expired certificate, some kind of connection pool management will be needed to wrap OpenSSL. This connection pool can also handle load balancing.

It is generally a good idea to allow certificate rotation to have some validity overlap so connections can be "upgraded" to the new certificates gradually. It avoids the problem where all connections die and have to be reconnected all at once, which can have undesirable latency impact.

Saturday, October 18, 2025

Compiling s3backer on MacOS, 2025 edition

s3backer allows you mount a single large file split into smaller chunks on S3 as a local file, via FUSE. In contrast, s3fs-fuse mounts S3 objects as individual files. s3backer is useful for hosting disk images formatted with a native filesystem. Although Apple subsequently introduced the Sparse Bundle image format, which also internally splits the image into multiple files, some reported that Sparse Bundle is limited to 100,000 bands, with a band size > 8.4MB each. On the other hand, s3backer supports better local caching, more granular block sizes, and builtin compression and encryption.

This is a quick note about how to compile s3backer.

  • If not already, install the Xcode command line tools: xcode-select --install
  • I recommend using either MacPorts or HomeBrew to install a recent version of automake, autoconf, and (lib)curl.
  • Install MacFUSE by downloading the appropriate package. This installs the header files under /usr/local/include and library files under /usr/local/lib.
  • Clone s3backer: git clone https://github.com/archiecobbs/s3backer

The commands I use to build:

git clean -fxd
mkdir -p m4
autoreconf -iv

configure \
  --prefix=$HOME/local \
  CPPFLAGS='-DFUSE_DARWIN_ENABLE_EXTENSIONS=0 -I/usr/local/include/fuse3 -I/opt/local/include' \
  LDFLAGS='-L/usr/local/lib -L/opt/local/lib'

make -j8 install

Some explanations:

  • git clean -fxd - removes any untracked files from the working dir.
  • mkdir -p m4 - output files for autoreconf.
  • autoreconf -iv - runs aclocal, automake, and autoconf.
  • configure \
    • --prefix=$HOME/local - pick your own install directory.
    • CPPFLAGS
      • -DFUSE_DARWIN_ENABLE_EXTENSIONS - on MacOS, fuse3 headers by default expects the code to use the Darwin file structs. This would have been useful for MacOS native filesystems, but s3backer uses the POSIX file structs.
      • -I/usr/local/include/fuse3 - MacFUSE comes with both legacy fuse2 and the newer fuse3. If we included -I/usr/local/include, then #include <fuse.h> would pull in fuse2. This might have been useful for legacy code, but s3backer uses fuse3.
      • -I/opt/local/include - add the MacPorts headers (for curl).
    • LDFLAGS
      • -I/usr/local/lib - MacFUSE library files.
      • -I/opt/local/lib - MacPorts library files.
  • make -j8 install - makes the s3backer binary and installs it to the --prefix configured above, using 8 parallel invocations.

It is worth noting that due to my specific setup, explained below, I ran into additional problems similar to this github issue: https://github.com/pyenv/pyenv/issues/3309, but I don't expect most people to run into it. I'll document my problem here for posterity.

During configure, gcc was not able to find the libcurl symbols. Upon closer inspection, gcc --version outputs x86_64-apple-darwin24.6.0 for the target, even though I'm now using an Apple Silicon machine, which should report arm64-apple-darwin24.6.0. The correct target was reported when I run gcc from the command line directly.

The reason for the wrong gcc target during configure is because I used a Python script to build s3backer out of tree. The script creates a unique temporary directory, cd into it, then runs the s3backer configure script from it. A long time ago, when I was still using an Intel mac, I installed Python from a python.org package. The x86 executables installed under /Library/Frameworks/Python.framework were migrated by the Migration Assistant when I bought a new Apple Silicon Mac.

When I run my Python script, it found the x86 Python interpreter, which is then run under Rosetta 2. Subsequently, when an x86 program runs gcc, it also runs the x86 version of gcc which would compile and link the output binary as x86. However, MacPorts only has arm64 of curl installed, which is why it couldn't find the curl symbols.

I'm not sure about the moral of the story. I would have preferred to start with a clean install, but it would have taken me too long to get it up and running. Migration Assistant gave me a usable machine more quickly but left some unexpected problems behind that needed clean up later. The technical debt trade-off was not known at that time, but an accidental one. I expect there will be more x86 files I need to clean up later.

Tuesday, September 30, 2025

AI, The Theranos of Software Engineering

Breaking news! Vibe coding achieved the utterly impressive feat! Macintosh System 7 Ported To X86 With LLM Help. The original System7 ran on Motorola 68K. The author claimed that they ported the OS to x86 in 3 days, with a fully functional Finder and GUI, and no access to the original source code. The project files can be found on GitHub.

It is understood that the port ran under QEMU to simplify dealing with real hardware, which has to do many things like ACPI power management. But as you may have surmised from the title, this is not the only "shortcut" that has been taken.

ISO contains just the kernel?

To start, let's take a look at how to run the project from the instructions provided:

# Build the kernel
make

# Create bootable ISO
make iso

# Clean build artifacts
make clean

# Run in QEMU
qemu-system-i386 -cdrom system71.iso -serial stdio -display sdl -vga std -m 256M

So let's take a look at the Makefile to see how the ISO was built.

ISO_DIR = iso
KERNEL = kernel.elf
ISO = system71.iso
GRUB = grub-mkrescue

# ISO target
iso: $(ISO)

$(ISO): $(KERNEL)
	@echo "Creating bootable ISO..."
	@cp $(KERNEL) $(ISO_DIR)/boot/
	@echo 'menuentry "System 7.1 Portable" {' > $(ISO_DIR)/boot/grub/grub.cfg
	@echo '    multiboot2 /boot/$(KERNEL)' >> $(ISO_DIR)/boot/grub/grub.cfg
	@echo '    boot' >> $(ISO_DIR)/boot/grub/grub.cfg
	@echo '}' >> $(ISO_DIR)/boot/grub/grub.cfg
	@$(GRUB) -o $(ISO) $(ISO_DIR)

The ISO is made using grub-mkrescue. The kernel is linked using the multiboot2 specification, which means grub would have switched the x86 CPU into 32-bit protected mode with a flat address space.

But it is a little surprising that the ISO comprised of just the kernel.elf and grub.cfg, devoid of all other assets such as fonts, icons, application binaries. There is no filesystem because everything is compiled into the kernel.

Indeed, you can find the hard coded icons and the hard coded font bitmap. The FontResources is really some code to download fonts from the Internet using curl, but this is not compiled into the kernel, and this is not the code used to generate the hard coded font bitmap either. This is probably just dead code.

If there is no filesystem, then it is a wonder why the project contains HFS filesystem code?

Dead Code Galore

There is a huge amount of dead code in the repo, all generated AI slop. This is an impressive catalog: AppleEventManager, ControlManager, ControlPanels, DeskManager, DeviceManager, DialogManager, EventManager, FS (HFS), FileMgr, Finder, FontManager, FontResources, GestaltManager, ListManager, MemoryMgr, MenuManager, PatternMgr, Platform, ProcessMgr, QuickDraw, ResourceMgr, Resources, ScrapManager, SoundManager, TextEdit, WindowManager.

I have spot checked a few: The FS code has mentions of btree and catalog, but it is an in-memory filesystem (no actual disk volume support). The QuickDraw code does not contain any shape drawing code, i.e. the code purporting to draw the line or oval does not actually render the pixels. There is no implementation of the Bresenham's algorithm.

Only a handful of these managers are referenced in main.c, and even so, only the initialization function is called. There is no evidence that main.c actually uses any of the managers in any substantial way. For example, even though it includes the QuickDraw header, most of the drawing routine is in main.c itself.

What's in a main?

So what does the main.c actually do?

  • Writing to the serial console using outb.
  • The serial port input is used to simulate GUI interaction, e.g. 'm' activates the menu, 'a' activates the Apple menu.
  • Draws console text in VGA mode.
  • Multiboot2 handover.
  • Drawing of text, apple logo, window, rectangle, and icon.
  • Fake initialization of managers.
  • There is some mouse handling using hard-coded coordinates to map cursor to UI elements. But this is commented out.

In summary, we just get a very thin veneer of what is minimally required to draw some UI in QEMU and using serial port to interact with it.

Conclusion

This is perhaps not the most flattering case study of using LLM to vibe code. AI faithfully created a bunch of dead code, but what was actually working is a small amount of code that is minimally required to create the appearance of a working operating system.

This whole thing reminds me of Theranos. Its CEO, Elizabeth Holmes, charmed her investors into believing that she invented a miracle medical diagnostics machine that purported to be able to run a bunch of tests using only a single drop of blood. In the few demos they did, they actually drew vials of blood using the normal process and ran the regular lab tests behind the scenes. The machine never worked.

It took Theranos (timeline) from rising to fame in 2015 to fraud indictment in 2018. ChatGPT was first publicly released in 2022. How many more years will people discover that AI is fraud? Probably never, unless somehow AI starts to put people's lives in danger, but not out of malice, just incompetence.

Tuesday, March 4, 2025

Review of Memory Safe C/C++ Proposals

In C++ creator calls for help to defend programming language from 'serious attacks', the article mentioned a few proposals:

Assuming we all know why memory safety is important, let's dive into how each of these proposals deliver on memory safety. Bear in mind that these are working proposals, so some of the details are "magic" or wishful thinking that needs to be fleshed out further.

Profiles (C++)

Profiles by Bjarne Stroustrup is not a single proposal, but a collection of proposals. Each profile states a promise about the safety property it provides and the language features or checks needed to achieve it. Profiles enforcement can be specified in code or toggled through compiler flags.

Through the use of a new expect() function, which is like assert(), error handling can be done in one of the following ways: ignore, logged (and ignored), logged (and throw an exception), throw an exception, or exit the program.

There are several profiles and their summaries:

  • Profile: Type stipulates that "every object is used only in accordance with its definition." It may be a union of several profiles such as Ranges, Invalidation, Algorithms, Casting, RAII and Union. One idea is to eliminate raw pointer handling from collection objects through a new span abstraction.
  • Profile: Arithmetic detects over and underflow conversion errors.
  • Profile: Concurrency detects race condition and deadlocks. It acknowledges that this is the "least mature of the suggested profiles" and "has received essentially no work specifically related to profiles."
  • Profile: Ranges detects out of range indexing.
  • Profile: Pointers stipulates that "every pointer points to an object or is the nullptr; every iterator points to an element or the end-of-range; every access through a pointer or iterator is not through the nullptr nor through a pointer to end-of range." It introduces a new language feature not_null, which looks like a type qualifier but it is not clearly specified.
  • Profile: Algorithms stipulates that "no range errors from mis-specified ranges (e.g., pairs of iterators or pointer and size). No dereferences of invalid iterators. No dereference of iterators to one-past-the-end of a range." There is some overlap with the pointers profile regarding the iterator end of range. It introduces a new language feature not_end(c, p) which returns whether iterator p is at the end of the container c.
  • Profile: Initialization stipulates that "every object is explicit initialized."
  • Profile: Casting prevents integer truncation by narrowing cast. Provides a narrow_cast<> with runtime checking.
  • Profile: Invalidation prevents "access through an invalidated pointer or iterator." The compiler is supposed to "ban calls of non-const functions on a container when a pointer to an element of the container has been taken," and suggests that the compiler does it through "serious static analysis involving both type analysis and flow analysis" (i.e. magic).
  • Profile: RAII prevents resource leaks by representing every resource as a scoped object. The constructor and destructor handle the acquisition and release of the resource. This can also be used to do reference counting.
  • Profile: Union says "every field of a union is used only as set" and suggests later providing pattern matching (i.e. algebraic data types) as an alternative.

It is clear that Profiles leverage many C++ only features and will not apply to C. However, the strength of this approach is that it recognizes safety as a synthesis of many issues that can be addressed incrementally. It allows legacy code to be incrementally updated to satisfy one profile at a time, so there is less upfront cost towards memory safety.

Profiles can also become unnecessarily broad. For example, concurrency through flow analysis is another can of worms that requires computing the arbitrary permutation of concurrent access to detect race conditions. Invalidation is also magic, as most code do not sufficiently express their intent to transfer resource ownership. On the other hand, it is unclear if all the profiles together will guarantee what people now expect from "safe" Rust.

TrapC

TrapC by Robin Rowe proposes a new dialect of C with the following modifications:

  • malloc() always allocates from a garbage collected heap, and free() is no-op.
  • Pointers are instrumented with type and size information, and pointer dereferencing is checked in runtime.
  • Access violations can be caught using a new trap statement. Unhandled violations will terminate the program.
  • goto and union are not supported.
  • TrapC can call C functions but not vice versa.
  • Typesafe printf() with a generic "{}" format specifier.
  • No special provision for thread-safety.

It is supposed to be able to compile unmodified legacy C code with additional runtime checks. When access violations cause unwanted program termination, users can write trap handlers as necessary. The white paper suggests a possible goal to "produce executables that are smaller and faster than from C compilers" but it is not clear how it is possible with additional runtime checking overhead (i.e. magic).

There is some escape analysis, so if a scoped object is returned as a pointer, it becomes heap allocated, similar to Go. Through this feature, TrapC proclaims that "it is possible to have code that is wrong in C, yet right in TrapC," but it is not clear how much legacy code that used to have undefined behavior in C will now benefit from having a defined behavior.

Fil-C

Fil-C by Filip Pizlo proposes a new dialect of C with the following modifications:

  • malloc() always allocates from a garbage collected heap, but free() puts an object to a free list.
  • Pointers are instrumented with type and size information, and pointer dereferencing is checked in runtime.
  • Garbage collection implementation supports concurrent collection without stop-the-world.

I have some doubts about the free list. The proposal does not prevent pointers from being aliased (having multiple pointers to the same object). Freeing an object will nullify one pointer but the other pointer is still valid. The proposal may be a little immature.

Much of the manifesto extols the virtue of author's garbage collector design, so it's not clear if the author is selling a new language or selling a new garbage collector. Garbage collector is not supposed to be tied to the language. There is no one-size-fits-all garbage collector, so it ought to be possible to use different garbage collection strategies depending on the workload requirements of the application.

Mini-C, or "Compiling C to Safe Rust, Formalized"

Aymeric Fromherz (Inria, France) and Jonathan Protzenko (Microsoft Azure Research, US) explore how to compile C to Rust without resorting to "unsafe" Rust. The resulting code strongly provides the same safety guarantee that Rust provides. Some of the considerations include:

  • Static analysis to translate pointer arithmetics in C to slices and splitting in Rust.
  • Infers when a reference needs mutable borrowing, including references from a struct.

They validated the feasibility of their approach on a subset of C that is already formally verified through other means, but it is probably a long shot from being able to accept legacy C code.

It relies on the fact that some carefully written C code has internal consistencies that are not explicitly expressed, but we can design inference algorithms to figure out what these internal consistencies are. Some of the inference techniques used in this paper can be reversely applied on Rust to reduce the notational requirements of Rust code.

The resulting executable does not need garbage collection, but still relies on runtime bounds checking (in Rust).

Safe C++ (also known as Circle C++)

Sean Baxter started the Circle C++ compiler around 2019 as "a compiler that extends C++17 for new introspection, reflection and compile-time execution" with a flare in meta-programming. Some of the memory safety extensions implemented by this compiler over the years are now being proposed as a C++ draft.

Some highlights of the proposal:

  • #feature on safety activates the compile-time memory safety checks, like #pragma in existing compilers.
  • A safe specifier that requires usage of safety language extension in a function (though it still allows explicit unsafe code), like the noexcept specifier that disallows a function from throwing an exception.
  • It still relies on runtime bounds checking.
  • Ownership tracking through checked references T^ (mutable) and const T^ (shared). Each object may have either a single mutable reference, or any number of shared references, but not both at once.
  • Named lifetime parameter /a and borrowing reference T^/a.
  • Lifetime binder template parameter typename T+.
  • A mut statement prefix to establish a mutable context that allows conversions from lvalues to mutable borrows and references.
  • A new standard library with safe containers and algorithms. In particular, it replaces begin() and end() iterators with slice iterators annotated by named lifetime parameters.
  • Pattern matching with "choice type" to enforce that optional values have to be checked before access.
  • Type traits for thread safety: T~is_send, T~is_sync.
  • Type traits for allowing unsafe pointer arithmetics: T~as_pointer, T~as_length; not fully explained.
  • Type traits T~string, T~is_trivially_destructible; not fully explained.

The safety semantics are inspired mostly by Rust, which is mentioned in the proposal 85 times.

Safe C++ may very well provide some concrete design for some of the Profiles work by Stroustrup. Contrary to Profiles, Safe C++'s monolithic "all or nothing" approach might make it more difficult to port legacy code due to the upfront cost to satisfy all memory safety requirements all at once. Perhaps choice type, thread safety, and pointer arithmetics can be split into their own Profiles.

Conclusion

There are several ways to compare and contrast these approaches.

  • Whether they expect significant modification to legacy code:
    • Upfront: Safe C++, Mini-C.
    • Incrementally: Profiles C++
    • No: TrapC, Fil-C.
  • Whether they force the use of garbage collection:
    • Yes: TrapC, Fil-C.
    • No: Profiles C++, Safe C++, Mini-C.
  • Whether they require C++.
    • Yes: Profiles C++, Safe C++.
    • No: TrapC, Fil-C, Mini-C.

In light of the recent Rust-in-Linux debacle, if we were to port Linux kernel code to a memory safe C dialect, we would not be able to use garbage collection, nor would we be able to use C++. This leaves Mini-C as the only viable option. However, the inference algorithm may not be able to handle the complexity of Linux kernel code, so some kind of object borrowing or lifetime annotation will still be needed.

Incremental safety check is a useful feature, as this alleviates the upfront cost to fix legacy code for all memory safety issues all at once.

It's worth noting that all of the proposals above require runtime bounds checking. Without some kind of size annotation throughout the code, it would be hard for static analysis to infer whether bounds checking can be safely omitted. This precise problem is solved by ATS through the use of dependent types. Perhaps it could be useful to design a dependent type system for a dialect of C for those who aren't used to ML styled programming of ATS. We can take some inspiration from Mini-C to reduce the notational overhead.

Saturday, February 8, 2025

Polling vs. Interrupt

Recently there is news reverberating across the interwebs that Tiny Linux kernel tweak could cut datacenter power use by 30%. The claim is based on this code commit detailed by the paper Kernel vs. User-Level Networking: Don't Throw Out the Stack with the Interrupts. The main idea is to mix the use of polling and interrupt driven I/O to get the best of both worlds.

Polling is just busy looping to wait for something to happen. It only takes a few CPU cycles each time we check, so the turnaround is very efficient. But if we keep looping and nothing happens, it is wasteful work.

Interrupt driven I/O programs the CPU how to handle an event when something happens, then it tells the CPU to go do something else or go to sleep when there is nothing else to do. The CPU has to save the context of the current program and switch context between programs, so the overhead is greater. But it is more efficient if there is a lot of wait between events.

When we have events that take some time between arrival, but tend to arrive in a bunch (e.g. poisson process with Exponential Distribution intervals), then it makes sense to wait for the next event using an interrupt, then process subsequent events using polling. We can further limit the amount of time spent in polling to equal to the opportunity cost if we were to use an interrupt, and it's a win-win situation: if an event arrives sooner, we could save on the cost of the interrupt minus the cost of polling, but never worse.

This idea is in fact not new. This technique is often used in task schedulers. I remember seeing its use in an old MIT Cilk version circa 1998. I have seen it in some mutex implementations. I have also used the same technique in my Ph.D. work circa 2014 (in the artifact Liu_bu_0017E_429/libikai++-dissertation.tbz2, look for parallel/parallel.cc and the deep_yield() function in particular). This technique just seems very trivial, so I have never seen anyone bother mentioning it.

So will this technique cut datacenter power use by 30%? The paper measured "queries per second" improvements in a memcached microbenchmark, but it is naive to assume that 30% increase in performance automatically translates to 30% reduction in power use. That's assuming all the datacenter does is receiving network packets with no additional work, such as: cryptography, marshaling and unmarshaling of wire protocol, local memory and data storage I/O, calling out to external services, and computing a response. The 30% figure is the upper bound in the most ideal scenario, and it caught the attention of the Internet.

It is a welcoming optimization nonetheless, and the authors have done the hard work to validate their claims with benchmarks and a thorough explanation how it works, so they deserve the credit for the execution. Perhaps someone could continue to audit whether other OS subsystems (e.g. block devices, CPU scheduler) can also benefit from this treatment, for Linux and other OSes.

Friday, January 3, 2025

Why RAID 1+0 is Better Than RAID 0+1?

In this article, we discuss why RAID 1+0 (stripes of mirrors) is better than RAID 0+1 (mirror of stripes) for those who are building a storage array.

Image credit: Network Storage Server (angle view with case open) from Wikimedia Commons

What is RAID, and why?

RAID (redundant array of independent disks) describes a system of arrangements to combine a bunch of disks into a single storage volume. The arrangements are codified into RAID levels. Here is a summary:

  • RAID 0 (striped) essentially concatenates two or more disks. The stripe refers to the fact that blocks from each of the disks are interleaved. If any disk fails, the whole array fails and suffers data loss.
  • RAID 1 (mirrored) puts the same data across all disks. It does not increase storage capacity or write performance, but reads can be distributed to each of the disks, thereby increasing read performance. It also allows continued operation until the last disk fails, which improves reliability against data loss (assuming that disk failures happen independently, see discussion below).
  • RAID 2, 3, 4, 5 (parity) employ various parity schemes to allow the array to still operate without data loss up to one disk failure.
  • RAID 6 (parity) uses two parity disks to allow up to two disk failures without data loss.
  • RAID-Z (parity) is a parity scheme implemented by ZFS using dynamic stripe width. RAID-Z1 allows one disk to fail without data loss. RAID-Z2 two disks, and RAID-Z3 three disks.

Disk failures may be correlated by many factorsoperational temperature, vibration, wear, age, and manufacture defects shared by the same batch—so they are not fully independent. However, the failures can be somewhat randomized if we mix disks from different vendors or models. Each year, Backblaze publishes hard drive failure rates aggregated by vendor and model (e.g. 2023) which is an invaluable resource to decide which vendor and model to use.

Why use nested RAID?

When disk fails in a RAID up to the tolerated failure without data loss, the failed disk can be replaced with a blank disk, and data is resilvered into the new disk. The resilver time lower bound is determined by the capacity of the disk divided by how fast we can write to it. A larger disk will take longer to resilver. If the disk failures are not fully independent, then there is a greater risk of data loss if another disk fails during resilvering.

By way of example, a 24TB disk at a maximum write speed of 280 MB/s will take about a day to resilver (rule of thumb: each 1 TB takes 1 hour). This makes a large disk somewhat unappealing due to the long resilver time. More realistically, we may want to use smaller disks so they can complete resilvering more quickly.

Creating RAID from smaller capacity bound disks necessitates nesting the RAID arrangements. Here are some ways we can arrange the nesting:

  • RAID 0+1 (a mirror of stripes) takes a stripe of disks (RAID 0) and mirror the stripes (RAID 1). When a disk fails, it impacts the entire stripe, and the whole stripe needs to be resilvered.
  • RAID 1+0 (a stripe of mirrors) takes a bunch of mirrored disks (RAID 1) and creates a stripe (RAID 0) out of the mirrors. When a disk fails, it impacts just the mirror. When the disk is replaced, only the mirror needs to be resilvered.
  • RAID 5+0, RAID 6+0, etc. creates an inner parity RAID combined as an outer stripe (RAID 0).

Nesting are not created equal!

We can see qualitatively from the failure impact point of view that RAID 1+0 has a smaller impact scope, therefore is better than RAID 0+1. But we can also analyze quantitatively what is the probability that a second failed disk might cause total data loss.

  • In RAID 0+1, suppose we have N total disks arranged into a 2-way mirror of \( N / 2 \) disk stripes. The first failure already brought down one of the mirrors. The second failed disk would bring down the whole RAID if it happens in the other stripe, or approximately 50% chance. Note that we cannot simply swap striped disks between the mirrors to avert the failure, since disk membership belongs to a specific RAID. You might be able to manually hack around that by hex-editing disk metadata, but that is a risky dealing for the truly desperate person.
  • In RAID 1+0, suppose we have N total disks arranged into \( N / 2 \) stripes of 2-way mirrored disks. The first failure already brought down one of the mirrors. The second failed disk would bring down the whole RAID if it happens in the other disk of the same mirror, or approximately \( 1 / N \) chance. The number of failed disks we can tolerate without data loss is a variation to The Birthday Paradox.

Deep dive into aggregate probability of failure

Another way to analyze failure impact quantitatively is to see what happens if we build a RAID 0+1 or RAID 1+0 array using disks of the same average single disk failure rate. We compute the aggregate probability of failure that would suffer data loss, assuming that failures are independent.

  • Suppose the probability of average single disk failure is p.
  • For RAID 0 with N striped disks, the probability that any of the N disks fails is \( P_0^N(p) = 1 - (1 - p)^N \), i.e. the opposite of all N disks do not fail, a double negation.
  • For RAID 1 with N mirrored disks, the probability that all N disks fail at the same time is \( P_1^N(p) = p^N \)
  • For RAID 0+1, with k-way mirrors of \( N / k \) stripes, the failure probability is:
    \[
      \begin{aligned}
        P_{0+1}^{k,N/k}(p) & = P_1^{k}(P_0^{N/k}(p)) \\
          & = P_1^k( 1 - (1 - p)^{N/k} ) \\
          & = ( 1 - (1 - p)^{N/k} )^k
      \end{aligned}
    \]
  • For RAID 1+0, with \( N / k \) stripes of k-way mirrors, the failure probability is:
    \[
      \begin{aligned}
        P_{1+0}^{N/k,k}(p) & = P_0^{N/k}(P_1^k(p)) \\
          & = P_0^{N/k}(p^k) \\
          & = 1 - (1 - p^k)^{N/k}
      \end{aligned}
    \]

We can plot \( P_{0+1}(p) \) and \( P_{1+0}(p) \) and compare them against p, the probability of single disk failure representing the AFR (annualized failure rate). Here are some examples of p from Backblaze Drive Stats for 2023 picking the models with the largest drive count from each vendor:

MFGModelDrive SizeDrive CountAFR
HGSTHUH721212ALE60412TB13,1440.95%
SeagateST16000NM001G16TB27,4330.70%
ToshibaMG07ACA14TA14TB37,9131.12%
WDCWUH721816ALE6L416TB21,6070.30%

In most cases, we are looking at p < 0.01. Some of the models have AFR > 10% but it is hard to tell how accurate the number is due to the small drive count for that model. Those models with a small average age can bias towards high early mortality due to the Bathtub Curve.

We generate the plots in gnuplot using the following commands:

gnuplot> N = 4
gnuplot> k = 2
gnuplot> set title sprintf("N = %g, k = %g", N, k)
gnuplot> plot [0:0.02] x title "p", (1 - (1-x)**(N/k))**k title "RAID 0+1", 1 - (1 - x**k) ** (N/k) title "RAID 1+0"

Number of disks N = 4 with mirror size k = 2 is the smallest possible nested RAID 0+1 or RAID 1+0. At p < 0.01, both RAID 0+1 and RAID 1+0 offer significant improvement over p.

At N = 8 while keeping the same mirror size k=2, we see that both RAID 0+1 and RAID 1+0 still offer improvements over p. However, RAID 1+0 failure rate doubles, and RAID 0+1 more than triples.


At N = 16, the trend of multiplying failure rate continues. Note that RAID 0+1 can now be less reliable than a single disk, while RAID 1+0 still offers 6x improvement over p.


To get the failure rate under control, we need to increase the mirror size to k = 3. Even so, RAID 1+0 failure rate (very close to 0) is still orders of magnitude lower than RAID 0+1.


So from the quantitative point of view, RAID 1+0 is much less probable to suffer a total data loss than RAID 0+1.

Conclusion

As a rule of thumb, when nesting RAID levels, we want the striping to happen at the outermost layer because striping is the arrangement that accumulates failure rates. When we mirror for the inner layer, we reduce the failure rates by orders of magnitude, so the striping accumulates failure rates much slower.

This conclusion also applies to storage pool aware filesystems like ZFS. RAID-Z does not allow adding disks to an existing set. The best practice is to always add disks as mirrored pairs. You can also add a new disk to mirror an existing disk in a pool. Stripes of mirrors offer the most flexibility to expand the storage pool in the future.

This is why RAID 1+0 is better.