Life of a Computer Scientist: March 2024

Friday, March 29, 2024

SSH server compromised by xz/liblzma 5.6.0 and 5.6.1

A backdoor compromising SSH server introduced in xz/liblzma 5.6.0 and 5.6.1 was reported today to oss-security by Andres Freund. According to the analysis, when the sshd binary is initialized by the dynamic linker at startup, the initialization code in liblzma installs a hook to the dynamic linker that modifies subsequent dynamic library symbol tables (before they are made read-only) that replaces SSH RSA encryption functions with malicious code.

Image credit: Wikimedia Commons

The xz repository provides a widely used data compression command line program “xz” as well as a library “liblzma” that allows the compression algorithm to be used in other programs. SSH is a remote secure login server. Although SSH does not use liblzma directly, many distributions such as Debian and Redhat patches it to integrate with systemd notification, which uses liblzma.

The malicious code in liblzma was introduced as obfuscated binary test data in a git commit and patched into compiled binary using an obfuscated M4 macro as part of the build system. The malicious git commit was introduced by JiaT75, who has been contributing xz commits for about two years while taking advantage of the mental health issues of the original author of xz, who maintained it as an unpaid hobby project. Most of Jia's commits are non-technical fixes such as translation or documentation. Jia reportedly urged Redhat and Debian maintainers of the xz package to push the new version to production, which suggests that it is premeditated. The attack was only discovered because Andres Freund noticed that his SSH login got slower and decided to investigate.

It is nearly impossible to manually audit supply chain attacks like this, but there is one way to mitigate this attack vector: all setuid binaries should be statically linked with the static-PIE linking option. Static linking eliminates the dynamic linking attack vector, while PIE enables address-space layout randomization (ASLR) to make it impossible for malicious actors to patch code in runtime. Allegedly, OpenBSD already compiles system binaries this way.

Some may have concerns about static linking, but they can read the refutation by Gavin D. Howard.

Advisory About Go

Static linking alone does not completely guard against runtime code patching, but we need address-space layout randomization (ASLR) for both code and data. That is because the data sections also contain function pointers that could alter the code path. Without ASLR for data, any function pointer in heap-allocated objects can be compromised by supply chain attack.

Go does not support heap data ASLR (golang/go#27583). This means that malicious code using the unsafe package could traverse the heap and override interface function pointers to change the behavior of existing code. This is despite the fact that Go language always compiles and statically links on the whole-program, and has a -buildmode=pie for code ASLR. Unfortunately, the Go crypto, ssh, tls, and net/http packages all make ample use of interfaces, and every one of these can be an attack vector.

Saturday, March 16, 2024

Deep Dive into MQA-CD Encoding

A few weeks ago, I saw this video by Techmoan introducing the MQA-CD. MQA-CD is an audio CD that can be played back in a regular CD player, which is limited to 16-bit samples at 44.1 kHz. However, when played back through an MQA decoder, it promises better sound quality at 24-bit at 192 kHz.

Before we dig into the MQA marketing material, we need to understand that MQA is an encoding scheme that can exist outside of a CD, e.g. audio delivered over the radio or the Internet. Some of the non-CD transports are assumed to carry 24-bit at 48 kHz or higher. However, MQA-CD transport is limited to 16-bit at 44.1 kHz by the CD as its physical medium.

At first glance, MQA violates the Nyquist–Shannon sampling theorem which places a hard upper-bound that a signal at frequency B must be uniquely represented by at least 2B samples per second. However, we can give it some leeway by allowing for lossy encoding, even though some MQA marketing material claims that the encoding is lossless.

In a lossy scheme, we can steal some lowest significant bits from the sample to passthrough a data stream like MP3 that employs psychoacoustic coding. The lowest significant bits sound like the noise floor when listened to without the decoder, and the psychoacoustic coding allows us to put more detail into the noise more economically—basically, the data stream contains instructions about how to synthesize only sounds humans can hear, so we use less data than if we have to encode the full Nyquist-Shannon spectrum. Furthermore, the data stream only needs to contain the delta, which is the sound not already present in the non-stolen bits.

The question about MQA-CD is how many bits it is stealing?

Music Origami, according to MQA

The MQA website links to a blog by the MQA inventor, Bob Talks, which discusses the CD encoding with some technical detail, but it is a little confusing:

If the original source is 44.1kHz/24b or if the sample rate is 88.2, 176.4, 352,8 kHz, or DSD, then a standard MQA file will be 44.1 kHz/24b. The file contains the information for decoding, ‘unfolding’, and rendering.
This 24b MQA file is structured so that, if in distribution it encounters a ’16-bit bottle-neck’ (e.g. in a wireless or automotive application), then the information in the top 16 bits is arranged to maximise the downstream sound quality and still permits unfolding and rendering. See [2]

[2] MQA-CD: Origami and the Last Mile

So reference [2] should contain some information about how the 24-bit is truncated to 16-bit. Here are some mentions:

The Green signal is completely removed by MQA decoders; but it is there so that we can hear more of the music when playback is limited to a 16-bit stream.

Sometimes we might want to listen to MQA music on equipment that doesn’t support 24 bits – maybe only 16? Rather than throw away all the buried information, MQA carries a small data channel (shown in Green) which can contain the ‘B’ estimates, enabling significantly improved playback quality on, e.g. a CD, over ‘Airplay’, in-car, to certain WiFi speakers and similar scenarios.

But it is also confusing because it shows the “Green signal” at -120 dB. We know that CD dynamic range is 96 dB, so it could not have been able to represent -120 dB noise floor. Samples at 24-bit has a dynamic range of 144 dB. However, the signal charts in the page shows a floor of -168 dB, and it was putting some information below -144 dB, which requires 28-bits.

As a side note, CD dynamic range of 96 dB is determined by the formula in terms of the 16-bit sample depth: \( 20 \times \log_{10}{2^{16}} \approx 96 \). As a rule of thumb, each bit in the sample represents about 6 dB in dynamic range.

Another page Deeper Look: MQA 16b and Provenance in the Last Mile also states that:

If we look at the block diagram above, we can see there are three components to the MQA data, broadly described as: i) top 16 bits, ii) MQA signalling and iii) bottom 8 bits

The block diagram clearly shows that the encoding result in 24-bit master file, but it still does not explain how that is reduced to MQA-CD which is bottlenecked to 16-bit samples.

Is Bit Stealing Plausible?

Since MQA still does not explain how the 24-bit master is reduced to 16-bit transport depth on a CD, we are left to speculate about the bit stealing idea earlier.

If we allow stealing 4 bits per sample, then we get a data rate of \( 2 \textit{ channels} \times 4 \textit{ bits per sample} \times 44100 \textit{ Hz} \approx 344 \textit{ kbps} \). This is pretty generous for high quality AAC, which is typically 256 kbps. The dynamic range before decoding is reduced from 96 dB to 72 dB, which is still comparable to a very high quality magnetic tape.

So I would say it is plausible, but it is inconclusive from the MQA marketing material if this is how they did it.

Furthermore, I don’t see the point of MQA’s “Music Origami” that folds 24-bit 192 kHz into 24-bit 48 kHz. If the transport is already capable of lossless 24-bit data, it must be a digital transport that is not a CD, which means there is no requirement to maintain backwards compatibility with a Red Book CD player. We can just use the whole stream to transport encoded audio, e.g. AAC or Flac. Even some later CD players in the 2000’s can play MP3 from a data CD or from a USB drive. That was all possible before MQA launched in 2014.

Which is why Techmoan says that even if you believe MQA delivers higher quality audio, it is a format that came a little too late.

Carrot and Stick Security Design

Carrot and stick security design is the idea to have frontend and backend work together to enforce security policies in software. The frontend interacts with the user and steers them towards compliance, while the backend enforces the security rules. Although we don’t necessarily use carrot and stick to mean reward and punishment, the carrot is a “soft nudge” and the stick is a “hard boundary.” If the user bypasses the frontend and tries to interact with the backend directly, they will be met with a hard error message.

Image credit: Wikimedia Commons

(“Good cop, bad cop” is a similar strategy, although the cop analogy may be controversial.)

An example is a photo gallery that allows visitors to browse but only signed-in users to download images. The frontend may present a “download” button but will ask the user to either login or create an account. The backend checks that the login credentials are present before allowing the image to be downloaded.

If the security check is only done in the frontend, then the user could simply bypass login by forging a URL request directly to download the images. If the security check is only done in the backend, then an innocent user that did not know they need to signup or login first may be confronted with an unfriendly error message.

I’m intentionally using the terms “frontend” and “backend” loosely. In practice, the designations may differ depending on the application:

For a user-facing website, frontend is the client side Javascript, and backend is the HTTP server.
For a mobile app, frontend is the app, and backend is some API used by the app.
For an API, frontend is the HTTP server middleware, and backend is the internal data storage.

Even at the API level, the API design should try to encourage well-defined use cases (the carrot), and let the protocol layer check for malformed requests (the stick).

What this means is that a complete software stack that spans the gamut of client side Javascript or app, an API middleware, and backend storage should implement security enforcement at all layers.

Friday, March 1, 2024

Memory Safety State of the Union 2024, Rationale Explained

There has been renewed interest in programming languages after The White House recently published a recommendation suggesting the transition to a memory safe language as a national security objective. Although I am not an author of the report, I want to explain the rationale that someone might use to consider whether their infrastructure meets the memory safety recommendations. This is more of an executive-level overview than a technical guide to programmers.

Image credit: Wikimedia Commons, Whitehouse North.

On February 26, 2024, the White House released Statements of Support for Software Measurability and Memory Safety calling attention to a technical report from the Office of the National Cyber Director titled “Back to the Building Blocks: A Path Towards Secure and Measurable Software” (PDF link). The whole framework works like this: we ultimately want to be able to measure how good the software is (e.g. by giving it a score), and it relies on memory safety as a signal. The tech report also references another report published by Cybersecurity and Infrastructure Security Agency (CISA) titled The Case for Memory Safe Roadmaps which contains a list of memory safe language recommendations.

To supplement their list, I will be using the TIOBE index top 20 most popular programming languages to provide the examples: Python, C, C++, Java, C#, JavaScript, SQL, Go, Visual Basic, PHP, Fortran, Pascal (Delphi), MATLAB, assembly language, Scratch, Swift, Kotlin, Rust, COBOL, Ruby.

I will also throw in some of my personal favorites: LISP, Objective Caml, Haskell, Shell, Awk, Perl, Lua, and ATS-lang.

High Level Languages

Languages that do not expose memory access to the programmer tend to be memory safe. The reason is that it reduces the opportunity for programmers to make memory violation mistakes. When accessing a null pointer or an array with an out of bounds index, these languages raise an exception that could be caught, or return a sentinel value (0 or undefined value), rather than silently corrupting memory content.

Examples: Python, Java, C#, JavaScript, SQL, Go, Visual Basic, PHP, MATLAB, Scratch, Kotlin, Ruby; LISP, Objective CAML, Haskell, Shell, Perl, Awk, Lua.

Under the hood, they employ an automatic memory management strategy such as reference counting or garbage collection, but programmers will have little to no influence over it because the language does not expose memory access.

It does not matter whether the language execution happens at the abstract syntax tree (AST) level, compiled to a byte code, or compiled to machine code. In general, any language could have a runtime implementation that spans the whole spectrum through Just In Time compilation.

Things to Watch Out For

High level languages are prone to unsanitized data execution error such as SQL injection, Code Injection, and most recently Log4j. This happens when user input is passed through to a privileged execution environment and treated as executable code. High level languages often blur the line between data and code, so extra care must be taken to separate data from code execution. Data validation helps, but ultimately data should not have influence over code behavior unless it is explicitly designed to do so.

I strongly oppose using PHP or any products written in PHP, which is particularly notorious for SQL and code injection problems and single handedly responsible for all highly critical WordPress vulnerabilities. But if you inherited legacy infrastructure in PHP, there are principles that will help hardening it.

Even though memory access errors are raised as an exception, if these exceptions are not caught, they could still cause the entire program to abort. They also still allow potentially unbounded memory consumption leading to exhaustion, which causes program to abort or suffer severely degraded performance, leading to denial of service.

Some languages provide an “unsafe” module which is essentially a backdoor to memory access. Using them is inherently unsafe.

Most languages also allow binding with an unsafe language through a Foreign Function Interface (ffi) like SWIG. This allows the high level code to run potentially unsafe code written in a non-safe language like C or C++.

Mid Level Languages

These languages expose some aspects of memory management to the programmer—such as explicit reference counting—and provides language facilities to make it safer.

Examples: Swift, Rust; also ATS-lang.

Performance is the main reason to use these languages, as memory management overheads have negative performance impact, and in some time-sensitive applications, we have to carefully control when to incur these overheads. The tradeoff is programmer productivity, since they have to worry about more things. Since performance is the main concern, these languages tend to be compiled into machine code before running in production.

I want to call out ATS-lang because it is a language I helped working on for my Ph.D. advisor, Hongwei Xi. It was conceived in 2002 and predated Rust (2015). ATS code can mostly be written like Standard ML or Objective CAML with fully automatic memory management. It also provides facilities to do manual memory management. Safety is ensured by requiring programmers to write theorems to prove that the code uses the memory in a safe manner (papers). The theorem checker uses stateful views inspired by linear logic to reason about acquisition, transferring of ownership, and disposal of resources.

Things to Watch Out For

These languages are safe by virtue that the compiler can check for most programmer errors in compile time, but these languages still provided unsafe ways to access memory.

Swift: UnsafePointer
Rust: Unsafe Rust
ATS-lang: Unsafe C-style Programming in ATS

Furthermore, they are still prone to denial of service, SQL injection, and code injection vulnerabilities.

Low Level Languages

These languages require the programmer to manually handle all aspects of memory management for legacy reasons. For this reason, they are inherently unsafe.

Examples: C, C++, Pascal.

Although garbage collection was invented by John McCarthy in 1959 for the LISP language, that concept did not gain mainstream adoption until much later.

Even so, there are a few strategies to make these languages more memory friendly.

Use an add-on garbage collector like Boehm-GC. Note that object references stored in the system malloc heap are not traced for liveness, so care must be taken when using both GC malloc and system malloc.
C++ code should use Resource Acquisition is Initialization (RAII) idiom as much as possible. The language already tracks object lifetime through variable scope. The constructor is called when an object is introduced into the scope, and the destructor is called when the object leaves the scope. Smart pointers like std::unique_ptr and std::shared_ptr use RAII to manage memory automatically.

My particular contribution in this field is my Ph.D. dissertation (2014), which proposed a different type of smart pointer in C++ that does not auto-free memory but still helps catching memory errors. I showed that it is practical to use, by implementing a memory allocator using the proposed smart pointer.

Legacy Languages

I have omitted Fortran and COBOL from any of the lists above. Historically, Fortran and COBOL only allowed static memory management. All the memory to be used by the program are declared in the code, and the OS provisions for them before the program is loaded for execution. However, they never had any array bounds checking, so they are not memory safe. Furthermore, attempts to modernize the language with dynamic memory allocation exacerbated the problem that these languages were not designed to be memory safe.

Fortran: memory management
COBOL: memory allocation

I have also omitted assembly languages as a whole. Assembly languages tend to be bare metal and provide completely unfettered memory access, but there have been some research to enhance the safety of assembly languages (e.g. Typed Assembly Language).

Conclusion

There is no language that is completely memory safe. Some languages are safer by design because either it limits the ability of the programmer to manipulate memory, or it gives the programmer facilities to help them use memory in a safer way. However, almost all languages have ways to unsafely access memory in practice.

Memory safety is also not the only way to cause security breach. Executing data as code is the main reason behind SQL and code injection, and these lead to highly critical privilege escalation, penetration, and data leak attacks. One safety net we can provide is by making code read-only, but this does not absolve the importance of good data hygiene.

Unbounded memory consumption can degrade performance or cause denial of service attacks, so memory usage planning must be considered in the infrastructure design.

In any case, the language is not the panacea to the memory safety or security problems. Programmer training and a culture to emphasize good engineering principles are paramount. However, I would love to see a renewed interest in programming language research to enhance language safety, by designing languages that encourage good practices.

Further Resources

Memory management in various languages