Friday, March 1, 2024

Memory Safety State of the Union 2024, Rationale Explained

There has been renewed interest in programming languages after The White House recently published a recommendation suggesting the transition to a memory safe language as a national security objective. Although I am not an author of the report, I want to explain the rationale that someone might use to consider whether their infrastructure meets the memory safety recommendations. This is more of an executive-level overview than a technical guide to programmers.

Image credit: Wikimedia Commons, Whitehouse North.

On February 26, 2024, the White House released Statements of Support for Software Measurability and Memory Safety calling attention to a technical report from the Office of the National Cyber Director titled “Back to the Building Blocks: A Path Towards Secure and Measurable Software” (PDF link). The whole framework works like this: we ultimately want to be able to measure how good the software is (e.g. by giving it a score), and it relies on memory safety as a signal. The tech report also references another report published by Cybersecurity and Infrastructure Security Agency (CISA) titled The Case for Memory Safe Roadmaps which contains a list of memory safe language recommendations.

To supplement their list, I will be using the TIOBE index top 20 most popular programming languages to provide the examples: Python, C, C++, Java, C#, JavaScript, SQL, Go, Visual Basic, PHP, Fortran, Pascal (Delphi), MATLAB, assembly language, Scratch, Swift, Kotlin, Rust, COBOL, Ruby.

I will also throw in some of my personal favorites: LISP, Objective Caml, Haskell, Shell, Awk, Perl, Lua, and ATS-lang.

High Level Languages

Languages that do not expose memory access to the programmer tend to be memory safe. The reason is that it reduces the opportunity for programmers to make memory violation mistakes. When accessing a null pointer or an array with an out of bounds index, these languages raise an exception that could be caught, or return a sentinel value (0 or undefined value), rather than silently corrupting memory content.

Examples: Python, Java, C#, JavaScript, SQL, Go, Visual Basic, PHP, MATLAB, Scratch, Kotlin, Ruby; LISP, Objective CAML, Haskell, Shell, Perl, Awk, Lua.

Under the hood, they employ an automatic memory management strategy such as reference counting or garbage collection, but programmers will have little to no influence over it because the language does not expose memory access.

It does not matter whether the language execution happens at the abstract syntax tree (AST) level, compiled to a byte code, or compiled to machine code. In general, any language could have a runtime implementation that spans the whole spectrum through Just In Time compilation.

Things to Watch Out For

High level languages are prone to unsanitized data execution error such as SQL injectionCode Injection, and most recently Log4j. This happens when user input is passed through to a privileged execution environment and treated as executable code. High level languages often blur the line between data and code, so extra care must be taken to separate data from code execution. Data validation helps, but ultimately data should not have influence over code behavior unless it is explicitly designed to do so.

I strongly oppose using PHP or any products written in PHP, which is particularly notorious for SQL and code injection problems and single handedly responsible for all highly critical WordPress vulnerabilities. But if you inherited legacy infrastructure in PHP, there are principles that will help hardening it.

Even though memory access errors are raised as an exception, if these exceptions are not caught, they could still cause the entire program to abort. They also still allow potentially unbounded memory consumption leading to exhaustion, which causes program to abort or suffer severely degraded performance, leading to denial of service.

Some languages provide an “unsafe” module which is essentially a backdoor to memory access. Using them is inherently unsafe.

Most languages also allow binding with an unsafe language through a Foreign Function Interface (ffi) like SWIG. This allows the high level code to run potentially unsafe code written in a non-safe language like C or C++.

Mid Level Languages

These languages expose some aspects of memory management to the programmer—such as explicit reference counting—and provides language facilities to make it safer.

Examples: Swift, Rust; also ATS-lang.

Performance is the main reason to use these languages, as memory management overheads have negative performance impact, and in some time-sensitive applications, we have to carefully control when to incur these overheads. The tradeoff is programmer productivity, since they have to worry about more things. Since performance is the main concern, these languages tend to be compiled into machine code before running in production.

I want to call out ATS-lang because it is a language I helped working on for my Ph.D. advisor, Hongwei Xi. It was conceived in 2002 and predated Rust (2015). ATS code can mostly be written like Standard ML or Objective CAML with fully automatic memory management. It also provides facilities to do manual memory management. Safety is ensured by requiring programmers to write theorems to prove that the code uses the memory in a safe manner (papers). The theorem checker uses stateful views inspired by linear logic to reason about acquisition, transferring of ownership, and disposal of resources.

Things to Watch Out For

These languages are safe by virtue that the compiler can check for most programmer errors in compile time, but these languages still provided unsafe ways to access memory.

Furthermore, they are still prone to denial of service, SQL injection, and code injection vulnerabilities.

Low Level Languages

These languages require the programmer to manually handle all aspects of memory management for legacy reasons. For this reason, they are inherently unsafe.

Examples: C, C++, Pascal.

Although garbage collection was invented by John McCarthy in 1959 for the LISP language, that concept did not gain mainstream adoption until much later.

Even so, there are a few strategies to make these languages more memory friendly.

  1. Use an add-on garbage collector like Boehm-GC. Note that object references stored in the system malloc heap are not traced for liveness, so care must be taken when using both GC malloc and system malloc.
  2. C++ code should use Resource Acquisition is Initialization (RAII) idiom as much as possible. The language already tracks object lifetime through variable scope. The constructor is called when an object is introduced into the scope, and the destructor is called when the object leaves the scope. Smart pointers like std::unique_ptr and std::shared_ptr use RAII to manage memory automatically.

My particular contribution in this field is my Ph.D. dissertation (2014), which proposed a different type of smart pointer in C++ that does not auto-free memory but still helps catching memory errors. I showed that it is practical to use, by implementing a memory allocator using the proposed smart pointer.

Legacy Languages

I have omitted Fortran and COBOL from any of the lists above. Historically, Fortran and COBOL only allowed static memory management. All the memory to be used by the program are declared in the code, and the OS provisions for them before the program is loaded for execution. However, they never had any array bounds checking, so they are not memory safe. Furthermore, attempts to modernize the language with dynamic memory allocation exacerbated the problem that these languages were not designed to be memory safe.

I have also omitted assembly languages as a whole. Assembly languages tend to be bare metal and provide completely unfettered memory access, but there have been some research to enhance the safety of assembly languages (e.g. Typed Assembly Language).

Conclusion

There is no language that is completely memory safe. Some languages are safer by design because either it limits the ability of the programmer to manipulate memory, or it gives the programmer facilities to help them use memory in a safer way. However, almost all languages have ways to unsafely access memory in practice.

Memory safety is also not the only way to cause security breach. Executing data as code is the main reason behind SQL and code injection, and these lead to highly critical privilege escalation, penetration, and data leak attacks. One safety net we can provide is by making code read-only, but this does not absolve the importance of good data hygiene.

Unbounded memory consumption can degrade performance or cause denial of service attacks, so memory usage planning must be considered in the infrastructure design.

In any case, the language is not the panacea to the memory safety or security problems. Programmer training and a culture to emphasize good engineering principles are paramount. However, I would love to see a renewed interest in programming language research to enhance language safety, by designing languages that encourage good practices.

Further Resources

No comments: