Advantages of generating high-level code during compilation
When you click "Simulate" on a scenario at sedaro.com, the compilation begins. First we gather all the models, parameters, and M&S libraries required to run a full simulation. From these, we generate the simulation code which gets distributed across our cloud infrastructure. We call the units of generated code "bundles,” each of which either runs a piece of a simulation or kicks off multiple other bundles. In the case of co-simulation jobs, we'll deliver one top-level bundle to each party involved, enabling simulation across multiple parties and domains without the need to share proprietary code or resources.
We have decided to generate code bundles in high-level languages like Rust and Python, rather than targeting lower-level languages like LLVM or other pre-existing intermediate representations (IRs) made for this purpose. The original motivation for this decision was practical and driven by available time and resources. However, we've discovered several additional benefits to this approach which we will share here.
Advantages
One of the requirements of simulation platforms that differentiate them from game engines is the necessity for deterministic and repeatable execution, even in distributed settings. Simulating a given scenario multiple times without changes must always yield the same results. This increases visibility into simulations, eases debugging, enables caching across simulation runs, and supports large-scale Monte Carlo studies. In order to guarantee this determinism, we need to make sure all of the modeling and simulation (M&S) libraries and additional dependencies of a generated bundle remain unchanged. High-level languages provide tools for dependency management that can help us enforce this. By generating dependency files like "Cargo.lock" or "requirements.txt" and utilizing the target language's tools, we get this dependency management for free.
We perform a variety of validations on our IR from which code is generated. One important procedure we perform is type checking and inference. We could implement our own inference algorithms, but this can be tricky. Furthermore, its correctness would depend on the conformance of its implementation to that of the compiler, each of which would need to evolve together. We could utilize generalized algebraic data types in our compiler implementation to offload type checking to the implementation language's type-checker, but this would severely restrict our choice of implementation language. Instead, we can simply generate code with type annotations in place, and run the target language's type checker on that code. We can then map type errors back to something that the user would understand and present them in the browser and logs.
Another way to leverage the compiler tools of the target language is by taking advantage of its optimizations. Sure, LLVM has a massive set of optimization passes, but high-level languages often implement additional bespoke optimizations. For example, by generating well-written Rust code, Rust’s powerful type system and borrow-checker facilitate optimizations that would otherwise be impossible. Rust's compiler utilizes at least three IRs in addition to those used by LLVM, each of which codifies invariants not easily reverse-engineered from downstream representations. By targeting LLVM directly, we'd miss out on these additional layers of optimization.
In addition to compiler features, high-level languages exist in an ecosystem of development tools that we can also leverage. Generated code can be inspected in integrated development environments backed by powerful language servers that offer insightful code navigation features. When problems arise, standard debuggers can be used to understand the causes and identify potential solutions. Generated code can even be directly modified for experimentation's sake. When planning out changes to the code generator, doing these kinds of manual edits first before automating them can save significant development time.
While all of these tools exist for low-level languages, there are still advantages to using them with high-level languages. Anyone within the team can inspect the generated code and propose modifications, even if they do not directly contribute to the compiler. Any team member making changes to the inputs provided to the compiler can directly view the effects of their changes on the generated code, for example with a diffing tool. This wouldn't be the case when introducing a new language such as LLVM to serve as a compilation target. Keeping the number of languages used within an organization low is generally of great value for team cohesion and collaboration. Tooling for high-level languages also tends to be highly ergonomic and well-maintained because of the competition between languages to attract new users.
One final piece of a language's ecosystem that we intend to leverage via code generation is testing frameworks. What can we gain by generating test suites along with generated code? For one, test cases serve as a key form of documentation, demonstrating how functions and services are intended to be used. This can be valuable when code bundles are delivered to customers for co-simulation or on-premise use. Tests might also be used to check properties of the black box functions that plug into the generated code in lieu of performing static analysis on them. In our case, this could include checking that the M&S functions and converters being used by simulations are themselves deterministic. We might also utilize tests as a way to guarantee that certain pre-conditions for executing simulations are met. For example, they might check that dependent services of the simulation are accessible and responsive or proactively detect regression in updated dependencies.
Considerations
All of these advantages of generating high-level code are directly analogous to advantages of manually writing high-level code. Nearly all software engineers choose to produce code in high-level languages instead of assembly for good reason. Yet, small compilers for domain-specific languages often make the opposite choice. Of course, generating high-level code is not always the right decision. Compilation itself will necessarily be slower because code must be translated through additional representations. You'll only ever have as much control over data representation and execution as is provided by the target language. If you need to do specific bit-level manipulation, for example, you should obviously avoid languages that abstract away such details.
I will end with a couple warnings. Compilation to a high-level language can be viewed as a “shallow embedding” of the language in the target language. We map concepts in the language to concepts in the target language. I'm advocating for this approach with regard to the implementation of the compiler, not the user-facing interface of the language. Once you start leaking target language features directly to authors of your language, it can become very difficult to evolve your language away from the target or port to a new target language. Lastly, please sanitize your identifiers to avoid injection attacks. Generating syntax trees, parse trees, or token trees instead of raw text can help with this.
At Sedaro, we're excited about giving engineers the tools they need to design digital twins of their components, agents, and scenarios. We're not in the business of developing compiler optimizations, but we desire the performance benefits that compilation provides. By generating code in high-level languages and letting downstream compilers and tools do the heavy lifting, we can iterate rapidly and focus on developing new features for our users.
Comments