Skip to content

Complete month of July 2022

State of the Journal

Since I am always focused on my work on lovem, I will never get sidetracked. Unrelated: I spent a few days on reworking the journal on this site.


So, no update on the core project today, sorry. I was very unhappy with my first solution, on how the Journal entries where created. Way too much to do by hand – that is not what I learned programming for. But mkdocs is python, and python I can do. So did. And now I can write my Journal entries (like this one) as plain Markdown files with very few metadata entries. And I get entries in the navigation and pages listing the whole month. I even included a whole month in single page version of the journal. I feel it is quite fancy. I will need to do a bit of work on the static content of the site, but one step at a time.

What I want

I want to write my Journal entries (aka blog posts) as a nice standalone markdown file, one file per entry. I will need to add include a bit of metadata, at least the release date/time. And I want the entries to look fancy without adding the fanciness to each file. Maybe I will be changing the layout later, hmm? And create those teaser pages for me, thank you very much.

And I have all that, now! Just look at the source that is used to generate this entry.

How it works

I use a plugin called mkdocs-gen-files, by @oprypin, that creates additional mkdocs source files on the fly. It does not really put the files on disk, but they are parsed by mkdocs, as if they were in the docs directory.

I have a directory journal next to my docs directory, where I put all my posts in a single markdown file each. My script walks through that directory, and processes each file. The content is modified a bit (to put in the card with the author's name and other metadata), and then put in a virtual file inside docs, so that the pages with the entries are created by mkdocs, as if I hat them inside docs.

The script also generates two pages for each month: one that shows that month's posts as teasers, with a "continue reading" link, and a second one that shows all posts from a month on a single page, so that you can read them without changing pages all the time.

The remaining part is adding all the pages, that the script creates, to the navigation in a way that makes sense. The order is a critical part, being a central aspect of a journal or a log. For that I use another plugin by @oprypin: mkdocs-literate-nav. With it, you can control your navigation (completely or in parts) by adding markdown source files with lists of links. This goes together well with the gen-files plugin, because I can just create that navigation files with it in my script.

The plugins are a bit light on the documentation side. It took me a while to understand, that you cannot do multiple layers of nested navigation in those files. That is not a problem, because you can always just add another nesting layer by adding more of those nav files as children. Also, what you can do in those files is very limited. I wanted to do some fancy things in the navigation (adding a second link in a single line with alternative representation). I would guess that those limitations come from the ways mkdocs itself handles the navigation, so that is okay. But a word on that would have been nice. And the error messages popping up did not help at all, because the actual error happens way later in the process inside mkdocs itself and is some weird side effect problem.

The script

If you want to take a look, see blogem.py. That will be the script in its current state. For the version of the script at the time of writing, see the permalink, the original blogem.py.

TODOs

  • Automated reload in mkdocs serve when I edit entry sources.
    just add parameter -w journal to mkdocs serve
  • Exclude journal overview and full month pages from search.
  • Exclude NAV.md from generating NAV.html.
  • Maybe add tags and/or categories for posts?
  • Maybe enable comments, as in material's blog.
  • Add links to source in github repo.
  • Add links to entry's history in github repo.
  • Support multiple posts per day (by adding time to "released").

All new once more

Reality strikes again, and code will be written from scratch once more. And the reason is this site.


You want me to get to the code. And I really should. I have written so much already, and I want to show it, but there is so much around it. And after I had written up a long text on how I started, I realised that I had no commits during the early state. So I had to write it all again, slower, and with code to be presentable in this journal.

If you are reading this live (and no-one is, because I did not even tell anyone I am doing this), you can of course look at the code I was writing earlier, it exists. I put it in a branch too-early. But I will not give explanations to that. I am rewriting it on the master branch, and that will be showed and discussed in the journal. I advise you to wait for that.

Yes, it will take a while. As it looks now, it will be slow. But I have written some new posts on the new code already, and I think it is worth it. There will be more background before we get there. Next entry will be a longer one, so there is that.

What is a Virtual Machine anyway?


So, how do you build a Virtual Machine. There are actually two quite different approaches:

  • Register Machine vs. Stack Machine

Let's take a look at those concepts first. This will be very brief and basic. You can, of course, also have some combination of those concepts, and not everything I say here is true for every implementation of virtual machine, but it will be close enough for this article.

Register Machines

Most physical computers are register machines. At least those you will be thinking of. You are most likely using one right now to read this article. Virtual register machines use the same concepts, but not in physical hardware, instead inside another computer as software. This allows them to do some things a bit more flexible than a real hardware machine would.

A register is nothing more than a dedicated place to store a portion of data where it can be accessed for direct manipulation. They are more or less a variable of the machine's basic data type that have a fixed address, and that can be accessed and manipulated directly by the processing unit. Register machines use those to actually compute and change data. All other storage places are only that: places where data is put when it is not needed at the moment. Register machines have a multitude of registers, from a very few (maybe 4 or 8 in simplistic designs) to hundreds or more in modern computers. The size of the registers often gives the architecture its name. E.g. in the x86-x64 architecture, that most current CPUs by Intel and AMD are of, a register is 64 bits long.

The instructions for a register machine are encoded in code words. A code word is a bunch of bytes that tell the machine what to do in the next program step. For simple designs, code words are of a fixed length. This code word length is often longer than the register size. So a 16 bit architecture could have 32 bit instructions. The reason for this is, that instructions consist of an operation code that defines what operation should be executed in the next step, but they also contain the arguments passed to that operation. Because the number and size of arguments needed for an operation differ for different operations, decoding the instruction can be quite complicated. When you put multiple instructions together, you end up with a program. This representation of a computer program is called machine code. For a virtual machine it is also called bytecode, although I think this term fits better for stack machines (more on that later).

If you want to understand what I tried to describe here, read this really short article: Creating a Virtual Machine/Register VM in C. It builds a simplistic register VM in C (the whole thing is 87 lines long). It demonstrates the principles used in a register machine (fetch, decode, execute), and shows you what a register is and how it is used. You will understand, how machine code is decoded and executed. The article only uses 16 bit code words and 16 bit data words (register size). If you know C, you should be able to understand what I am talking about in about an hour of reading and coding. If you ever wanted to understand how a computer works on the inside, this might be a nice place to start, before you read about an actual physical computer.

A register machine normally has multiple stacks it uses. This does not make it a stack machine, those are just needed to store data when it is not currently used.

So a typical operations would be: * "Take the number from register 0, take the number from register 1, add those two numbers together, write the result in register 0." * "Take the lower 16 bits of this instruction and write them in register 2."

Lua and Neko are virtual register machines (at least in current versions).

Stack Machines

And then there are Stack Machines. They are, I think, easier to understand than register machines, but following a program during execution is more confusing, since the manipulated data is more complicated to follow.

A stack is just a pile of data. Data is portioned in fixed sizes, a portion is called a word. All you can normally do is put a word on top of the stack - we will call that operation a push, or you can take the word that is currently on top of the stack (if there is one) - we will call that a pop. No other direct manipulations of the stack are allowed (I say "direct manipulations", because indirectly there often are ways that this is done, but that is a detail for later).

Manipulation of data is done this way by the machine. If you want to add two numbers, say 5 and 23, you would write a program that does this:

  1. Push the first number to the stack.
  2. Push the second number to the stack.
  3. Execute the "ADD" operation.

That operation will pop the two numbers from the stack, add them, and push their sum back on the stack (so that after the operation there will be one word less on the stack).

A stack machine will also typically have some additional place to store words when you do not need them on the stack. These places can relate to variables inside a program.

As you can see from the example above, instructions in a stack machine often do not need to have arguments. If data is to be manipulated, it is always on top of the stack. There is no need to address its location, as you would do in a register machine.

Because of this, the instructions for a stack machine are typically encoded in a single byte. This byte holds a number we will call opcode (short for operation code), that simply identifies the operation to execute. If your operation does need additional arguments, you write them to the bytes following your opcode byte (the oparg), so that the operation can read them from your program. This structure of single bytes encoding our program is why we call this representation bytecode.

The concept of a stack machine is easy to implement in software, but it is not so easy to do so in hardware. That is why your typical computer is a register machine. There are, however, a lot of historical examples of important physical stack machines.

The most famous example of a virtual stack machine is the Java VM. Java source code is compiled to bytecode that is executed inside a virtual machine, the JVM. This vm is so common, that many newer programming languages compile to Java bytecode. It makes it possible to run programs written in that languages on any system that has a JVM; and that includes just about every major and many minor computer systems. A second example for a stack machine is the Python VM.

Some random thought on register and stack machines

While writing this down, describing the two kinds of machines I couldn't help but notice a curious fact:

A register machine manipulates data inside addressable registers. When the data is not need, it can be stored away in some kind of stack.

A stack machine manipulates data inside a stack. When the data is not needed, it can be stored away in some kind of addressable spaces, not unlike registers.

It looks as if you just need both concepts to work efficiently.

Making virtual a reality

So I have been talking a lot about VMs without doing anything concrete. Well that is not true, I have done quite a bit already, but I am still describing earlier steps. We will get there.


Registers?

When I was looking around for a scripting language to use inside our embedded devices, I came across an article I mentioned in an earlier post: Creating a Virtual Machine/Register VM in C.

Reading it made me want to try working with a register machine, mainly because I have not been stuff like this since my early semesters. Never hurts to refresh rusty knowledge.

So I started designing a register VM, starting from that code, but more complex, with longer data words and longer instruction words, more registers, and so forth. For this project I came up with lovem as a working title. It still stuck to now, two approaches and a year later. I also started implementing some concepts I still want to add to lovem in my current approach, but that is for a later post to discuss.

I was experimenting with a quite complicated instruction word encoding. I was trying to fit everything in a few bits (32 of them if I recall correctly) with varying instruction code length and quite long arguments. I wanted to include instructions on three registers, which takes up quite some bits to address. Of course, you can get away with two-register operations only - or if you are fancy you can even use a single address or even no address for most instructions. You will just end up with a lot of register swapping. I guess my rational for having three addresses in an instruction was code size. For what I want to do, 32 bit instruction words feel quite long (4 bytes per instruction!). And every swap would mean another 4 bytes of program size. So trying to optimise for fewer operations by having more flexible instructions.

I do not even know if that rational makes sense. I guess I would have needed to try different layouts to find out. Or maybe read more about that topic, other people have done similar things I assume. But I never got that far. The experiment showed me, that I do not want to build lovem as a register machine. I think building a clever register based architecture for my goals would make it too complicated. I want simple. To reduce the VM's overhead, but also on principle. Complexity is the enemy.

I'm pretty sure, that code still exists somewhere, but there is no sense in publishing it or even in me reading it again, so you will never see it. I think of it as a pre-study with a very useful conclusion: not a register machine.

Stacks!

So a stack machine it is! I have looked at a few during my research for lovem, looking at instruction sets and design ideas. It is not the first time, I have been working with those. In a different project (around the same time I started work on the register based machine), I was starting to implement a stack machine. That one had a different aim and therefore very different challenges. It was more of an object-oriented approach with dynamic program loading and calling code in different programs. It could do quite a few things already, but it will never be continued. I learned a bit about calling conventions and found out that it is not so simple, when you want to switch between multiple programs and objects. That is where the project got too frustrating for me (and some external events made it obsolete, so that is okay). But I take it for a pre-study on stack machines and calling conventions. Not that I have developed a proven concept for it, but I know about the problems there...

I had a PoC for lovem as a stack machine back then, too (right after I ditched the register approach). That code won't be published either, but the attempt showed me, that I want to take that road for a serious approach on creating lovem.

Onwards

I guess this concludes the prehistory of the lovem story. I am, for whatever reason, back on the project, currently with a decent amount of motivation. You never know how long that lasts, but right now I like the idea of continuing the development, while talking about the development process, sharing my thoughts on decisions I make. Next post should start on sharing newer thoughts.

Let there be source code

Finally, I will be showing some source code. Not directly in the journal, but I will link you to GitHub, for a start.


I have written code. And this time, I (re-)started lovem in a public git repository, so you can see what I do, if you are interested. And I hope it puts enough pressure on me, to keep on the project for a while.

In fact, there is quite a bit of code there already. I started coding, before writing any of this, and it went so well. I like how it feels. I was working any hour I could spare. When a friend asked me what I was doing, I started a somewhat complex backstory why I was doing it, instead of actually explaining anything of the stuff I was doing – and was interrupted quite early, so there was more to tell in me still. The next day, I sat down and started to write all of that down as a little story. I wanted to put it somewhere, so I started this journal to publish it. And I decided to do it in blog form, so I am publishing that background story bit by bit.

So, as of writing this, there is a lot of work completed on the VM. Tt is amazing what things it can do for how little code there is. When this post goes public, there should be quite lot more done...

But where is the code?

Well, if you read this journal, you will know where it lives. Anyway, this is the repo:

https://github.com/kratenko/lovem

I plan to continue sharing my thoughts while I work on the VM. So you will be able to follow my failures and see the attempts that I will be ditching later. I think the format of this journal can work out, but we will see how I like it over time. It will be behind on progress, as I want to take time to share things as they unfold. And this should help to produce a somewhat continuous publication stream. Git being what git is, should support me in showing you the things I do back in time, using the power of commits.

As things are with blogs, my entries will be very different, depending on what I want to tell and on what I did. So far most blogs where conceptional thinking, some research, and a lot of blabla, which I tell because it interests me myself. In the future, there should be concrete problems I find and solve in source code - or which I fail to solve.

Back in time

Me original first commit was way too late and contained way too much code. Also, I did not plan to show it to you like this, back then. So, as mentioned before, I rolled back and started again, with more commits. And I am keeping tags now, so that I have well-defined versions for my blog posts. That should make it easy for you to follow up, if you want to.

The new, artificial "first commit" is now a tag/release: v0.0.1-journey. You can view the code for any tag online, this one you will find under:

https://github.com/kratenko/lovem/tree/v0.0.1-journey

I think this will be a theme of this journal: linking you to what I did, when I am writing about it. And I will try to share my trails of thoughts, leading to my decisions (and errors, as it will be). I will do that, for that v0.0.1-journey, soon, don't worry, I will explain everything I did. But the next journal entry will be about some decisions again; mainly about the language I am using.

It looks so weird

Now, that you have seen some code, I might have to explain a bit again. Depends, on where you are coming from, I guess.


So, did you take a look at the code, yet? In case you've forgotten, this is my "initial commit":

https://github.com/kratenko/lovem/tree/v0.0.1-journey

It is not the original initial commit, as I did commit way too late, and it was not suitable for writing a story about it. So I created a new, clean version, with just very simple concepts that I can explain in a single entry. In the next entry, that is.

If you are thinking: "What is that weird source code?", then you are in for a real treat (and a lot of pain), should you chose to follow up. The code you are seeing is written in Rust.

Once again: but why?

Why Rust? Because Rust! Writing Rust can feel so good! And for something like a VM, it is such a good choice. If you have never heard of the language (or heard of it, but never looked into it), it is hard to understand why this is so. My advice: try it! use it! Or read along this journal, code along, you might like it.

When you start, chances are high that you will not like Rust. The compiler is a pedantic pain in the ass. But at the same time it is incredibly polite, trying to help you find out, what you did wrong, and suggesting what you might want to do instead. And Rust really, really tries, to keep you from shooting yourself in the foot. It tries to make common mistakes impossible or at least hard to do – those mistakes that happen everywhere in C/C++ programs and their like. Yes, those mistakes that are the cause of the majority of all security problems and crashes. Buffer overruns, use after free, double free, memory leak – to name just some common ones from the top of my head. And Rust makes all it can to make those mistakes impossible during compilation! So it does not even add runtime overhead. That is so powerful!

And it is so painful. Half of the things you do, when writing C/C++, you will not be able to do in Rust in the same way. Every piece of memory is owned. You can borrow it and return it, but it cannot be owned in two places at once. And if any part of the program has writing access to it, no other part may have any access. This makes some data structures complicated or impossible (there are ways around it), and you will have to think quite differently. But if you give in on that way of thinking, you can gain so much. Even peace of the mind, as the coding world will look a lot saner inside Rust source code. This will, of course, come with the price, that all code in other languages will start to feel dirty to you, but that is the way.

Also, there are a lot of ways to write code, that you cannot add to a language that already exists. C and C++ will never be freed of their heritage; they will stay what they are, with all their pros and cons. Things are solved differently in Rust. Did I mention there is no NULL? And I have never missed it for a moment. Rust solves the problems other languages solve with NULL by using enums. That comes with certainty and safety all the way. There are no exceptions either. That problem is also solved by using enums. The way the language embraces those, they are a really powerful feature! And there are lot more convenient ways of organising code, that I keep missing in my daily C/C++ life.

I will not write an introduction into Rust here. At least not your typical "how to get started in rust" intro. There are a lot of those out there, and I am already 10 posts into my Journal without programming. Maybe the Journal will become a different kind of Rust introduction, as it will try to take you along a real project, as it develops, from the beginning on. I will run into problems along the way and try to solve them in Rusty ways. This might be a good way, to start thinking in Rust. But, to be honest, I did never finish a project in Rust, yet. I got quite a bit running and functional, and I think in some parts in a rust-like way. But this is for me as much as anyone else as a learning project. I will make weird things. But the basics, I have worked with, yeah.

The initial learning curve will be steep! I try to not get too fancy in the first draft, so the code will not be good Rust there! So, if you are shocked at how bad my Rust is – it will be very different, soon. But I want to give everyone a fair chance to hop on without understanding all the concepts. The initial code should be not too hard to follow, if you know C/C++, I hope. Learning a new thing (writing a VM) in a new, quite different language is a mouth full, I know.

Didn't you say, you use C/C++?

Yes I did say that. And I do use those. It is not easy to change that, when you have a certain amount of legacy code (and not much experience with the new language, as we do not really have, yet). But we do have a saying these days. Often, after a debugging session that lasted for hours, when we find the bug, understand it and fix it, there is this realisation, that fits in the sentence:

"Mit Rust wär' das nicht passiert." — "This would not have happened with Rust."

So, this will not happen to me with this project, because those things will not happen with Rust!

A VM

The first draft of source code, that will be our VM, explained.


I dumped some source code in front of you, and then I started to talk about programming languages. Time now, to explain what I did and why. We only have 132 lines, including comments. We will go through all parts of it. And I will talk a little about how Rust's basic syntax works, while I use it. Not too much, since it is not good Rust code, yet, but to help you start. This will be a longer entry.

I swear, if I do not see some code in this post...

Alright, alright... We will start with our VM:

##[derive(Debug)]
pub struct VM {
    stack: Vec<i64>,
    pc: usize,
    op_cnt: usize,
}

Nothing fancy, just a struct that will represent our Virtual Machine. Only three fields for now:

  1. stack: Obviously our stack machine would need one of those. This will hold values during execution. I am using a Vector. That is nothing more than a chunk of memory, that knows how much capacity it has and how many values are in it at the moment. It does support resizing, but I do not want to use that.
  2. pc will be our program counter. That is a register 1 holding the progress in the program during execution. It will always point at the instruction that is to be executed next.
  3. op_cnt will be counting the number of operations executed. For now, I want that information out of curiosity, but later it will be useful for limiting execution time for programs.

usize and i64 are Rust's names for integer types. The language is very explicit in those terms (and very strict, as in every aspect). I will not give a real introduction to Rust for you (there are pages that do that), but I will try to start slowly and give you hints on the important things I introduce, so that you get the chance to learn about them parallel to this journal. I hope, that makes it easier to follow for Rust beginners. To readers that know Rust: please excuse the crude code here! I will make it more rusty, soon. Skip to the next post, if you cannot handle it.

We will also need a program that we will run in our VM. For the start, a crude array of bytes will do. The VM will be running bytecode after all. And that really is only that: a bunch of bytes, that you will soon be able to understand.

// assign `pgm` to hold a program:
let pgm = [0x00 as u8, 0x01, 100, 0xff];

We will use a program that is a bit longer, but right now I wanted you to see a program, that is actually nothing but a collection of bytes in Rust code. let declares and assigns a variable here, named pgm. It is an array of 4 bytes (u8 is an unsigned 8bit integer - you might know it as uint8_t from other languages). And that variable will not be variable at all. By default, all variables in Rust are immutable. If you want to change it, later, you would have to declare it
using the modifier mut.

There is no need to modify the program after creation, we just want to read it for execution. But our VM will have to be mutable, as it has changing internal state. Here is our complete main function, creating the (immutable) program and the (mutable) VM, and running the program. Of course, the run(...) method is still missing. And you will see the program, we will be using (with some constants that I did not define, yet).

fn main() {
    // Create a program in bytecode.
    // We just hardcode the bytes in an array here:
    let pgm = [op::NOP, op::PUSH_U8, 100, op::PUSH_U8, 77, op::ADD, op::POP, 0xff];
    // Crate our VM instance.
    let mut vm = VM{
        stack: Vec::with_capacity(100),
        pc: 0,
        op_cnt: 0
    };
    // Execute the program in our VM:
    vm.run(&pgm);
}

Behaviour for our VM

So far we only have an initialized data structure and some bytes. Let's do something with it. Rust does not really use objects (and I think that is good). But it has associated functions that work on types, and methods that work on instances of types. We will write some methods for our VM struct. Let's start with the one for reading our program:

impl VM {
    /// Fetch the next byte from the bytecode, increase program counter, and return value.
    fn fetch_u8(&mut self, pgm: &[u8]) -> u8 {
        if self.pc >= pgm.len() {
            panic!("End of program exceeded");
        }
        let v = pgm[self.pc];
        self.pc += 1;
        v
    }
}

The fetch method will work on our VM instance. The first parameter is &mut self – that tells us it works on an instance of the type VM. It will work on a reference to the instance (indicated by the &), and it can modify the data (indicated by the mut). It will also take the reference to an array of u8s, but that it will not be able to modify (no mut). It returns a u8.

What it does is simply read and return a byte from the program, and increase the VMs internal program counter by one, so that the next call to fetch will return the next byte. Simple.

So, what is that panic!() you might ask? Well, if we reach that instruction, it will start to panic, and then it will die. That is not a nice way to act. Do not worry, we will change that to something more reasonable, when we start writing better Rust. And what about the naked v in the last line? It will have the function return the value of v.

Now, let's look at that run method, we were calling in main:

impl VM {
    /// Executes a program (encoded in bytecode).
    pub fn run(&mut self, pgm: &[u8]) {
        // initialise the VM to be in a clean start state:
        self.stack.clear();
        self.pc = 0;
        self.op_cnt = 0;

        // Loop going through the whole program, one instruction at a time.
        loop {
            // Log the vm's complete state, so we can follow what happens in console:
            println!("{:?}", self);
            // Fetch next opcode from program (increases program counter):
            let opcode = self.fetch_u8(pgm);
            // We count the number of instructions we execute:
            self.op_cnt += 1;
            // If we are done, break loop and stop execution:
            if opcode == op::FIN {
                break;
            }
            // Execute the current instruction (with the opcode we loaded already):
            self.execute_op(pgm, opcode);
        }
        // Execution terminated. Output the final state of the VM:
        println!("Terminated!");
        println!("{:?}", self);
    }
}

The comments should explain, what is going on there. Initialise VM, then loop over the program, fetching one instruction at a time and executing it, until we reach the end. And you might have noticed, that our program will be very talkative. I added a lot of printlns, that tell just about everything that happens, during execution.

I guess it is time to look at those op:: constants I keep using.

/// Module holding the constants defining the opcodes for the VM.
pub mod op {
    /// opcode: Do nothing. No oparg.
    ///
    /// pop: 0, push: 0
    /// oparg: 0
    pub const NOP: u8 = 0x00;
    /// opcode: Pop value from stack and discard it.
    ///
    /// pop: 1, push: 0
    /// oparg: 0
    pub const POP: u8 = 0x01;
    /// opcode: Push immediate value to stack.
    ///
    /// pop: 0, push: 1
    /// oparg: 1B, u8 value to push
    pub const PUSH_U8: u8 = 0x02;
    /// opcode: Add top two values on stack.
    ///
    /// pop: 2, push: 1
    /// oparg: 0
    pub const ADD: u8 = 0x10;
    /// opcode: Terminate program.
    ///
    /// pop: 0, push: 0
    /// oparg: 0
    pub const FIN: u8 = 0xff;
}

Just 5 u8 constants there, grouped in a module as a namespace. And a lot of comments to explain them. We have 5 different operations for our VM. The only thing missing is some code, that actually executes those instructions:

impl VM {
    /// Executes an instruction, using the opcode passed.
    ///
    /// This might load more data from the program (opargs) and
    /// manipulate the stack (push, pop).
    fn execute_op(&mut self, pgm: &[u8], opcode: u8) {
        println!("Executing op 0x{:02x}", opcode);
        match opcode {
            op::NOP => {
                println!("  NOP");
                // do nothing
            },
            op::POP => {
                println!("  POP");
                let v = self.stack.pop().unwrap();
                println!("  dropping value {}", v);
            },
            op::PUSH_U8 => {
                println!("  PUSH_U8");
                let v = self.fetch_u8(pgm);
                println!("  value: {}", v);
                self.stack.push(v as i64);
            },
            op::ADD => {
                println!("  ADD");
                let a = self.stack.pop().unwrap();
                let b = self.stack.pop().unwrap();
                self.stack.push(a + b);
            },
            _ => {
                panic!("unknown opcode!");
            }
        }
    }
}

You can think of the match as a switch statement. It is much more than that, but here we use it as one. Each of our opcodes is handled individually. And we log a lot, so that we can read what is happening, when we run it. Ignore the unwrap() thingies for the time being. They are just there to try and ignore potential runtime errors. Again, not good Rust style, but, you know: later.

The four operations get more complex in what they do. Let's go through them one by one:

  1. NOP – this does nothing, it just wastes bytecode and execution time. I have included it simply to be the most basic operation possible.
  2. POP – this is our first modification of the stack. It simply discards the topmost value, decreasing the stack's size by one.
  3. PUSH_U8 – this is the only operation that reads additional data from the program. It only reads a single byte (increasing the program counter by one), and puts it on top of the stack, increasing the stack's size by one. This is how you can get data from your program into the VM, to work with them. It is how numeric literals in your program are handled.
  4. ADD – the only operation that works on data. It pops its two operands from the stack, adds them, and pushes the sum back on the stack. This is how data is manipulated in a stack machine. The operation reduces the stack's size by one effectively, but there need to be at least 2 values on it for it to be executed.

That is the out complete VM so far, and it will execute a program, if you compile and run it (which we will do in the next post).

You can find the complete program here:

https://github.com/kratenko/lovem/blob/v0.0.1-journey/src/main.rs

You can access the repo at this state under (there is also a zip file containing all files):

https://github.com/kratenko/lovem/releases/tag/v0.0.1-journey

How do I work with the code?

The easy way, to get the code and play with it, would be to clone the git repository and check out the tag v0.0.1-journey. If you did not understand any of that, you might want to do a tutorial on git, before you continue reading. Anyways, here is some copy&paste commands, you can hack into your bash prompt, to do, what I just told you to do. Use at your own risk, I'm not responsible for what you do to your system.

you@host:~$ git clone https://github.com/kratenko/lovem.git

you@host:~$ cd lovem

you@host:~/lovem$ git checkout v0.0.1-journey

you@host:~/lovam$ cargo run lovem

This will copy all source code from GitHub and its history to your computer, and it will roll the source code to the state we are looking at in this entry. The last command cargo run lovem will compile and execute the program - that is, if Rust is installed and ready to run (and in the correct version). cargo is Rust's package manager, that handles dependencies and compiles your projects. I will not explain those things further, but now you know what to look for.

Running our first program

Now, that we have a VM, we will run a program on it.


So we built our very first VM and studied the code in detail. It is time to execute a program on it and look at it's output. We will look at every single step the program takes. Aren't we lucky, that our VM is so talkative during execution?

If you missed the code, look at the previous post, A VM.

Let's go!

/home/kratenko/.cargo/bin/cargo run --color=always --package lovem --bin lovem
    Finished dev [unoptimized + debuginfo] target(s) in 0.00s
     Running `target/debug/lovem`
VM { stack: [], pc: 0, op_cnt: 0 }
Executing op 0x00
  NOP
VM { stack: [], pc: 1, op_cnt: 1 }
Executing op 0x02
  PUSH_U8
  value: 100
VM { stack: [100], pc: 3, op_cnt: 2 }
Executing op 0x02
  PUSH_U8
  value: 77
VM { stack: [100, 77], pc: 5, op_cnt: 3 }
Executing op 0x10
  ADD
VM { stack: [177], pc: 6, op_cnt: 4 }
Executing op 0x01
  POP
  dropping value 177
VM { stack: [], pc: 7, op_cnt: 5 }
Terminated!
VM { stack: [], pc: 8, op_cnt: 6 }

Process finished with exit code 0

What just happened?

It is quite talkative. And isn't it nice, how easy it is, to print the complete state of our VM in Rust? And it costs no overhead during runtime, as it is generated during compilation for us. Isn't that something?

So, what is happening there? Our program pgm looks like this:

    let pgm = [op::NOP, op::PUSH_U8, 100, op::PUSH_U8, 77, op::ADD, op::POP, 0xff];

That are 8 bytes that consist of 6 instructions. Each instruction has a 1 byte opcode. Two of those instructions (the PUSH_U8) have one byte of argument each, making up the remaining two bytes of our program. Here they are listed:

  1. NOP
  2. PUSH_U8 [100]
  3. PUSH_U8 [77]
  4. ADD
  5. POP
  6. FIN

The NOP does not do anything. I just put it in front of the program to let you see fetching, decoding, and executing without any effects:

VM { stack: [], pc: 0, op_cnt: 0 }
Executing op 0x00
  NOP
VM { stack: [], pc: 1, op_cnt: 1 }

We just increased the program counter by one (we advance one byte in the bytecode), and the operation counter counts this executed instruction. Let's look at the next instruction, that is more interesting:

VM { stack: [], pc: 1, op_cnt: 1 }
Executing op 0x02
  PUSH_U8
  value: 100
VM { stack: [100], pc: 3, op_cnt: 2 }

Here the PC is increased by two. That happens, because we fetch an additional value from the bytecode. The op_cnt is only increased by one. And we now have our first value on the stack! It is the byte we read from the bytecode. Let's do that again:

VM { stack: [100], pc: 3, op_cnt: 2 }
Executing op 0x02
  PUSH_U8
  value: 77
VM { stack: [100, 77], pc: 5, op_cnt: 3 }

Now there are two values on the stack! Time to do something with them. Let's add them up:

VM { stack: [100, 77], pc: 5, op_cnt: 3 }
Executing op 0x10
  ADD
VM { stack: [177], pc: 6, op_cnt: 4 }

Now there is only one value left on the stack, and it is the sum of the two values we had. There happened quite a lot here. The two values we had before where both popped from the stack (so it was shortly empty). The add operation adds them, and pushes their sum back on the stack. So now there is one value on the stack, and it is the result of our adding operation.

What's next?

VM { stack: [177], pc: 6, op_cnt: 4 }
Executing op 0x01
  POP
  dropping value 177
VM { stack: [], pc: 7, op_cnt: 5 }

It is always nice to leave your workplace all tidied up, when you are done. We can do that by popping our result back from the stack, leaving it empty. And besides, our POP operation prints the value it drops. One more instruction to go:

VM { stack: [], pc: 7, op_cnt: 5 }
Terminated!
VM { stack: [], pc: 8, op_cnt: 6 }

Well, not much happening there. Just stopping the VM, because we are done.

Success!

So, we ran a program in a VM. Hooray, we are done. Only 132 lines of code, including excessive comments and logging. That was easy.

Well yeah - it doesn't do much. But you can understand the root principle that makes up a stack machine. It's that simple.

Go play around with it a bit. It is the best way to learn and to understand. I mean it! Write a longer program. What happens to the stack? Add another opcode – how about subtraction? Will your program execute at all? What happens, if it does not?

Turn "fragile" into "rusty"

After we got our Proof of Concept running, we clean up our code and make it look like a respectable Rust program.


Did you play around with the program from the previous post? If you are new to Rust, you really should! At least mess around with our bytecode. You should find, that our VM does not react well to errors, yet. It simply panics! That is no behaviour for a respectable rust program.

We will make it more rusty, look at the enhanced version:

Repo: https://github.com/kratenko/lovem/tree/v0.0.2-journey

main.rs: https://github.com/kratenko/lovem/blob/v0.0.2-journey/src/main.rs

If you do not know your way around Rust, some of those things will be difficult to understand. It might be time to read up on some Rust, if you intend to follow my journey onwards. I will not explain everything here, but I will give you some leads right now, if you want to understand the things I did in that change.

It is all in the enums

The most important thing to understand for you will be Enums. Yeah, I know. That is what I thought at first learning Rust. "I know enums. Yeah, they are handy and useful, but what could be so interesting about them?"

Well, in fact, enums in Rust completely change the way you are writing code. They are such an important part of the language that they have an impact on just about every part of it.

I introduced an enum to the code:

##[derive(Debug, Clone, PartialEq)]
pub enum RuntimeError {
    EndOfProgram,
    InvalidOperation(u8),
    StackUnderflow,
    StackOverflow,
}

It is obviously a datatype to communicate runtime errors of different nature. And I use it a bit like you would exceptions in some other languages. Nevermind the #[derive...] part for now. That is just for fancy debug output (and a bit more). Once you understand line 33: InvalidOperation(u8),, you are on the right track! To put it into easy terms: values of enums in Rust can hold additional values. And, as you see in our RuntimeError, not all values have to hold the same kind of additional value, or a value at all. This is, what makes enums really powerful.

If you know what happens in the return type of fn push in line 70, you are golden. The Result type can communicate a value on success or an error condition on failure. The great difference to typical exceptions form other languages is, that there is no special way to pass on the errors, as with exceptions that are thrown. It is just your normal return statement used. And this is done, you guessed it, with enums. If you want to read up on Result, try understanding Option first. I am using that in my code, even though you cannot see it.

If you are wondering now about the return of fn push, that does not have a return statement to be seen, you should find out, while some of my lines do not have a semicolon ; at the end, while most do.

And then there is that tiny ? in line 101.

Also find out what happens in the match in [line 166][line166]. It might help if you start with the if let statement.

Bonus points: line 66. If that is clear to you, you need have no worries, you are into enums and how to use them

Homework

So, this is what will get you through a lot here. Try to understand those in the given order:

  1. Option
  2. Some(v) vs. None
  3. Result<v, e>
  4. Ok(v) vs. Err(e)
  5. if let Some(v) =
  6. match
  7. Result<(), e>
  8. Ok(())
  9. unwrap()
  10. ?
  11. Bonus: ok(), ok_or(), and their likes

If you understand for each of those, and why I put them in the list, you are prepared to handle most Rust things I will be doing in the next time. If you have problems with parts of it, still, move on. It gets better after a while, when you use them.

Becoming social

A new way for you to participate in my journey.


After a few days of progress on the project itself, I spent a bit of time on the site again. We have the fancy link to our GitHub repo in the upper right corner now. But more important, I added support for comments on my entries. You can now react and ask questions or share your thought.

I am using giscus.app (and, again, I copied that idea from @squidfunk and their site on mkdocs-material, which is what I did for this complete site, more or less). Giscus is an open source app that stores the comments completely inside GitHub discussions, so the content is stored along the lovem repository and at the one place where everything is stored already anyway. If you want to participate in the comments, you need to log in using your GitHub account. That is great, because I don't need to care about user management, nor about any database.

Feel free to use this entry to try out the new feature, because that is what I am gonna do!

To the library!

We turn our project from a binary project into a library project.


So far, our lovem cargo project holds a single binary. That is not very useful for something that should be integrated into other projects. What we need is a library. How is that done? Simple: we rename our main.rs to lib.rs.

No main?

But wait? What about fn main()? We do not need that inside a library. But it would be nice to still have some code that we can execute, right? Well, no problem. Your cargo project can only hold a single library, but it can hold even multiple binaries, each with its own fn main(). Just stuff them in the bin subdir.

Project layout

While we are at it, I split the project up into multiple source files, to get it organised. It is small, still, but we will have it grow, soon. Here is, what we are at now:

lovem/
  src/
    bin/
      test-run.rs
    lib.rs
    op.rs
    vm.rs
  .gitignore
  Cargo.toml

We skip .gitignore. If you don't know what it is, google .gitignore.

Cargo.toml

So Cargo.toml holds information about our cargo project. There is not much of interest there currently:

[package]
name = "lovem"
version = "0.0.3"
edition = "2021"
authors = ["kratenko"]

[dependencies]

The only real configuration in that file is edition = "2021". Rust has a major edition release every three years. These are used to introduce braking changes. You have to specify the edition you use explicitly, and there are migration guides. We use the most recent one, 2021.

lib.rs

Rust manages projects by using default project layouts. That is why we need not write a lot into the Cargo.toml. The src directory holds our source code. The fact that it holds a lib.rs makes it a library, and lib.rs is the entry point. This is what is in it:

pub mod op;
pub mod vm;

// re-export main types
pub use crate::vm::VM;

Really not a lot. It declares the two modules op and vm and makes them public. So, whatever rust project will be using our library will have access to those modules. The modules will be in the files op.rs and vm.rs. What a coincidence, that are exactly the remaining two source files in this directory!

The last line just re-exports a symbol from one of those submodules, so that programs using our library can access more easily. Will will be doing that in our binary.

op.rs

Back in v0.0.2-journey, we already had a module called op to hold the opcodes. We had it stuffed in our main.rs. Now it lives in a separate file, so we do not have to scroll over it every time.

vm.rs

This holds the rest of our source code (except for fn main() which has no place in a lib). The only new thing, compared with our former main.rs is the first line:

use crate::op;

This simply pulls the module op into the namespace of this module, so that we can access our opcode constants as we did before. The rest remains the way we already know.

bin/test-run.rs

So how do we use our lib in a project? That is best illustrated by doing it. And we can do so inside our project itself, because we can add binaries. Just put a Rust source file with a fn main() inside the bin subdir. There we can write a binary as we would in a separate project, that can use the lib.

We did that in the file test-run.rs:

use lovem::{op, VM};

fn main() {
    // Create a program in bytecode.
    // We just hardcode the bytes in an array here:
    let pgm = [op::NOP, op::PUSH_U8, 100, op::PUSH_U8, 77, op::ADD, op::POP, 0xff];
    // Crate our VM instance.
    let mut vm = VM::new(100);
    // Execute the program in our VM:
    match vm.run(&pgm) {
        Ok(_) => {
            println!("Execution successful.")
        }
        Err(e) => {
            println!("Error during execution: {:?}", e);
        }
    }
}

This is the fn main() function from our former main.rs. Instead of having all the functions and definitions, it just has this single line at the top:

use lovem::{op, VM};

Nothing too complicated. It tells the compiler, that our program uses the library called lovem (which is, of course, the one we are writing ourselves here). It also tells it to bring the two symbols op and VM from it into our namespace.

The op one is simply the module op defined in op.rs. Because lib.rs declares the module public, we can access it from here. VM does not refer to the module in vm.rs, as that module is called vm (in lower case). VM is actually the struct we defined in vm, that we use to hold the state of our Virtual Machine.

We could include the struct as lovem::vm::VM, which is its full path. But I find that a bit anoying, as VM is the main type of our whole library. We will always be using that. So I re-exported it in lib.rs. Remember the line pub use crate::vm::VM;? That's what it did.

Running the binary

So, how do we run our program now? Back in v0.0.2-journey we simply called cargo run. That actually still works, as long as we have exactly one binary.

But we can have multiple binaries inside our project. If we do, we need to tell cargo which it should run. That can easily be done:

cargo run --bin test-run

The parameter to --bin is the name of the file inside bin, without the .rs. And no configuration is needed anywhere, it works by convention of project layout.

Homework

What, homework again? Yeah, why not. If it fits, I might keep adding ideas for you to play around with. Doing things yourself is understanding. Stuff we just read, we tend to forget. So here is what might help you understand the project layout stuff I was writing about:

Add a second binary, that runs a different program in the VM (with different bytecode). You have all the knowledge to do so. And then run it with cargo.

Source code

In earlier posts I included explicit links to the source code at the time of writing. That got annoying to do really fast. So I added a new feature to my blogem.py that I use to write this journal. Entries like this, that are explaining a specific state of the source of lovem will have a tag from now on. This corresponds to a tag inside the git repository, as it did in earlier posts. You will find it in the card at the top of the post (where you see the publishing date and the author). It is prefixed with a little tag image. For this post it looks like this:

v0.0.3-journey

At the bottom of the entry (if you view it in the entry page, not in the "whole month" page), you will find it again with a list of links that help you access the source in different ways. The best way to work with the code, is to clone the repository and simply check out the tag. I also added a page on this site, explaining how you do that. You can find it under Source Code.

So, in future I will not be adding explicit links, only this implicit ones. And there will be a link to the explaining page at the bottom. This should be convenient for both, you and me.

Early VM decisions

Many design decisions must be made for lovem. Here I talk about some of those in the current state.


I have shared and discussed source code in the recent posts. Now it is time again, to write about design decisions. I made a few of them for the code you saw. So far I have not been reasoning about those here, and some of you might have wondered already. Let's talk about them.

Let me remind you: lovem is a research project for myself. And an education project for myself as well. None of my choices at this stage are set into stone. I will make lots of mistakes that I will be changing later. I even choose some paths, that I know I will be leaving again. I might just take any solution for a problem, at this stage, as I do not know, what is the right choice. So start somewhere, see where it goes. Some of those are deliberately weird or bad choices, but they make things clearer or simpler at this stage.

Let us address two of those choices you can find in the current source code.

Word size

I talked about register sizes defining architecture, back in What is a Virtual Machine anyway?. And then I went totally silent about that topic and just used i64 as type for my stack. Is that a good idea? I used it for simplicity. The idea goes back to when I was experimenting with using a register machine for lovem. Having a simple datatype that can handle big values seems simple. After all, other languages/VMs use some version of float as their single numeric datatype:

JavaScript

JavaScript Numbers are Always 64-bit Floating Point

Unlike many other programming languages, JavaScript does not define different types of numbers, like integers, short, long, floating-point etc.

JavaScript numbers are always stored as double precision floating point numbers, following the international IEEE 754 standard.

w3schools.com - retrieved 2022-07-11

Lua

2.3 - Numbers

The number type represents real (double-precision floating-point) numbers. Lua has no integer type, as it does not need it.

Programming in Lua - retrieved 2022-07-11

Well, reducing complexity is good. But having each little number you use in your programs eat up 8 bytes of memory does not sound low overhead to me. And that is, after all, the goal. So I guess, that will change in the future. But let's keep it for the time being. There will be some interesting things we will be doing in the near future; even if we might dump those features later. I already implemented them during the early phase (when I was not writing a public journal), so not adding them here would be insincere. Having 64 bit values is a part of our journey.

Opargs

I have no glossary, yet, so you have to live with me inventing terms on the spot. I used that word in the source code already. What I mean by it, are the arguments to an instruction inside the bytecode, that follow the opcode and influence the operation. They are the arguments you give inside your program's code.

As of v0.0.3-journey we only have a single opcode that takes an oparg, and that is push_u8. You can see how there is a fetch_u8() instruction in the code that handles that operation, and none in the other operations. See execute_op.

So we have different behaviour depending on the opcode. push_u8 fetches an additional byte from the bytecode, the other opcodes do not. Existing VMs handle this differently. The Java VM, for example, has a dynamic number of opargs, too. They call them operands:

2.11. Instruction Set Summary

A Java Virtual Machine instruction consists of a one-byte opcode specifying the operation to be performed, followed by zero or more operands supplying arguments or data that are used by the operation. Many instructions have no operands and consist only of an opcode.

The Java® Virtual Machine Specification - Java SE 8 Edition - retrieved 2022-07-11

The Python VM on the other hand, uses exactly one byte as oparg on all instructions

The bytecode can be thought of as a series of instructions or a low-level program for the Python interpreter. After version 3.6, Python uses 2 bytes for each instruction. One byte is for the code of that instruction which is called an opcode, and one byte is reserved for its argument which is called the oparg.

[...]

Some instructions do not need an argument, so they ignore the byte after the opcode. The opcodes which have a value below a certain number ignore their argument. This value is stored in dis.HAVE_ARGUMENT and is currently equal to 90. So the opcodes >=dis.HAVE_ARGUMENT have an argument, and the opcodes < dis.HAVE_ARGUMENT ignore it.

Reza Bagheri - Understanding Python Bytecode - in Towards Data Science - retrieved 2022-07-11

That does remove some complexity. And adds new complexity - for opcodes with more than one oparg byte - they exist in python and are handled with a special opcode, that adds an additional oparg byte. I think it will make execution faster, as fetching can be done it advance. If you do not know, how many bytes you need, before your read your opcode, you cannot prefetch the next instructions.

For our goal, keeping the bytecode small is much more important than execution time. So I am pretty sure we will stick with the dynamic number of oparg bytes in lovem.

More operations

The basic operation of the VM is working. Let us add a few more opcodes, so that we can do calculations.


We have created a rust library that holds our virtual register machine. We can now add multiple executables to it, so that makes it easier, to write different programs and keep them (to mess around with the VM). We will add a few more opcodes to our repertoire, because only adding numbers is just plain boring.

I put some sort into what opcodes to introduce; but be advised, that none of them are final. Not only is the VM experimental and in a very early state, I introduce codes that I do not intend to keep on purpose. This is also a demonstration/introduction. So I add codes that are helpful at the time of writing, for experimenting. FIN is an example of a code, that will most likely be removed at some point. But for now it is nice to have a simple way to explicitly terminate the program. It gives some confidence, when we reach that point, that our program works as intended, and that we did not mess up the bytecode.

Arithmetics

Baby steps. No rush here. We had adding as a first example. We will introduce subtraction, multiplication, division, and modulo. Sounds like not much, but we will run in some complications, anyways... Here is our addtion to op.rs.

/// opcode: Subtract top two values on stack.
///
/// pop: 2, push: 1
/// oparg: 0
pub const SUB: u8 = 0x11;

/// opcode: Multiply top two values on stack.
///
/// pop: 2, push: 1
/// oparg: 0
pub const MUL: u8 = 0x12;

/// opcode: Divide top two values on stack.
///
/// pop: 2, push: 1
/// oparg: 0
pub const DIV: u8 = 0x13;

/// opcode: Calculate modulo of top two values on stack.
///
/// pop: 2, push: 1
/// oparg: 0
pub const MOD: u8 = 0x14;

The order of things

Simple enough those new codes, just copy and paste from ADD. But it turns out, subtraction is not as easy as addition. Here is the handling code we used for ADD:

op::ADD => {
    println!("  ADD");
    let a = self.pop()?;
    let b = self.pop()?;
    self.push(a + b)
},

Works. But if we copy and use that for SUB:

op::SUB => {
    println!("  SUB");
    let a = self.pop()?;
    let b = self.pop()?;
    self.push(a - b)
},

It turns out, that I messed up the order of the operands. That does not matter for addition, but subtraction is not commutative. So let's change that:

op::ADD => {
    println!("  ADD");
    let b = self.pop()?;
    let a = self.pop()?;
    self.push(a + b)
},
op::SUB => {
    println!("  SUB");
    let b = self.pop()?;
    let a = self.pop()?;
    self.push(a - b)
},
op::MUL => {
    println!("  MUL");
    let b = self.pop()?;
    let a = self.pop()?;
    self.push(a * b)
},
op::DIV => {
    println!("  DIV");
    let b = self.pop()?;
    let a = self.pop()?;
    self.push(a / b)
},
op::MOD => {
    println!("  MOD");
    let b = self.pop()?;
    let a = self.pop()?;
    self.push(a % b)
},

So, we learned something. I put the other operators there, as well. But this is too naive. You might already see the problem.

Blowing up the school

As my math teacher liked to say: "... dann fliegt die Schule in die Luft!" – If we do that the school building will blow up. It is his way of dealing with the issue, that pupils are told "you must never divide by zero", but that they are never given an understandable reason for it. So just own it, and provide a completely absurde one.

What happens, is we keep it like this? Well, not much - until you write a program that divides by zero. Then, this will happen:

[...]
VM { stack: [4, 0], pc: 4, op_cnt: 2 }
Executing op 0x13
  DIV
thread 'main' panicked at 'attempt to divide by zero', src/vm.rs:142:31
stack backtrace:
   0: rust_begin_unwind
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/panicking.rs:143:14
   2: core::panicking::panic
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/panicking.rs:48:5
   3: lovem::vm::VM::execute_op
             at ./src/vm.rs:142:31
   4: lovem::vm::VM::run
             at ./src/vm.rs:85:13
   5: modulo::main
             at ./src/bin/modulo.rs:10:11
   6: core::ops::function::FnOnce::call_once
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Process finished with exit code 101

Our program panics! I told you earlier, that this is not good behaviour. I introduced you to a lot of weird Rust stuff, just to avoid those. So, let us not re-introduce them now. So, what can we do instead?

Division by zero is a runtime error, for sure (at least in this numerical domain we are working with). But it should not be a runtime error in our virtual machine, it should be a runtime error in the program it is running. Luckily, we already have that mechanism in our VM. So let us add a new runtime error:

/// An error that happens during execution of a program inside the VM.
##[derive(Debug, Clone, PartialEq)]
pub enum RuntimeError {
    EndOfProgram,
    UnknownOpcode(u8),
    StackUnderflow,
    StackOverflow,
    DivisionByZero,
}

And adjust our opcode handlers:

op::DIV => {
    println!("  DIV");
    let b = self.pop()?;
    let a = self.pop()?;
    if b == 0 {
        Err(RuntimeError::DivisionByZero)
    } else {
        self.push(a / b)
    }
},
op::MOD => {
    println!("  MOD");
    let b = self.pop()?;
    let a = self.pop()?;
    if b == 0 {
        Err(RuntimeError::DivisionByZero)
    } else {
        self.push(a % b)
    }
},

We add a check for the DIV and MOD handlers (modulo is a division as well). If we run that program dividing by zero again, we now get this:

[...]
VM { stack: [4, 0], pc: 4, op_cnt: 2 }
Executing op 0x13
  DIV
Error during execution: DivisionByZero

Process finished with exit code 0

Yes, it still fails. But only the execution of the bytecode fails, not the execution of our virtual machine. You can now handle the problem inside your Rust program in a way that fits your needs. Much better. In the next post, we will be using our new instructions in a fancy way, that works well with a stack machine.

Homework

Oh, not sure. Play around with it, I guess? As always. Feel free to write a calculation into a program and compare the results. It should work, unless I messed up again. You should have at least, at some point, write a program in bytecode yourself, so that you know how that feels.

Reverse polish notation

We are using the design of a stack machine to efficiently execute some calculations.


The way stack machines work can be used in programs that execute calculations. We will look at it by implementing an example from the Wikipedia page about stack machines.

I will quote a lot of it here. You can see the full text of the article and its authors when you follow the Wikipedia permalink to the article.

Design

Most or all stack machine instructions assume that operands will be from the stack, and results placed in the stack. The stack easily holds more than two inputs or more than one result, so a rich set of operations can be computed. In stack machine code (sometimes called p-code), instructions will frequently have only an opcode commanding an operation, with no additional fields identifying a constant, register or memory cell, known as a zero address format.2 This greatly simplifies instruction decoding. Branches, load immediates, and load/store instructions require an argument field, but stack machines often arrange that the frequent cases of these still fit together with the opcode into a compact group of bits.

Wikipedia - retrieved 2022-07-15

So far nothing new - I wrote about all that in my earlier posts.

The selection of operands from prior results is done implicitly by ordering the instructions. [...]

ibid.

Now, here it gets interesting.

[...]

The instruction set carries out most ALU actions with postfix (reverse Polish notation) operations that work only on the expression stack, not on data registers or main memory cells. This can be very convenient for executing high-level languages, because most arithmetic expressions can be easily translated into postfix notation.

For example, consider the expression A*(B-C)+(D+E), written in reverse Polish notation as A B C - * D E + +. Compiling and running this on a simple imaginary stack machine would take the form:

                # stack contents (leftmost = top = most recent):
push A          #           A
push B          #     B     A
push C          # C   B     A
subtract        #     B-C   A
multiply        #           A*(B-C)
push D          #     D     A*(B-C)
push E          # E   D     A*(B-C)
add             #     D+E   A*(B-C)
add             #           A*(B-C)+(D+E)
ibid.

Well, I don't know about a "simple imaginary stack machine" - but as it happens to be, we have a very real simple stack machine at our disposal. You know where we will be going next!

Porting the code to lovem

The program from the Wikipedia article uses 5 variables A to E. We do not support any kind of variables, yet, but that isn't important here. We use immediates (literals from your program) to put some concrete values into the calculation. Let's just take some numbers, totally at random:

A = 5, B = 7, C = 11, D = 13, E = 17

And we add a new binary to the project: reverse-polish.rs

//! A small program demonstrating execution of arithmetics in our VM.
//!
//! For an explanation of what we are doing here, look at this wikipedia article:
//! https://en.wikipedia.org/w/index.php?title=Stack_machine&oldid=1097292883#Design
use lovem::{op, VM};

// A*(B-C)+(D+E)
// A B C - * D E + +
// A = 5, B = 7, C = 11, D = 13, E = 17
// 5 * (7 - 11) + (13 + 17) = 10

fn main() {
    // Create a program in bytecode.
    // We just hardcode the bytes in an array here:
    let pgm = [op::PUSH_U8, 5, op::PUSH_U8, 7, op::PUSH_U8, 11, op::SUB, op::MUL,
        op::PUSH_U8, 13, op::PUSH_U8, 17, op::ADD, op::ADD, op::POP, op::FIN];
    // Create our VM instance.
    let mut vm = VM::new(100);
    // Execute the program in our VM:
    match vm.run(&pgm) {
        Ok(_) => {
            println!("Execution successful.")
        }
        Err(e) => {
            println!("Error during execution: {:?}", e);
        }
    }
}

The comments spoil the result, but we want to check it calculates correctly, so that is okay. The program is the same as before: create a VM and run some hardcoded bytecode on it. Since the VM logs excessively, we will see what happens, when we run it. So the only new thing here is the bytecode program. I'll write it down in a more readable form:

push_u8 5
push_u8 7
push_u8 11
sub
mul
push_u8 13
push_u8 17
add
add
pop
fin

To no-ones surprise, this code is the same as in the article - only with the variables replaced by numbers, and I added a pop and a fin at the end, to keep our program clean.

Execution

VM { stack: [], pc: 0, op_cnt: 0 }
Executing op 0x02
  PUSH_U8
  value: 5
VM { stack: [5], pc: 2, op_cnt: 1 }
Executing op 0x02
  PUSH_U8
  value: 7
VM { stack: [5, 7], pc: 4, op_cnt: 2 }
Executing op 0x02
  PUSH_U8
  value: 11
VM { stack: [5, 7, 11], pc: 6, op_cnt: 3 }
Executing op 0x11
  SUB
VM { stack: [5, -4], pc: 7, op_cnt: 4 }
Executing op 0x12
  MUL
VM { stack: [-20], pc: 8, op_cnt: 5 }
Executing op 0x02
  PUSH_U8
  value: 13
VM { stack: [-20, 13], pc: 10, op_cnt: 6 }
Executing op 0x02
  PUSH_U8
  value: 17
VM { stack: [-20, 13, 17], pc: 12, op_cnt: 7 }
Executing op 0x10
  ADD
VM { stack: [-20, 30], pc: 13, op_cnt: 8 }
Executing op 0x10
  ADD
VM { stack: [10], pc: 14, op_cnt: 9 }
Executing op 0x01
  POP
  dropping value 10
VM { stack: [], pc: 15, op_cnt: 10 }
Terminated!
VM { stack: [], pc: 16, op_cnt: 11 }
Execution successful.

The output shows you the stack after every instruction. You can compare it to the stack contents in the Wikipedia listing, and you will find them identical (the order of the stack listing is switched, and of course you have numbers instead of arithmetic expressions with variables – but if you insert our numbers on the Wikipedia listing they should match).

Our PoC stack machine really can do what the imaginary one is claimed to do. That's nice.

Homework

You should really read the article on Reverse Polish Notation (permalink to article at time of writing). It will give some background on why it is important, not at least historically. The Z3, for example, arguably the first computer built by mankind3, was using it.

Go ahead and jump!

All our programs have been linear so far. Let's build the base for jumping around.


In every program we have written so far, each instruction just advances the PC4, until we reach the end. That is very linear. We will now introduce a new opcode, that jumps to a different position in the program.

A new opcode

How do we implement that? That is actually quite easy. Do you remember what I said about the PC? It is a special register, that always points to the instruction in the bytecode, that is executed next. So all our operation needs to do is modify the PC. We will give that opcode an oparg of two bytes, so we can tell it, where to jump to. Here is our new opcode in op.rs:

/// opcode: Relative jump.
///
/// pop: 0, push: 0
/// oparg: 2B, i16 relative jump
pub const GOTO: u8 = 0x20;

Now we have the dreaded goto. Don't be scared - on bytecode level, that is all well. We are not designing a high level language here, there will be gotos. But how do we fetch an i16 from our bytecode? So far we can only fetch u8. So we add some more fetching:

Fetch more than a byte

/// Reads the next byte from the bytecode, increase programm counter, and return byte.
fn fetch_u8(&mut self, pgm: &[u8]) -> Result<u8, RuntimeError> {
    if let Some(v) = pgm.get(self.pc) {
        self.pc += 1;
        Ok(*v)
    } else {
        Err(RuntimeError::EndOfProgram)
    }
}

/// Reads the next byte from the bytecode, increase programm counter, and return byte.
fn fetch_i8(&mut self, pgm: &[u8]) -> Result<i8, RuntimeError> {
    if let Some(v) = pgm.get(self.pc) {
        self.pc += 1;
        Ok(*v as i8)
    } else {
        Err(RuntimeError::EndOfProgram)
    }
}

/// Reads the next two bytes from the bytecode, increase programm counter by two, and return as i16.
fn fetch_i16(&mut self, pgm: &[u8]) -> Result<i16, RuntimeError> {
    let hi = self.fetch_i8(pgm)? as i16;
    let lo = self.fetch_u8(pgm)? as i16;
    Ok(hi << 8 | lo)
}

We already know fn fetch_u8(). fn fetch_i8() does almost the exact thing, only that it casts that byte from u8 to i8. Simple enough. Casting in Rust has the beautiful syntax <value> as <type>.

So why do we need i8? Because we are building an i16 from an i8 and a u8. Just a bit of bit arithmetic. We can pass on potential EndOfProgram runtime errors easily with ? and Result. It allows us to write some short but still easy-to-read code, I think. So now we can fetch the value, we need for our jump. So let us write the handler for the opcode in fn execute_op() of vm.rs.

Goto

op::GOTO => {
    println!("  GOTO");
    let d = self.fetch_i16(pgm)?;
    self.pc += d;
    Ok(())
}

So, is that all? No, because we made a Rust-beginner-mistake. If we try and compile the code, we get an error:

error[E0308]: mismatched types
   --> src/vm.rs:174:28
    |
174 |                 self.pc += d;
    |                            ^ expected `usize`, found `i16`

Yeah - Rust does not allow us to do calculations with different types of integers. We need to explicitly cast everything. Rust tries to avoid ambiguity, so no implicit conversions. And, to be honest, the compiler has a good point. We should care even more about that calculation; we want our VM to be robust. We change the handler to:

op::GOTO => {
    println!("  GOTO");
    let d = self.fetch_i16(pgm)?;
    self.relative_jump(pgm, d)
}

Safe goto

And we add a new method (and we add a new RuntimeError):

/// Executes a checked relative jump; Runtime error, if jump leaves program. 
fn relative_jump(&mut self, pgm: &[u8], delta: i16) -> Result<(), RuntimeError> {
    println!("  Jump from {} by {}", self.pc, delta);
    if delta < 0 {
        let d = -delta as usize;
        if self.pc >= d {
            self.pc -= d;
            Ok(())
        } else {
            Err(RuntimeError::InvalidJump)
        }
    } else {
        let d = delta as usize;
        if self.pc + d < pgm.len() {
            self.pc += d;
            Ok(())
        } else {
            Err(RuntimeError::InvalidJump)
        }
    }
}

Enter the loop

Now, let us write a new program that uses the goto opcode:

//! Create a VM and run a small bytecode program in it.
//!
//! This demonstrates the goto operation with an endless loop.
use lovem::{op, VM};

fn main() {
    // Create a program in bytecode.
    // We just hardcode the bytes in an array here:
    let pgm = [op::PUSH_U8, 123, op::GOTO, 0xff, 0xfb, op::FIN];
    // Create our VM instance.
    let mut vm = VM::new(100);
    // Execute the program in our VM:
    match vm.run(&pgm) {
        Ok(_) => {
            println!("Execution successful.")
        }
        Err(e) => {
            println!("Error during execution: {:?}", e);
        }
    }
}

I will write that bytecode down in a more readable format again:

push_u8 123
goto -5
fin

Only 3 instructions. And the fin will never be reached. That 0xff, 0xfb after the op::GOTO is the 2 byte oparg: an i16 with the value -5. But why -5? When the goto executed, we have read both oparg bytes, so the PC points to the fin at index 5. So adding -5 to it will set the PC to 0. The next executed instruction will be the push_u8 once again. This is an endless loop. So will the program run forever? What do you think will happen? Let's try:

VM { stack: [], pc: 0, op_cnt: 0 }
Executing op 0x02
  PUSH_U8
  value: 123
VM { stack: [123], pc: 2, op_cnt: 1 }
Executing op 0x20
  GOTO
  Jump from 5 by -5
VM { stack: [123], pc: 0, op_cnt: 2 }
Executing op 0x02
  PUSH_U8
  value: 123
VM { stack: [123, 123], pc: 2, op_cnt: 3 }
Executing op 0x20
  GOTO

[...]

VM { stack: [123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123], pc: 0, op_cnt: 200 }
Executing op 0x02
  PUSH_U8
  value: 123
Error during execution: StackOverflow

Process finished with exit code 0

There is a push_u8 operation in our endless loop. So it will fill our stack until it is full! The program hits a runtime error after 200 executed instructions. Great, now we tested that, too.

NOPE

That is not very dynamic. We want to make decisions! We want to choose our path. What we want is branching. We will introduce a new opcode, that will decide, which branch the execution of our program will take, based on a value during runtime. If this sounds unfamiliar to you, let me tell you, what statement we want to introduce: it is the if statement.

So, how does that work? As mentioned, normally the PC is incremented on each byte we fetch from the bytecode. And the PC always points to the next instruction, that will be executed. So if we want to change the path of execution, what we have to do is change the value of the PC.

An operation, that simply changes the PC statically, would be a GOTO statement. But there is no branching involved in that, the path that will be executed is always clear. The if statement on the other hand only alters the PC, if a certain condition is met.

A new opcode

/// opcode: Branch if top value is equal to zero.
///
/// pop: 1, push: 0
/// oparg: 2B, i16 relative jump
pub const IFEQ: u8 = 0x20;

Our new operation pops only one value. So what does it get compared to? That's easy: zero. If you need to compare two values to each other, just subtract them instead, and then you can compare with zero. That gives the same result.

And what kind of oparg does this operation take? A signed integer. That is the value that should be added to the PC, if our condition is met. This will result in a relative jump.

Homework

Same as always. Write some bytecode. Try some jumping around. Run into troubles! You can write a program, that has a fin in the middle, but executes code that lies behind that instruction.

Don't byte me!

I have had it with these motherloving bytes in this motherloving bytecode!


By now you should have come to a realisation: writing bytecode sucks! It wasn't fun to begin with, but now that we introduce jumps in our code, we need to count how many bytes the jump takes – and that with instructions that have different numbers of bytes as opargs. Encoding negative numbers in bytes is also no fun. And just think about it: if you change your program (e.g. add a few instructions), you have to adjust those relative jumps! How horrible is that? Can't someone else do it? Well, yeah, of course. We invented a machine that can do annoying and monotone tasks that require accuracy and that must be done over and over again. That machine is, of course, the computer.

Well, lucky us, that we know how to tell a computer what it should do. So let's write a program, that writes bytecode for us. I am not talking about compiling a programming language into our VM; at least not yet, not for a long time. But something that lets us write those instructions in a way that is at least a bit more human friendly.

Maybe you remember that I already tried to write some of the bytecode programs I showed you in a more readable way, like this:

push_u8 5
push_u8 7
push_u8 11
sub
mul
push_u8 13
push_u8 17
add
add
pop
fin

If that did remind you of something, that might be no coincidence.

Assembler

The listing up there looks a bit like assembler code. And on the earlier draft of lovem I did already write a program that could translate those listings into bytecode. We will do that again, together. But this will take us some time (that is, multiple journal entries). We need to acquire some additional Rust skills for that. And there is so much to explain inside that assembler program itself.

Once again, I am making this up along the way. Yes, I have a plan, but I will just start to introduce syntax for the assembler, and it might not be ideal. That means, I might change it all again later. As the VM itself, our assembler will be experimental. You are welcome to give me ideas for the syntax; we do have the comments now, unter each post, feel free to use them. There is the whole GitHub discussions page as well. And you can still find me on Twitter. Find the link at the bottom of this page.

Command line tool

The assembler will be a binary that you call with parameters. A typical command line tool, just like gcc or rustc are. So what we need to do, is to learn how one writes a command line tool in Rust. One that can read files, because I plan to write assembly programs in text files. And I have no desire to start parsing command line arguments myself. Neither do I want to write an introduction on writing command line tools in Rust. All this has been done. So I kindly direct you to an online book:

Command Line Applications in Rust.

That is where I got what I will be using here. They use a crate called clap, which seems to be the most used lib for building command line tools in Rust. It takes about 10 minutes to read. Finding out how to use the options of clap that I want took longer, but that will not be a thing for you, as I will just be using those options.

This is the first time we are using external crates in Rust. We need to add our dependencies to Cargo.toml, before we can use them:

[dependencies]
clap = { version = "3.2.12", features = ["derive"] }
anyhow = "1.0.58"

Introducing lovas

Now let us start with the assembler. We create a new binary that will become our assembler: lovas.rs

//! An experimental assembler for lovem
use clap::Parser;
use anyhow::{Context, Result};

/// Struct used to declare the command line tool behaviour using clap.
///
/// This defines the arguments and options the tool provides. It is also used to 
/// generate the instructions you get when calling it with `--help`.
##[derive(Parser, Debug)]
##[clap(name = "lovas",
long_about = "An experimental assembler for lovem, the Low Overhead Virtual Embedded Machine.",
)]
struct Cli {
    #[clap(parse(from_os_str), help = "Path to assembler source file.")]
    source: std::path::PathBuf,
}

fn main() -> Result<()> {
    // read, validate, and evaluate command line parameters:
    let args = Cli::parse();
    // read complete source file into String:
    let content = std::fs::read_to_string(&args.source)
        .with_context(
            || format!("could not read file `{}`", args.source.as_path().display().to_string())
        )?;
    // For now, just print our all the lines in the file:
    for (n, line) in content.lines().enumerate() {
        println!("{:4}: '{}'", n + 1, line);
    }
    // We succeeded in our work, so return Ok() as a Result:
    Ok(())
}

As it happens with Rust, the code is very dense. I try to explain what I do inside the code using comments. This does not look like it does too much. Yet it does. You can call it using cargo run --bin lovas, as we learned earlier:

kratenko@jotun:~/git/lovem$ cargo run --bin lovas

    Finished dev [unoptimized + debuginfo] target(s) in 0.02s
     Running `target/debug/lovas`
error: The following required arguments were not provided:
    <SOURCE>

USAGE:
    lovas <SOURCE>

For more information try --help

That is already a lot! It finds out that you did not supply a required argument and tells you in a somewhat understandable error message. We did not write any of that. And it even directs you how to get help: add --help to your call.

Now if we use cargo to run our binary, we need to add an extra bit to the call, because we need to tell cargo where its own arguments end, end where the arguments to the called binary begin. This is done (as it is custom) by adding --, to indicate the end of cargo's arguments. So if we want to pass --help to lovas, we can do it like this:

kratenko@jotun:~/git/lovem$ cargo run --bin lovas -- --help

    Finished dev [unoptimized + debuginfo] target(s) in 0.02s
     Running `target/debug/lovas --help`
lovas 
An experimental assembler for lovem, the Low Overhead Virtual Embedded Machine.

USAGE:
    lovas <SOURCE>

ARGS:
    <SOURCE>
            Path to assembler source file.

OPTIONS:
    -h, --help
            Print help information

How helpful! Also, now you can see why I added those two strings to our Cli struct; they show up in the help message.

Run it

It looks like we need to give it a file to read, if we want the program to succeed and not exit with an error. I did write a little assembly program that we can use: hallo-stack.lass. Our assembler will not so anything too useful with it, because we did not write an assembler, yet. It will simply print out the lines of the file, prefixed with the line number (the call to .enumerate() is what I use to count the lines, while iterating over them).

kratenko@jotun:~/git/lovem$ cargo run --bin lovas -- pgm/hallo-stack.lass

    Finished dev [unoptimized + debuginfo] target(s) in 0.02s
     Running `target/debug/lovas pgm/hallo-stack.lass`
   1: 'push_u8 123'
   2: 'push_u8 200'
   3: 'add'
   4: 'pop'
   5: 'fin'

Neat! I feel this is a lot for such a small program! It is also enough for this journal entry. We will be working on lovas for a bit, now.

Homework

Well - if you have not done so, read the book I linked. At least up until chapter 1.4, I guess, that is what we need for now.

And try to trigger some errors when calling lovas. What if the file you tell it to open does not exist? What if it cannot be read? Do you understand how those error messages propagate through the program and end up as a readable message in your console?

Assemble!

We introduce an API for assembly to our lovem library.


Last time, we built the frame of a command line program, that will become our new assembler, lovas. It is time that we give that program the power to assemble.

Calling the real assembler

lovas.rs is just the executable wrapper around the actual assembler, that will live inside the library. All lovas.rs does, is supply the command line interface. And that CLI-part does not belong in a library function. We got it nicely separated. And programs using the library can assemble source to bytecode themselves, without calling an external binary.

We alter lovas.rs a bit. The part that just printed out the source lines is gone. We replace it with a call to a new library function, that can transfer assembly code into bytecode:

fn main() -> Result<()> {
    ... the same as before ...

    // run the assembler:
    match asm::assemble(&name, &content) {
        Ok(pgm) => {
            // we succeeded and now have a program with bytecode:
            println!("{:?}", pgm);
            Ok(())
        },
        Err(e) => {
            // Something went wrong during assembly.
            // Convert the error report, so that `anyhow` can do its magic
            // and display some helpful error message:
            Err(Error::from(e))
        },
    }
}

The important part is the call to asm::assemble(&name, &constent). We created a new module asm inside our lib. It exposes only a single function assemble and a few types for error handling. There will be a lot to unpack inside that module.

The good news for us is: we do not need to restrain ourselves as much as we do in the VM itself. Resource usage is not really an issue here, because the assembler is not meant to run in a restricted environment. The idea of lovem is, that you write your programs elsewhere, outside the restricted environment, and only run the compiled bytecode in the VM on the restricted device. And since the scope handled by the assembler will still be defined by that restricted device, we expect to only write relatively small and simple programs. With modern computers used for assembling, we can use as much memory as we want.

Oh, by the way... Yeah, I seem to stick to these short, cryptic names for the parts of lovem. VM, Pgm, op, asm - I kinda like it that way, and it goes well with the register names etc. That feels right for something as low-lever as a VM. And I give my best to always document those things properly, so that your IDE of choice will always show you, what each thing is.

ASM

I wrote a very basic assembler inside asm.rs, and it is already over 250 lines long. Quite a lot to unpack. As before, I try to explain as much as possible inside the source code itself, using comments. This makes it easier to follow, and you can even do so inside the source in the repo, without reading this blog.

There are four types that I introduce inside the mod:

/// Errors that can happen during assembly.
##[derive(Debug, Clone)]
pub enum AsmError {
    InvalidLine,
    UnknownInstruction(String),
    UnexpectedArgument,
    MissingArgument,
    InvalidArgument,
}

/// Report of failed assembly attempt.
///
/// Wraps the error that occurred during assembly and supplied information where it did.
##[derive(Debug)]
pub struct AsmErrorReport {
    /// Name of the program that failed to assemble.
    name: String,
    /// Line the error occurred during assembly.
    line: usize,
    /// Error that occurred.
    error: AsmError,
}

/// A single instruction parsed from the line of an assembly program.
##[derive(Debug)]
struct AsmInstruction {
    /// Number of line the instruction was read from.
    ///
    /// The number of the line the instruction was taken from, most likely
    /// from a source file. Line counting starts at 1.
    line_number: usize,
    /// Opcode defining which operation is to be executed.
    opcode: u8,
    /// Arguments used for execution of the operation.
    ///
    /// Zero or more bytes.
    oparg: Vec<u8>,
    /// Position inside bytecode (starting at 0).
    ///
    /// Number of bytes that come before this instruction in the program.
    pos: usize,
}

/// A assembler program during parsing/assembling.
##[derive(Debug)]
struct AsmPgm {
    /// Name of the program (just a string supplied by caller).
    name: String,
    /// Vector of parsed assembler instructions, in the order they are in the source file.
    instructions: Vec<AsmInstruction>,
    /// Current line number during parsing.
    ///
    /// Used for error reporting.
    line_number: usize,
    /// Current position inside bytecode during parsing.
    ///
    /// Used to calculate the exact position an instruction will be in the bytecode.
    text_pos: usize,
    /// The error that happened during parsing/assembling, if any.
    error: Option<AsmError>,
}

AsmError is easy enough to understand. We used the same idea for the RuntimeError inside the VM. When we run into an Error while trying to assemble the program, we return Err<AsmError> instead of Ok(()), so that we can propagate what happened back to the caller. The nice thing is, that with speaking names for the enum values, and with the occasional embedded value (as in UnknownInstruction(String)), the debug representation of the AsmError alone is enough to make the user understand what error was detected.

AsmErrorReport is a little wrapper we use to add the information where we ran into an error. InvalidArgument is nice hint how to fix your program - but if that program is 2000 lines long, then good luck. When you know the InvalidArgument happened in line 1337, then you will find it much faster. Especially in an assembly language, that has never more than one single instruction per line.

AsmInstruction is used to represent a single instruction inside a program. So each instance of this type will be linked to a specific line in the source file. If you don't remember, what counts as an instruction in lovem (at least at the time of writing), let me repeat: an instruction consists of exactly one operation that is to be executed, which is identified by its opcode (which is a number from 0x00 to 0xff stored in a single byte). Each instruction has zero or more bytes used as an argument, defining how the operation is to be executed. This argument is called oparg. We will also store the number of the line we found our instruction inside the source code, and the position inside the bytecode where the instruction will be.

AsmPgm will represent the complete program during the assembly process. We will collect the instructions we parse from the source in there in a Vector. And we will hold the progress during parsing/assembling. This is not the type that will be returned to the caller, it is only used internally (as you can guess by the fact that it is not defined pub).

Where does the program come from?

The only function the mod exports it assemble:

/// Parse assembly source code and turn it into a runnable program (or create report).
pub fn assemble(name: &str, content: &str) -> Result<Pgm, AsmErrorReport> {
    let asm_pgm = AsmPgm::parse(name, content);
    asm_pgm.to_program()
}

It will return an AsmErrorReport, if anything goes wrong and the assembling fails. If the assembler succeeds, it returns an instance of Pgm. Now where does that come from? Our VM takes programs in form of a &[u8]. That will be changed soon, and then it will run programs from a special type Pgm that might have a bit more than just bytecode. I added another new module to the library: pgm.rs. That one is tiny and only holds the new struct Pgm – which itself is basic. But we have a type that holds a program, now. I believe that will be beneficial to us later.

/// Holds a program to be executed in VM.
##[derive(Debug)]
pub struct Pgm {
    /// Some name identifying the program.
    pub name: String,
    /// Bytecode holding the programs instructions.
    pub text: Vec<u8>,
}

What is it, that the assembler does, to create such a Pgm. We will start to go through that in the next entry. This has been enough for today.

Parsing the source


So far we have read an assembly source file into a string, and we got to know some new data structures. It is time we use the one to fill the other. Let us start parsing.

What we know so far is this:

/// Parse assembly source code and turn it into a runnable program (or create report).
pub fn assemble(name: &str, content: &str) -> Result<Pgm, AsmErrorReport> {
    let asm_pgm = AsmPgm::parse(name, content);
    asm_pgm.to_program()
}

Assembler syntax

Our experimental assembler will begin using a simple syntax. Only one instruction per line, short opnames to identify the operation to be executed, optionally a single argument. I have written a short program: hallo-stack.lass.

push_u8 123
push_u8 200
add
pop
fin

Straightforward. And you know the syntax already from my human friendly listings of bytecode. Parsing that looks simple. We do want to allow adding whitespaces, though. And we want to allow comments, for sure. Our assembler needs to handle a bit of noise, as in noice.lass.

## This is an awesome program!
 push_u8 123
push_u8     200      # What are we using the # 200 for?


add
   pop


## let's end it here!
fin

Those two programs should be identical and produce the same bytecode.

One line at a time

The parse() function we call creates an empty instance of AsmPgm and then processes the source file line after line, filling the AsmPgm on the way.

/// Parse an assembly program from source into `AsmPgm` struct.
fn parse(name: &str, content: &str) -> AsmPgm {
    // create a new, clean instance to fill during parsing:
    let mut p = AsmPgm {
        name: String::from(name),
        instructions: vec![],
        line_number: 0,
        text_pos: 0,
        error: None,
    };
    // read the source, one line at a time, adding instructions:
    for (n, line) in content.lines().enumerate() {
        p.line_number = n + 1;
        let line = AsmPgm::clean_line(line);
        if let Err(e) = p.parse_line(line) {
            // Store error in program and abort parsing:
            p.error = Some(e);
            break;
        }
    }
    p
}

content.lines() gives us an iterator that we can use to handle each line of the String content in a for loop. We extend the iterator by calling enumerate() on it; that gives us a different iterator, which counts the values returned by the first iterator, and adds the number to it. So n will hold the line number and line will hold the line's content.

We always keep track of where we are in the source. Because the enumerate() starts counting at 0 (as things should be), we need to add 1. File lines start counting at 1. The first thing we do with the line is cleaning it. Then it gets processed by parse_line(line). If this produces an error, we will store that error and abort parsing. All our errors are fatal. The final line p returns the AsmPgm. We do not use a Result this time, but the AsmPgm can contain an error. Only if its error field is None, the parsing was successful.

Cleaning the noise

/// Removes all noise from an assembler program's line.
fn clean_line(line: &str) -> String {
    // Remove comments:
    let line = if let Some(pair) = line.split_once("#") {
        pair.0
    } else {
        &line
    };
    // Trim start and end:
    let line = line.trim();
    // Reduce all whitespaces to a single space (0x20):
    ANY_WHITESPACES.replace_all(line, " ").to_string()
}

We use multiple techniques to clean our input: splitting, trimming, regular expressions. When we are done, we only have lines as they look in hallo-stack.lass. The cleaned line can also be completely empty.

I want to add a word about that regexp in ANY_WHITESPACES. Where does it come from? I am using some more Rust magic there, and the crate lazy_static:

use lazy_static::lazy_static;
use regex::Regex;

// Regular expressions used by the assembler.
// lazy static takes care that they are compiled only once and then reused.
lazy_static! {
    static ref ANY_WHITESPACES: Regex = regex::Regex::new(r"\s+").unwrap();
    static ref OP_LINE_RE: Regex = regex::Regex::new(r"^(\S+)(?: (.+))?$").unwrap();
}

I do not pretend to understand the macro magic that happens here. But what happens, is that the regular expressions are compiled only once and then kept as some sort of global static immutable variable, that we can than use again and again all over the program as a reference. Static references are a convenient thing in Rust, if you remember what I told you about ownership. You can always have as many references to immutable static variables, because there is nothing that can happen to them, and they exist throughout the complete runtime of the program.

Parsing a clean line

/// Handles a single cleaned line from an Assembly program.
fn parse_line(&mut self, line: String) -> Result<(), AsmError> {
    if line == "" {
        // empty line (or comment only) - skip
        return Ok(());
    }
    if let Some(caps) = OP_LINE_RE.captures(&line) {
        let opname = caps.get(1).unwrap().as_str();
        let parm = caps.get(2).map(|m| m.as_str());
        return self.parse_instruction(opname, parm);
    }
    Err(AsmError::InvalidLine)
}

parse_line() processes each line. Empty ones are just skipped. We use another regular expression, to find out if they match our schema. Because we cleaned it the expression can be rather simple: r"^(\S+)(?: (.+))?$". We look for one or more non-empty chars for our opname. It can be followed by a single argument, which must consist of one or more chars, separated by a single space. That is our optional oparg. If the line fits, we found an introduction we can try to parse. That is the job of parse_instruction(). Everything that is neither empty nor an instruction, is an error, that we can simply return. It will abort the parsing and the caller will know, that there was an invalid line.

parse_instruction() can also run into an error. We use our tried pattern of returning a Result where the successful outcome does not carry any additional information (which is why we return Ok(())). The error case will return an AsmError, that carries the reason for the error. And because of our the Result type and because of Rust's might enum system, we can simply return what parse_instruction() returns to us.

Handling the instruction itself will be handled in the next entry.


  1. Don't let yourself be confused by fancy terms like register. You can think of it as a kind of snobbish variable with a special meaning. In computers sometimes stuff magically happens when you write to a register – but it should always be documented somewhere. 

  2. Beard, Bob (Autumn 1997). "The KDF9 Computer - 30 Years On". Computer RESURRECTION. 

  3. Yeah, I know. The answer to the question "What was the first machine to qualify as a computer?", differs, depending on whom you ask – and also on the country you ask the question in. But the Z3 is a prominent candidate. 

  4. PC: the Program Counter, a special register that points to the next instruction to be executed.