Assembler, Part 1 2017

I don't yet have a name for the Assembler language program-thingy I'm doing so I'm just going to refer to it as Assembler (giev suggestions!) for the time being. Cool? Cool.

As mentioned in a previous post, I'm taking time out from my commercial software and am instead working on fun little projects. Clearly, our definition of the word fun differs greatly.

I wrote a Turing program a long time ago and recently it occurred to me that I could get a more advanced version by essentially emulating a CPU. The Turing machine is pretty much what that is. In this case, I'd be going with a Motorola 68000. Because it's lovely, and because it seems simple enough as compared to the nightmares that are the x86 and x68 architectures. Bear in mind that I don't really know much about assembler/assembly other than tinkerings back in the mid-to-late 1990s.

Most of the time so far has been spent creating a lexer (syntax highlighter) for Scintilla. Being a .NET OOP guy and working with Scintilla.NET (which doesn't attempt to really refactor anything), its API is a complete mess. So it took me quite a while to get highlighting happening, and it's still not done. Thankfully there's a project called ScintillaNET-Kitchen that helps out.

I haven't gone for a multi-document interface or anything that you'd find in more complete IDEs - how do I even know if this project will work satisfactorily? Hell, saying that, I've (partly!) made it possible to switch in new CPUs. I think that itself is a bit much and so don't currently plan on fully adding support for such a feature. Hmm... I could also create a chipset with multiple chips...? No - NO. Stop right there.

Yeah, alright - what the eff is the point?

Fun. As said above.

I'm going to keep adding to it. Forever. CPUs don't tend to do a whole lot on their own; their instruction sets aren't very big, either. Lots of shunting of data all over the place. Of course this is simplified and from someone that doesn't really assembler-me-do, but how hard can it be?

So, if I can get the basics working then I can start adding built-in routines that a program can call to do more advanced stuff. A video display would need to be added, of some sort. Only being able to manipulate registers isn't the most exciting thing after the initial novelty wears off.

Shut up and tell me if it's done yet

Hell no it isn't done. It doesn't even parse the entered code yet. There's the editor with all its syntax highlighting tomfoolery, and there's the Registers window that displays the contents of each register. That bit works; it shows exactly what's in the registers, but they can't be set using code yet.

There's a screenshot below. That's all there currently is. Will still if I have time to get the parsing - at least started - in today.

Assembler, screenshot uno

Edit: Did the opposite of what I said and have continued working on the interface. Oops.


Assembler, Part 2 2017

Part number the two!

Did some more work on the Assembler project (part 1 here) that didn't just involve improving the interface. Ooh, ooh - what?

Parsing. That's what.

Hitting Run (or F5) now results in the entered code being parsed and added to an Instructions collection. The following style is correctly parsed:

<mnemonic><datatype> <source_operand>, <dest_operand>

An example of which is:

move.l #4, d0 ; an optional comment which isn't explicitly captured

mnemonic is the instruction (move, add, etc). datatype is the size which is either .b (byte), .w (word), .l (long-word). source_operand and dest_operand is where stuff happens (values, registers, addresses, and the like).

Each line in the source is read one-by-one. Once a line is read, a regular expression1 match is attempted (hard-coded add and move for now) .

(add|move)(?:\.(l|b|w)?)? *(.*), (\w*)

Note that this expression is far from complete. First change would involve removing the rigid white-space structure.

The reason for the datatype being optional in the regex is because I thought it could be omitted from assembler source. Doing a bit of research seems to show that everyone always includes a datatype (or size as I think is the nomenclature), so I've dropped the idea of having a default type. Still, this now means I can detect and specifically give an error about any missing types.

Anyway, back to parsing.

If the line fails the regex match, then onto the next line. If it succeeds, the parsing continues and an Instruction object is created containing mnemonic and datatype. I haven't yet figured out how the source and destination operands are going to be represented. In assembler code, they can be addresses, registers, or whatever else so I don't really know what's going to happen until I just start typing.

Undecided if mnemonics should inherit from a baseMnemonic and be individual classes, or if I should continue with the mnemonic simply being an enum on Instruction. I like instructions being self-contained, but comparisons become more involved. I'm leaning more towards baseMnemonic with a defined Interface so instructions can do whatever the hell they want as long as they take in an input and give back an output.

Thinking further ahead, I'm also undecided if the entire assembler source should be parsed in one go or instead proceed one line at at time. The latter will allow code to be edited while the program is running, which is quite nice, and goes hand-in-hand with the Step debugger feature.

Right now, I can't think of any advantages of parsing in one go (parse each line and populate Instructions immediately before execution even starts). Maybe "compiler" optimisations at a later date? Performance isn't even a minor consideration, so I don't see that as much of a win. I think the step-by-step method simply gives more advantages and so that may very well be the route to take.

1 I can highly recommend Regex Pal for writing and testing expressions.

Assembler, Part 3

I've gone ahead and added the BaseMnemonic class, along with an IMnemonic interface that all CPU mnemonics (instructions, op-codes - whatever you want to call them) inherit from.

The past hour or so has been spent improving the interface and adding things where necessary. The main window now has the standard set of root menu items (File, Edit, View, Debug, Tools, Help) and there's a (currently collapsed) project viewer.

Added a Messages window that will show To-do items, syntax errors, and the like. Also added Status Register to the Registers window. The tooltip for each register's value also displays the value of that register's contents in base 10 (Decimal) and likely base 16 (Hex) in the near future, too.

Next up is making use of the BaseMnemonic class by implementing the Move instruction. Oooh!

Messages and Registers windows


Assembler, part 4 2017

What am I doing?

I've once again resumed working on the interface rather than the actual core; I still don't even know if any of this truly works. Sheesh.

Not a whole lot to report from yesterday as I practically spent the day reading while doing the odd bit of UI tinkering. Big thanks to Rob for reminding me about SyntaxBox! Scintilla is pretty horrible overall and SB is done so much better, so I'm glad the editor is now more solid. Code folding was extremely easy to implement, unlike Scintilla where I just couldn't be bothered and worked on another part instead.

Today I started a new control for displaying parsed instructions. Think of it as a ListBox but with no user interaction, and the "current item" is always vertically centred and highlighted; currently adding column support to tabulate the view. Oh, issue: I can't get the background transparent no-matter what. I've done millions of transparent controls in the past, so this is confusing the hell out of me.

Next, I need to have a go at actually executing instructions. I've done very little bit-work in .NET so I'm not sure how that's going to go, especially as I don't even know the syntax for VB.Net; presumably it's going to be bat-shit insane compared to C#.

Nothing instead of just null still annoys me to this very day.


Assembler, part 5 2017

In the previous version, Assembler parsed down the entered code into Instructions which contained an enum for denoting the type of that instruction. For example, Move being set as the property would mean it was a Move instruction.

Now that property has been removed and all Instructions inherit from BaseMnemonic (and some other interface). So, the Move instruction is now its own Move class.

All mnemonics for a CPU are defined within a Mnemonics collection when a Machine is instantiated (along with registers and such within the CPU class); there's (currently) a method that allows the mapping of a textual mnemonic to its class counterpart during the parsing stage; this may later be changed to a Dictionary or something.

Some limited parsing is in for determining what the source and destination operands consist of. This is the most complex part as those operands can contain everything from a plain immediate value to relative addressing.

Because of this, each instruction will have to maintain metadata on exactly what's going on within that instruction; having to parse more than once to later on to determine what needs to happen is silly, so lots of metadata is required. As a bonus, we'll get a ton of helpful debugging info for "free" when we implement all that gubbins.

On the interface side of things (c'mon, like I can resist UI work) the syntax highlighting now colourises the datatype (or size as is the actual term) for mnemonics. Turned out this was achievable via making the datatypes an operator when it comes to highlighting as they don't need to be on word boundaries - thanks to Rob for the suggestion. Still more highlighting work required.

The Registers window now has a trace of what it parsed from the entered source code and displays the interpreted code. Changed the font to a mono-space one as the other just looked messy when registers' widths didn't line up. Sorry Segoe UI - not this time.

Editor and Registers


Assembler, part 6 2017

Okay, okay; can't put off doing the operand parsing any longer if I want this project to progress any further.

Just like the (incomplete) line-by-line parsing of the entered sourced, I'm going with Regular Expressions. Everything is going to be regular expressions; regular expressions all the way down.

As there's no specific reason why the source and destination operands should be treated differently and so a generic parsing method is used for determining what an operand consists of.

The Operand object has, among other things, a Type enum property that specifies the type (oh, my!) of operand. This Type dictates what occurs when the instruction is executed. Currently, the following types have been defined:


There will no-doubt be more as I add parsing for additional features when I get the existing ones working. Types will certainly become more granular, such as specifying whether an Address is relative or absolute. Any BaseMnemonic-inherited class should be given as much information as possible to perform their function; there's no reason why they should need to perform any of their own interrogation.

The next big step is to have the Add mnemonic adding an immediate value to a register, but before that occurs, I'm going to need to sort out a couple of other things first.

Bit-work. All operations can work with either a byte, word, or long-word. Long-word is easy, but the other two will require bit manipulation of which I haven't really done much of in .NET.

Memory. Sure, the registers are currently just an integer(?) and so easy enough to work with, but what about addresses? There's not much point having address support if there's no memory to read/write with! I haven't really thought about this.

Initial thinking would be just to allocate an array that's a property of the BaseMachine class. I'm not going to add any CPU caches or anything of the sort as it won't make any performance difference as everything is so high level. Machine (memory, CPU, registers, etc) state will easily be serialisable and so can be "saved stated" for whatever reason.

Assembler, part 7

Haven't quite gone ahead with actually executing instructions as yet because I re-worked the way a "machine" is implemented.

I ended-up ripping out the hard coded properties and features and made everything generic. There's now a set of base classes that are used for implementing the CPU, Registers, Memory, and Mnemonics.

+ BaseMachine
  + Cpu
  + Memory

The first machine being implemented (as said from the very start) is the Motorola 68000. It's the first CPU I've used assembler with back in, ooh, 1995, maybe? It's also simpler and makes more sense than today's CPUs. Don't wanna go making things harder for myself.


The potential issue here is that registers are now loosely typed. I'm not sure how to go about implementing the usp and ssp registers. The pc (Program Counter) register can be added to the base class as all CPUs will require one. Not entirely sure what it will count, though.

When a new register is added to the Registers collection, you need to specify its base name, such as D (for a Data reg) or A (or an Address one). An index is automatically applied. The D, in this case, doesn't actually mean anything as far as the program is concerned. You can call the register(s) FloppyJobbies instead, if you wanted.

Anyway! If you wanted to add 8 Data registers, you'd do the following:

AddRegister("D", 8)

This will create 8 registers, all named D0 to D7. Basically, the index is applied directly after the base name.

Do CPUs only really have (general purpose) Data and Address registers? Could I replace the literal with an enum, instead?

Fetching a register? Like so:


With name being, say, "D6" to get the sixth Data register.

As I've (currently) implemented the Register collection as a list of type <Register>, I've added a Dictionary for caching purposes to save having to reiterate over the list each item a register is fetched. But, as registers are uniquely-named, I may just change the base collection to a Dictionary itself.


No changes have been made to mnemonics.


This currently just has a Map array of Integers. Other than that, I haven't done any further work on memory. Registers, above, is the current focus.