Assembler, Part 2

Part number the two!

Did some more work on the Assembler project (part 1 here) that didn't just involve improving the interface. Ooh, ooh - what?

Parsing. That's what.

Hitting Run (or F5) now results in the entered code being parsed and added to an Instructions collection. The following style is correctly parsed:

<mnemonic><datatype> <source_operand>, <dest_operand>

An example of which is:

move.l #4, d0 ; an optional comment which isn't explicitly captured

mnemonic is the instruction (move, add, etc). datatype is the size which is either .b (byte), .w (word), .l (long-word). source_operand and dest_operand is where stuff happens (values, registers, addresses, and the like).

Each line in the source is read one-by-one. Once a line is read, a regular expression1 match is attempted (hard-coded add and move for now) .

(add|move)(?:\.(l|b|w)?)? *(.*), (\w*)

Note that this expression is far from complete. First change would involve removing the rigid white-space structure.

The reason for the datatype being optional in the regex is because I thought it could be omitted from assembler source. Doing a bit of research seems to show that everyone always includes a datatype (or size as I think is the nomenclature), so I've dropped the idea of having a default type. Still, this now means I can detect and specifically give an error about any missing types.

Anyway, back to parsing.

If the line fails the regex match, then onto the next line. If it succeeds, the parsing continues and an Instruction object is created containing mnemonic and datatype. I haven't yet figured out how the source and destination operands are going to be represented. In assembler code, they can be addresses, registers, or whatever else so I don't really know what's going to happen until I just start typing.

Undecided if mnemonics should inherit from a baseMnemonic and be individual classes, or if I should continue with the mnemonic simply being an enum on Instruction. I like instructions being self-contained, but comparisons become more involved. I'm leaning more towards baseMnemonic with a defined Interface so instructions can do whatever the hell they want as long as they take in an input and give back an output.

Thinking further ahead, I'm also undecided if the entire assembler source should be parsed in one go or instead proceed one line at at time. The latter will allow code to be edited while the program is running, which is quite nice, and goes hand-in-hand with the Step debugger feature.

Right now, I can't think of any advantages of parsing in one go (parse each line and populate Instructions immediately before execution even starts). Maybe "compiler" optimisations at a later date? Performance isn't even a minor consideration, so I don't see that as much of a win. I think the step-by-step method simply gives more advantages and so that may very well be the route to take.

1 I can highly recommend Regex Pal for writing and testing expressions.