Sunday, 15th October. Day of the HomeLab-2 game jam deadline. My port of The Revenge, or at least its first half, is fully playable. It even has quicksaving/loading implemented, with an in-memory buffer. Unfortunately, I didn't have time to find out what it would entail to implement persistent saving on cassette, but at least with quicksaving you can play cautiously enough to avoid frustrating deaths.
But can you actually win the game? To make sure you can, I wanted to play through from start to finish, using the hacked-up game scripts I ended up with that has a lot of the locations and scripts stripped. I could have just played the whole thing on a HomeLab-2 emulator, but I wanted something a bit more streamlined for two reasons:
So I ended up writing a no-frills bytecode interpreter as a normal desktop Haskell program. By no-frills, I mean that quality-of-life builtin commands like INVENTORY or SAVE were unimplemented. But of course all the command handlers that were implemented via script were available. And on top of that, the big advantage of this interpreter was that it supports playing with an external input transcript file: the transcript's lines are executed at startup, then the game drops into interactive mode, with all subsequent player commands appended to the transcript. This way, if I find something fishy, I can fix the script files, and re-run from the game's beginning up to that point and then go on playing the fixed version.
Technically, everything worked without issues. However, game-design-wise, I did find two faults that needed fixing. Both of them involved the game depending on its graphics to guide the player:
There's one location where the textual description of an alleyway doesn't mention a big tree growing next to a fence, but it is shown on the picture. In fact, the whole point of the puzzle, as emphasized by the built-in hint for this location, is to "use your eyes".
I changed this one so that the tree is included in the description of the fence, thereby still requiring you to use your (in this case, your character's) "eyes".
The second one involves a tree with a carved symbol made of 45-degree lines and a picture of a hoe that is actually a map. By default, the location's picture doesn't show the carving, but it appears when the user types EXAMINE TREE. So the obvious fix was to just change the text response to include a carved "mysterious message" with something like SE N SE NE W DIG.
Writing this new interpreter, finding and fixing these mistakes, and then getting all the way to the game's finale took most of the free time I had that day since, of course, this wasn't the only thing I've spent Sunday on. But hey, it all came together just in time.
Except, when I took the finalized version and converted it into WAV for submission, the resulting audio file didn't load correctly on my emulator. It seemed to get to the start of the game, but then it immediately crashed, seemingly leaving the machine in a weird state where the screen would keep flashing between blank and the correct start screen.
I was tearing my hair out at this point. I'VE WORKED ON THIS ALL THIS TIME, MADE IT ALL WORK, PUT A REASONABLE VERSION OF A 64 KB GAME INTO THIS 16 KB MACHINE, AND NOW THE WHOLE THING JUST CRASHES FOR NO REASON WHEN I WANT TO SUBMIT IT, MERE HOURS BEFORE THE DEADLINE?!?!
But, you might ask, how was this different from running the game in the same emulator, with no issues, all week long during development?
First of all, during development, I didn't load my game using an emulated cassette. I didn't load any of my games (Snake or HL-2048) that way. Instead, I hacked my emulator so that it initialized RAM contents with my game, starting at the right address. Then you can just type something like CALL 17000 (using your program's start address) into BASIC and your pgoram starts.
This was to save time: even something as simple as Snake is takes than half a minute to load from tape, and due to how crunched I was for time, I never implemented any flags for temporary faster-than-real-time emulation. For The Revenge, the 16 KB game takes almost three and a half minutes to load. This doesn't sound like a lot, but imagine if you had to develop something where it takes three and a half minutes to try anything out.
OK, so the dev environment didn't match prod, a tale as old as time. But what is causing the crash when loading from tape? And why was there absolutely no similar problem when submitting Snake and HL-2048? To understand that, I need to talk a bit about HomeLab-2's cassette format.
As I have mentioned in an earlier post, bits are stored on tape as 1.6 ms long patterns. Unsurprisingly, eight bits, one after the other, make up one byte. But what do the bytes mean?
A file on cassette is made up of one or more records. Apart from irrelevant details like the file name and some 0-byte padding used by the firmware for synchronization, each record consists of a starting address, and then a sequence of bytes that will be loaded to RAM from that address. If you assembled a program that starts at, let's say, address 0x4200, you might think it's enough to have a single record with that start address. But then, after loading, seemingly nothing would happen: the user would have to somehow know that they have to type CALL 16896 to actually start the program.
For example, on a Commodore 64, the standard way of solving this is to include a BASIC starter: a one-liner BASIC program that only consists of 10 SYS 16896 (SYS being C-64 BASIC's equivalent of CALL). BASIC programs have a known start location, so you would put that program at that location, and then after loading, the user can just type RUN to start the BASIC program which, in turn, starts the machine-code program.
If you wanted to do the same on the HomeLab-2, you could either have two records, one consisting of the one-line BASIC starter and the second containing the meat of your program; or you could assemble the machine-code program right after the BASIC program and put everything into one record. The problem with either approach is that the in-memory representation of BASIC programs on the HomeLab-2 is quite fragile. When you save a BASIC program to tape using the built-in BASIC SAVE instruction, what you get is a record that loads to the very start of RAM, and then contains 256 bytes of internal firmware state until it gets to the actual first byte of the program itself. So you don't just save your program – you save your whole firmware state. And that is because that firmware state includes the pointers to the first and last lines of the BASIC program.
This makes it pretty much impossible to sanely generate a tape image from scratch. The only easy option is to boot up the machine, type in your BASIC loader program, and then save that as a "template" to prepend to your machine-code program. But even with this approach I was unable to get it working reliably when, since it requires all the memory it can get, the machine-code program needs to start right after the BASIC program.
In fact, for The Revenge, I couldn't even spare the space for the BASIC starter program. But even for Snake and HL-2048, I wanted to generate everything from scratch instead of fiddling around with this 256 extra bytes of who-knows-how-redundantly-intertwined firmware state.
There's a dirty trick to avoiding the BASIC starter altogether; in fact, it's even better since it results in an auto-starter, i.e. a program that automagically starts after loading, not even RUN required. The trick hinges on the fact that after a program is normally loaded, the LOAD command returns with the usual OK BASIC prompt and waits for the user to type in their next command.
However, waiting for the user to type in a command involves the firmware's input routine at 0x28 which in turns calls through an indirection at 0x4002, i.e. in RAM! Which means we can hijack that by using a two-byte-long record that starts at 0x4002 and contains our program's real start address. A second record then contains the main program with no BASIC starter. After the full program is loaded, BASIC calls 0x28 which in turn jumps to the adress stored in 0x4002 which jumps to our program.
And so after what felt like an hour of desperation, I put all of the above together in my head and then figured out the source of my problem. Snake and HL-2048 has very simple directional input and a game loop that consists of checking the relevant keys' state, reacting to that and then redrawing the screen. The Revenge, instead, uses textual input, and the easiest way of doing that was to outsource the decoding of keypresses into ASCII character codes to the firmware.
Putting it like this, you can probably see already where this is going. I was actually using the 0x28 routine in my program... except, when loading from tape, that routine's vector was overwritten with my program's start address. So the game would start normally, print the first room's description, then start waiting for user input, which would effectively reset the game. Which is why the game was blinking, since one of the very first things it does is to clear the screen...
And of course if you're not loading from cassette, but instead inject the program directly into RAM, the input vector is not changed since the firmware's whole 0x4000..0x40ff area is left alone. Which explains why it was only the final "mastered" WAV file that was showing this issue, not the intermediate development builds.
Of course, this is the classic kind of bug where 99% of the effort is in figuring out what's happening, and then the fix is trivial. So the first thing my program does now is restoring the address at 0x4002 to the hard-coded value 0x0306, which is its value at boot as determined by using the built-in monitor.
So here it is, the obligatory screenshot:
Yeah, not much of a looker. But it's got it where it counts: in content.
Oh, this screenshot reminds me of one more last-minute hack I did, but one that was planned all along, I just kept postponing this: since the game always starts printing response messages on a new line, we can ahead-of-time statically word-wrap everything. This requires absolutely no extra space at runtime, since we're just replacing some spaces with newlines, or even removing a space when a word naturally ends at the last column.
And this concludes my RetroChallenge series of posts for 2023. If you've read along, by now we've learned about the HomeLab-2, wrote some simple games, and then remixed a 35-year-old Hungarian text adventure game to make it fit into the 16 KB of RAM available on this strange, weird, but somehow still charming machine.
And as for the actual game jam? The Revenge came in 4th out of 13 entries. Interestingly, in the week since the official voting period ended, it's been steadily gaining extra votes from players and by now it's tied for second place. Maybe it took people some time to play enough of it to discover how much depth there is to it.
So here we are, with a couple days to go until the deadline, with a text adventure game that is roughly 22 KB in size, targeting a machine with 16 KB of RAM.
One thing to try would be to start trimming content. This is a straightforward idea: if we can remove messages and pieces of the script, then the size goes down. We know that the "empty" game (i.e. just the engine plus its runtime memory usage) is 2 KB, so by the intermediate value theorem, if we remove enough content, the size at some point will get below 16 KB.
But removing content from a text adventure game is not as easy as it might first sound, if you only have a couple days remaining to do that plus finish the engine plus do enough testing to make sure the game can still be finished. The reason for that is that most of the rooms, almost all objects, and most possible interactions are all there as part of some puzzle, with the whole game's start and end connected through a chain of the puzzles. Remove something and you might break a link and make the game unplayable. Perhaps even worse is if technically it doesn't become unplayable, but you remove something that contains the crucial information that would prompt the player to come up with a solution, and thus instead of a puzzle, now you have a "guess the random input you need to type in" game.
The other way to make the game smaller is to cut it in half: take some mid-point event, and turn that into the winning condition. This should allow you to remove all the rooms that are normally available only after that mid-point event; then, you can remove the scripts that are keyed to the removed rooms; and then you can remove the messages that are not referenced by the remaining scripts anymore.
To pull this off, you need a good middle point that makes sense as the new game not just technically, but narratively as well. Luckily, I've found such a point in The Revenge. So the original game's plot (spoiler alert for a 35-year-old Hungarian text adventure game!) is that random Japanese rural dude is sent to fetch water, arrives back to find his village razed and burned by some evil local warlord, and swears revenge (TITLE DROP!). Goes to the city, finds the warlord's mansion, but surprise surprise, it's heavily guarded. So after some local adventuring the protagonist ends up on a ship to China, living in a monastery for a while acquiring stamina through rigorous exercise, then learns kung-fu from an old master whose daughter he saves from bandits. Armed with deadly fists, he returns to Japan on a perilious trek through shark-infested waters on a dinghy, infiltrates the mansion and finally confronts the evil warlord. Standard kung-fu B movie plot.
On the one hand, the good news is that there is a very natural way to split this into two episodes: the first episode can end with the protagonist learning kung-fu, leaving the actual confrontation with the Big Bad to the second episode. On the other hand, as you can see from this structure, you don't actually get to remove half the locations by just keeping half the game, because the second half mostly consists of coming back to Japan and revisiting the same places, just now with the ability to defeat opponents in hand-to-hand combat.
On the gripping hand, we don't need to cut down our memory usage by half -- we just need to reduce it by 6 KB, or 30% of our original 20 KB asset size. That should be doable.
For the technical details, the idea is the following. First of all, let's briefly look at what the game script looks like. Here's some example bytecode that scripts a single interactive response.
1d Length of second room's scripts, for easy indexing 10 Length of next script, for easy skipping if the input doesn't match 3a 78 00 Matches user input consisting of word 3a "fill" and 78 "bucket" 08 78 02 If item 78 is not here (in inventory or in current room), print message 2 and end 06 0a 29 If variable 0a is not 0, print message 29 and end 04 0a Set variable 0a to ff 02 04 Print message 4 14 0b Play chime 0b 00 End 05 Length 33 a4 00 Input: 33 "swim" a4 "river" 0c 03 Go to location 03 "island" 04 Length 02 00 Input: 02 "northeast" 0c 04 Go to location 04 "hillside above village" 00 End of handlers for room 2 15 Length of room 3's handlers ... Room 3's script
(I should note here that since my HomeLab-2 version has no "multimedia", all bytecode operations like show picture P or play chime C are stripped during conversion.)
Now let's suppose we wanted the game to end in this room without the ability to progress to the island. We can remove the second handler in its entirety (saving 6 bytes in the process), and then statically re-traverse the remaining bytes to find all go to location X commands. When viewing the result as a directed graph, we will see that room 03 is no longer reachable from the starting location, and so we can throw away the script in the next 22 (including the length) bytes. Then we can build similar accessibility information from the remaining bytecode, this time not for the rooms, but for the messages. For example, message 29 is definitely needed (since it is referenced in room 2), but there might be other messages that are only referenced in the 22 bytes we've just thrown away.
By the way, for some eye candy, here's what the map looks like. The left-hand side shows the original map. Rooms 8 to 13 and 70 to 76 comprise two of those super annoying mazes that old text adventure games were keen to include. The right-hand side shows the map of The Revenge Episode 1, i.e. my cut-in-half HomeLab-2 version. Simplifying the episode one maze didn't really save much space since all the constituent rooms use the same text descriptions, but I think skipping them simply makes the game better.
Of course, there are also location-agnostic interaction scripts involving carriable items; since with the changes to the map, I already fit into memory (if just barely), and time was quickly running out, I decided not to bother pruning the scripting of items that are not aquirable in the first half of the game.
So I had a fully playable game with even enough space left to implement a simple in-memory gamestate saving facility, which is useful because there are quite some lethal situations in the game. At this point I had one day (a Sunday) remaining, and I thought it was going to be smooth sailing: my plan was to play through the modified script in a desktop interpreter (with replaying and checkpointing capabilities) to make sure the game is winnable, convert it to a WAV file for deployment, and send it in.
Join me in the next post to read why that last day was anything BUT smooth sailing. DUN DUN DUUUUUUUUUUN!
Two weeks before the game jam deadline, I finally had the idea for my main entry. But with most of the first week occupied with work and travel, will the second week be enough to make it a reality?
Years ago, after remaking Time Explorer, I went on to reverse-engineer István Rátkai's subsequent games for the Commodore 64: The Revenge, and New Frontier I and II. This work culminated in a crowd-funded modern Android release. These Android versions were possible because these text adventure games, in quite forward-looking way for 1988, were originally written in a small homebrew bytecode language that was then packaged up with an interpreter. Think of it like a very primitive version of Infocom's Z-machine, but without all the Lisp-y sensitivities.
So anyway, my idea was to take the game's scripts as-is, and implement my own bytecode interpreter for the HomeLab-2. The Commodore 64 has 64 KB of RAM, while the HomeLab only has 16; but surely if I get rid of the multimedia (the per-room graphics and the small handful of short melodies played at key points of the games), the remaining text and scripts can't be that much?
Well it turns out when talking about text adventure games for 8-bit home computers, all that text is actually relatively quite a lot! In The Revenge, for example, just the various messages printed throughout the game fill 16 KB, and then we have 10 more KB of the recognized words and the game script. Clearly, this 26 KB of data will not fit into 16 KB, especially not leaving enough space for the game engine itself and the 256 bytes of game state.
Well, what if we try to be a bit more clever about representation? For example, because of the HomeLab-2's fixed upper-case character set, all that text will be in upper-case and without using any Hungarian accented characters, so we can use something similar to ZSCII to store 3 characters in 2 bytes. This, and some other small tricks like encoding most message-printing in just a single byte instead of a byte to signal that this is message-printing followed by the byte of the message ID, gets us to below 20 KB. Still not good, but better!
It was at this point that I also started writing the engine, to see how big that will be. Of course, at this point I was only able to try it out by only including a subset of the rooms and the game script, but at least it allowed me to get a size estimate for it. And so it came to about 1.5 KB of code to implement everything, plus 256 bytes of game state. Optionally, if we have space left, we can use a second 256 bytes to implement quicksave/quickload.
So what will we do with our budget of 14 KB, given a 20 KB game? Let's find out in the next post.
Previously, we left off our HomeLab-2 game jam story with two somewhat working emulators, and a creeping realization that we still haven't written a single line of code.
Actually, it was a bit worse than that. My initial "plan" was to "participate in the HomeLab-2 game jam", with nothing about pesky details such as:
I found the answers to these questions in reverse order. First of all, since for three of the five weeks I've spent in Hungary, I was working from home instead of being on leave, we didn't really have much planned for those days so the afternoons were mostly free.
Because the HomeLab-2 is so weak in its processing power (what with its Z80 only doing useful work in less than 20% of the time if you want to have video output), and also because I have never ever done any assembly programming, I decided now or never: I will go full assembly. Perhaps unsurprisingly, perhaps as a parody of myself, I found a way to use Haskell as my Z80 assembler of choice.
This left me with the question of what game to do. Coming up with a completely original concept was out of the quesiton simply because I lack both the game designing experience as well as ideas. Also, if there's one thing I've learnt from the Haskell Tiny Games Jam, it is that it's better to crank out multiple poor quality entries (and improve in the process) than it is to aim for the stars (a.k.a. that pottery class story that is hard to find an authoritative origin for). Another constraint was that neither of my emulators supported raster graphics, and I was worried that even if they did, it would be too slow on real hardware; so I wanted to come up with games that would work well with character graphics.
After a half-hearted attempt at Tetris (which I stopped working on when someone else has already submitted a Tetris implementation), the first game I actually finished was Snake. For testing, I just hacked my emulator so that on boot, it loads the game to its fixed stating address, then used CALL from HomeLab BASIC to start it. This was much more convenient than loading from WAV files; doubly so because it took me a while to figure out how exactly to generate a valid WAV file. For the release version, I ended up going via an HTP file (a byte-level representation of the cassette tape contents) which is used by some of the pre-existing emulators. There's an HTP to WAV converter completing the pipeline.
There's not much to say my Snake. I tried to give it a bit of an arcade machine flair, with an animated attract screen and some simple screen transitions between levels. One of the latter was inspired by Wolfenstein 3D's death transition effect: since the textual video mode has 40×25 characters, a 10-bit maximal LFSR can be used as a computationally cheap way of replacing every character in a seemingly random (yet full-screen-covering) way.
For my second entry, I went with 2048. Shortly after finishing Snake, thanks to Gábor Képes I had the opportunity to try a real HomeLab-2 machine. Feeling how unresponsive the original keyboard is convinced me that real-time games are not the way to go.
The challenge with a 2048-like game on a platform like this is that you have to update a large portion of the screen as the tiles slide around. Having an assembler that is an EDSL made it a breeze to try various speedcoding techniques, but with lots of tiles sliding around, I just couldn't get it to fit within the frame budget, which led to annoying flickering as a half-finished frame would get redrawn before the video refreshing interrupt handler eventually returned to finish the rest of the frame. So I ended up using double buffering, which technically makes each frame longer to draw, but avoids inconsistent visible frames.
Since the original 2048 is a product of modern times, I decided to pair my implementation with a more clean design: just like a phone app, it boots straight into the game with no intro screen, with all controls shown right on the main screen.
Between these two games, and all the other fun stuff one doesn when visiting home for a month, September flew by. As October approached, I stumbled upon this year's RetroChallenge announcement and realized the potential for synergy between doing something cool in a last hurrah before the HomeLab-2 game jam deadline and also blogging about it for RetroChallenge. But this meant less than two weeks to produce a magnum opus. Which is why this blogpost series became retrospective — there was no way to finish my third game on time while also writing about it.
But what even was that third and final game idea? Let's find out in the next post.
I don't blame you if you don't know what a HomeLab-2 is. Up until I listened to this podcast episode, I didn't either. And there's not much info online in English since it never made it out of Hungary.
As interesting as the history of this "Soviet bloc Homebrew Computer Club" machine is, I will be skipping that here and concentrate on the technical aspects.
HomeLab-2 is a home computer in the eighties sense: a computer that boots to a BASIC interpreter, with a built-in keyboard, video output that can be connected to a TV.
The core of the machine is the well-known Zilog Z80 CPU, one of the stars of this class of computers (the other one being, of course, the MOS 6502). It is connected to 8 KB of ROM containing the BASIC interpreter, some IO routines for thing like loading programs from cassettes, and a rudimentary monitor. The system also comes with 16 KB of general purpose RAM (upgradeable to 32 KB), and 1 KB of text-mode video RAM coupled with a 2 KB character set ROM that is inaccessible to the CPU.
One interesting aspect of the machine is that due to export restrictions and a weak currency, availability of more specialised ICs was limited, and so the HomeLab-2 was designed around this limitation by only using 7400-series ICs beside the Z80. This meant that a lot of the functionality that you would expect to be done with custom circuitry, chief among them the video signal generation, was done by the CPU bit-banging the appropriate IO lines. This is somewhat similar to the ZX80/81 video generator, in that the CPU "jumps" to video memory so that its program counter can be used as the fastest-possible-updating software counter, and the supporting circuitry makes sure the CPU's data lines are fed NOPs. Concretely, the value appearing on the data bus is 0x3F, which is effectively a NOP (it inverts the carry flag) and makes it easy to conditionally change it to a 0xFF, i.e. a RST 38, which is used to mark end-of-(visible)-line.
To program the HomeLab-2, you don't need to know the exact details of this, but it is important to keep in mind that as long as the video system is turned on, the CPU will spend 80+% of its time drawing the screen, leaving your program with less than 20% of its 4 MHz speed.
Data storage is done to cassette tape, via an audio mic/speaker port. Writing to a specific memory location sets the audio output into its high level for about 10 μs. The on-tape format is based on simple 10 μs-wide square waves 1.6 ms apart: for high bits, this interval is halved by an extra mid-point square. Of course, for the CPU to be able to accurately keep track of the audio signal timing, the video system has to be turned off while accessing the tape.
The audio output is also routed to an internal speaker so you can generate sound by modulating this 10 μs square.
To get started with HomeLab-2 development, we need some way of testing programs. A straightforward tool of doing that is an emulator of the machine. Unfortunately, at least back in August when I started working on my games, the emulator situation wasn't quite rosy.
The obvious place to check first is MAME, and indeed it claims to support the HomeLab-2. However, it was obviously written as a quick hack by someone who didn't really invest the time into understanding how the original machine's video system worked. This of course wreaks havoc with the timings, and makes it impossible to get cassette IO working.
Discounting very old emulators running on DOS, the other one I found was Attila Grósz's emulator of the whole HomeLab family, which had an update as recently as May 2022, but its HomeLab-2 support was quite limited. And much more annoyingly, it's a Windows-only closed source software. I don't want to dunk on the guy, but that's just stupid; especially because looking at the source of actual working emulators is usually a really good way in resolving any ambiguities in documentation during development. And realistically, what benefit can you possibly hope from your closed-source emulator of a computer that in our Lord's year of 2023 probably interests about a dozen people?!
So I did what any responsible adult would do when faced with a limited-time game jam where he has to also learn Z80 assembly and figure out, well, everything: I set out to put all that aside and cobble together my own emulator. With blackjack and hookers, of course.
Oops I guess the section title is a spoiler.
I wanted to make something that people can just use without any fuss, so I decided to target web browsers and make it into a single-page app. The goal was to quickly get something off the ground, publish it so that others can also use it for the game jam, and then later hope for contributions from people who know the machine better.
Because I was in peak "just get the damn thing working" mode, I decided to write vanilla JavaScript instead of transpiling from some statically typed functional language, which is what I would normally do. With JavaScript, at least I knew that whatever happens, the code might end up as a horrible mess of spaghetti but at least I won't run into situations where I'm "the first guy to try doing that" and everything breaks, which is usually how it goes with these projects of mine.
For the CPU core itself, I found an easy-to use Z80 emulator library. I connected it to some array-backed ROM and RAM, started rendering the text video RAM onto a canvas, and let the firmware rip. This got me all the way to the BASIC prompt, not bad for a couple minutes of hacking:
Getting from this to actually blinking the cursor and accepting input was much trickier, however. Remember all that detail a couple paragraphs ago about how the video system is implemented, what fake read values appear on the data bus as the video memory is scanned, that sort of stuff? That was not documented at all. The users' manual only mentions that the NMI "can't be used" for user purposes because it is used by the video system. I pieced the rest together mostly from reading the firmware disassembly, observing the CPU's behaviour, looking at the schematics, and doing a lot of "now if I were designing this machine, how would I do things?".
Eventually I got enough working that text-mode video worked; and then I gave up on raster video because I knew I wouldn't need it for the kinds of games I was envisioning. Then I added cassette IO, which necessitated a cassette player UI which then became way too much, and I kind of lost steam. But hey, at least I lost steam after I've got everything working. Well, everything except sound and raster graphics. But definitely everything that I was planning to use for my games!
This emulator, named HonLab (because HomeLab, and it runs on a web page, and honlap is Hungarian for home page, ha ha very clever, get it?! yeah sometimes I crack myself up!) can be used online here and its source code is on GitHub here.
Now, at this point, the game jam deadline was rapidly approaching, and I still haven't written a single line of Z80 assembly, so it was time to finally...
Oh my god what is wrong with me.
Also, this blogpost is starting to take too long to write, so long story short: based on my previous good experience with Idris 2's JavaScript backend and also itching to use Stefan Höck's new SPA FRP library, I decided to write a new version from scratch (only reusing the Z80 core), but this time in Idris 2. It's almost as finished as the first version, just missing the ability to save to tape; you can look at its source here and it was exactly the kind of project that I initially wanted to avoid: one where a significant amount of my time went into reporting upstream bugs and even fixing some. Time enjoyed wasting, and all that.
Also, because the two emulators do look the same from the outside, I won't bother making another screenshot; you wouldn't notice it anyway.
So by the next post, we'll finally get to the beginning of September, when I started writing Actual Lines of Code.
The Hungarian retro-gaming podcast Checkpoint recently did an episode on the HomeLab series of computers from the early 1980's. I've never heard of this Hungarian home computer computer before, and it turns out, there's a good reason for that: these machines never really made it to mass production. There was only a couple hundred of them made, a lot of them built from kits.
Gábor Képes, the guest of the episode, announced a game jam for the HomeLab-2 to spur the creation of new games for the member of the HomeLab family that is most lacking in software. Gábor also has an infectious enthusiasm for the machine, and so I decided I would try my hands at it.
Then, I saw that this year's RetroChallenge would also take place in October, so a perfect opportunity to double-dip: I would do a couple of small games in September to get to know the system, then do a more substantial project in October, finish it by the deadline of the 15th, and then spend the rest of the month retroactively documenting it all for RetroChallenge.
So in the upcoming posts, I will tell a bit about the machine itself; the emulator I had to create to have an easier time devleoping, and the second emulator I wrote for no good reason, really; and the games that I entered to the competition (my first ever assembly projects!).
I had an idea for a retro-gaming project that will require a MOS 6502 emulator that runs smoothly in the browser and can be customized easily. Because I only need the most basic of functionality from the emulation (I don't need to support interrupts, timing accuracy, or even the notion of cycles), I thought I'd just quickly write one. This post is not about the actual retro-gaming project that prompted this, but instead, my experience with the performance of the generated code using various functional-for-web languages.
As I usually do in situations like this, I started with a Haskell implementation to serve as a kind of executable specification, to make sure my understanding of the details of various 6502 instructions is correct. This Haskell implementation is nothing fancy: the outside world is modelled as a class MonadIO m => MonadMachine m, and the CPU itself runs in MonadMachine m => ReaderT CPU m, using IORefs in the CPU record for registers.
Ironing out all the wrinkles took a whole day, but once it worked well enough, it was time for the next step: rewriting it in a language that can then target the browser. PureScript seemed like an obvious choice: it's used a lot in the real world so it should be mature enough, and with how simple my Haskell code is, PureScript's idiosyncracies compared to Haskell shouldn't really come into play beyond the syntax level. The one thing that annoyed me to no end was that numeric literals are not overloaded, so all Word8s in my code had to be manually fromIntegral'd; and, in an emulator of an eight-bit CPU, there's a ton of Word8 literals...
The second contender was Idris 2. I've had good experience with Idris 1 for the web when I wrote the ICFP Bingo web app, but that project was all about the DOM manipulation and no computation. I was curious what performance I can get from Idris 2's JavaScript backend.
And then I had to include Asterius, a GHC-based compiler emitting WebAssembly. Its GitHub page states it is "actively maintained by Tweag I/O", but it's actually in quite a rough shape: the documentation on how to build it is out of date, so the only way to try it is via a 20G Docker container...
Notably missing from this list is GHCJS. Unfortunately, I couldn't find an up-to-date version of it; it seems the project, or at least work on integrating with standard Haskell tools like Stack, has died off.
To compare performances, I load the same memory image into the various emulators, set the program counter to the same starting point, and run it for 4142 instructions until a certain target instruction is reached. To paper over the browser's JavaScript JIT engine etc., each test runs for 100 times first as a warm-up, then 100 times measured.
Beside the PureScript, Idris 2, and GHC/Asterius implementations, I have also added a fourth version to serve as the baseline: vanilla JavaScript. Of course, I tried to make it as close to the functional versions as possible; I hope what I wrote is close to what could reasonably be expected as the output of a compiler.
The following numbers come from the collected implementations in this GitHub repo. The PureScript and Idris 2 versions have been improved based on ideas from the respective Discord channels. For PureScript, using the CPS-transformed version of Reader helped; and in the case of Idris 2, Stefan Höck's changes of arguments instead of ReaderT, and using PrimIO when looping over instructions, improved performance dramatically.
Implementation | Generated code size (bytes) | Average time of 4142 instructions (ms) |
JavaScript | 12,877 | 0.98 |
ReasonML/ReScript | 27,252 | 1.77 |
Idris 2 | 60,379 | 6.38 |
Clean | 225,283 | 39.41 |
PureScript | 151,536 | 137.03 |
GHC/Asterius | 1,448,826 | 346.73 |
So Idris 2 comes out way ahead of the pack here: unless you're willing to program in JavaScript, it's by far your best bet both for tiny deployment size and superb performance. All that remains to improve is to compile monad transformer stacks better so that the original ReaderT code works as well as the version using implicit parameters
To run the benchmark yourself, checkout the GitHub repo, run make in the top-level directory, and then use a web browser to open _build/index.html and use the JavaScript console to run await measureAll().
I've added ReScript (ReasonML for the browser), which comes in as the new functional champion! I still wouldn't want to write this program in ReScript, though, because of the extra pain caused it lacks not only overloaded literals, but even type-driven operator resolution...
Also today, I have received a pull request from Camil Staps that adds a Clean implementation.
This post is about an optimization to the Intel 8080-compatible CPU that I describe in detail in my book Retrocomputing in Clash. It didn't really fit anywhere in the book, and it isn't as closely related to the FPGA design focus of the book, so I thought writing it as a blog post would be a good idea.
Just like the real 8080 from 1974, my Clash implementation is microcoded: the semantics of each machine code instruction of the Intel 8080 is described as a sequence of steps, each step being the machine code instruction of an even simpler, internal micro-CPU. Each of these micro-instruction steps are then executed in exactly one clock cycle each.
My 8080 doesn't faithfully replicate the hardware 8080's micro-CPU; in fact, it doesn't replicate it at all. It is a from-scratch design based on a black box understanding of the 8080's instruction set, and the main goal was to make it easy to understand, instead of making it efficient in terms of FPGA resource usage. Of course, since my micro-CPU is different, the micro-instructions have no one to one correspondence with the orignal Intel 8080, and so the microcode is completely different as well.
In the CPU, after fetching the machine code instruction byte, we look up the microcode for that byte, and then execute it cycle by cycle until we're done. This post is about how to store that microcode efficiently.
To avoid dealing with the low-level details of what exactly goes on in our microcode, for the rest of this blog post let's use a dictionary of a small handful of English words as our running example. Suppose that we want to store the following table:
0. shape 1. shaping 2. shift 3. shapeshifting 4. ape 5. aping 6. ship 7. shipping 8. grape 9. elope 10. shard 11. sharding 12. shared 13. geared
There's a lot of redundancy between these words, and we will see how to exploit that. But does this make it a poor example that won't generalize to our real use case of storing microcode? Not at all. There are lots of 8080 instructions that are just minimal variations of each other, such as doing the exact same operation but on different general purpose registers; thus, their microcode is also going to be very similar, doing the same setup/teardown around a different kernel.
Since our eventual goal is designing hardware, everything ultimately needs a fixed size. The most straightforward representation of our dictionary, then, is as a vector that is sized to fit the longest single word:
type Dictionary = Vec 14 (Vec 13 Char)
The longest word "shapeshifting" is 13 characters. For all 14 possible inputs, we store 13 characters, using a special "early termination" marker like '.' in the middle for those words that are shorter:
0. shape........ 1. shaping...... 2. shift........ 3. shapeshifting 4. ape.......... 5. aping........ 6. ship......... 7. shipping..... 8. grape........ 9. elope........ 10. shard........ 11. sharding..... 12. shared....... 13. geared.......
We can then use this table very easily in a hardware implementation: after fetching the "instruction", i.e. the dictionary key, we look up the corresponding Vec 13 Char in the dictionary ROM, and keep a 4-bit counter of type Index 13 to process it cycle by cycle.
This is the equivalent of the microcode representation that we use in Retrocomputing in Clash, but it is easy to see that it is very wasteful. In our illustrative example, we store a total of 14 ⨯ 13 = 182 characters, whereas the total length of all strings is only 85, so we waste about 55% of our storage.
On our 8080-compatible CPU we get similar (slightly worse) numbers: the longest instruction, XTHL, takes 18 cycles. We don't need to store microcode for the first cycle, since that always corresponds to just fetching the instruction byte itself. This leaves us with 17 micro-operations. For all 256 possible machine code instruction bytes, we end up storing a total of 256 ⨯ 17 = 4352 micro-operations, but if we look at the cycle count of each individual 8080 instruction, the useful part is only 1493 micro-operations. That's a waste of about 65%.
No, wait, not this guy.
The obvious way to cut down on some of that fat is to store each word only up to its end. We can use a terminated representation for this, by keeping some end-of-word marker ('.' in the examples below), and concatenating all items:
0. shape. 6. shaping. 14. shift. 20. shapeshifting. 34. ape. 38. aping. 44. ship. 49. shipping. 58. grape. 64. elope. 70. shard. 76. sharding. 85. shared. 92. geared.
Instead of storing 182 characters, we now only store 99. While this is still more than 85, because we also have to store all those word-separating '.' markers, it is still a big improvement.
However, there's a bit of cheating going in here, because with the above table as given, we'd have no way of looking up words by their original index. For example, word #7 is shipping, but if we started at entry number 7 in this representation, we'd get haping. We need to also store a table of contents that gives us the starting address of each dictionary entry:
0. 0 1. 6 2. 14 3. 20 4. 34 5. 38 6. 44 7. 49 8. 58 9. 64 10. 70 11. 76 12. 85 13. 92
If we want to calculate the contribution of the table of contents to the total size, we have to get a bit more precise. Previously, we characterized ROM footprint in units of characters, but now we need to store 7-bit indices as well. To be able to add the two together, we need to also fix the bit width of each character. For now, let's just use 8 bits per character.
The total size in bits, for storing the table of contents and the dictionary in terminated form, comes out to 14 ⨯ 7 + 99 ⨯ 8 = 890. We can compare this to the 14 ⨯ 13 ⨯ 8 = 1456 bits of the fixed-length representation to see that it's a huge improvement.
Not this guy either.
As we've seen, the table of contents takes up 98 bits, or about 11% of our total footprint in the terminated representation. Can we get rid of it?
One way of doing this is to change the starting address of each word to its key. This is already the case for our first word, shape, since its key is 0 and it starts at address 0. However, the next word, shaping, can't start at address 1, since that is where the second letter of the first word resides.
If we store the next character's address instead of making the assumption that it's going to be the next address, we can start each word at the address corresponding to its key, and then leave subsequent letters to addresses beyond the largest key:
0. s → @14 1. s → @18 2. s → @24 3. s → @28 4. a → @40 5. a → @42 6. s → @46 7. s → @49 8. g → @56 9. e → @60 10. s → @64 11. s → @68 12. s → @75 13. g → @80 14. h → a → p → e → @85 18. h → a → p → i → n → g → @85 24. h → i → f → t → @85 28. h → a → p → e → s → h → i → f → t → i → n → g → @85 40. p → e → @85 42. p → i → n → g → @85 46. h → i → p → @85 49. h → i → p → p → i → n → g → @85 56. r → a → p → e → @85 60. l → o → p → e → @85 64. h → a → r → d → @85 68. h → a → r → d → i → n → g → @85 75. h → a → r → e → d → @85 80. e → a → r → e → d → @85
Not only did we get rid of the table of contents, we can also store the terminators more implicitly, by using a special value for the next pointer. In this example, we can use @85 for that purpose, pointing beyond the last cell. This leaves us with just 85 cells compared to the 99 cells with terminators.
However, each cell now contains both a character and a pointer. Since we need to address 85 cells, the latter takes up 7 bits, for a total of 85 ⨯ (8 + 7) = 1275 bits.
This is a step back from the the terminated representation's 890 bits. We can make a note, though, that if each character was at least 35 bits wide instead of 8, the linked representation would come out ahead. But the real reason we are interested in the linked-list form is that it suggests a further optimization that finally exploits the redundancy between the words in our dictionary.
This is where we get to the actual meat of this post. Let's focus on the following subset of our linked list representation:
12. s → @75 13. g → @80 75. h → a → r → e → d → @85 80. e → a → r → e → d → @85
We are storing the shared suffix ared twice, when instead, we could redirect the second one to the first occurrence, saving 4 cells:
12. s → @75 13. g → @80 75. h → @76 76. a → r → e → d → @85 80. e → @76
If we apply the same idea to all words, we arrive at the following linked representation. Note that we still start every word at the index corresponding to its key, avoiding the need for a table of contents:
0. s → @14 1. s → @15 2. s → @16 3. s → @20 4. a → @32 5. a → @34 6. s → @35 7. s → @38 8. g → @41 9. e → @42 10. s → @44 11. s → @48 12. s → @52 13. g → @56 14. h → @4 15. h → @5 16. h → i → f → t → @57 20. h → a → p → e → s → h → i → f → t → @29 29. i → n → g → @57 32. p → e → @57 34. p → @29 35. h → i → p → @57 38. h → i → p → @34 41. r → @4 42. l → o → @32 44. h → a → r → @47 47. d → @57 48. h → a → r → d → @29 52. h → @53 53. a → r → e → @47 56. e → @53
It's hard to see what exactly is going on here from this textual format, but things become much cleaner if we display it as a graph:
We can compute the size of this representation along the same lines as the linked-list one, except now we only have 57 cells. This also means that the pointers can be 6 bits instead of 7, for a total size of 57 ⨯ (8 + 6) = 798 bits. A 10% save compared to the 890 bits of the terminated representation!
Going back to our real-world use case of 8080 microcode, each micro-instruction is 15 bits wide. We have already computed that the fixed-length representation uses 4352 ⨯ 15 = 65,280 bits; if we do the same calculation for the other representations, we get 28,286 bits in the terminated representation, 37,492 bits with linked lists, and a mere 13,675 bits, that is, just 547 cells, with the common suffixes shared!
So how do we compute this shared-suffix representation? Luckily, it turns out we can do that in just a handful of lines of code.
To come up with the basic idea, let's start by thinking about why we are aiming to share common suffixes instead of common prefixes, such as the prefix between shape and shaping. The answer, of course, is that we need to address each word separately. If we try to unify shape and shaping, starting with the key 0 or 1 and making it as far as shap doesn't tell us on its own if we should continue with e or ing for the given key.
On the other hand, once we start a given word, it doesn't matter where subsequent letters are, including even the beginning of other words (as is the case between shaping and aping). So fan-out is bad (it would mean having to make a decision), but fan-in is a-OK.
There's an obvious data structure for exploiting common prefixes: we can put all our words in a trie. We can make one in Haskell by using a finite map from the next key element to the stored data (for terminal nodes) and the rest of the trie:
import Data.List.NonEmpty (NonEmpty(..)) import qualified Data.List.NonEmpty as NE import qualified Data.Map as M newtype Trie k a = MkTrie{ childrenMap :: M.Map k (Maybe a, Trie k a) } children :: Trie k a -> [(k, Maybe a, Trie k a)] children t = [(k, x, t') | (k, (x, t')) <- M.toList $ childrenMap t]
Note that a terminal node doesn't necessarily mean no children, since one full key may be a proper prefix of another key. For example, if we build a trie that stores "FOO" ↦ 1 and "FOOBAR" ↦ 2, then the node at 'F' → 'O' → 'O' will contain value Just 1 and also the child trie for 'B' → 'A' → 'R'.
The main operation on a Trie that we will need is building one from a list of (key, value) pairs via repeated insertion:
empty :: Trie k a empty = MkTrie M.empty insert :: (Ord k) => NonEmpty k -> a -> Trie k a -> Trie k a insert ks x = insertOrUpdate (const x) ks insertOrUpdate :: (Ord k) => (Maybe a -> a) -> NonEmpty k -> Trie k a -> Trie k a insertOrUpdate f = go where go (k :| ks) (MkTrie ts) = MkTrie $ M.alter (Just . update . fromMaybe (Nothing, empty)) k ts where update (x, t) = case NE.nonEmpty ks of Nothing -> (Just $ f x, t) Just ks' -> (x, go ks' t) fromList :: (Ord k) => [(NonEmpty k, a)] -> Trie k a fromList = foldr (uncurry insert) empty
Ignoring the actual indices for a moment, looking at a small subset of our example consisting of only shape, shaping, and aping, the trie we'd build from it looks like this:
But we want to find common suffixes, not prefixes, so how does all this help us with that? Well, a suffix is just a prefix of the reversed sequence, so watch what happens when we build a trie after reversing each word, and lay it out left-to-right:
In this representation, terminal nodes correspond to starting letters of each word, so we can store the dictionary index as the value associated with them:
At this point, it should be clear how we are going to build our nicely compressed representation: we build a suffix trie, and then flatten it by traversing it bottom up. Before we do that, though, let's take care of one more subtlety: what if we have the same word at multiple indices in our dictionary? This is not an invalid case, and does come up in practice for the multiple NOP instructions of the 8080, all mapping to the exact same microcode. The solution is to simply allow a NonEmpty list of dictionary keys on terminal nodes in the resulting trie:
fromListMany :: (Ord k) => [(NonEmpty k, a)] -> Trie k (NonEmpty a) fromListMany = foldr (\(ks, x) -> insertOrUpdate ((x :|) . maybe [] NE.toList) ks) empty suffixTree :: (KnownNat n, Ord a) => Vec n (NonEmpty a) -> Trie a (NonEmpty (Index n)) suffixTree = fromListMany . toList . imap (\i word -> (NE.reverse word, i))
The pipeline going from a suffix tree to flat ROM payload containing the linked-list representation has three steps:
compress :: forall n a. (KnownNat n, Ord a) => Vec n (NonEmpty a) -> [(a, Maybe Int)] compress = reorder . renumber . links . suffixTree where reorder = map snd . sortBy (comparing fst) renumber xs = [ (flatten addr, (x, flatten <$> next)) | (addr, x, next) <- xs ] where offset = snatToNum (SNat @n) flatten (Left k) = fromIntegral k flatten (Right idx) = idx + offset
Since we don't know the full size of the resulting ROM upfront, we have to use Int as the final unified address type; this is not a problem in practice since for our real use case, all this microcode compression code runs at compile time via Template Haskell, so we can dynamically compute the smallest pointer type and just fromIntegral the link pointers into that.
We conclude the implementation with links, the function that computes the next pointers. The cell emitted for each trie node should link to the cell for its parent node, so we pass the parent down as we traverse the trie (this is the next parameter below). The new cell itself is either put in the next empty cell if it is not a terminal node, i.e. if it doesn't correspond to a first letter in our original dictionary; or, it and all its aliases are emitted at their corresponding Left addresses.
links :: Trie k (NonEmpty a) -> [(Either a Int, k, Maybe (Either a Int))] links = execWriter . flip runStateT 0 . go Nothing where go next = mapM_ (node next) . children node next (k, mx, t') = do this <- case mx of Nothing -> Right <$> alloc Just (x:|xs) -> do tell [(Left x', k, next) | x' <- xs] return $ Left x tell [(this, k, next)] go (Just this) t' alloc = get <* modify succ
If you want to play around with this, you can find the full code on GitHub; in particular, there's Data.Trie for the no-frills trie implementation, and Hardware.Intel8080.Microcode.Compress implementing the microcode compression scheme described in this post.
And finally, just for the fun of it, this is what the 8080 microcode — that prompted all this — looks like in its suffix tree form, clearling showing the clusters of instructions that share the same epilogue:
On August 15th, the following answer showed up on the Retrocomputing Stack Exchange site:
llvm-mos compiles a considerable subset of C++ to 6502 machine code out of the box. IIRC it's just missing "runtime support"; things like RTTI, exceptions, and the runtime layout of VTables. [...]
I came across this answer about a week later, so just in time for ICFP 2021 to start. I also had my Haskell Love talk coming up, and I really wanted to get the RetroClash page in shape by then, so that I can direct people there at the end of my talk.
But still! LLVM, and consequently, Rust, on a C64! Damn if that isn't the perfect nerd-sniping for me. So I looked around to find this 6502.org forum post that shows compiling a Rust implementation of the Fibonacci function and linking it with a small C main. I've reproduced their code here in a Git repo for easier fiddling.
So after ICFP, although other things kept piling up, including a GM-ing gig playing Trail of Cthulhu's excellent The Dance in the Blood and coordinating with the awesome David Thinnes on getting Arrow DECA support into clash-pong in time for Haskell Love, I carved out a weekend to dust off my Rust-on-AVR CHIP-8 implementation to try to get it running on the C64.
As a refresher, the idea behind the AVR CHIP-8 stuff was to write a no_std Rust crate that implements the CHIP-8 VM spec in an idiomatic way without worrying about the fact that we'll use a completely unstable and underdeveloped LLVM backend, and then worry about the AVR- and peripheral-specific concerns in separate code. In particular, chirp8-engine uses algebraic datatypes as the result of the instruction decoder, and relies on the Rust compiler to implement case-of-case optimization between the decoder and the interpreter. This architecture allowed using a "normal" SDL-based frontend for debugging the CHIP-8 implementation itself. The AVR LLVM experience itself was quite a rough one: most of my output was in the form of LLVM bug reports over the course of several months.
Now, standing in front of the MOS 6502 LLVM backend four years later, I knew I wouldn't be able to sink this much time into it. I just wanted to see how far I can get before hitting the inevitable wall of miscompilation or unimplemented features. I did a bunch of cleanup on chirp8-engine and dived head first into LLVM-MOS.
Even though I grew up on the Commodore 64, I never really learned the details of accessing its hardware from low-level machine code. I used Commodore BASIC, then a BASIC extender called Graphics BASIC, before moving on to x86; and even there, I was always more interested in going "upwards" in abstraction. As a first step, I wanted to see what it would even look like to draw the CHIP-8 screen on the C64.
The CHIP-8 uses 1-bit graphics at 64⨯32 resolution. That is miniscule compared to the C64's 320⨯200 graphical resolution, but it compares nicely with its 40⨯25 text mode. So the idea is to use each quarter character as a CHIP-8 pixel; in other words, we will fit 2⨯2 CHIP-8 pixels onto each C64 character. This uses 32⨯16 characters, so still not quite full screen (there's 4 characters worth of horizontal and vertical border), but it doesn't look too bad and is very easy to implement by just creating a character set of all 16 possible 2⨯2 tiles.
So the first program validated this idea (and also kicked the tires on LLVM-MOS) by taking a CHIP-8 video memory dump from the startup screen of David Winter's game Hidden, and writing some C code that stores it in an array of 32 uint64_ts, generates the right character set, and draws it on the screen. I was worried about the handling of 64-bit variables by LLVM-MOS, but it compiled everything to 8-bit instructions like a champ, getting me to the following screen:
Very promising! I switched the video bank, but didn't clear the new memory area, so there's garbage all over the unused area; moreover, there's an off-by-one bug in the character set generation, which is why the "b" of "by" looks wonky; but these are all problems with my code, so I can just fix them! There's stuff on screen on first try, compiled with Clang!
This is where the fun started: taking chirp8-engine and writing a C64 frontend for it, all in Rust. This worked remarkably well. I don't know what is the relevant difference between LLVM-MOS and LLVM-AVR, but doing anything with the latter was an uphill battle back in the days, and now, when I wanted to see where it is today, I ended up having to go back a full years worth of Rust releases to get anything that can even compile CHIRP-8; and it still gets miscompiled somewhere, resulting in strange bugs after the game's been running for a while on an AVR simulator. I plan to look into this some day, to figure out what's going on, but I didn't want to let anything detract me from the C64 project.
So anyway, the C64 frontend just writes the right 2⨯2 tile characters to the right memory address, uses the Kernal's keyboard buffer at 0x00c5 for keyboard input (using the 4⨯4 rectangle spanned by the 1 to V keys), and that's pretty much it. The chirp8-engine crate takes care of everything else.
After centering the virtual screen, and using a nicer color scheme, the moment of truth came: I hardcoded Hidden's program into an array, compiled the whole shebang with LLVM-MOS (again, it worked on first try), and booted it up:
Yep, that's Hidden running all right!
I can't stress enough how smooth that whole process was: zero to playable in one weekend.
I really wanted to add the ability to load games from disk instead of hardcoding a single game. The challenge here, again, was me not knowing how to do, well, pretty much anything involving IO on the C64 beyond the BASIC LOAD command. Let's try loading the directory first!
The most useful resource I've found for this has been this Codebase 64 page. I'm sure a lot of people would think going to Codebase 64 is overkill for this, but what I've found is that they have very good straightforward implementations before getting into the hyper-optimized demoscene stuff. So I took that directory lister, and ported it first to C, then from C to Rust, changing it in the process to give programmatic access to the directory contents instead of just printing them.
Going from assembly to C was straightforward because I am using Clang from the LLVM-MOS project, so I can just add inline assembly for Kernal calls wherever needed. For example, we can load a filename into the Kernal's IO routines with the following C wrapper:
void k_setnam(const char *fname) { uint16_t ptr = (uint16_t)(fname); __attribute__((leaf)) asm volatile( "jsr $ffbd" : : "a" ((uint8_t)(strlen(fname))) , "x" ((uint8_t)(ptr >> 0)) , "y" (((uint8_t)(ptr >> 8))) : ); }
Unfortunately, we can't port this straight to Rust without going through the pain of building an LLVM-MOS-based Rust compiler. Instead, I decided to use normal Rust nightlies, compile to LLVM IR, and pass that to LLVM-MOS's Clang. But this means, as far as Rust is concerned, it is compiling to x86; and so the inline assembly can't address 6502 registers like X above. But that's OK: we can leave k_setnam and similar functions in C, and use Rust's FFI mechanism to import them.
Using these Kernal functions, I implemented a very basic file selector UI which then calls the Kernal's LOAD routine to load the CHIP-8 memory image into a reserved 4 kB area. That memory area is then passed to Rust as a pointer, and we make a slice from it on the other side:
/* main.c */ uint8_t mem[4 * 1024 - 0x200]; select_and_load_file(mem); set_frame_irq(&irq); run(mem, scr);
// machine.rs #[no_mangle] pub extern "C" fn run (mem: *mut u8, scr: *mut u8) { let mut c64 = C64{ mem: unsafe { core::slice::from_raw_parts_mut(mem, RAMSIZE) }, scr: scr, vmem: [0;32] }; let mut cpu = CPU::new(); loop { cpu.step(&mut c64); } }
Why not do the same for scr? Interestingly, I have observed that going through a slice for screen manipulation is massively slower than using a raw pointer. I haven't looked into this, and it could just be that I am doing something wrong; for reference, this is the faster (but still quite slow...) version:
fn draw_row (scr: *mut u8, rows: &[u8], y: ScreenY) { let mut ptr = unsafe{ scr.offset((4 + (y / 2) as isize) * STRIDE + 4) }; for i in 0..8 { let mut row1 = rows[(y as usize + 0) * 8 + (7 - i)]; let mut row2 = rows[(y as usize + 1) * 8 + (7 - i)]; for _ in 0..4 { // Take top two bits of row1 and row2 let ch = (row1 >> 6) | ((row2 >> 6) << 2); row1 <<= 2; row2 <<= 2; unsafe { *ptr = ch; }; ptr = unsafe{ ptr.offset(1) }; } } }
fn draw_row (scr: &mut [u8], rows: &[u8], y: ScreenY) { let mut ptr = (4 + (y / 2) as usize) * STRIDE + 4; for i in 0..8 { let mut row1 = rows[(y as usize + 0) * 8 + (7 - i)]; let mut row2 = rows[(y as usize + 1) * 8 + (7 - i)]; for _ in 0..4 { // Take top two bits of row1 and row2 let ch = (row1 >> 6) | ((row2 >> 6) << 2); row1 <<= 2; row2 <<= 2; scr[ptr] = ch; ptr += 1 } } }
It is worth pointing out that the amazing thing about chirp8-c64 is not how well it works, but that it works at all. It is quite slow; perfectly playable for turn-based games like Hidden, but practically unplayable for real-time games. I suspect most of the slowdown is in updating the screen. If that's the case, maybe it could be improved dramatically by writing just a bit of hand-rolled assembly.
As a concrete example, let me show you clear_screen, first in Rust, and then the assembly output from LLVM-MOS. I'll be the first to admit that the Rust version isn't written the most efficient way (for example, we update the non-border parts of color_arr twice); but the hope, after all, is to be able to write nicely readable code like that and let the Rust compiler and the LLVM-MOS backend optimize it away.
#[no_mangle] pub extern "C" fn clear_screen (scr: *mut u8) { let arr = unsafe { core::slice::from_raw_parts_mut(scr, STRIDE * 25) }; for i in 0 .. STRIDE * 25 { arr[i] = 0x0f; } let color_arr = unsafe { core::slice::from_raw_parts_mut(0xd800 as *mut u8, STRIDE * 25) }; for i in 0 .. STRIDE * 25 { color_arr[i] = 0x0b; } for y in 4 .. 4 + 16 { for x in 4 .. 4 + 32 { color_arr[y * STRIDE + x] = 0x07; } } }
The generated assembly is really something else for what should mostly be just filling two continuous areas of 1000 bytes each:
clear_screen: lda mos8(__rc6) pha lda mos8(__rc7) pha lda mos8(__rc8) pha lda mos8(__rc9) pha lda mos8(__rc10) sta __clear_screen_sstk lda mos8(__rc11) sta __clear_screen_sstk+1 lda mos8(__rc12) sta __clear_screen_sstk+2 lda mos8(__rc13) sta __clear_screen_sstk+3 lda mos8(__rc14) sta __clear_screen_sstk+4 lda mos8(__rc15) sta __clear_screen_sstk+5 lda mos8(__rc16) sta __clear_screen_sstk+6 lda mos8(__rc17) sta __clear_screen_sstk+7 lda mos8(__rc18) sta __clear_screen_sstk+8 lda mos8(__rc19) sta __clear_screen_sstk+9 lda mos8(__rc20) sta __clear_screen_sstk+10 lda mos8(__rc21) sta __clear_screen_sstk+11 lda mos8(__rc22) sta __clear_screen_sstk+12 lda mos8(__rc23) sta __clear_screen_sstk+13 lda mos8(__rc24) sta __clear_screen_sstk+14 lda mos8(__rc25) sta __clear_screen_sstk+15 lda mos8(__rc26) sta __clear_screen_sstk+16 lda mos8(__rc27) sta __clear_screen_sstk+17 lda mos8(__rc28) sta __clear_screen_sstk+18 lda mos8(__rc29) sta __clear_screen_sstk+19 lda mos8(__rc30) sta __clear_screen_sstk+20 lda mos8(__rc31) sta __clear_screen_sstk+21 lda mos8(__rc32) sta __clear_screen_sstk+22 lda mos8(__rc33) sta __clear_screen_sstk+23 lda mos8(__rc34) sta __clear_screen_sstk+24 lda mos8(__rc35) sta __clear_screen_sstk+25 lda mos8(__rc36) sta __clear_screen_sstk+26 lda mos8(__rc37) sta __clear_screen_sstk+27 lda mos8(__rc38) sta __clear_screen_sstk+28 lda mos8(__rc39) sta __clear_screen_sstk+29 ldx #0 lda #-60 stx mos8(__rc2) sta mos8(__rc3) lda #-40 stx mos8(__rc6) sta mos8(__rc7) ldx #-92 stx mos8(__rc8) sta mos8(__rc9) ldx #-52 stx mos8(__rc10) sta mos8(__rc11) ldx #-12 stx mos8(__rc12) sta mos8(__rc13) ldx #28 lda #-39 stx mos8(__rc14) sta mos8(__rc15) ldx #68 stx mos8(__rc16) sta mos8(__rc17) ldx #108 stx mos8(__rc18) sta mos8(__rc19) ldx #-108 stx mos8(__rc20) sta mos8(__rc21) ldx #-68 stx mos8(__rc22) sta mos8(__rc23) ldx #-28 stx mos8(__rc24) sta mos8(__rc25) ldx #12 lda #-38 stx mos8(__rc26) sta mos8(__rc27) ldx #52 stx mos8(__rc28) sta mos8(__rc29) ldx #92 stx mos8(__rc30) sta mos8(__rc31) ldx #-124 stx mos8(__rc32) sta mos8(__rc33) ldx #-84 stx mos8(__rc34) sta mos8(__rc35) ldx #-44 stx mos8(__rc36) sta mos8(__rc37) ldx #-4 stx mos8(__rc38) sta mos8(__rc39) lda #3 sta mos8(__rc4) ldx #-24 lda #15 jsr __memset lda mos8(__rc6) sta mos8(__rc2) lda mos8(__rc7) sta mos8(__rc3) lda #3 sta mos8(__rc4) ldx #-24 lda #11 jsr __memset lda mos8(__rc8) sta mos8(__rc2) lda mos8(__rc9) sta mos8(__rc3) lda #0 sta mos8(__rc4) ldx #32 lda #7 jsr __memset lda mos8(__rc10) sta mos8(__rc2) lda mos8(__rc11) sta mos8(__rc3) lda #0 sta mos8(__rc4) ldx #32 lda #7 jsr __memset lda mos8(__rc12) sta mos8(__rc2) lda mos8(__rc13) sta mos8(__rc3) lda #0 sta mos8(__rc4) ldx #32 lda #7 jsr __memset lda mos8(__rc14) sta mos8(__rc2) lda mos8(__rc15) sta mos8(__rc3) lda #0 sta mos8(__rc4) ldx #32 lda #7 jsr __memset lda mos8(__rc16) sta mos8(__rc2) lda mos8(__rc17) sta mos8(__rc3) lda #0 sta mos8(__rc4) ldx #32 lda #7 jsr __memset lda mos8(__rc18) sta mos8(__rc2) lda mos8(__rc19) sta mos8(__rc3) lda #0 sta mos8(__rc4) ldx #32 lda #7 jsr __memset lda mos8(__rc20) sta mos8(__rc2) lda mos8(__rc21) sta mos8(__rc3) lda #0 sta mos8(__rc4) ldx #32 lda #7 jsr __memset lda mos8(__rc22) sta mos8(__rc2) lda mos8(__rc23) sta mos8(__rc3) lda #0 sta mos8(__rc4) ldx #32 lda #7 jsr __memset lda mos8(__rc24) sta mos8(__rc2) lda mos8(__rc25) sta mos8(__rc3) lda #0 sta mos8(__rc4) ldx #32 lda #7 jsr __memset lda mos8(__rc26) sta mos8(__rc2) lda mos8(__rc27) sta mos8(__rc3) lda #0 sta mos8(__rc4) ldx #32 lda #7 jsr __memset lda mos8(__rc28) sta mos8(__rc2) lda mos8(__rc29) sta mos8(__rc3) lda #0 sta mos8(__rc4) ldx #32 lda #7 jsr __memset lda mos8(__rc30) sta mos8(__rc2) lda mos8(__rc31) sta mos8(__rc3) lda #0 sta mos8(__rc4) ldx #32 lda #7 jsr __memset lda mos8(__rc32) sta mos8(__rc2) lda mos8(__rc33) sta mos8(__rc3) lda #0 sta mos8(__rc4) ldx #32 lda #7 jsr __memset lda mos8(__rc34) sta mos8(__rc2) lda mos8(__rc35) sta mos8(__rc3) lda #0 sta mos8(__rc4) ldx #32 lda #7 jsr __memset lda mos8(__rc36) sta mos8(__rc2) lda mos8(__rc37) sta mos8(__rc3) lda #0 sta mos8(__rc4) ldx #32 lda #7 jsr __memset lda mos8(__rc38) sta mos8(__rc2) lda mos8(__rc39) sta mos8(__rc3) lda #0 sta mos8(__rc4) ldx #32 lda #7 jsr __memset lda __clear_screen_sstk+29 sta mos8(__rc39) lda __clear_screen_sstk+28 sta mos8(__rc38) lda __clear_screen_sstk+27 sta mos8(__rc37) lda __clear_screen_sstk+26 sta mos8(__rc36) lda __clear_screen_sstk+25 sta mos8(__rc35) lda __clear_screen_sstk+24 sta mos8(__rc34) lda __clear_screen_sstk+23 sta mos8(__rc33) lda __clear_screen_sstk+22 sta mos8(__rc32) lda __clear_screen_sstk+21 sta mos8(__rc31) lda __clear_screen_sstk+20 sta mos8(__rc30) lda __clear_screen_sstk+19 sta mos8(__rc29) lda __clear_screen_sstk+18 sta mos8(__rc28) lda __clear_screen_sstk+17 sta mos8(__rc27) lda __clear_screen_sstk+16 sta mos8(__rc26) lda __clear_screen_sstk+15 sta mos8(__rc25) lda __clear_screen_sstk+14 sta mos8(__rc24) lda __clear_screen_sstk+13 sta mos8(__rc23) lda __clear_screen_sstk+12 sta mos8(__rc22) lda __clear_screen_sstk+11 sta mos8(__rc21) lda __clear_screen_sstk+10 sta mos8(__rc20) lda __clear_screen_sstk+9 sta mos8(__rc19) lda __clear_screen_sstk+8 sta mos8(__rc18) lda __clear_screen_sstk+7 sta mos8(__rc17) lda __clear_screen_sstk+6 sta mos8(__rc16) lda __clear_screen_sstk+5 sta mos8(__rc15) lda __clear_screen_sstk+4 sta mos8(__rc14) lda __clear_screen_sstk+3 sta mos8(__rc13) lda __clear_screen_sstk+2 sta mos8(__rc12) lda __clear_screen_sstk+1 sta mos8(__rc11) lda __clear_screen_sstk sta mos8(__rc10) pla sta mos8(__rc9) pla sta mos8(__rc8) pla sta mos8(__rc7) pla sta mos8(__rc6) rts
chirp8-c64 is also not a particularly fancy CHIP-8 implementation in terms of features. You can really tell I just wanted to get to the point where I can claim "it works". Sound isn't implemented; there should be some customization for the user to change the screen colors, load a new game, etc. The UI could also use some improvement, and there's a huge performance reserve in using a fastloader instead of the default Kernal routines.
In summary, Rust and LLVM-MOS proved to be a really solid duo for non-performance-intensive development on the Commodore 64!
Just a quick post to peel back the curtain a bit on a Clash example I posted to Twitter. My tweet in question shows the following snippet, claiming it is "a complete Intel 8080-based FPGA computer that runs Tiny BASIC":
logicBoard :: (HiddenClockResetEnable dom) => Signal dom (Maybe (Unsigned 8)) -> Signal dom Bool -> Signal dom (Maybe (Unsigned 8)) logicBoard inByte outReady = outByte where CPUOut{..} = intel8080 CPUIn{..} interruptRequest = pure False (dataIn, outByte) = memoryMap _addrOut _dataOut $ do matchRight $ do mask 0x0000 $ romFromFile (SNat @0x0800) "_build/intel8080/image.bin" mask 0x0800 $ ram0 (SNat @0x0800) mask 0x1000 $ ram0 (SNat @0x1000) matchLeft $ do mask 0x10 $ port $ acia inByte outReady
I got almost a hundred likes on this tweet, which isn't too bad for a topic as niche as hardware design with Haskell and Clash. Obviously, the above tweet was meant as a brag, not as a detailed technical description; but given the traction it got, I thought I might as well try to expand a bit on it.
Now, the background to all this is that I'm working on a book on retrocomputing with Clash. So in this post, I will try to condense the 75k words of text and the 5k lines of code that I have written so far; it's going to be more of an extended table of contents than a Cliffs notes.
Tiny BASIC is an interactive BASIC interpreter; these days, we would call it a REPL. The computer above runs one of the original, Intel 8080-based versions of Tiny BASIC as its firmware. When the computer is turned on, it boots straight into a BASIC prompt, just like the Commodore PETs and Apple II's of yesteryear. The software assumes there is a peripheral controller connected to certain output ports of the CPU; all IO is done via a stream of bytes written to and read from this controller.
Let's start with the interface. We see that logicBoard has two input signals and one output signal. The Maybe (Unsigned 8) input is a byte coupled with an "input ready" line; in other words, the value is Just a byte if there is new input coming in in a given clock cycle, and Nothing otherwise. The output of the same type is, unsurprisingly, the output of the whole computer in the same format. The extra Bool input is for backpressure: some output peripherals might need time to process an output byte before they are ready for the next one.
We will need to connect these incoming and outgoing bytes to the outside world somehow, and in a real hardware implementation, the IO controller actually contained a UART so that input and output was serialized into a stream of bits. However, by uncoupling the UART from the ACIA implementation, and exposing input and output as byte events, we make it easy to hook up alternative communication media:
main :: IO () main = do sim <- simulateIO_ @System (uncurry logicBoard . unbundle) (Nothing, True) withTerminal $ runTerminalT $ forever $ sim $ \outByte -> do traverse_ printByte outByte inByte <- sampleKey return (inByte, True)
topEntity :: "CLK" ::: Clock System -> "RESET" ::: Reset System -> "RX" ::: Signal System Bit -> "TX" ::: Signal System Bit topEntity = withEnableGen board where board rx = tx where outByte = logicBoard inByte outReady inByte = fmap unpack <$> serialRx (SNat @9600) rx (tx, outReady) = serialTx (SNat @9600) (fmap pack <$> outByte)
topEntity :: "CLK_25MHZ" ::: Clock Dom25 -> "RESET" ::: Reset Dom25 -> "PS2" ::: PS2 Dom25 -> "VGA" ::: VGAOut Dom25 8 8 8 topEntity = withEnableGen board where board ps2 = vga where (frameEnd, vga) = video cursor vidWrite (vidReady, cursor, vidWrite) = screenEditor outByte outByte = logicBoard inByte vidReady inByte = keyboard ps2
The Intel 8080 core we are using is also, of course, written in Clash. I can't really go through its implementation in this small space; the chapter where we construct it from scratch is at 13k words and builds heavily on techniques introduced in earlier chapters describing a CPU that uses Brainfuck as its machine code and a CHIP-8 machine. In this post, we'll just look at its interface, to see how it fits into the logicBoard:
declareBareB [d| data CPUIn = CPUIn { dataIn :: Maybe Value , interruptRequest :: Bool } |] declareBareB [d| data CPUOut = CPUOut { _addrOut :: Maybe (Either Port Addr) , _dataOut :: Maybe Value , _interruptAck :: Bool , _halted :: Bool } |] type Signals dom b = b Covered (Signal dom) type Pure b = b Bare Identity type Partial b = Barbie (b Covered) Last intel8080 :: (HiddenClockResetEnable dom) => Signals dom CPUIn -> Signals dom CPUOut
We use the wonderful Barbies library for CPUIn and CPUOut so that we can switch between a record of signals and a signal of a record. Outwards, Signals dom CPUOut gives easy access to each output pin separately; but internally, the code describing a single clock cycle's state transition creates a full Pure CPUOut inside a Writer monad, with composable Partial CPUOut-producing fragments.
Looking back at logicBoard, we can see that it keeps interruptRequest at False and feeds the dataIn bus from the result of the memory mapper/address decoder. On the output side, addrOut and dataOut is, unsurprisingly, fed into the address decoder; the other pins of CPUOut are not used in this particular computer.
The Intel 8080 has a 16-bit address bus that it uses both for accessing memory and for communication with up to 256 peripheral controllers. The actual details of how a real Intel 8080 signals memory vs. IO port access is quite baroque; but because we are building full computers using an Intel 8080 ISA-compatible CPU, we don't look for pin compatibility. Instead, we capture the morally right addressing interface of the 8080 by using the type Either Port Address (or, with the type synonys resolved, Either (Unsigned 8) (Unsigned 16)) for addrOut. In this particular computer, the details of the original Tiny BASIC firmware prescribe the following memory layout:
Putting it all together, we need 6 KB of RAM from 0x0800 to 0x1fff. We split it into two parts for easier address decoding: 2 KB from 0x0800 to 0x0fff, and 4 KB from 0x1000 to 0x1fff:
.--------. | ROM 2KB|<---. `--------' | .------. | | | .------. .--------. | | | | |<--------<( inByte | RAM 2KB|<---|----->| 8080 |<--->| ACIA | `--------' | | | | |>-------->( outByte | | | | |<--------<( outReady .--------. | `------' `------' | RAM 4KB|<---' `--------'
We want to write this, and exactly this in logicBoard, and the abstraction we use for this is memoryMap. In an address decoder, we start with an address of the whole memory space, and iteratively cut it down into smaller sub-spaces, until we get to something small enough that it maps to a single component. In this case, we start with Either Port Address, and cut it down into e.g. an Unsigned 11 to address a 2K memory component. The dataOut writes and the dataIn reads are restricted along the address decoding, so that the write request only goes to the selected component, and the read result is taken from that as well.
Under the hood, the memory map description uses a Reader of the address and the write, and a Writer of the read result:
newtype Addressing dom addr dat a = Addressing { unAddressing :: ReaderT (Signal dom (Maybe addr), Signal dom (Maybe dat)) (Writer (Ap (Signal dom) (First dat))) a } deriving newtype (Functor, Applicative, Monad) memoryMap :: Signal dom (Maybe addr) -> Signal dom (Maybe dat) -> Addressing dom addr dat a -> (Signal dom (Maybe dat), a)
Why the Maybe in the address? Because we want to know if the CPU doesn't need memory access in a given cycle at all; this is useful when memory is shared between the CPU and other components. We can resolve memory access contention by prioritizing some peripherals (returning Nothing in dataIn, forcing the CPU to stall) and de-prioritizing others (by only allowing them to access their memory component when the addressing of that sub-space is Nothing).
The implementation of memoryMap and its combinators is not particularly interesting; mask routes based on the high bits and leaves the low bits for the sub-space to handle:
mask :: (KnownNat k, KnownNat n) => (HiddenClockResetEnable dom) => Unsigned (n + k) -> Addressing dom (Unsigned k) dat a -> Addressing dom (Unsigned (n + k)) dat a
ram0 and romFromFile simply wrap Clash memory primitives (zero-initialized synchronous RAM, and synchronous ROM initialized from a bitfile image) into Addressing-compliant forms, and port hooks up an IO peripheral that responds to PortCommands:
type Port dom addr dat a = Signal dom (Maybe (PortCommand addr dat)) -> (Signal dom (Maybe dat), a) port :: (HiddenClockResetEnable dom, NFDataX dat) => Port dom addr dat a -> Addressing dom addr dat a
This is also where we find out where the second component of memoryMap's return vaue comes from: the whole point of IO peripherals is that they can have connections, and as such, output signals, going to other parts of the circuit (or directly the outside world), not just the CPU's data bus.
The full Clash source code of the Tiny BASIC computer, including the Intel 8080 core, is available on Github. There is no user documentation at all yet; I've been using all the time and energy I can put into hacking for writing my book instead of tidying up the related repositories. So that's still something I need to get around to eventually; pull requests are obviously welcome!
I've been thinking a bit about describing microcode lately. My motivation was the Intel 8080-compatible CPU I've been building for my upcoming Clash book. As with everything else for that book, the challenge is not in getting it to work — rather, it is in writing the code as close as possible to the way you would want to explain it to another person.
So in the context of a microprocessor as simple as the Intel 8080 and using synchronous RAM, I think of the microcode as a sequence of steps, where each step consists of an internal CPU state transition, and a memory read or write request. For example, the machine instruction 0x34 (mnemonic INR M) increments by one the byte pointed to by the register pair HL. In my core, the micro-architecture has an 8-bit value- and a 16-bit address-register; the latter can be used for memory addressing. To use something else for addressing, you need to load it into the address buffer first. So the steps to implement INR M are:
However, memory access happens on the transition between cycles, so the final write will not be its own step; rather, it happens as the postamble of step 4. Similarly, the correct address will have to be put on the address pins in the preamble of step 2 for the load to work out:
What makes this tricky is that on one hand, we want to describe preambles as part of their respective step, but of course for the implementation it is too late to process them when we get to that step. So I decided to write out the microcode as a sequence of triplets, corresponding to the preamble, the state transition, and the postamble, and then transform it into a format where preambles are attached to the previous step:
[ (Nothing, Get2 rHL, Nothing) , (Just Indirect, ReadMem Nothing) , (Nothing, ALU ADD Const0x01, Nothing) , (Nothing, UpdateFlags, Just Indirect) ]
Here, Indirect addressing means setting the address pins from the address buffer (as opposed to, e.g. the program counter); if it is in the postamble (i.e. write) position, it also means the write-request pin should be asserted.
So this is what the microcode developer writes, but then we can transform it into a format that consists of a state transition paired with the addressing:
(Nothing, [ (Get2 rHL, Just (Left Indirect)) , (ReadMem Nothing) , (ALU ADD Const0x01, Nothing) , (UpdateFlags, Just (Right Indirect)) ])
So we're done, without having done anything interesting enough to warrant a blog post.
Or are we?
Note that in the format we can actually execute, the addressing at each step is either a Left read address, or a Right write address (or Nothing at all). But what if we had two subsequent micro-steps, where the first one has a write request in its postamble, and the second one has a read request in its preamble? We are describing a CPU more than 40 years old, it is to be connected to single-port RAM, so we can't do read and write at the same time. This constraint is correctly captured by the Maybe (Either Read Write) type of memory requests in the normalized form, but it is not enforced by our naïve [(Maybe Read, Transition, Maybe Write)] type for what the microcode developer writes.
So this is what I set out to solve: to give an API for writing microcode that has self-contained steps including the read addressing, but still statically disallows conflicting writes and reads from subsequent steps. We start by going full Richard Eisenberg and lifting the memory addressing directives to the type level using singletons. While we're at it, let's also turn on Haskell 98 mode:
{-# LANGUAGE DataKinds, PolyKinds, ConstraintKinds, GADTs, FlexibleContexts #-} {-# LANGUAGE TypeOperators, TypeFamilies, TypeApplications, ScopedTypeVariables #-} {-# LANGUAGE StandaloneDeriving, DeriveFunctor #-} data Step (pre :: Maybe a) (post :: Maybe b) t where Step :: Sing pre -> t -> Sing post -> Step pre post t deriving instance Functor (Step pre post)
The plan, then, is to do enough type-level magic to only allow neighbouring Steps if at most one of the first post- and the second preamble is a type-level Just index.
The operations we want to support on microcode fragments is cons-ing a new Step and appeding fragments. For the first one, we need to check that the postamble of the new Step is compatible with the first preamble of the existing fragment; for the latter, we need the same check between the last postamble of the first fragment and the first preamble of the second fragment. First, let's codify what "compatible" means here:
type family Combine (post :: Maybe b) (pre :: Maybe a) :: Maybe (Either a b) where Combine Nothing Nothing = Nothing Combine (Just post) Nothing = Just (Right post) Combine Nothing (Just pre) = Just (Left pre)
Importantly, there is no clause for Combine (Just post) (Just pre).
Getting dizzy with the thin air of the type level? Let's leave ourselves a thread leading back to the term level:
combine :: forall a b (post :: Maybe b) (pre :: Maybe a). (SingKind a, SingKind b, SingI (Combine post pre)) => Sing post -> Sing pre -> Demote (KindOf (Combine post pre)) combine _ _ = demote @(Combine post pre)
(This post is not sponsored by Singpost BTW).
For the actual fragments, we can store them internally almost in the normalized format, i.e. as a term-level list of (Maybe a, [(t, Maybe (Either a b))]). Almost, but not quite, because the first a and the last b need to appear in the index, to be able to give a restricted type to cons and append. So instead of storing them in the list proper, we will store them as separate singletons:
data Ends a b = Empty | NonEmpty (Maybe a) (Maybe b) data Amble (ends :: Ends a b) t where End :: Amble Empty t More :: forall (a0 :: Maybe a) (bn :: Maybe b) n t. () => Sing a0 -> [(t, Demote (Maybe (Either a b)))] -> t -> Sing bn -> Amble (NonEmpty a0 bn) t deriving instance Functor (Amble ends)
Note that we need a special Empty index value for End instead of just NonEmpty Nothing Nothing, because starting with an empty Amble, the first cons needs to change both the front-end preamble and the back-end postamble, whereas later cons operators should only change the front-end.
type family Cons (b1 :: Maybe b) (ends :: Ends a b) where Cons b1 Empty = b1 Cons b1 (NonEmpty a1 bn) = bn
We can now try writing the term-level cons. The first cons is easy, because there is no existing front-end to check compatibility with:
cons :: forall (a0 :: Maybe a) b1 (ends :: Ends a b) t. () => Step a0 b1 t -> Amble ends t -> Amble (NonEmpty a0 (Cons b1 ends)) t cons (Step a0 x b1) End = More a0 [] x b1 cons (Step a0 x b1) (More a1 xs xn bn) = More a0 ((x, _):xs) xn bn
We get into trouble when trying to fill in the hole in the cons to a non-empty Amble. And we should be, because nowhere in the type of cons so far have we ensured that b1 is compatible with the front-end of ends. We will have to use another type family for that, to pattern-match on Empty and NonEmpty ends:
type family CanCons (b1 :: Maybe b) (ends :: Ends a b) :: Constraint where CanCons b1 Empty = () CanCons (b1 :: Maybe b) (NonEmpty a1 bn :: Ends a b) = (SingKind a, SingKind b, SingI (Combine b1 a1))
Unsurprisingly, the constraints needed to be able to cons are exactly what we need to fill the hole with the term-level value of combine b1 a1:
cons :: forall (a0 :: Maybe a) b1 (ends :: Ends a b) t. (CanCons b1 ends) => Step a0 b1 t -> Amble ends t -> Amble (NonEmpty a0 (Cons b1 ends)) t cons (Step a0 x b1) End = More a0 [] x b1 cons (Step a0 x b1) (More a1 xs xn bn) = More a0 ((x, combine b1 a1):xs) xn bn
Now we are cooking with gas: we can re-use this idea to implement append by ensuring we CanCons the first fragment's backend to the second fragment:
type family CanAppend (ends1 :: Ends a b) (ends2 :: Ends a b) :: Constraint where CanAppend Empty ends2 = () CanAppend (NonEmpty a1 bn) ends2 = CanCons bn ends2 type family Append (ends1 :: Ends a b) (ends2 :: Ends a b) where Append Empty ends2 = ends2 Append ends1 Empty = ends1 Append (NonEmpty a0 bn) (NonEmpty an bm) = NonEmpty a0 bm append :: (CanAppend ends1 ends2) => Amble ends1 t -> Amble ends2 t -> Amble (Append ends1 ends2) t append End ys = ys append (More a0 xs xn bn) End = More a0 xs xn bn append (More a0 xs xn bn) (More an ys ym bm) = More a0 (xs ++ [(xn, combine bn an)] ++ ys) ym bm
We finish off the implementation by writing the translation into the normalized format. Since the More constructor already contains almost-normalized form, we just need to take care to snoc the final element onto the result:
stepsOf :: forall (ends :: Ends a b) t. (SingKind a, SingKind b) => Amble ends t -> (Maybe (Demote a), [(t, Maybe (Demote (Either a b)))]) stepsOf End = (Nothing, []) stepsOf (More a0 xs xn bn) = (fromSing a0, xs ++ [(xn, Right <$> fromSing bn)])
What we have so far works, but there are a couple of straightforward improvements that would be a shame not to implement.
As written, you would have to use Step like this:
Step (sing @Nothing) UpdateFlags (sing @(Just Indirect))
All this singing noise would be more annoying than the Eurovision Song Contest, so I wanted to avoid it. The idea is to turn those Sing-typed arguments into just type-level arguments; then do some horrible RankNTypes magic to keep the parameter order. Prenex? What is that?
{-# LANGUAGE RankNTypes #-} step :: forall pre. (SingI pre) => forall t. t -> forall post. (SingI post) => Step pre post t step x = Step sing x sing
So now we will be able to write code like step @Nothing UpdateFlags @(Just Indirect) and get a Step type inferred that has the preamble and the postamble appearing in the indices.
Suppose we make a mistake in our microcode, and accidentally want to write after one step and read before the next (using >:> for infix cons):
step @Nothing (Get2 rHL) @(Just IncrPC) >:> step @(Just Indirect) ReadMem @Nothing >:> step @Nothing (Compute Const01 ADD KeepC SetA) @Nothing >:> step @Nothing UpdateFlags @(Just Indirect) >:> End
This is rejected by the type checker, of course; however, the error message is not as informative as it could be, as it faults the missing SingI instance for a stuck type family application:
• No instance for (SingI (Combine ('Just 'IncrPC) ('Just 'Indirect)))
arising from a use of ‘>:>’
With GHC's custom type errors feature, we can add a fourth clause to our Combine type family. Unfortunately, this requires turning on UndecidableInstances for now:
{-# LANGUAGE UndecidableInstances #-} import GHC.TypeLits type Conflict post pre = Text "Conflict between postamble" :$$: Text " " :<>: ShowType post :$$: Text "and next preamble" :$$: Text " " :<>: ShowType pre type family Combine (post :: Maybe b) (pre :: Maybe a) :: Maybe (Either a b) where Combine Nothing Nothing = Nothing Combine (Just post) Nothing = Just (Right post) Combine Nothing (Just pre) = Just (Left pre) Combine (Just post) (Just pre) = TypeError (Conflict post pre)
With this, the error message changes to:
• Conflict between postamble 'IncrPC and next preamble 'Indirect
Much nicer!
The final difference between what we have described here and the code I use for real is that in the real version, Amble also tracks its length in an index. This is needed because the CPU core is used not just for emulation, but also FPGA synthesis; and in real hardware, we can't just store lists of unbounded size in the microcode ROM. So instead, microcode is described as a length-indexed Amble n ends t, and then normalized into a Vec n instead of a list. Each instruction can be at most 10 steps long; everything is then ultimately normalized into a uniformly typed Vec 10 by padding it with "go start fetching next instruction" micro-ops.
Find the full code on GitHub, next to the rest of the Intel 8080 core.
A couple weeks ago, I attended some of the FSCD talks that the time zone difference allowed. One of the satellite workshops of FSCD this year was the Workshop on Rewriting Techniques for Program Transformations and Evaluation, where Martin Lester presented a talk Program Transformations Enable Verification Tools to Solve Interactive Fiction Games.
Because I haven't been able to find slides or the paper itself online, let me summarize it with my own words here quickly. Back in the late seventies and early eighties, Scott Adams (unrelated to the Dilbert comic strip creator) was a prolific text adventure creator; the Digital Antiquarian has written about his games in length, but basically these were simplistic (and in my opinion, creatively quite weak) games published on very limited platforms, where they found a following. Martin Lester's talk was about taking an off-the-shelf open source interpreter for these interactive fiction games (ScottFree) and feeding it (and a given game data file's contents) to a compiler that turns it into an SMT problem, with the free variables corresponding to the user input. Then, if an SMT solver can find an assignment satisfying a winning end condition, that means we have found a possible transcript of user input that wins the given game.
This combination of semi-formal methods and interactive fiction, plus the fact that I wanted to play around with SMT solver-backed interpreters ever since the Rosette talk at last year's ICFP, meant that the pull of this topic was just irresistable nerd catnip for me. So I took a week of afternoon hacking time from working on my Clash book, and started writing a Scott Adams adventure game engine in Haskell, with the aim of doing something similar to Lester's work.
And so now I'm here to tell you about how it went, basically a "poor man's Rosette". When I started this, I have never used SMT solvers and never even looked at SBV, so I am not claiming I became an expert in just a week. But I managed to bumble my way through to something that works well enough on a toy example, so... let's see.
Step one was to implement a bog-standard interpreter for these Scott Adams adventure games. I didn't implement every possible condition and instruction, just barely enough to get a non-trivial example from ScottKit working. My target was the fourth tutorial adventure; see if you can find a solution either by playing the game, or by reading the game script. Why specifically the fourth tutorial example? Well, the short answer is because that is what Lester used as his example in his talk.
Here is the Git commit of this implementation; as you can see, there is really nothing special in there. The game data is parsed using Attoparsec into a simple datatype representation, which is then executed in ReaderT WriterT State, with the Reader containing the static game data, the Writer the output messages, and the State the game state, which is mostly just the current locations of the various items the player can interact with:
data S = S { _currentRoom :: Int16 , _needLook :: Bool , _itemLocations :: Array Int16 Int16 , _endState :: Maybe Bool } deriving Show makeLenses ''S type Engine = ReaderT Game (WriterT [String] (State S))
The second step was to use SBV so that (mostly) the same interpreter can also be executed symbolically. This is the interesting part. It started with changing the Writer output and the State representation to use SBV's symbolic types, and then following the type errors:
data S = S { _currentRoom :: SInt16 , _needLook :: SBool , _itemLocations :: Array Int16 SInt16 , _endState :: SMaybe Bool } deriving (Show, Generic, Mergeable)
Arithmetic works out of the box because of the Num instances; because the Haskell Prelude's Eq and Ord typeclasses hardcode the result type to Bool, and we want symbolic SBool results instead, we have to replace e.g. (==) with (.==).
For the control structures, my idea was to write Mergeable instances for the MTL types. The definitions of these instances are very straightforward, and it allowed me to define symbolic versions of things like when or case with literal matches. The end result of all this is that we can write quite straightforward monadic code, just replacing some combinators with their symbolic counterpart. Here's an example of the code that runs a list of instruction codes in the context of their conditions; even without seeing any other definitions it should be fairly straightforward what it does:
execIf :: [SCondition] -> [SInstr] -> Engine SBool execIf conds instrs = do (bs, args) <- partitionEithers <$> mapM evalCond conds let b = sAnd bs sWhen b $ exec args instrs return b
I decided to use concrete lists and arrays of symbolic values instead of symbolic arrays throughout the code. One interesting example of sticking to this approach is in the implementation of listing all items in the current room. The original concrete code looks like this:
let itemsHere = [ desc | (Item _ _ desc _, loc) <- zip (A.elems items) (A.elems itemLocs) , loc == here ] unless (null itemsHere) $ do say "I can also see:" mapM_ (\desc -> say $ " * " <> desc) itemsHere
For the symbolic version, I had to get rid of the filtering (since loc .== here returns an SBool now), and instead, I create a concrete list of pairs of symbolic locations and concrete descriptions. By going over the full list, I can push all the symbolic ambiguity down to just the output:
let itemLocsWithDesc = [ (loc, desc) | (Item _ _ desc _, loc) <- zip (A.elems items) (A.elems itemLocs) ] anyHere = sAny ((.== here) . fst) itemLocssWithDesc sWhen anyHere $ do say_ "I can also see:" forM_ itemLocsWithDesc $ \(loc, desc) -> sWhen (loc .== here) $ say $ literal $ " * " <> desc
By the way, as the above code shows, I kept the user-visible text messages in the symbolic version. This is completely superfluous for solving, but it allows using the symbolic interpreter with concrete values: since all input is concrete, we can safely assume that the symbolic output values are all constants. In practice, this means we recover the original interactively playable version from the SBV-based one simply by running inside SBV's Query monad and getValue'ing the concrete String output from the SStrings coming out of the WriterT. I wouldn't be surprised if this turns out to be a major drain on performance, but because my aim was mostly to just get it working, I never bothered checking. Besides, since noone looks at the output in solver mode, maybe Haskell's laziness ensures there's no overhead. I really don't know.
So at this point, we have a symbolic interpreter which we can feed user input line by line:
stepPlayer :: SInput -> Engine (SMaybe Bool) stepPlayer (verb, noun) = do perform (verb, noun) finished
The question then is, how do we keep running this (and letting the state evolve) for more and more lines of symbolic input, until we get an sJust sTrue result (meaning the player has won the game)? My original idea was to let the user say how many steps to check, and then generate a full list of symbolic inputs up-front. I asked around on Stack Overflow for something better, and it was this very helpful answer that pointed me in the direction of the Query monad in the first place. With this incremental approach, I can feed it one line of symbolic input, check for satisfiability with the newly yielded constraints, and if there's no solution yet, keep this process going, letting the next stepPlayer call create additional constraints.
I've factored out the whole thing into the following nicely reusable function; this is also the reason I am using ReaderT WriterT State instead of RWS so I can peel away naturally into a State.
loopState :: (SymVal i) => (Int -> Query (SBV i)) -> s -> (SBV i -> State s SBool) -> Query ([i], s) loopState genCmd s0 step = go 1 s0 [] where go i s cmds = do io $ putStrLn $ "Searching at depth: " ++ show i cmd <- genCmd i let cmds' = cmds ++ [cmd] push 1 let (finished, s') = runState (step cmd) s constrain finished cs <- checkSat case cs of Unk -> error $ "Solver said Unknown, depth: " ++ show i Unsat -> do pop 1 go (i+1) s' cmds' Sat -> do cmds' <- mapM getValue cmds' return (cmds', s')
Because I'm a complete SBV noob, I was reluctant to attribute problems to SBV bugs first; I ended up with a ton of Stack Overflow questions. However, it turned out I do still have my Midas touch of finding bugs very quickly in anything I start playing around with; this time, it started with SBV generating invalid SMTLib output from my code. Although my initial report was basically just "this several hundred line project Just Doesn't Work", I managed to cut it down into more reasonable size. The SBV maintainers, especially Levent Erkök, have been very helpful with quick turnaround.
The other bug I found was symbolic arrays misbehaving; although I ended up not using either SFunArray nor SArray in the final version of ScottCheck, it is good to know that my silly project has somehow contributed, if just a little, to making SBV itself better.
So, lots of words, but where's the meat? Well first off, my code itself is on GitHub, and could serve as a good introduction to someone wanting to start using SBV with a stateful computation. And second, here is a transcript of ScottCheck verifying that the tutorial game is solvable, with the Z3 backend; the timestamps are in minutes and seconds from the start, to give an idea of how later steps become slower because there's an exponential increase in all possible inputs leading up to it. The words may look truncated, but that's because the actual internal vocabulary of the game only uses three letter words; further letters from the user are discarded (so COIN parses as COI etc.).
00:00 Searching at depth: 1 00:00 Searching at depth: 2 00:00 Searching at depth: 3 00:00 Searching at depth: 4 00:00 Searching at depth: 5 00:01 Searching at depth: 6 00:01 Searching at depth: 7 00:02 Searching at depth: 8 00:05 Searching at depth: 9 00:11 Searching at depth: 10 00:24 Searching at depth: 11 00:45 Searching at depth: 12 01:24 Searching at depth: 13 02:38 Searching at depth: 14 03:35 Solution found: 03:35 1. GO WES 03:35 2. GET CRO 03:35 3. GO EAS 03:35 4. GO NOR 03:35 5. GET KEY 03:35 6. GO SOU 03:35 7. OPE DOO 03:35 8. GO DOO 03:35 9. GET COI 03:35 10. GO NOR 03:35 11. GO WES 03:35 12. GO NOR 03:35 13. DRO COI 03:35 14. SCO ANY
TL;DR: This is a detailed description of how I got Clashilator working seamlessly from Cabal. It took me three days to figure out how the pieces need to fit together, and most of it was just trawling the documentation of Cabal internals, so I better write this down while I still have any idea what I did.
We set the scene with the dramatis personæ first:
When designing some circuit, it is very useful to be able to simulate its behaviour. Getting debugging information out of a hardware FPGA is a huge hassle; iteration is slow because FPGA synthesis toolchains, by and large, suck; and driving the circuit (i.e. setting the right inputs in the right sequence and interpreting the outputs) can be almost as complicated as the circuit under testing itself. Of course, all of these problems apply doubly to someone like me who is just dabbling in FPGAs.
So during development, instead of synthesizing a circuit design and loading it onto an FPGA, we want to simulate it; and of course we want to use nice expressive languages to write the test bench that envelopes the simulated circuit. One way to do this is what I call very high-level simulation: in this approach, we take the Haskell abstractions we use in our circuit, and reinterpret them in a software context. For example, we might have a state machine described as i -> State s o: instead of lifting it into a signal function, we can just runState it in a normal Haskell program's main and do whatever we want with it.
However, sometimes we want to simulate whole circuits, i.e. Signal dom i -> Signal dom o functions that might have who knows what registers and memory elements inside. For example, if we have a circuit that generates video output from a frame buffer, there's a difference between a high-level simulation that renders the frame buffer's contents to the screen, and a lower level one that interprets the VGA signal output of the circuit. Timing issues in synchronizing the VGA blanking signals with the color lines will only manifest themselves in the latter. So for this kind of applications, Clash of course contains a signal simulator that can be used to feed inputs into a circuit and get outputs. For example, here's a simulation of a Brainfuck computer where only the peripheral lines are exposed: internal connections between the CPU and the RAM and ROM are all part of the Clash circuit.
There is only one problem with the Clash simulator: its performance. This small benchmark shows how long it takes to simulate enough cycles to draw 10 full video frames at 640 ⨯ 480 resolution (i.e. 4,192,000 cycles). Clash does it in ~13 seconds; remember, at 60 FPS, it shouldn't take more than 166 milliseconds to draw 10 frames if we want to simulate it in real time. Of course, real-time simulation at this level of detail isn't necessarily feasable on consumer hardware; but less than one frame per second means any kind of interactive applications are out.
In contrast, Verilator, an open-source Verilog simulator can run all 4,192,000 cycles in 125 ms. This could be faster than realtime, were it not for the additional overhead of drawing each frame to the screen (and the Haskell FFI of 4 million calls accross to C...), and of course this is a very simple circuit that only renders a single bouncing ball; anything more complex will only be slower. But still, that same benchmark shows that 20+ FPS is possibe, end-to-end, if we ditch Clash and use Verilator instead. Pong is fully playable at 13 FPS.
The interface between Clash and Verilator is simple: Verilator consumes Verilog code, so we can simply run Clash and point Verilator at its output. However, we still need to connect that Verilator code to the Haskell code to drive the inputs and interpret the outputs. Here are the glue files from the Pong example:
As I was preparing to write the next chapter of a Clash book I've been working on, I made a new Clash project and then, because I needed a Verilator simulation for it, I started going through the steps of making all these files. And of course, I realized this should be all automated. But what is all in that sentence?
Step one was to write generators for the C++ and Haskell source files and the Makefile. This is quite easy, actually; after all, it is the fact that these files are so regular that makes it infuriating writing them by hand. So we do a bit of text template substitution, using Clash's .manifest output as the source of input/output pin names and bus widths. This gives us a simple code generator: you run Clash yourself, point Clashilator at a .manifest file, and it outputs a bunch of files, leaving you ready to run make. Mission accomplished?
No, not really.
While we've eliminated the boilerplate in the source files, one source of boilerplate remains: the Cabal package settings. Here's the relevant HPack package.yaml section from the Pong example:
extra-libraries: stdc++ extra-lib-dirs: verilator include-dirs: verilator build-tools: hsc2hs ghc-options: -O3 -fPIC -pgml g++ -optl-Wl,--whole-archive -optl-Wl,-Bstatic -optl-Wl,-L_build/verilator -optl-Wl,-lVerilatorFFI -optl-Wl,-Bdynamic -optl-Wl,--no-whole-archive
Oof, that hurts. All those magic GHC options just to statically link to libVerilatorFFI.a, to be repeated accross all Clash projects that use Verilator...
Also, while Clashilator outputs a Makefile to drive the invocation of Verilator and the subsequent compilation of the C++ bits, it doesn't give you a solution for running *that* Makefile at the right time — not to mention running Clashilator itself!
The problem here is that in order to compile a Verilator-using Haskell program, we first need to compile the other, non-simulation modules into Verilog. And this is tricky because those modules can have external dependencies: remember Clash is just a GHC backend, so you can import other Haskell libraries. And how are we going to bring those libraries in scope? I myself use Stack but your mileage may vary: you could be using Cabal directly, or some Cabal-Nix integration. In an case, you'd basically need to build your package so you can compile to Verilog so you can run Verilator so you can... build your package.
To solve this seemingly circular dependency, and to get rid of the Cabal file boilerplate, I decided to try and do everything in the Cabal workflow. Whatever your preferred method of building Haskell packages, when the rubber hits the road, they all ultimately run Cabal. If we run Clash late enough in the build process, all dependencies will be installed by then. If we run Verilator early enough in the build process, the resulting library can be linked into whatever executable or library Cabal is building.
If we do all this during cabal build, everything will happen at just the right time.
So, what gives us at least a fighting chance is that Cabal is extensible with so-called hooks. You can write a custom Setup.hs file like this:
import Distribution.Simple main = defaultMainWithHooks simpleUserHooks
Here, simpleUserHooks is a record with a bunch of fields for extension points; of particular interest to us here is this one:
buildHook :: PackageDescription -> LocalBuildInfo -> UserHooks -> BuildFlags -> IO ()
At ths point, we have breached the defenses: inside buildHook, we can basically do arbitrary things as long as the effect is building the package. In particular, we can:
The result of all this is a bunch of new files under the build directory, and modified BuildInfos for all the components marked with x-clashilator-top-is. We put these back into the PackageDescription and then call the default buildHook, which then goes on to compile and link the Haskell simulation integrating the Verilator parts:
clashilatorBuildHook :: PackageDescription -> LocalBuildInfo -> UserHooks -> BuildFlags -> IO () clashilatorBuildHook pkg localInfo userHooks buildFlags = do pkg' <- clashilate pkg localInfo buildFlags buildHook simpleUserHooks pkg' localInfo userHooks buildFlags
All the details are in the full implementation, which is probably worth a look if you are interested in Cabal's internals. As I wrote in the beginning of this post, it took me days to wade through the API to find all the moving parts that can be put together to apply this level of violence to Cabal. The final code also exports a convenience function clashilatorMain for the common case where enabling Clashilator is the only desired Setup.hs customization; and also clashilate itself for hardcore users who want to build their own buildHooks.
The implementation is almost 150 lines of not particularly nice code. It is also missing some features; most notably, it doesn't track file changes, so Clash and Verilator is always rerun, even if none of the Clash source files have changed. It is also completely, utterly untested. But it does give us what we set out to do: completely boilerplate-less integration of Clash and Verilator. A complete example package is here, and here's the money shot: the executables section of the package.yaml file.
executables: simulator: main: simulator.hs verbatim: x-clashilator-top-is: MyCircuit.Nested.Top x-clashilator-clock: CLK
My last post ended with some sample CλaSH code illustrating the spaghetti-ness you can get yourself into if you try to describe a CPU directly as a function (CPUState, CPUIn) -> (CPUState, CPUOut). I promised some ideas for improving that code.
To start off gently, first of all we can give names to some common internal operations to help readability. Here's the code from the previous post rewritten with a small kit of functions:
intensionalCPU (s0, CPUIn{..}) = case phase s of WaitKeyPress reg -> let s' = case cpuInKeyEvent of Just (True, key) -> goto Fetch1 . setReg reg key $ s _ -> s out = def{ cpuOutMemAddr = pc s' } in (s', out) Fetch1 -> let s' = goto Exec s{ opHi = cpuInMem, pc = succ $ pc s } out = def{ cpuOutMemAddr = pc s' } in (s', out) where s | cpuInVBlank = s0{ timer = fromMaybe 0 . predIdx $ timer s0 } | otherwise = s0
To avoid the possibility of screwing up the threading of the internal state, we can use the State CPUState monad:
stateCPU :: CPUIn -> State CPUState CPUOut stateCPU CPUIn{..} = do when cpuInVBlank $ modify $ \s -> s{ timer = fromMaybe 0 . predIdx $ timer s } gets phase >>= \case WaitKeyPress reg -> do forM_ cpuInKeyEvent $ \(pressed, key) -> when pressed $ do setReg reg key goto Fetch1 return def{ cpuOutMemAddr = pc s' } Fetch1 -> do modify $ \s -> s{ opHi = cpuInMem, pc = succ $ pc s } goto Exec return def{ cpuOutMemAddr = pc s' }
At this point, the state-accessing actions either come from the kit, because they are useful in many parts of our CPU description (like setReg), or they read or change individual fields of the CPUState record. This second group can benefit from using lenses:
stateCPU CPUIn{..} = do when cpuInVBlank $ timer %= fromMaybe 0 . predIdx use phase >>= \case WaitKeyPress reg -> do forM_ cpuInKeyEvent $ \(pressed, key) -> when pressed $ do setReg reg key goto Fetch1 return def{ cpuOutMemAddr = pc s' } Fetch1 -> do opHi .= cpuInMem pc %= succ goto Exec return def{ cpuOutMemAddr = pc s' }
While this code is much more readable than the original one, and allows describing the internal state changes piecewise, it still has a problem with composability. To illustrate this, let's add some more functionality by implementing some CPU instructions (simplified a bit from a real CHIP-8 which can only write to memory through a save-registers instruction and even that is addressed by a dedicated pointer register; and accessing the framebuffer is also done via multiple-byte blitting only).
stateCPU CPUIn{..} = do when cpuInVBlank $ timer %= fromMaybe 0 . predIdx use phase >>= \case WaitKeyPress reg -> do forM_ cpuInKeyEvent $ \(pressed, key) -> when pressed $ do setReg reg key goto Fetch1 pc <- use pc return def{ cpuOutMemAddr = pc } Idle -> do goto Fetch1 pc <- use pc return def{ cpuOutMemAddr = pc } Fetch1 -> do opHi .= cpuInMem pc %= succ goto Exec pc <- use pc return def{ cpuOutMemAddr = pc } Exec -> do opLo .= cpuInMem pc %= succ goto Fetch1 decode >>= \case WaitKey reg -> do goto $ WaitKeyPress reg pc <- use pc return def{ cpuOutMemAddr = pc } LoadImm reg imm -> do setReg reg imm pc <- use pc return def{ cpuOutMemAddr = pc } WriteMem addr reg -> do val <- getReg reg goto Idle -- flush write to RAM before starting next fetch return def{ cpuOutMemAddr = addr, cpuOutMemWrite = Just val } SetPixel regX regY -> do x <- getReg regX y <- getReg regY pc <- use pc return def{ cpuOutMemAddr = pc, cpuOutFBAddr = (x, y), cpuOutFBWrite = Just True } -- And a lot more opcodes here
Since there are going to be a lot of opcodes to support, it could seem like a good idea to refactor the Exec state to be a separate function, leaving only the implementation of the other control phases in the main function:
stateCPU cpuIn@CPUIn{..} = do when cpuInVBlank $ timer %= fromMaybe 0 . predIdx use phase >>= \case WaitKeyPress reg -> do forM_ cpuInKeyEvent $ \(pressed, key) -> when pressed $ do setReg reg key goto Fetch1 pc <- use pc return def{ cpuOutMemAddr = pc } Idle -> do goto Fetch1 pc <- use pc return def{ cpuOutMemAddr = pc } Fetch1 -> do opHi .= cpuInMem pc %= succ goto Exec pc <- use pc return def{ cpuOutMemAddr = pc } Exec -> do opLo .= cpuInMem pc %= succ goto Fetch1 decode >>= exec cpuIn exec CPUIn{..} (WaitKey reg) = ... -- same as previous version exec CPUIn{..} (LoadImm reg imm) = ... -- same as previous version exec CPUIn{..} (WriteMem addr reg) = ... exec CPUIn{..} (SetPixel regX regY) = ...
However, this becomes problematic if we want to set some parts of the CPUOut result in a generic way, outside exec. Suppose we implement sound support, which in CHIP-8 is done via another 60Hz timer:
stateCPU cpuIn@CPUIn{..} = do when cpuInVBlank $ do timer %= fromMaybe 0 . predIdx soundTimer %= fromMaybe 0 . predIdx soundOn <- uses soundTimer (/= 0) def <- return def{ cpuOutSound = soundOn }
This takes care of setting cpuOutSound in the branches for WaitKeyPress, Idle and Fetch1 (albeit in an ugly way, by shadowing def), but now this needs to be passed as an extra parameter to exec (or the code to set cpuOutSound needs to be duplicated).
So what we want is a way to specify the resulting CPUOut such that
This has led me to a design where the CPU is described by a type CPU i s o = RWS i (Endo o) s:
runCPU :: (s -> o) -> CPU i s o () -> (i -> State s o) runCPU mkDef cpu inp = do s <- get let (s', f) = execRWS (unCPU cpu) inp s put s' def <- gets mkDef return $ appEndo f def
With this setup, we can rewrite our CPU implementation as:
cpu = do CPUIn{..} <- input when cpuInVBlank $ timer %= fromMaybe 0 . predIdx soundTimer %= fromMaybe 0 . predIdx soundOn <- uses soundTimer (/= 0) when soundOn $ do tell $ \out -> out{ cpuOutSound = True } use phase >>= \case WaitKeyPress reg -> do forM_ cpuInKeyEvent $ \(pressed, key) -> when pressed $ do setReg reg key goto Fetch1 Idle -> do goto Fetch1 Fetch1 -> do opHi .= cpuInMem pc %= succ goto Exec Exec -> do opLo .= cpuInMem pc %= succ goto Fetch1 decode >>= exec exec (WaitKey reg) = do goto $ WaitKeyPress reg exec (LoadImm reg imm) = do setReg reg imm exec (WriteMem addr reg) = do val <- getReg reg writeMem addr val goto Idle -- flush write to RAM before starting next fetch exec (SetPixel regX regY) = do x <- getReg regX y <- getReg regY writeFB (x, y) True
Most branches don't need to do anything to get the correct final CPUOut; and just like we were able to name the State kit of setReg &c, we can define
writeMem addr val = tell $ \out -> out{ cpuOutMemAddr = addr, cpuOutMemWrite = Just val } writeFB addr val = tell $ \out -> out{ cpuOutFBAddr = addr, cpuOutFBWrite = Just val }
The system described above is implemented here for the reusable library definitions and here for CHIP-8 specifically; the version that uses lenses is in the branch linked above, since I am not 100% convinced yet that the cognitive overhead of the lens library is worth it once enough kit is developed around state access.
There are two questions about the above that I feel that haven't been resolved yet:
All in all, even though I couldn't get as far as I originally hoped, I am glad I at least managed to build a CHIP-8 computer in CλaSH that can boot up on real hardware and play original CHIP-8 games; that's pretty much what I originally set out to do. The RetroChallenge format of detailed blog posts concurrent with the actual development work being done is quite taxing on me, because I am slow to write these posts; overall, I have probably spent about half and half of my RetroChallenge time on coding/design vs. writing about it. But the fixed one-month timeframe and the great progress reports by the other participants have been great motivators; I would probably try something next time where I can expect less uknown unknowns.
There is still work to be done even on this CHIP-8 computer: I've found some games that don't work as expected when they try to use the built-in hexadecimal digit font; and there are of course the previously discussed performance problems inherent in addressing the frame buffer as single bits. Also, I would like to adapt it to use a real four-by-four keypad (by scanning the key rows from the FPGA) and a PCD8544 monochrome matrix LCD
And then, of course, further exploration of the design space of CPU descriptions discussed above; and then moving on to a "real" CPU, probably by rewriting my Lava 6502 core. Also, I will need to get a new FPGA dev board with a chip that is supported by a contemporary toolchain, so that I won't get bogged down by problems like the Spartan-6 synthesis bug.
My entry for RetroChallenge 2018/09 is building a CHIP-8 computer. Previously, I've talked in detail about the video signal generator and the keyboard interface; the only part still missing is the CPU.
Since the CHIP-8 was originally designed to be an interpreted language run as a virtual machine, some of its instructions are quite high-level. For example, the framebuffer is modified via a dedicated blitting instruction; there is a built-in random number generator; and instructions to manipulate two 60 Hz timers. Other instructions are more in line with what one would expect to see in a CPU, and implement basic arithmetic such as addition or bitwise AND. There is also a generic escape hatch instruction but that doesn't really apply to hardware implementations.
The CPU has 16 generic-purpose 8-bit registers V0…VF; register VF is also used to report flag results like overflow from arithmetic operations, or collision during blitting. Most instructions operate on these general registers. Since the available memory is roughly 4K, these 8-bit registers wouldn't be too useful as pointers. Instead, there is a 12-bit Index register that is used as the implicit address argument to memory-accessing instructions.
For flow control, the program counter needs 12 bits as well; the CHIP-8 is a von Neumann machine. Furthermore, it has CALL / RET instructions backed by a call-only stack (there is no argument passing or local variables).
We can collect all of the registers described above into a single Haskell datatype. I have also added two 8-bit registers for the high and low byte of the current instruction, but in retrospect it would be enough to just store the high byte, since the low byte is coming from RAM exactly when we need to dispatch on it anyway. The extra phase register is to distinguish between execution phases such as fetching the first byte of the next instruction, or for instructions that are implemented in multiple clock cycles, like clearing the frame buffer (more on that below).
type Addr = Unsigned 12 type Reg = Index 16 data CPUState = CPUState { opHi, opLo :: Word8 , pc, ptr :: Addr , registers :: Vec 16 Word8 , stack :: Vec 24 Addr , sp :: Index 24 , phase :: Phase , timer :: Word8 , randomState :: Unsigned 9 }
I implemented the random number generator as a 9-bit linear-feedback shift register, truncated to its lower 8 bits; this is because a maximal 8-bit LFSR wouldn't generate 0xFF.
lfsr :: Unsigned 9 -> Unsigned 9 lfsr s = (s `rotateR` 1) `xor` b4 where b = fromIntegral $ complement . lsb $ s b4 = b `shiftL` 4
Similar to how a real chip has various pins to interface with other parts, our CPU description will also have multiple inputs and outputs. The input consists of the data lines read from main memory and the framebuffer; the events coming from the keypad, and the keypad state; and the 60 Hz VBlank signal from the video generator. This latter signal is used to implement the timer register's countdown. The keypad's signals are fed into the CPU both as events and statefully; I've decided to do it this way so that only the peripheral interface needs to be changed to accomodate devices that are naturally either parallel (like a keypad matrix scanner) or serial (like a computer keyboard on a PS/2 connector).
type Key = Index 16 type KeypadState = Vec 16 Bool data CPUIn = CPUIn { cpuInMem :: Word8 , cpuInFB :: Bit , cpuInKeys :: KeypadState , cpuInKeyEvent :: Maybe (Bool, Key) , cpuInVBlank :: Bool }
The output is even less surprising: there's an address line and a data out (write) line for main memory and the video framebuffer.
type VidX = Unsigned 6 type VidY = Unsigned 5 data CPUOut = CPUOut { cpuOutMemAddr :: Addr , cpuOutMemWrite :: Maybe Word8 , cpuOutFBAddr :: (VidX, VidY) , cpuOutFBWrite :: Maybe Bit }
As far as CλaSH is concerned, the CPU is extensionally a circuit converting input signals to output signals, just like any other component:
extensionalCPU :: Signal dom CPUIn -> Signal dom CPUOut
The internal CPU state is of no concern at this level. Internally, we can implement the above as a Mealy machine with a state transition function that describes behaviour in any given single cycle:
intensionalCPU :: (CPUState, CPUIn) -> (CPUState, CPUOut) extensionalCPU = mealy intenstionalCPU initialState
As far as a circuit is concerned, a clock cycle is a clock cycle is a clock cycle. If we want to do any kind of sequencing, for example to fetch two-byte instruction opcodes from the byte-indexed main memory in two steps, we need to know in intensionalCPU which step is next. This is why we have the phase field in CPUState, so we can read out what we need to do, and store what we want to do next. For example, in my current version the video framebuffer is bit-indexed (addressed by the 6-bit X and the 5-bit Y coordinate), and there is no DMA to take care of bulk writes; so to implement the instruction that clears the screen, we need to write low to all framebuffer addresses, one by one, from (0, 0) to (63, 31). This requires 2048 cycles, so we need to go through the Phase that clears (0, 0), to the one that clears (0, 1), all the way to (63, 31), before fetching the first byte of the next opcode to continue execution. Accordingly, one of the constructors of Phase stores the (x, y) coordinate of the next bit to clear, and we'll need to add some logic so that if phase = ClearFB (x, y), we emit (x, y) on the cpuOutFBAddr line and Just low on the cpuOutFBWrite line. Blitting proceeds similarly, with two sub-phases per phase: one to read the old value, and one to write back the new value (with the bitmap image xor'd to it)
data Phase = Init | Fetch1 | Exec | StoreReg Reg | LoadReg Reg | ClearFB (VidX, VidY) | Draw DrawPhase (VidX, VidY) Nybble (Index 8) | WaitKeyPress Reg | WriteBCD Word8 (Index 3)
So how should we write intensionalCPU? We could do it in direct style, i.e. something like
intensionalCPU (s0, CPUIn{..}) = case phase s of Fetch1 -> let s' = s{ opHi = cpuInMem, pc = succ $ pc s, phase = Exec } out = CPUOut{ cpuOutMemAddr = pc s', cpuOutMemWrite = Nothing , cpuOutFBAddr = minBound, cpuOutFBWrite = Nothing } in (s', out) WaitKeyPress reg -> let s' = case cpuInKeyEvent of Just (True, key) -> s{ registers = replace reg key (registers s), phase = Fetch1 } _ -> s out = CPUOut{ cpuOutMemAddr = pc s', cpuOutMemWrite = Nothing , cpuOutFBAddr = minBound, cpuOutFBWrite = Nothing } in (s', out) -- and lots of other cases as well, of course where s | cpuInVBlank = s0{ timer = fromMaybe 0 $ predIdx $ timer s0 } | otherwise = s0
If you think this is horrible and unreadable and unmaintainable, then yes! I agree! Which is why I've spent most of this RetroChallenge (when not fighting synthesizer crashes) thinking about nicer ways of writing this.
This post is getting long, let's end on this note here. Next time, I am going to explain how far I've gotten so far in this quest for nicely readable, composable descriptions of CPUs.
For most of this week, it seemed I will have to throw in the towel. As I mentioned in my previous entry last Saturday, I ran into what at first seemed like a CλaSH bug. However, further investigation showed that the error message was actually pointing at an internal bug in the Xilinx ISE synthesizer. The same generated VHDL didn't cause any problems when fed into the Yosys open source synthesizer, Altera Quartus, or the newer version of Xilinx Vivado. But the Papilio Pro I'm using is based on the Spartan 6 FPGA, which is not supported by the newer Xilinx tools, so I am stuck with ISE 14.7 from 2013. So the conclusion is, just like all other closed-source proprietary software from FPGA vendors, the Xilinx ISE is simply a piece of shit that falls over under its own weight on perfectly valid VHDL.
I was thinking of ordering a new FPGA board, but I only have until next Wednesday to finish this (I won't be able to put in any work on the last Retrochallenge weekend), so it's not even a given it would get here in time. Also, I'd like to do a bit more research on what board I should get -- on one hand, both Altera and Xilinx have nice, more modern dev boards with good IO ports for my retro-computing-oriented needs, but on the other hand, it feels a bit like throwing good money after bad, since these would still be programmed with proprietary shitty software, with no way forward when (not if!) they break.
Then there's Lattice's ICE40 line which is fully supported by the open source toolchain IceStorm, but the largest ICE40 is still quite small compared to the Spartan 7 or the Cyclone V series; not to mention that even the nicest ICE40 board I could find doesn't have a USB connector on board, so you have to play around with an Arduino and plug jumper wires into this weird connector to get anything working. Also, while I'm ranting, of course the Lattice ICE40 open source toolchain is not from Lattice themselves; instead, its bitstream format had to be reverse-engineered by awesome free software hackers
So anyway, I had a perfectly functioning board betrayed by its software toolchain. I tried some desparate ideas like generating Verilog instead of VHDL or getting rid of the unguarded block statements, but nothing made any difference. Then Thursday night I had an even wilder idea. If the Xilinx ISE is crashing because the generated VHDL is triggering some weird corner case in the synthesizer, then maybe using the same ISE version, but changing the target FPGA model, would get over the hurdle? And that's when I remembered I still have my first ever FPGA board: the Papilio One based on the Spartan 3E. Luckily, the Spartan 3-series is also supported by the 14 series ISE, so the same toolchain can serve both boards.
On Friday morning, I did the necessary changes to my code to target the Papilio One. The clock generator is different between the models, so I needed to replace that; the other difference was that the Spartan 3 doesn't seem to have wide enough blocks for 64-bit arithmetic. This shouldn't be a problem for the CHIP-8, but CλaSH generates code that converts everything into 64 bits. I initially overcame that by post-processing CλaSH's output with sed, but then I discovered that there is a flag -fclash-intwidth to set that properly.
With these changes, I was able to get it through the Xilinx ISE's synthesizer, and all the way through the rest of the pipeline! As before, the code is on GitHub.
And with this, I am where I was supposed to be a week ago at half-time. I probably won't have time to work on this project next weekend since we'll be travelling; this looks like a good time to take inventory of the project.
Initially, I wanted to talk this week about how I plan to structure the CλaSH description of the CHIP-8 CPU. However, I'm postponing that for now, because I ran into what seems like a CλaSH bug, and I want to see my design run on real hardware before I describe it in too much detail. So instead, here's a post on how I am testing in software.
After stripping away all the nice abstractions that I am using in my description of the CPU, what remains is a Mealy machine, which simply means it is described by a state transition and output function s -> i -> (s, o). If that looks familiar, that is not a coincidence: this is, of course, just one argument flip away from the Kleisli category of the State s monad. Just think of it as being either this or that, depending on which one you have more intuition about. A lot more on this in my upcoming blogpost.
My CHIP-8 CPU is currently described by a Mealy machine over these types:
data CPUIn = CPUIn { cpuInMem :: Word8 , cpuInFB :: Bit , cpuInKeys :: KeypadState , cpuInKeyEvent :: Maybe (Bool, Key) , cpuInVBlank :: Bool } data Phase = Init | Fetch1 | Exec | StoreReg Reg | LoadReg Reg | ClearFB (VidX, VidY) | Draw DrawPhase (VidX, VidY) Nybble (Index 8) | WaitKeyPress Reg data CPUState = CPUState { opHi, opLo :: Word8 , pc, ptr :: Addr , registers :: Vec 16 Word8 , stack :: Vec 24 Addr , sp :: Index 24 , phase :: Phase , timer :: Word8 } data CPUOut = CPUOut { cpuOutMemAddr :: Addr , cpuOutMemWrite :: Maybe Word8 , cpuOutFBAddr :: (VidX, VidY) , cpuOutFBWrite :: Maybe Bit } cpu :: CPUIn -> State CPUState CPUOut
Note that all the types involved are pure: signal inputs are turned into pure input by CλaSH's mealy function, and the pure output is similarly turned into a signal output. But what if we didn't use mealy, and ran cpu directly, completely sidestepping CλaSH, yet still running the exact same implementation?
That is exactly what I am doing for testing the CPU. By running its Mealy function directly, I can feed it a CPUIn and consume its CPUOut result while interacting with the world — completely outside the simulation! The main structure of the code that implements the above looks like this:
stateful :: (MonadIO m) => s -> (i -> State s o) -> IO (m i -> (o -> m a) -> m a) stateful s0 step = do state <- newIORef s0 return $ \mkInput applyOutput -> do inp <- mkInput out <- liftIO $ do s <- readIORef state let (out, s') = runState (step inp) s writeIORef state s' return out applyOutput out
I hooked up the main RAM and the framebuffer signals to IOArrays, and wrote some code that renders the framebuffer's contents into an SDL surface and translates keypress events. And, voilà: you can run the CHIP-8 computer, interactively, even allowing you to use good old trace-based debugging (which is thankfully removed by CλaSH during VHDL generation so can even leave them in). The below screencap shows this in action: :main is run from clashi and starts the interactive SDL program, with no Signal types involved.
This week, most of my weekday evenings were quite busy, but I did manage to squeeze in a PS/2 keyboard interface in small installments; then today I went down the rabbit hole of clearing up some technical debt I've accumulated so far by not really looking into how CλaSH handled clock domains.
(Just to preempt any possible confusion, we're talking about the peripheral port of the IBM Personal System/2 introduced in the late '80s, not the Playstation 2 console)
The same way VGA is ideal for hobbyist video signal generation since it is both simple and ubiquitous, PS/2 is the go-to interface for keyboards. It is a two-directional, synchronous serial protocol with a peripheral-generated clock in the 10-20 KHz range. Any PC keyboard old enough will support it. One important caveat, though, is that the common USB-to-PS/2 adapters don't actually convert signals, and so they only work with keyboards that were originally designed with that conversion in mind. Here, we are only concerned with device to host communication; it is also possible to communicate in the other direction to e.g. change the Num Lock LED's state.
"Synchronous" here means that there is a separate clock line, unlike in the family of asynchronous serial protocols that were used in RS-232; it is this latter one that is usually meant as "serial communication" when unqualified. In synchronous serial communication, everything happens on the clock ticks; in asynchronous communication, there is no separate clock signal, so the data signal has enough structure that the two communicating sides can agree on the exact framing.
Turning the data line of PS/2 into a stream of bits is a straightforward process: the standard prescribes sampling the data line on the falling edge of the clock line. We also apply an 8-cycle debouncer for good measure, just because some pages on the Internet suggest it:
data PS2 dom = PS2 { ps2Clk :: Signal dom Bit , ps2Data :: Signal dom Bit } samplePS2 :: (HiddenClockReset dom gated synchronous) => PS2 dom -> Signal dom (Maybe Bit) samplePS2 PS2{..} = enable <$> isFalling low ps2Clk' <*> ps2Data' where ps2Clk' = debounce d3 low ps2Clk ps2Data' = debounce d3 low ps2Data
The second step in that pipeline is to shift in the bits, 11 at a time. A leading low bit signals the start of a packet; eight data bits and one parity bit follow; the packet is finished with one high bit. Of course, only the eight data bits are presented externally. I use a WriterT (Last Word8) (State PS2State) monad to implement this logic, and then turn that into a CλaSH Mealy machine, in a pattern that I plan to use a lot in implementing the CHIP-8 CPU later:
data PS2State = Idle | Bit Word8 (Index 8) | Parity Word8 | Stop (Maybe Word8) decodePS2 :: (HiddenClockReset dom gated synchronous) => Signal dom (Maybe Bit) -> Signal dom (Maybe Word8) decodePS2 = flip mealyState Idle $ \bit -> fmap getLast . execWriterT . forM_ bit $ \bit -> do state <- get case state of Idle -> do when (bit == low) $ put $ Bit 0 0 Bit x i -> do let x' = shiftInLeft bit x put $ maybe (Parity x') (Bit x') $ succIdx i Parity x -> do let checked = bit /= parity x put $ Stop $ enable checked x Stop x -> do when (bit == high) $ tell $ Last x put Idle
To be able to try out on real hardware what I had at this point, I had to leave the trusty LogicStart Mega-Wing of my Papilio Pro, and instead switch over to the Arcade since that one has a PS/2 port. There are actually two ports on it, so that one could connect e.g. a keyboard and a mouse.
This change involved rewriting my UCF file since the pinout is different from the LogicStart. Also, the Arcade has 4+4+4 bits of VGA color output instead of the previous 3+3+2; of course with the black & white graphics of the CHIP-8, that color depth is all going to waste with this project.
Unfortunately, it is not enough to shift in the PS/2 data into a byte: we also have to make sense of that byte. While this could be as straightforward as interpreting each byte as the ASCII code of the character on the key pressed, the reality is not this simple. Keyboards emit so-called scan codes, where one or several bytes can encode a single keypress or key release event (see here for example for a list of some keyboard scan codes). I haven't been able to come up with an elegant way of handling this yet, so for now I just have some messy Mealy machine that returns a 16-bit code, where the high byte is zero for one-byte codes. You can see in the comment my frustration at both the implementation and the spec itself:
data KeyEvent = KeyPress | KeyRelease deriving (Generic, NFData, Eq, Show) data ScanCode = ScanCode KeyEvent Word16 deriving (Generic, NFData, Eq, Show) data ScanState = Init | Extended Word8 | Code KeyEvent Word8 -- TODO: rewrite this for clarity. -- All it does is it parses 0xE0 0xXX into an extended (16-bit) code, and everything else into -- an 8-bit code. The only complication is that the key release marker 0xF0 is always the -- second-to-last byte. Who does that?!? parseScanCode :: (HiddenClockReset dom gated synchronous) => Signal dom (Maybe Word8) -> Signal dom (Maybe ScanCode) parseScanCode = flip mealyState Init $ \raw -> fmap getLast . execWriterT . forM_ raw $ \raw -> do let finish ev ext = do tell $ Last . Just $ ScanCode ev $ fromBytes (ext, raw) put Init state <- get case state of Init | raw == 0xe0 -> put $ Extended raw | raw == 0xf0 -> put $ Code KeyRelease 0x00 | otherwise -> finish KeyPress 0x00 Extended ext | raw == 0xf0 -> put $ Code KeyRelease ext | otherwise -> finish KeyPress ext Code ev ext -> finish ev ext where fromBytes :: (Word8, Word8) -> Word16 fromBytes = unpack . pack
With the video output from last time and the keyboard from this post, but no CPU yet, our options to put everything together into something impressive are somewhat limited. I ended up showing a single CHIP-8 pixel that can be moved around in the CHIP-8 screen space with the arrow keys; this results in something tangible without needing a CPU or even a framebuffer yet. Note how well the code lends itself to using applicative do syntax:
VGADriver{..} = vgaDriver vga640x480at60 ps2 = decodePS2 $ samplePS2 PS2{..} (dx, dy) = unbundle $ do key <- parseScanCode ps2 pure $ case key of Just (ScanCode KeyPress 0xe075) -> (0, -1) -- up Just (ScanCode KeyPress 0xe072) -> (0, 1) -- down Just (ScanCode KeyPress 0xe06b) -> (-1, 0) -- left Just (ScanCode KeyPress 0xe074) -> (1, 0) -- right _ -> (0, 0) pixel = do x <- fix $ register 0 . (+ dx) y <- fix $ register 0 . (+ dy) x0 <- (chipX =<<) <$> vgaX y0 <- (chipY =<<) <$> vgaY pure $ case (,) <$> x0 <*> y0 of Just (x0, y0) -> (x0, y0) == (x, y) _ -> False
In reality, after getting the PS/2 decoder working, but before hooking it up to the scan code parser, I thought I'd use the serial IO on the Papilio Pro to do a quick test by just transmitting the scan codes straight away as they come out of the decoder. This has then prompted me to clean up a wart on my UART implementation: they took the clock rate as an extra term-level argument to compute the clock division they need to do:
tx :: (HiddenClockReset domain gated synchronous) => Word32 -> Word32 -> Signal domain (Maybe Word8) -> TXOut domain tx clkRate serialRate inp = TXOut{..} where (txReady, txOut) = unbundle $ mealyState (tx0 $ clkRate `div` serialRate) (0, Nothing) inp
This bothered me because the clock domain already specifies the clock rate, at the type level. Trying to remove this redundancy has led me down a rabbit hole of what I believe is a CλaSH bug; but at least I managed to work around that for now (until I find an even better way).
This, in turn, forced me to use a proper clock domain with the correct clock period setting in my CHIP-8 design:
-- | 25.175 MHz clock, needed for the VGA mode we use. -- CLaSH requires the clock period to be specified in picoseconds. type Dom25 = Dom "CLK_25MHZ" (1000000000000 `Div` 25175000)
But then, this allowed me to start putting pixel clock specifications into the type of VGATimings, allowing me to statically enforce that the clock domain in which the VGA signal generator runs is at exactly the right frequency:
vgaDriver :: (HiddenClockReset dom gated synchronous, KnownNat w, KnownNat h) => (dom ~ Dom s ps, ps ~ (1000000000000 `Div` rate)) => VGATimings rate w h -> VGADriver dom w h -- | VGA 640*480@60Hz, 25.175 MHz pixel clock vga640x480at60 :: VGATimings 25175000 10 10 vga640x480at60 = VGATimings { vgaHorizTiming = VGATiming 640 16 96 48 , vgaVertTiming = VGATiming 480 11 2 31 }
I decided to start with the video signal generation so that I have something fun to look at, and also so that I can connect other components to it at later steps.
VGA is a very simple video signal format with separate digital lines for vertical and horizontal synchronization, and three analog lines for the red, green and blue color channels. This is so much simpler than TV signals like PAL or NTSC that were designed to be backwards-compatible with early black and white TV formats and that only support a single row count, it has some quite low-bandwidth standard modes (with pixel clocks at just tens of MHz), and the whole world is filled with displays that support VGA; put it together, and it is ideal for hobbyist computer projects.
The basic mode of operation is you do a bit of a song and dance on the vertical and horizontal sync lines, and then just keep a counter of where you are so you know what color to put on the red/green/blue lines. The clock speed one should use for this counter is intimately tied to the sync pattern, and this is where VGA timing databases come into play.
I'm going to skip the tale of the electron beam scanning the CRT in þe olde times because every page describing VGA has it; for our purposes here it is enough to just regard it as an abstract serial protocol.
The CHIP-8 has 1-bit graphics with 64⨯32 resolution. This is not a typo, there is no unit or scaling factor missing: it really is only 64 (square) pixels across ands 32 pixels down. That is not a lot of pixels; to give you an idea, here is a full-screen dump rendered with no scaling:
To make it full-screen, we need to scale it up. The easiest way I could think of was to scale it by a factor of a power of 2; that way, we can easily convert the screen-space X/Y coordinates to computer-space by just dropping the lowest bits. From 64⨯32, we could scale it for example by 8 to get 512⨯256, or by 16 for 1024⨯512. Of course, given the awkward 2:1 aspect ratio of the CHIP-8, we can't hope to avoid borders altogether: if we look at some lists of VGA modes, these two look promising: the larger one would fit on 1024⨯768 with no vertical bordering, and the other one is a close contender with a small border in 640⨯480, requiring only a 25MHz pixel clock.
A low pixel clock frequency is useful for this project because I'm still learning the ropes with CλaSH, so getting just something to work is an achievement on its own; getting something to work efficiently, at a high clock rate, or using two separate clock domains for the video generator and the CPU would both be more advanced topics for a later version. So for this project, I'm going to go with 640⨯480: divided by 8, this gives us a screen layout that looks like this:
So out of all the 640⨯480 modes, we're going to use 640⨯480@60Hz for this, since the CHIP-8 also has a 60 Hz timer which we'll get for free this way. Let's write down the timing parameters; we'll need a 10-bit counter for both X and Y since the porches push the vertical size to 524.
data VGATiming n = VGATiming { visibleSize, pre, syncPulse, post :: Unsigned n } data VGATimings w h = VGATimings { vgaHorizTiming :: VGATiming w , vgaVertTiming :: VGATiming h } -- | VGA 640*480@60Hz, 25.175 MHz pixel clock vga640x480at60 :: VGATimings 10 10 vga640x480at60 = VGATimings { vgaHorizTiming = VGATiming 640 16 96 48 , vgaVertTiming = VGATiming 480 11 2 31 }
The output of the signal generator is the vertical and horizontal sync lines and the X/Y coordinates of the pixel being drawn; the idea being that there would be some other circuit that is responsible for ensuring the correct color data is put out for that coordinate. I've also bundled two extra lines to signal the start of a line/frame: the frame one will be used for the 60Hz timer, and the line one can be useful for implementing other machines later on: for example, on the Commodore 64, the video chip can be configured to interrupt the processor at some given raster line.
data VGADriver dom w h = VGADriver { vgaVSync :: Signal dom Bit , vgaHSync :: Signal dom Bit , vgaStartFrame :: Signal dom Bool , vgaStartLine :: Signal dom Bool , vgaX :: Signal dom (Maybe (Unsigned w)) , vgaY :: Signal dom (Maybe (Unsigned h)) }
For the actual driver, a horizontal counter simply counts up to the total horizontal size, and a vertical counter is incremented every time the horizontal one is reset. Everything else can be easily derived from these two counters with just pure functions:
vgaDriver :: (HiddenClockReset dom gated synchronous, KnownNat w, KnownNat h) => VGATimings w h -> VGADriver dom w h vgaDriver VGATimings{..} = VGADriver{..} where vgaVSync = activeLow $ pure vSyncStart .<=. vCount .&&. vCount .<. pure vSyncEnd vgaHSync = activeLow $ pure hSyncStart .<=. hCount .&&. hCount .<. pure hSyncEnd vgaStartLine = hCount .==. pure hSyncStart vgaStartFrame = vgaStartLine .&&. vCount .==. pure vSyncStart vgaX = enable <$> (hCount .<. pure hSize) <*> hCount vgaY = enable <$> (vCount .<. pure vSize) <*> vCount endLine = hCount .==. pure hMax endFrame = vCount .==. pure vMax hCount = register 0 $ mux endLine 0 (hCount + 1) vCount = regEn 0 endLine $ mux endFrame 0 (vCount + 1) VGATiming hSize hPre hSync hPost = vgaHorizTiming hSyncStart = hSize + hPre hSyncEnd = hSyncStart + hSync hMax = sum [hSize, hPre, hSync, hPost] - 1 VGATiming vSize vPre vSync vPost = vgaVertTiming vSyncStart = vSize + vPre vSyncEnd = vSyncStart + vSync vMax = sum [vSize, vPre, vSync, vPost] - 1
I hooked it up to a small test circuit, connected it to a small CCTV display I had lying around before I unpacked the bigger old VGA screen, and... nothing. This was frustrating because in the CλaSH simulator the sync timings looked right. Then Joe had this simple but brilliant idea to just blink an LED at 1 Hz using the pixel clock, and see if that is correct -- this immediately uncovered that I was using the wrong clock manager settings, and instead of a pixel clock of 25.125 MHz, it was running at 40 MHz. No wonder the signal made no sense to the poor screen... With that out of the way, I finally saw the test pattern:
And so with a bit of bit truncation, I now have a checkers pattern displayed at the CHIP-8 resolution; and I've even ended up bringing out the larger screen:
The full code in progress is on GitHub; in particular, the version that generates the checkerboard pattern is in this code.
The 2018 September run of the RetroChallenge starts in a couple of days, and I've been playing around with CλaSH for the last couple of weeks, so I thought I'd aim for the stars and try to make a CHIP-8 computer.
CλaSH is, basically, an alternative GHC backend that emits a hardware description from Haskell code. In the more established Lava family of HDLs, like Kansas Lava that I've used earlier, you write Haskell code which uses the Lava eDSL to describe your hardware, then you run said Haskell code and it emits VHDL or Verilog. With CλaSH, you write Haskell code which you then compile to VHDL/Verilog. This allows you to e.g. use algebraic data types with pattern matching, and it will compile to just the right bit width.
Here is an example of CλaSH code by yours truly: the receiver half of a UART.
There is no shortage of information about the CHIP-8 on the Internet; the Wikipedia page is a good starting point. But basically, it is a VM spec from the mid-'70s; originally meant to be run on home computers built around the the 8-bit COSMAC CPU. There are tens, TENS! of games written for it.
CHIP-8 has 35 opcodes, a one-note beeper, 64x32 1-bit framebuffer graphics that is accessible only via an XOR bit blitting operation (somewhat misleadingly called "sprites"), and 16-key input in a 4x4 keypad layout.
I have some experience with CHIP-8 already: it was the first computer I've implemented on an FPGA (in Kansas Lava back then); I've used it as the test application when I was working on Rust for AVR; I even have a super-barebones 2048 game for the platform.
In summary, CHIP-8 is so simple that it is a very good starting point when getting accustomed with any new way of making computers, be it emulators, electronics boards or FPGAs.
I want to build an FPGA implementation of CHIP-8 on the Papilio Pro board I have laying around. It is based on the Xilinx Spartan-6 FPGA, a chip so outdated that the last version of Xilinx's development tools that supports it is ISE 14.7 from 2013; but it is what I have so it will have to do. I plan to work outside-in by first doing the IO bits: a VGA signal generator for the video output, and a PS/2 keyboard controller for the input. Afterwards, I will move to making a CPU which uses CHIP-8 as its machine language.
If I can finish all that with time remaining, there are some "stretch goals" as well: implementing sound output (the Papilio IO board has a 1-bit PWM DAC that I could use for that); interfacing with a real 4x4 keypad instead of a full PS/2 keyboard; and implementing some way of switching ROM images without rebuilding the FPGA bitfile.
In February this year, Dylan McKay wrote in his blog:
In the coming months the Rust compiler should support AVR support out-of-the-box!
I've been watching this from afar, planning to try it out when AVR support is mainlined into Rust; I thought it would be much easier to wait until the dust settles than trying to track the development versions of LLVM and Rust at the same time. However, it's now May, LLVM 4.0 has come out with the new AVR backend and the Rust compiler has been updated to use it; and my interest in Rust has also started to grow: in late March one of my beachside readings on Gili Meno was O'Reilly's Programming Rust, and a Rust meetup group started in Singapore. And so I started to formulate a plan.
Two years ago, Viki and I designed and prototyped a very minimal AVR-based handheld console that can run CHIP-8 programs. It's made up of an Adafruit Trinket (an AVR ATMega328P running on 12MHz@3.3V, with a bit of circuitry to be programmable via serial-over-USB), a serial SRAM chip, an LCD from an old Nokia phone, and a 4x4 keypad.
The 2015 software was, of course, written in C++. So my idea was to rewrite that in Rust, mainly aiming to use Rust's algebraic data type support with pattern matching. Given the way one usually programs microcontrollers, I think pattern matching gives me the most bang for my buck for now, until we start designing and implementing funky Rust libraries that use the borrows checker to enforce all kinds of interesting non-memory-allocation invariants.
And so, while fiddling my thumbs waiting for AVR support to show up in the next Rust release, I got busy reimplementing the CHIP-8 engine in Rust, in two modules: chip8-engine is a no_std library implementing the CHIP-8 machine with all IO done over a trait, and chip8-sdl is the executable that links to the library and implements the frontend using SDL. This is the exact same architecture we used back in 2015 to develop the C++ version; this allowed us to debug the CHIP-8 implementation on an x86 computer, only having to worry about running on the device for the IO-specific parts.
Two weeks ago, I decided to take the plunge and build LLVM and the Rust compiler from the dev branch where AVR support is being worked on.
At this point, we have to note a big difference between the development process on C++ vs. Rust. With the C++ version, once the engine was running OK on the computer, we simply recompiled it using an AVR-targeting C++ compiler and linked it with code that communicates with the serial RAM chip, the LCD, and the keypad. On the other hand, at least for now, AVR support in LLVM and Rust is in its infancy, so even if the engine works on x86, there is absolutely no guarantee that it will do remotely the same on AVR as well.
And so the next step was to use simavr to create a simulator for our board. The simulator implements just enough of the PCD8544 protocol to be able to display the pixels of the LCD in an SDL window; implements just enough of the serial SRAM protocol to read/write single bytes; and implements the keypad in the obvious way. I debugged the simulator by running the C++ version of the firmware; then when I decided I trusted the simulator enough, it was time to go back to Rust and see what it takes to compile it to AVR.
First, I tried just compiling the Hello World of microcontrollers: blinking an LED.
Right out the gate, this fails. It fails even before you get to try compiling your program. It fails because the Core library of Rust itself cannot be compiled on AVR due to a compiler bug. So the zeroth thing you have to do, before you even do the first thing, is to start trimming down Rust's libcore until it doesn't contain too much stuff that compiling it triggers the aforementined bug, but still contains enough to be useful. For example, you need to include core::iter if you want to do any for loops. I've put my version of libcore-mini on GitHub; if you need anything that is not included from the real libcore, just try adding it and hope for the best.
BTW, if you use a custom libcore, there's a bunch of magic Rust incantations you need in your code:
#![feature(no_core)] #![no_core] extern crate libcore_mini as core; // These are imported to get for-loops working #[allow(unused_imports)] use core::option; #[allow(unused_imports)] use core::iter; #[allow(unused_imports)] use core::ops;
Also the module providing main() has to jump through a bunch of extra hoops:
pub mod std { #[lang = "eh_personality"] #[no_mangle] pub unsafe extern "C" fn rust_eh_personality(state: (), exception_object: *mut (), context: *mut ()) -> () { } #[lang = "panic_fmt"] #[unwind] pub extern fn rust_begin_panic(msg: (), file: &'static str, line: u32) -> ! { loop{} } } #[no_mangle] pub extern fn main() { // Finally! Put your real main() here! }
Note that rust_begin_panic just loops, since there's nothing better I could come up with for a microcontroller.
Once you have your own stripped-down libcore, you can start writing programs. Here's my first one, a blinker that uses a timer interrupt instead of busy-waiting:
use core::intrinsics::{volatile_load, volatile_store}; mod avr { pub const DDRB: *mut u8 = 0x24 as *mut u8; pub const PORTB: *mut u8 = 0x25 as *mut u8; pub const TCCR1B: *mut u8 = 0x81 as *mut u8; pub const TIMSK1: *mut u8 = 0x6f as *mut u8; pub const OCR1A: *mut u16 = 0x88 as *mut u16; } use avr::*; const MASK: u8 = 0b_0010_0000; #[no_mangle] pub extern fn main() { unsafe { volatile_store(DDRB, volatile_load(DDRB) | MASK); // Configure timer 1 for CTC mode, with divider of 64 volatile_store(TCCR1B, volatile_load(TCCR1B) | 0b_0000_1101); // Timer frequency volatile_store(OCR1A, 62500); // Enable CTC interrupt volatile_store(TIMSK1, volatile_load(TIMSK1) | 0b_0000_0010); // Good to go! asm!("SEI"); loop {} } } #[no_mangle] pub unsafe extern "avr-interrupt" fn __vector_11() { volatile_store(PORTB, volatile_load(PORTB) ^ MASK); }
Again, at first, this failed due to another compiler bug that I've reported here and got fixed in a couple hours; and then for the first weekend, I got into this pattern of compiling something, running it in the simulator, noticing that the MCU gets jammed due to invalid machine code, or LLVM fails to turn IR into AVR assembly, and so on; reporting the issue at hand; then I'd test and confirm that the fix works.
As I dug myself deeper into both LLVM and the Rust compiler, after a while I started not just reporting bugs and reducing test cases, but fixing them as well. In fact, if you want to try Rust on AVR today, I very much recommend using LLVM from my fork since it has all my changes that haven't been upstreamed yet, but are all needed to compile my code.
And so now, two weeks later, I can reasonably claim that I know LLVM (something that I always wanted to learn but haven't gotten around to until now), I have a complete Rust implementation of CHIP-8 that runs on my real hardware board, and there are interesting problems to work on in Rustc and LLVM moving forward. Also, the Rust API to the actual AVR IO functionality that I've implemented is very rudimentary; I know of at least one project already that tries to design a nicer API, and we could probably also reuse some of the ideas from Marten's C++ AVR API to do bundled IO port updates.
I've been spending a couple weeks now on getting Időrégész to Android in the best possible way. Időrégész is the Hungarian text adventure game from the '80s that I reverse-engineered and then re-implemented in Inform. My original plan was to just implement a Glulx interpreter; however, initial experiments with a Clojure implementation weren't too promising on the performance front.
I decided to turn to compilation instead of interpretation, then: take the Inform-emitted Glulx image, and compile that directly to the JVM. Of course, that approach would have its own problems with self-modifying code, but my goal is just to get it working well enough that I can compile Időrégész, which is very vanilla as far as Inform programs go.
Most instructions of Glulx map to JVM instructions in a relatively straightforward manner; some unsigned integer operations are implemented by hand in Java and then called via invokestatic. The rather baroque string compression is currently handled at compile time, by just emitting a single Java function that looks like this:
public String resolveString(int ptr) { switch (ptr) { case 32851 : return "Class" ; case 32857 : return "Object" ; case 32863 : return "Routine" ; // and so on... }
However, there is one aspect of Glulx that makes the compilation trickier than a straightforward mapping of procedures to procedures and calls to calls: some Glulx opcodes provide fairly detailed access to the runtime state. For example, save must be able to serialize enough of the state that it can later load that back and continue with the next instruction.
In an interpreter, this is a trival matter, since things like the VM's memory and the stack are already reified data structures. However, if it's compiled to JVM, the state of the Glulx program and the state of the JVM is one and the same; so how do you then save it — from the inside?
The solution is to not use the JVM's stack as the Glulx stack; rather, function calls are compiled to areturn instructions that return a small Java object describing which function to call and where to jump back once its result is available. Returns are also compiled to areturn instructions, but this time returning an object describing the result value. Function-local variables and the per-function stack are passed to the generated JVM code as function arguments:
Each Glulx function is compiled into a class that extends the Glulx.Fun class, defined in Kotlin as:
package Glulx abstract class Fun { abstract class Cont { class Return(val value : Int) : Cont() class Call(val args: IntArray, val nextFun : Fun, val contPC : Short) : Cont() } class Result(val value : Int, val contPC : Short) abstract fun enter(stub: Stack.CallStub?, args: IntArray): Stack.CallFrame abstract fun exec(result: Result?, localVars: IntArray): Cont }
Since the JVM doesn't support computed jumps, the continuation address contPC is handled by starting the exec of each Fun with a big tableswitch. Here's an example of a recursively defined factorial function (using the Krakatau JVM assembler's syntax):
.method public exec : (LGlulx/Fun$Result;[I)LGlulx/Fun$Cont; .code stack 10 locals 10 aload_1 ifnull LSTART aload_1 invokevirtual Method Glulx/Fun$Result getContPC ()S tableswitch 0 LCONT0 default: LSTART LSTART: ;; if V0=0, jump to base case aload_2 ldc 0 iaload ifeq L0 ;; START call FACT(V0-1) ldc 1 newarray int dup ldc 0 aload_2 ldc 0 iaload ldc 1 isub iastore new Glulx/Fun$Cont$Call swap dup2 getstatic Field Glulx/Image/FACT fun LGlulx/Fun; ldc 0 invokespecial Method Glulx/Fun$Cont$Call <init> ([ILGlulx/Fun;S)V pop areturn LCONT0: aload_1 invokevirtual Method Glulx/Fun$Result getValue ()I ;; END call FACT(V0-1) ;; Note the code generated for the call spans an areturn! ;; Multiply result by V0 aload_2 ldc 0 iaload imul ;; Return result -- this is the "real" return new Glulx/Fun$Cont$Return swap dup2 invokespecial Method Glulx/Fun$Cont$Return <init> (I)V pop areturn L0: ;; For the base case, we just return 1 new Glulx/Fun$Cont$Return dup ldc 1 invokespecial Method Glulx/Fun$Cont$Return <init> (I)V areturn .end code .end method
Running these functions then becomes a matter of mere stack juggling, implemented again in Kotlin:
package Glulx class Stack { class CallFrame(val parent: CallStub?, val localVars: IntArray) { constructor(parent: CallStub?, localVarCount: Int): this(parent, IntArray(localVarCount)) fun storeArgs(args: IntArray) { for(i in args.zip(localVars).indices) localVars[i] = args[i] } } class CallStub(val parent: CallFrame, val parentFun: Fun, val parentPC : Short) } fun run(startFun: Fun) { var frame = startFun.enter(null, IntArray(0)) var fn = startFun var result : Fun.Result? = null while (true) { val cont = fn.exec(result, frame.localVars) when (cont) { is Fun.Cont.Return -> { val stub = frame.parent if (stub == null) return frame = stub.parent fn = stub.parentFun result = Fun.Result(cont.value, stub.parentPC) } is Fun.Cont.Call -> { val stub = Stack.CallStub(frame, fn, cont.contPC) fn = cont.nextFun frame = fn.enter(stub, cont.args) result = null } } } }
In the real implementation, there's slightly more state to pass around: Fun.exec also gets as argument an instance of a generic Environment class which it can use to e.g. access the main Glulx memory, or to issue Glk calls; and there are some boring details about handling both 32-, 16- and 8-bit local variables.
Note that the exact same technique can also be used to implement tail recursion (even mutual tail recursion) on a platform that doesn't support it, like the JVM. I am not using it here, but Glulx actually has a tailcall instruction (not used by Időrégész, or sane Inform code in general), so it might come handy if I want to increase feature coverage.
Encsé writes in his blog that one of the reasons he created a tiny CPS-based Scheme interpreter was because he realized he hasn't done any fun side projects all year. So I thought I'd make an inventory of fun hacks I've done in 2015.
Some of the answers I wrote this year on Stack Overflow required me to learn just enough of something new to be able to answer the question:
Then, there were those where it turned out there was a bug to be found by scratching the surface of the question deep enough:
Then there were the answers that were just too much fun to write:
All in all, it seems this has been quite a productive year for me out of the office, even if you exclude the Stack Overflow part. I was actually surprised how long this list was while compiling it for this blog post. Maybe I should write a list like this every year from now...
Amikor már megvolt a C64-ünk, megkaptam egyszerre az LSI-féle 1001 játék sorozat első három részét anyukáméktól (ez úgy '89 körül lehetett), azt hiszem a negyedik-ötödik rész később jelent meg. Annyira nem voltam a scene-ben, hogy azt sem tudtam, hogy van olyan, hogy scene. Az én scene-em az a pár általános iskolai osztálytárs volt, akinek legalább egy ZX Spectrumja volt. Szóval a lényeg, hogy ez a pár könyv volt minden kapcsolatom a C64 játékokkal. Azon a néhány játékon kívül, amik így apukám ismerősein keresztül jutottak hozzánk, a többiről csak olvasni tudtam. Persze a leírás alapján mindegyik a legjobb játéknak tűnt EVER!, és volt néhány, amit el is kezdtem BASIC-ben leprogramozni az alapján, ahogy elképzeltem; ez nyilván több oldalról is ab ovo teljes sikertelenségre volt ítélve.
A harmadik 1001 játékban volt egy végigjátszás az Időrégészről, amibe persze azonnal beleszerettem. A szöveges kalandjáték volt az egyetlen játék-műfaj, aminek a programozásában akkortájt labdába tudtam rúgni annyira, hogy valami játszható jöjjön ki belőle, és emiatt pláne úgy éreztem, hogy ez a játék megszólít engem. Ráadásul nem is tudom, hogy volt-e akkor már CoV, de az biztos, hogy én nem ismertem (a Magic Candle-ös szám volt az első, amikor belefutottam, ha minden igaz, '92-ben), úgyhogy a leírás stílusa, humora is egy teljesen új világot nyitott meg előttem.
Mivel a fent ecsetelt okok miatt warez-ként esélyem se volt hozzájutni, ez lett a második játék, amit eredetiben megvettünk (az első az Impossible Mission 2 volt (kazettán), de azzal én nem sokat játszottam, azt apukám tolta — külön post lehetne, amikor az éjszaka közepén, a TV hangerejét éppen csak hallhatóra állítva ordított a figura leesés közben).
A játék csalódást nem okozott, de végigjátszani az istennek se ment, még az 1001-leírás alapján sem. Az áttörést csak 1991-ben hozta egy osztálytársam, aki rájött, hogy SPOILER ALERT a papnál még imádkozni is kell, nem elég odaadni neki az aranykeresztet vagy mit.
Most illene arról is írni, hogy akkor most ez egy jó játék volt-e. Azt gondolom, hogy most tudom a legkevésbé ezt megítélni, mert túl közelről látom (mint majd mindjárt kiderül hogy miért). Az biztos, hogy akinek annak idején volt C64-e (vagy akár +4-e!), és magyar, az ismeri az Időrégészt meg a többi Rátkai-játékot: nem véletlenül volt róla IDDQD cikk meg Index illetve Facebook csoportok. Igazi cult classic.
Mostanában elkezdtek érdekelni a régi 8-bites gépek, mivel annak idején az alacsonyszintű programozásuk, meg úgy egyáltalán a felépítésük totál kimaradt nekem. Most, hogy autodidakta módon FPGA-fejlesztést tanulok, elértem oda, hogy épkézláb számítógépeket tudok építeni, amihez persze elkerülhetetlen, hogy megismerjem minden kis részletüket. Ennek egyébként egy nem várt, de kellemes mellékhatása, hogy tudom értékelni a korabeli és azóta készült demókat, amikor minden képkockát önmagában is lehetetlennek tűnik előállítani a C64-en.
Szóval kitaláltam, hogy kipróbálnám egy játék teljes visszafejtését, mivel ilyesmit sose csináltam még. Egy ilyen nem-realtime, nem-grafikus játék, mint az Időrégész, remek kezdő projektnek tűnt. Az alap-elképzelés (még május elején) az volt, hogy visszafejtem a programot annyira, hogy tudjam minden részéről hogy mit csinál, és ezalapján megértem, és minél magasabb szinten reprodukálom a működését. Hogy valami igazán örök formában rögzítsem, mindenképpen a ZMachine virtuális gépet terveztem célbavenni, mivel a modern interactive fiction-világban ez tűnik egyértelműen a bevett eszköznek már évtizedek óta.
Természetesen a ZMachine-re való fejlesztésről se tudtam sokat pár héttel ezelőttig (annak ellenére, hogy még 2004-ben a működő prototípus fázisig eltoltam egy egy ZMachine virtuális gép fejlesztését), csak annyit, hogy két komoly játékos van a színen: van az Inform 7, ami nagyon brutálisan ráment arra, hogy úgy nézzen ki, mintha természetes angol nyelven lenne, és van az Inform 6, ami egy hagyományosabb, valamennyire objektum/prototípus-orientált nyelv. Bár az Inform 7 nagyon érdekesnek tűnik (amellett, hogy én rühellem az ilyen természetes nyelvet utánzó programozási rendszereket), azt hamar eldöntöttem, hogy hülyén és zavaróan összevissza nézne ki egy magyar nyelvű játék leírása összekötő angol szöveggel. Úgyhogy maradt az Inform 6.
Így állt össze tehát az eredeti terv: fogom az Időrégészt, ki-dump-olom a szövegeket belőle, dokumentálom az összes szabályát, azokat újraimplementálom Inform 6-ban, lefordítom ZMachine-re, és kész.
Az hiányzott a legkevésbé, hogy még mindenféle mesterséges szabállyal is nehezítsem a dolgomat, úgyhogy azt hamar eldöntöttem, hogy semmiképpen nem lesz az a cél, hogy úgymond "korhű" módon csináljam a visszafejtést; mármint hogy olyan eszközökre korlátozódva, amik egy igazi Commodore 64-en '89 körül a rendelkezésemre álltak volna. Ehelyett a Vice emulátor debug-eszközeivel vágtam neki a feladatnak.
Az alap-elképzelés annyi volt, hogy betöltom az Időrégészt a Vice-ba, hagyom hogy kicsomagolja magát stb, és amikor már elindult a játék odáig, hogy várja az inputot, ki-dumpolom a teljes memóriát, disassemblálom a da65-tel, és elkezdek benne kutakodni. Találtam ehhez egy korabeli kazettás törést az Időrégészből, ami azért volt ígéretes, mert elindítás után már nem tölt semmit, tehát biztos lehettem benne, hogy (a grafikák kivételével) minden benne lesz a memóriában.
Bár 64 kilobyte nem tűnik soknak, 64 K-nyi disassemblált gépi kódot végigolvasni és megpróbálni kihámozni belőle, hogy hol van valami érdekes program, nem igazán bizonyult megvalósíthatónak; még azzal együtt sem, hogy a C64 BASIC és Kernal ROM-ok is be voltak a dump-ban kapcsolva (de ez utóbbi kérdésről még lesz szó).
Ehelyett azt csináltam, hogy elkezdtem trace-elni a programszámlálót (továbbiakban PC) a ROM-on kívül, először úgy, hogy nem csináltam a játékban semmit. Ezután az így érintett PC-ket kiszűrtem, és megnyomtam egy billentyűt. Az így kapott PC-ket is kiszűrtem, és ekkor eljutottam oda, hogy amíg nem írtam be egy teljes sort, addig nem volt tele-szemetelve a trace. Ezekután beírtam egy (értelmetlen) sort, és az így érintett PC-ket összeolvasva a disassemblált kóddal, végre ráleltem a jackpot-ra; az eredeti feljegyzéseim között a következő sorokban a felkiáltójelek száma mutatja, mennyire belelkesültem:
ff48: jo break-point a parancs lefutasa utanra? c65a: parser eleje!!!!!!!!!!!!!!!!!!!!!!!! jo breakpoint!
Ekkor tehát végre hozzá tudtam adni a disassemblált kódhoz az első cimkét. Ez aztán elindított egyfajta lavinát: $C662-től egyértelműen látszik a lexikális elemző ciklus, ami egy szótár szavaival hasonlítja össze az inputot, amiből persze azonnal meglett egyrészt a szótár címe, másrészt megtaláltam néhány segédfüggvényt, köztük a $C707-en azt, ami a lexikális elemzés hibája esetén fut le (magyarán ha a játék által nem ismert szót írunk be), és $C6E0-től kezdődően azt, ami a sikeresen tokenizált inputot dolgozza fel. Hogy lássuk, hogy ebben semmi varázslat nincs, példaképp álljon itt ez a legutóbbi rutin, miután teljesen megértettem, hogy mi mit csinál, és felcimkéztem-felkommeteztem:
parse_finish: lda #$0D ; C6E0 A9 0D jsr putchr ; C6E2 20 D2 FF lda words_buffer ; C6E5 AD E8 CF cmp #$00 ; C6E8 C9 00 bne check_verb ; C6EA D0 08 ; Signal error if no (non-pronoun) words were parsed nowords_error: lda #$80 ; C6EC A9 80 jsr message ; C6EE 20 D2 C9 jmp startmainloop ; C6F1 4C 57 C7 ; Signal error if first word is not a verb check_verb: lda pronoun ; C6F4 AD CF CF cmp words_buffer ; C6F7 CD E8 CF bpl LC704 ; C6FA 10 08 lda #$81 ; C6FC A9 81 jsr message ; C6FE 20 D2 C9 jmp startmainloop ; C701 4C 57 C7 LC704: jmp prepare_sentence ; C704 4C 68 C8
A szótár címének birtokában gyorsan összedobtam egy Haskell programot, ami a memóriadump-ból kiszedi az összes szót $1000-től, dekódolja a reprezentációt, és szinonímánként csoportosítja őket. Ez volt az első kézzelfogható eredménye az eddigi munkámnak. Nézzünk is bele: a szavak indexe hexában és tizes számrendszerben is szerepel, hogy a gépi kód és a BASIC szkriptek olvasása közben is kényelmes legyen látni, mi micsoda:
0x1 1 n 0x1 1 néz 0x1 1 körülnéz 0x1 1 körülnézek 0x1 1 nézek 0x2 2 é 0x2 2 észak 0x3 3 k 0x3 3 kelet ... 0xb 11 r 0xb 11 lerak 0xb 11 leteszem 0xb 11 lerakom ... 0x37 55 a 0x37 55 az 0x37 55 egy 0x37 55 egyik 0x37 55 meg 0x37 55 kis 0x37 55 kicsit 0x37 55 keveset 0x37 55 valamit 0x38 56 mező 0x38 56 mezőt 0x38 56 mezőn ... 0x61 97 oltár 0x61 97 oltárt 0x64 100 pajzs 0x64 100 pajzsot ... 0x6f 111 lámpa 0x6f 111 lámpát ... 0x7e 126 szerzetes 0x7e 126 szerzetest 0x7e 126 szerzetesnek ...
Nem nehéz észrevenni (főleg így, hogy én már szétszedtem), hogy öt részre oszlik a szavak halmaza:
A $C9D2-n lévő, üzenetkiírási rutint visszafejtve az is kiderült, hogy természetesen a programnak szüksége van a teljes 64 K memóriára, ezért bizonyos adatok, köztük a kiirandó üzenetek is, a ROM által is használt memóriacímeken vannak tárolva. Ez azt jelenti, hogy a kiírás előtt kikapcsolja a ROM-ot, átmásolja a kiirandó szöveget egy ideiglenes, ROM által nem elfedett részre, visszakapcsolja a ROM-ot, és meghívja a megfelelő Kernal rutint a kiíráshoz.
Ez persze azt is jelenti, hogy az üzeneteket nem találhattam volna meg a kidumpolt memóriában, mert ha csak nem pont azt a pillanatot kapom el a dump-olással, amikor a ROM ki van kapcsolva, akkor ezek el vannak fedve a ROM által (bank switching). Persze, így utólag, ez nem egy nagy szopás (nyilván a Vice-ban ki/be lehet kapcsolni a ROMokat dumpban a gép tényleges, belső állapotától függetlenül), de amíg nem jöttem rá, addig elég furi volt, hogy sehol sem találom a kiírt üzenetek nagy részét.
De miért volt mégis néhány üzenet az eredeti dump-ban? Mint kiderült, a játék szabályainak csak a legáltalánosabb része (tárgykezelés, körülnézés, stb) van gépi kódban megírva; az ad-hoc szabályokat egy apró BASIC program írja le. A BASIC program a specifikus üzeneteket, mint például "Törtél egy nádszálat.", saját maga írja ki, közvetlenül print utasításokkal.
Így hát a következő lépés egy újabb apró Haskell program megírása volt, ami kiszedte a ROM által eltakart szövegeket, illetve egyéb táblázatokat, mint például a helyszínek egymással való összeköttetései. A $CA30-on lévő "néz" és a $C89F-en lévő "segítség" rutinokból látszik, hogy mindegyik helyszínhez két kötelező, és egy opcionális üzenet tartozik. Az első a helyszín leírása, a második a segítség, a harmadik pedig a leírásnak az a része, amit csak a helyszínhez tartozó flag állapotától függően írunk ki. Mivel ekkor még csak ismerkedtem a kinyerhető adatokkal, és egyáltalán rá sem néztem hogy hogyan fog kinézni a végső Inform 6 program, az első verzió csak szövegesen kiírta az összes helyszín adatát:
ROOM 11 A kastély nagytermében állsz. Egy gyö- nyörű lady áll előtted. A terem északi és déli ajtajánál őrök strázsálnak. Hint: Ki tud ellenállni egy szép hölgy kérésének? Puzzle: Keresd meg, és hozd vissza az elveszett ékszereimet! - szól a lady. Észak: 10 Dél: 9 Kelet: 15
Persze ez már egy kicsit idealizált verzió, mert az ékezetes betűk is jól vannak visszafejtve benne. Természetesen semmilyen szabványos kódolásról szó sem volt még akkoriban, úgyhogy az Időrégész egy teljesen saját karakterkészletet és kódolást használ. Az egyszerűség kedvéért a visszafejtést lusta kiértékeléssel csináltam: először a nem-ASCII karakterek helyén a hexa kódjukat írtam ki. A legelső számú üzenet ekkor még így nézett ki:
A v|0xB0|r kazamat|0xB0|iban vagy. Alagutak indul-nak minden ir|0xB0|nyba.
Nem nehéz rájönni, hogy ezek szerint a $B0 az á kódja, aztán elég a második üzenetet is végigolvasni az első hexa kódig, és így tovább.
Egyébként a sok space meg furcsa kötőjel azért van, mert újsor-karaktereket nem tárolnak az üzenetek, hanem egyszerűen a 40. oszlopban törnek a sorok.
A BASIC-ben leírt szabályokkal jóval könnyebb dolgom volt, köszönhetően a petcat nevű programnak, ami a Commodore gépeken használt tokenizált BASIC-et visszafejti sima szövegre, valahogy így:
27060 if b<>11 then 27090 27070 if peek(t)=1 then print "Világít." : goto160 27080 print "Most nem ég." : goto160
Ez a részlet azt kezeli le, ha a játékos a 11-es tárgyat (a lámpát) vizsgálja meg: a t egy flag memóriacíme, ahol el van tárolva a lámpa állapota (meggyújtottuk-e), és ennek függvényében kiírjuk a megfelelő üzenetet, majd visszatérünk a fősodorba.
Hoppá, de hiszen a b változó a beírt főnév, mint szó sorszámát tartalmazza, és mint láttuk, bár a 11-es tárgy (a 111-es szó) valóban a "lámpa", a 11-es szó az az ige, hogy "lerak". Ez bizony egy bug!
A fenti példa felvet egy érdekes kérdést: mit kezdjen a reprodukció a bug-okkal? Egyáltalán, milyen szintű hűséget célozzunk be?
A legszerencsétlenebb hiba talán a 3910-es BASIC sorban lévő szintaktikai hiba, ami miatt a HÚZ élőlény alakú parancsok azonnal elcrashelik a játékot. Lehet erre azt mondani, hogy ki akarná azt beírni, hogy KIHÚZ PAP LÓHERE, de azt ne mondja nekem senki, hogy a MEGHÚZOM A LADYT nem egy életszerű parancs...
A többi hiba viszonylag ártalmatlan, mint a fenti példában a lámpa megvizsgálása, aminek egyetlen hatása, hogy a játékosnak emlékeznie kell, hogy meg van-e gyújtva a lámpája. A lámpával kapcsolatban amúgy nem ez az egyetlen bug, ugyanis a 4703-as sorban, amikor a kútban elrejtett üregben próbáljuk a falra vésett szöveget elolvasni, csak azt vizsgálja a program, hogy meg van-e gyújtva a lámpa — de azt már nem, hogy a játékosnál van-e.
Az ilyen és ehhez hasonló hibákat végül kijavítottam, egy kivétellel. A játék két tárgy megvizsgálása esetén mondja azt, hogy "épp a te méreted": a sisak, és... az ásó. Egészen biztos vagyok benne, hogy a második tárgy a páncélruha akart lenni (hiszen azt és a sisakot kell a későbbiekben viselnünk), de a pont jó méretű ásó annyira vicces kép (és még az 1001 játék leírásban is meg van említve), hogy muszáj volt bennehagynom.
Felmerült persze az is bennem, hogy ki lehetne adni egy patch-et az eredeti játékhoz, ami kijavítja ezeket a hibákat, de sajnos ez nem olyan egyszerű, mint elsőre tűnik, mivel sokszor nem férne be az eredeti helyre a jó verzió, tehát rácsúszna a következő sorra, és nem lehet könnyen eltolni a BASIC sorokat (csomó következő sor-pointert át kéne írni, meg nem is biztos, hogy van mögötte üres hely, ahova túllóghatna).
Amiben viszont mindenképpen szorosan akartam követni az eredetit, az a parser: semmi csili-vili, ige főnév főnév, teljesen figyelmen kívül hagyott kötőszavakkal. Persze, lehetett volna megpróbálnom magyarítani az Inform parserjét, Gálya-i magaslatokba emelkedni névmásokkal meg minden, de azt hiszem, az már nem az Időrégész lenne.
Miután így kiszedtem az összes szöveget, megértettem a szabályokat, és olvasható formára hoztam a BASIC szkriptet, ideje volt elkezdeni gondolkodni azon is, hogy milyen formát is öltsön a reprodukció maga.
Mint mondtam, eredetileg ZMachine-re akartam megírni, de mint kiderült, az nem igazán van felkészülve nem-ASCII karakterekre: a latin-1 karakterek még úgy-ahogy működnek, de a latin-2 csak néhány interpreterrel. Kiderült viszont, hogy van egy hasonló, de sokkal modernebb virtuális gép szöveges kalandjátékokhoz: a Glulx. Ez már ígéretesebb: kezel unicode-ot is, úgyhogy semmilyen magyar ékezettel nem lehet problémája; az Inform 6 compiler ugyanúgy támogatja, mint a ZMachine-t; ráadásul még könnyedén támogat grafikát és hangokat is! De ekkor, az ismerkedési fázisban, az még ötlet szintjén sem szerepelt, hogy a képeket és hangokat is kiszedjem az eredeti programból.
A végső verzió lefordul Glulx-ra és ZMachine-re is, az utóbbi verzió csak 48 K, úgyhogy elvileg az sem lehetetlen, hogy elfutna egy C64-es ZMachine interpreterrel. Persze azt kizártnak tartom, hogy van olyan C64-es interpreter, ami kezel latin-2 karaktereket...
Egyébként az Inform 6-tal azért meggyűlt a bajom: ez is az a fajta szar, mint a Ruby on Rails, hogy persze, tök pöpec a példákban hogy minden Magától Működik™, csak mihelyst valamit kusztomalizálni akarsz, akkor derül ki, hogy egy nagy átláthatatlan katymasz az egész... Valami 300 körüli programsor lett csak a parser, és úgy, hogy legalább a lexikális elemzéshez tudtam a platform által nyújtott szolgáltatást használni.
A tényleges kódot szerintem alapvetően kétféleképpen lehetett volna megírni. Generálhattam volna az igék és a tárgyak listáját is a memória-dump alapján, valami ilyesmi formában (Inform 6-ot ismerők előnyben):
Verb 'n//' 'néz' 'körülnéz' 'körülnézek' 'nézek' * -> VERB1; Verb 'é//' 'észak' * -> VERB2; ... Object NOUN56 with name 'mező' 'mezőt' 'mezőn', has scenery; ... Object NOUN100 "egy kerek pajzs" with name 'pajzs' 'pajzsot';
Ez azt jelentette volna, hogy a BASIC szkriptet is legalább fél-gépiesítve lehetett volna átfordítani Inform-ra. Ehelyett más utat választottam: kézzel írtam át a szavakat, mindegyiknek értelmes nevet adva, és az Informos konvenciókat követve, félig helyszín-, félig objektumorientált kódot írtam. Ez persze azt is jelenti, hogy becsúszhattak bug-ok; másrészt azt gondolom, hogy az így írt forrás sokkal jobban kifejezi az Időrégész szerkezetét.
Nézzünk egy konkrét példát! Az alábbi BASIC-részlet a (MÁSZIK POLC) kezelése:
4103 if p<>17 or b<>94 then 4110 4104 if peek(t+16)=1 then print r$ : print "Nem történt semmi." : goto 4108 4105 poke t+16,1 : print r$ : print "Innen már eléred az ablakot. Látod, hogy"; 4106 print "valaki a várudvaron, a túlsó fal tövében"; 4107 print "elejt egy pénzérmét." : poke x+2,15 : gosub500 4108 print "Gyorsan lemászol a polcról, mert már inog alattad." : goto160
Az én átiratomban ez úgy néz ki, hogy a 17-es szoba alatt található egy Polc objektum, a következő, szerintem kifejezetten olvasmányos kóddal:
Room ROOM17; ! részletek itt most elhagyva Object -> Polc with name 'polc' 'polcot' 'polcra' has scenery, with before [; Maszik: if (player notin parent(self)) rfalse; if (self has solved) "Nem történt semmi."; give self solved; move Penz to ROOM15; AddScore(); "Innen már eléred az ablakot. Látod, hogy valaki a várudvaron, a túlsó fal tövében elejt egy pénzérmét.^ Gyorsan lemászol a polcról, mert már inog alattad."; ];
Az átírás végleges menete így az lett, hogy a szobákhoz először legeneráltam a kódot, aztán before/after kezelőket és tereptárgyakat adtam hozzájuk. Az alábbi szobában például nincs semmi különleges, úgyhogy nem kellett egyáltalán hozzányúlnom a generált kódhoz:
Room ROOM3 with description "A vár kazamatáiban vagy. Alagutak indulnak minden irányba.", picture 1, hint "Minden rendben.", n_to ROOM2, s_to ROOM0, e_to ROOM2, w_to ROOM4;
Mivel a BASIC szkript ilyen átírása eléggé sok munkával járt, és hamar unalmassá vált, arra gondoltam, megfűszerezem kicsit azzal, hogy megpróbálkozom a képeket is kiszedni.
Már az első, ismerkedő átolvasásoknál feltűnt a $CBF8-nál kezdődő rutin, amelyik a lemezről olvas be RAM-ba. Egy jólirányzott breakpoint azonnal bebizonyította, hogy ez a rutin valóban akkor hívódik meg, amikor egyik szobából a másikba megy a játékos, és a két szoba kép-száma nem egyezik meg.
Mivel itt gyakorlatilag annyi történik, hogy a betöltött adatot (egy 192x80 pixeles multicolor képet) egy-az-egyben átmásoljuk a videómemória megfelelő részébe, ígéretes volt csak a lemezen lévő adatokra koncentrálnom ahelyett, hogy belépek minden egyes szobába az emulátorban, és kisave-elem a videómemóriát. Hogy lássam, hogy tényleg az történik, amit gondolok, egy újabb Haskell szkriptet írtam a videómemória alapján a képernyő rekonstrukciójára PNG-be. Itt volt egy kis szopás azzal, hogy milyen furcsa sorrendben van a pixelek címzése; de biztos vagyok benne, hogy ennek az oka az, hogy a karakteres képernyő minél több áramkörön tudjon osztozni a grafikussal. Az első helyesen dekódolt kép alja azért volt még szemetes, mert ott már szöveges módot használ az Időrégész:
Azután hogy az alját levágtam, és hozzáadtam a szín-kezelést is, máris felismerhető lett a kép:
Ugyanezt a szkriptet a lemezen lévő file-okra alkalmazva azonnal meglettek a szobák illusztrációi:
A hangokkal kapcsolatban nincs hasonlóan izgalmas sztorim, azokat végül a Vice emulátorból vettem fel a zenéket wav-ba, aztán letömörítettem Vorbisba.
Az első publikus változat egy három szobából álló demó volt, ami a végleges parsert használta, és már tartalmazott képeket (de zenét még nem). Igyekeztem azt a hangulatot megteremteni vele, amit annak idején a LucasArts játékok demói: igaz, hogy csak pár helyszín, igaz, hogy pár perc végigjátszani, de megáll a saját lábán, van benne puzzle, és meg lehet nyerni. Ezt a verziót május 23-án raktam fel az IDDQD Facebook-csoportjába, ahol nagyon sokakat érdekelt. Itt győztek meg egyébként arról is, hogy rakjam bele a végleges verzióba a zenéket is; és innen tudtam meg azt is, hogy van a Facebook-on külön Rátkai-oldal.
A teljes játékot június 1-én fejeztem be, és töltöttem fel az IDDQD oldalára, egyfajta fél-béta verzióként, ugyanis megkértem a tesztelőket, hogy ezt még ne terjesszék, amig ki nem javítok benne minden bug-ot. És hát a lelkes tesztelők bizony találtak is néhányat. Ami a legjobb volt ebben, hogy el is kezdtek egymásnak sztorizni a játékról a kommentekben!
In my quest to build more and more complicated computers on FPGAs armed with nothing but a crappy hobbist mindset and some hazy ideas of how Kansas Lava is supposed to work, I've reached another milestone: my first real computer.
That is, unlike the Brainfuck CPU that I designed myself, or the CHIP-8, which was originally a virtual machine spec (with all the implementation leeway that implies), this latest subject is a bona fide 8-bit home computer from the seventies: the Commodore PET.
The PET is a very simple machine compared to later Commodore models, which is why I thought it would make a good first step on a journey that I hope will one day culminate in implementing a Commodore 64. Its centerpiece is the MOS 6502 CPU (practically the same as the MOS 6510 used in the C=64), and there are only four other components: a text-only video signal generator and three IO interface chips (two PIA's and one VIA) for keyboard, Datasette and extension port communication. Just hooking up one of the PIAs is enough to get a minimal system working with keyboard input.
12 KBytes of PET ROM contain implementation of basic IO routines (the so-called "kernal"), the full-screen text editor, and Microsoft's BASIC interpreter. Then there's a separate character ROM (not addressable from the CPU) used by the video generator.
The 6502 microprocessor was a staple of the eight-bit home computer era of the late seventies and eighties. By today's standards, it is incredible to imagine what it must have been like to design it manually, drawing the layout with pencils on paper. On the other hand, if it was designed in such a low-tech way, I figured it shouldn't be too difficult to build something compatible using modern tools even by a hobbist like myself. And of course there are already dozens of home-built 6502 implementations out there, to varying degrees of compatibility.
The ultimate reference on the 6502 must be the Visual 6502 Project which I deliberately avoided consulting. I don't really see the educational value in copying the original 6502 design; so instead, I went with a more black-box approach by just looking at the opcode descriptions and interrupt model and working from that.
The first milestone I aimed for was to get enough of the CPU working that I can run the sample programs on 6502asm.com, which defines a tiny microcomputer-like architecture that doesn't have interrupts or any fancy video modes: you just have 32×32 pixels with a fixed 16-color palette mapped to main RAM from $0200, and a zero page-mapped register for keyboard input that you can do polling on. The Kansas Lava implementation is really simple and I plan to reuse it later if I do a similar project with the Z80.
My workflow was that I would use ca65 to assemble test programs, burn them into ROM, and run it in the Kansas Lava simulator for a couple thousand cycles; then render the video RAM into a GTK+ window. I would start with this program that does nothing but moves data around in memory (drawing the Commodore logo pixel by pixel), and basically I implemented the 6502 opcodes as I went along. After two days of work, I finally got it working:
Seeing this was an incredible feeling. The input was valid 6502 machine code, and my very own CPU managed to run it correctly for the approximately 40,000 cycles that it took to draw this image. There was no stopping at this point: I already had a working VGA frame buffer implementation from the CHIP-8, so next day I synthesized it and run it on real hardware, my venerable Papilio Pro:
As I added more and more opcodes and started running more and more complicated programs, things very quickly stopped working. My CPU was full of bugs, and figuring out what went wrong by looking at the simulation logs after running it for tens of thousands of cycles was very tedious.
And so, it was at this point that I started adding unit tests. The framework for writing tests exposes a monad where the available effects are making observations on the state of the system (CPU registers and contents of the memory) and executing instructions. This presents an API that allows writing tests in an imperative way:
php = do flags <- observe statusFlags sp <- observe regSP execute0 0x08 sp' <- observe regSP pushed < observe $ mem (stackAddr <$> sp) assertEq "Stack pointer is decremented" sp' (pred <$> sp) assertEq "Status is correctly pushed" pushed flags
A test like this is turned into a ROM image containing $08 at the address pointed to by the reset vector. The simulation is then run until the CPU enters the Fetch internal state for the second time (the first time is when it fetches the opcode under testing, i.e. the PHP ($08) instruction), and then the observations are evaluated by looking at the simulation output in the same cycles as the Fetch state. Of course, this means you shouldn't be able to write tests like the following:
impossiblyDynamicTest = do arg <- observe regX execute1 0x00 arg a' <- observe regA assertEq "A is updated" a' arg
This is ensured by observe returning values wrapped in an Obs type, and execute1 requiring unwrapped arguments:
observe :: Query a -> TestM (Obs a) execute1 :: Byte -> Byte -> TestM () assertEq :: (Eq a, Show a) => String -> Obs a -> Obs a -> TestM ()
To allow assertions over derived values, Obs is an applicative functor (in fact, it is the free applicative functor over the co-Yoneda functor of the primitive observations).
I think this approach has merit as a general framework for hardware simulator-based unit testing and I intend to extend it and maybe even carve it out into a separate library in the future.
Once I had a sufficiently working CPU, I started building the other pieces around it. I took the PET emulator from the VICE suite and commented out all the PIA and VIA code, replacing writes with nops and reads with hardcoded values, until I was able to boot it up with the stock ROM to get to the READY. prompt. Of course, since the PIA supplying the interrupt used for timing was removed by that point, I had no flashing cursor or keyboard input. All in all, the system got to a steady state in about 80,000 operations. (Since my implementation is not yet cyle-accurate, I had to switch to counting operations instead of cycles beyond this point. Every operation is at least as fast as on the real chip, so I hope by adding some wait cycles I'll be able to take care of this at some later point.)
After hooking up the same hardcoded values on the same addresses to the CPU, the next step was running the simulator and peeking at the video memory area ($8000..$8FFF on the PET), using the original fonts to render the screen. The initial version showed there might be someone home (sorry for crap quality on the screenshot):
By comparing detailed logs from running the emulator and the simulator, I was able to make observations like "the first 12,345 steps seem to be in agreement", which was a big boost to productivity, getting me, in short order, to this:
After fixing some more bugs in the arithmetic opcodes, I was finally rewarded by this sight:
While working on the CPU, I also started writing the character generator, on top of the VGA signal generator in the kansas-lava-papilio package that I originally made for the CHIP-8. This way, the VGA synchronization signals were abstracted away from me and I just had to take care of pumping out the actual pixels. This turned out to be tricker than I originally thought, since you have to time all read-aheads just right so that everything is at hand just in time for the next pixel. So before it finishes drawing the 8 pixels that make up a single row of a character, the next character index is loaded from RAM, and then the character ROM is consulted for the first row of the font image of the next indexed character. Initial versions were having some ghosting issues, or even more fun, full character transpositions (like showing the character from one line above in the first position of each line).
The Commodore PET diverts the vsync signal from the video generator to one of the PIA chips, which generates a CPU interrupt that can be acknowledged by reading from one of its memory-mapped registers. So the next obvious step was to implement this functionality to get the cursor blinking! This required more than just implementing a PIA, since I didn't even have interrupts in the CPU at that point.
But all that work was totally worth it:
The current version supports keyboard input from PS/2 keyboards (but not all keys are mapped yet), so for the first time since I started working on this more than a month ago, it can be used to write and run BASIC programs!
What you can't see on the video below is that there's still a bug somewhere that causes the classic 10 PRINT "FOO": 20 GOTO 10 program to terminate with an out of memory error after some time.
Apart from fixing these bugs, the big outstanding feature is to add Datasette support so that programs can be loaded and saved to virtual "casettes". For a first version, I'll just burn some extra ROM onto the FPGA containing the tape images and hook that up to the PIA controlling the casette player; but I guess the proper way to do this would be to use something like an SD card reader to get proper persistent, read-writable storage. Or maybe, alternatively, have some kind of serial-over-USB communication with a computer acting as the Datasette unit.