Blog tags RSS

Entries tagged fpga

A tiny computer for Tiny BASIC

17 November 2020 (programming haskell clash fpga)

Just a quick post to peel back the curtain a bit on a Clash example I posted to Twitter. My tweet in question shows the following snippet, claiming it is "a complete Intel 8080-based FPGA computer that runs Tiny BASIC":

logicBoard
    :: (HiddenClockResetEnable dom)
    => Signal dom (Maybe (Unsigned 8)) -> Signal dom Bool -> Signal dom (Maybe (Unsigned 8))
logicBoard inByte outReady = outByte
  where
    CPUOut{..} = intel8080 CPUIn{..}

    interruptRequest = pure False

    (dataIn, outByte) = memoryMap _addrOut _dataOut $ do
        matchRight $ do
            mask 0x0000 $ romFromFile (SNat @0x0800) "_build/intel8080/image.bin"
            mask 0x0800 $ ram0 (SNat @0x0800)
            mask 0x1000 $ ram0 (SNat @0x1000)
        matchLeft $ do
            mask 0x10 $ port $ acia inByte outReady
    

I got almost a hundred likes on this tweet, which isn't too bad for a topic as niche as hardware design with Haskell and Clash. Obviously, the above tweet was meant as a brag, not as a detailed technical description; but given the traction it got, I thought I might as well try to expand a bit on it.

Now, the background to all this is that I'm working on a book on retrocomputing with Clash. So in this post, I will try to condense the 75k words of text and the 5k lines of code that I have written so far; it's going to be more of an extended table of contents than a Cliffs notes.

Tiny BASIC is an interactive BASIC interpreter; these days, we would call it a REPL. The computer above runs one of the original, Intel 8080-based versions of Tiny BASIC as its firmware. When the computer is turned on, it boots straight into a BASIC prompt, just like the Commodore PETs and Apple II's of yesteryear. The software assumes there is a peripheral controller connected to certain output ports of the CPU; all IO is done via a stream of bytes written to and read from this controller.

Input and output

Let's start with the interface. We see that logicBoard has two input signals and one output signal. The Maybe (Unsigned 8) input is a byte coupled with an "input ready" line; in other words, the value is Just a byte if there is new input coming in in a given clock cycle, and Nothing otherwise. The output of the same type is, unsurprisingly, the output of the whole computer in the same format. The extra Bool input is for backpressure: some output peripherals might need time to process an output byte before they are ready for the next one.

We will need to connect these incoming and outgoing bytes to the outside world somehow, and in a real hardware implementation, the IO controller actually contained a UART so that input and output was serialized into a stream of bits. However, by uncoupling the UART from the ACIA implementation, and exposing input and output as byte events, we make it easy to hook up alternative communication media:

The CPU

The Intel 8080 core we are using is also, of course, written in Clash. I can't really go through its implementation in this small space; the chapter where we construct it from scratch is at 13k words and builds heavily on techniques introduced in earlier chapters describing a CPU that uses Brainfuck as its machine code and a CHIP-8 machine. In this post, we'll just look at its interface, to see how it fits into the logicBoard:

declareBareB [d|
  data CPUIn = CPUIn
    { dataIn :: Maybe Value
    , interruptRequest :: Bool
    } |]

declareBareB [d|
  data CPUOut = CPUOut
      { _addrOut :: Maybe (Either Port Addr)
      , _dataOut :: Maybe Value
      , _interruptAck :: Bool
      , _halted :: Bool
      } |]

type Signals dom b = b Covered (Signal dom)
type Pure b = b Bare Identity
type Partial b = Barbie (b Covered) Last

intel8080 :: (HiddenClockResetEnable dom) => Signals dom CPUIn -> Signals dom CPUOut
    

We use the wonderful Barbies library for CPUIn and CPUOut so that we can switch between a record of signals and a signal of a record. Outwards, Signals dom CPUOut gives easy access to each output pin separately; but internally, the code describing a single clock cycle's state transition creates a full Pure CPUOut inside a Writer monad, with composable Partial CPUOut-producing fragments.

Looking back at logicBoard, we can see that it keeps interruptRequest at False and feeds the dataIn bus from the result of the memory mapper/address decoder. On the output side, addrOut and dataOut is, unsurprisingly, fed into the address decoder; the other pins of CPUOut are not used in this particular computer.

Address decoding

The Intel 8080 has a 16-bit address bus that it uses both for accessing memory and for communication with up to 256 peripheral controllers. The actual details of how a real Intel 8080 signals memory vs. IO port access is quite baroque; but because we are building full computers using an Intel 8080 ISA-compatible CPU, we don't look for pin compatibility. Instead, we capture the morally right addressing interface of the 8080 by using the type Either Port Address (or, with the type synonys resolved, Either (Unsigned 8) (Unsigned 16)) for addrOut. In this particular computer, the details of the original Tiny BASIC firmware prescribe the following memory layout:

Putting it all together, we need 6 KB of RAM from 0x0800 to 0x1fff. We split it into two parts for easier address decoding: 2 KB from 0x0800 to 0x0fff, and 4 KB from 0x1000 to 0x1fff:

   .--------.           
   | ROM 2KB|<---.      
   `--------'    |      .------.                                   
                 |      |      |     .------.                      
   .--------.    |      |      |     |      |<--------<( inByte    
   | RAM 2KB|<---|----->| 8080 |<--->| ACIA |                      
   `--------'    |      |      |     |      |>-------->( outByte   
                 |      |      |     |      |<--------<( outReady  
   .--------.    |      `------'     `------'                      
   | RAM 4KB|<---'      
   `--------'           
    

We want to write this, and exactly this in logicBoard, and the abstraction we use for this is memoryMap. In an address decoder, we start with an address of the whole memory space, and iteratively cut it down into smaller sub-spaces, until we get to something small enough that it maps to a single component. In this case, we start with Either Port Address, and cut it down into e.g. an Unsigned 11 to address a 2K memory component. The dataOut writes and the dataIn reads are restricted along the address decoding, so that the write request only goes to the selected component, and the read result is taken from that as well.

Under the hood, the memory map description uses a Reader of the address and the write, and a Writer of the read result:

newtype Addressing dom addr dat a = Addressing
    { unAddressing :: ReaderT
                      (Signal dom (Maybe addr), Signal dom (Maybe dat))
                      (Writer (Ap (Signal dom) (First dat)))
                      a
    }
    deriving newtype (Functor, Applicative, Monad)

memoryMap
    :: Signal dom (Maybe addr)
    -> Signal dom (Maybe dat)
    -> Addressing dom addr dat a
    -> (Signal dom (Maybe dat), a)
    

Why the Maybe in the address? Because we want to know if the CPU doesn't need memory access in a given cycle at all; this is useful when memory is shared between the CPU and other components. We can resolve memory access contention by prioritizing some peripherals (returning Nothing in dataIn, forcing the CPU to stall) and de-prioritizing others (by only allowing them to access their memory component when the addressing of that sub-space is Nothing).

The implementation of memoryMap and its combinators is not particularly interesting; mask routes based on the high bits and leaves the low bits for the sub-space to handle:

mask
    :: (KnownNat k, KnownNat n)
    => (HiddenClockResetEnable dom)
    => Unsigned (n + k)
    -> Addressing dom (Unsigned k)       dat a
    -> Addressing dom (Unsigned (n + k)) dat a
    

ram0 and romFromFile simply wrap Clash memory primitives (zero-initialized synchronous RAM, and synchronous ROM initialized from a bitfile image) into Addressing-compliant forms, and port hooks up an IO peripheral that responds to PortCommands:

type Port dom addr dat a
    = Signal dom (Maybe (PortCommand addr dat)) -> (Signal dom (Maybe dat), a)

port
    :: (HiddenClockResetEnable dom, NFDataX dat)
    => Port dom addr dat a
    -> Addressing dom addr dat a      
    

This is also where we find out where the second component of memoryMap's return vaue comes from: the whole point of IO peripherals is that they can have connections, and as such, output signals, going to other parts of the circuit (or directly the outside world), not just the CPU's data bus.

The full code

The full Clash source code of the Tiny BASIC computer, including the Intel 8080 core, is available on Github. There is no user documentation at all yet; I've been using all the time and energy I can put into hacking for writing my book instead of tidying up the related repositories. So that's still something I need to get around to eventually; pull requests are obviously welcome!

Integrating Verilator and Clash via Cabal

7 May 2020 (programming haskell clash fpga)

TL;DR: This is a detailed description of how I got Clashilator working seamlessly from Cabal. It took me three days to figure out how the pieces need to fit together, and most of it was just trawling the documentation of Cabal internals, so I better write this down while I still have any idea what I did.

Background information

We set the scene with the dramatis personæ first:

When designing some circuit, it is very useful to be able to simulate its behaviour. Getting debugging information out of a hardware FPGA is a huge hassle; iteration is slow because FPGA synthesis toolchains, by and large, suck; and driving the circuit (i.e. setting the right inputs in the right sequence and interpreting the outputs) can be almost as complicated as the circuit under testing itself. Of course, all of these problems apply doubly to someone like me who is just dabbling in FPGAs.

So during development, instead of synthesizing a circuit design and loading it onto an FPGA, we want to simulate it; and of course we want to use nice expressive languages to write the test bench that envelopes the simulated circuit. One way to do this is what I call very high-level simulation: in this approach, we take the Haskell abstractions we use in our circuit, and reinterpret them in a software context. For example, we might have a state machine described as i -> State s o: instead of lifting it into a signal function, we can just runState it in a normal Haskell program's main and do whatever we want with it.

However, sometimes we want to simulate whole circuits, i.e. Signal dom i -> Signal dom o functions that might have who knows what registers and memory elements inside. For example, if we have a circuit that generates video output from a frame buffer, there's a difference between a high-level simulation that renders the frame buffer's contents to the screen, and a lower level one that interprets the VGA signal output of the circuit. Timing issues in synchronizing the VGA blanking signals with the color lines will only manifest themselves in the latter. So for this kind of applications, Clash of course contains a signal simulator that can be used to feed inputs into a circuit and get outputs. For example, here's a simulation of a Brainfuck computer where only the peripheral lines are exposed: internal connections between the CPU and the RAM and ROM are all part of the Clash circuit.

There is only one problem with the Clash simulator: its performance. This small benchmark shows how long it takes to simulate enough cycles to draw 10 full video frames at 640 ⨯ 480 resolution (i.e. 4,192,000 cycles). Clash does it in ~13 seconds; remember, at 60 FPS, it shouldn't take more than 166 milliseconds to draw 10 frames if we want to simulate it in real time. Of course, real-time simulation at this level of detail isn't necessarily feasable on consumer hardware; but less than one frame per second means any kind of interactive applications are out.

In contrast, Verilator, an open-source Verilog simulator can run all 4,192,000 cycles in 125 ms. This could be faster than realtime, were it not for the additional overhead of drawing each frame to the screen (and the Haskell FFI of 4 million calls accross to C...), and of course this is a very simple circuit that only renders a single bouncing ball; anything more complex will only be slower. But still, that same benchmark shows that 20+ FPS is possibe, end-to-end, if we ditch Clash and use Verilator instead. Pong is fully playable at 13 FPS.

Clash, Verilog and Haskell

The interface between Clash and Verilator is simple: Verilator consumes Verilog code, so we can simply run Clash and point Verilator at its output. However, we still need to connect that Verilator code to the Haskell code to drive the inputs and interpret the outputs. Here are the glue files from the Pong example:

As I was preparing to write the next chapter of a Clash book I've been working on, I made a new Clash project and then, because I needed a Verilator simulation for it, I started going through the steps of making all these files. And of course, I realized this should be all automated. But what is all in that sentence?

Step one was to write generators for the C++ and Haskell source files and the Makefile. This is quite easy, actually; after all, it is the fact that these files are so regular that makes it infuriating writing them by hand. So we do a bit of text template substitution, using Clash's .manifest output as the source of input/output pin names and bus widths. This gives us a simple code generator: you run Clash yourself, point Clashilator at a .manifest file, and it outputs a bunch of files, leaving you ready to run make. Mission accomplished?

No, not really.

Clash, Verilog and... Cabal

While we've eliminated the boilerplate in the source files, one source of boilerplate remains: the Cabal package settings. Here's the relevant HPack package.yaml section from the Pong example:

extra-libraries: stdc++ 
extra-lib-dirs: verilator
include-dirs: verilator
build-tools: hsc2hs

ghc-options:
  -O3
  -fPIC -pgml g++
  -optl-Wl,--whole-archive -optl-Wl,-Bstatic
  -optl-Wl,-L_build/verilator -optl-Wl,-lVerilatorFFI
  -optl-Wl,-Bdynamic -optl-Wl,--no-whole-archive
    

Oof, that hurts. All those magic GHC options just to statically link to libVerilatorFFI.a, to be repeated accross all Clash projects that use Verilator...

Also, while Clashilator outputs a Makefile to drive the invocation of Verilator and the subsequent compilation of the C++ bits, it doesn't give you a solution for running *that* Makefile at the right time — not to mention running Clashilator itself!

The problem here is that in order to compile a Verilator-using Haskell program, we first need to compile the other, non-simulation modules into Verilog. And this is tricky because those modules can have external dependencies: remember Clash is just a GHC backend, so you can import other Haskell libraries. And how are we going to bring those libraries in scope? I myself use Stack but your mileage may vary: you could be using Cabal directly, or some Cabal-Nix integration. In an case, you'd basically need to build your package so you can compile to Verilog so you can run Verilator so you can... build your package.

To solve this seemingly circular dependency, and to get rid of the Cabal file boilerplate, I decided to try and do everything in the Cabal workflow. Whatever your preferred method of building Haskell packages, when the rubber hits the road, they all ultimately run Cabal. If we run Clash late enough in the build process, all dependencies will be installed by then. If we run Verilator early enough in the build process, the resulting library can be linked into whatever executable or library Cabal is building.

If we do all this during cabal build, everything will happen at just the right time.

Is such a thing       even possible? Yes it is.

So, what gives us at least a fighting chance is that Cabal is extensible with so-called hooks. You can write a custom Setup.hs file like this:

import Distribution.Simple

main = defaultMainWithHooks simpleUserHooks
    

Here, simpleUserHooks is a record with a bunch of fields for extension points; of particular interest to us here is this one:

buildHook :: PackageDescription -> LocalBuildInfo -> UserHooks ->
BuildFlags -> IO ()
    

At ths point, we have breached the defenses: inside buildHook, we can basically do arbitrary things as long as the effect is building the package. In particular, we can:

The result of all this is a bunch of new files under the build directory, and modified BuildInfos for all the components marked with x-clashilator-top-is. We put these back into the PackageDescription and then call the default buildHook, which then goes on to compile and link the Haskell simulation integrating the Verilator parts:

clashilatorBuildHook :: PackageDescription -> LocalBuildInfo -> UserHooks -> BuildFlags -> IO ()
clashilatorBuildHook pkg localInfo userHooks buildFlags = do
    pkg' <- clashilate pkg localInfo buildFlags
    buildHook simpleUserHooks pkg' localInfo userHooks buildFlags
    

All the details are in the full implementation, which is probably worth a look if you are interested in Cabal's internals. As I wrote in the beginning of this post, it took me days to wade through the API to find all the moving parts that can be put together to apply this level of violence to Cabal. The final code also exports a convenience function clashilatorMain for the common case where enabling Clashilator is the only desired Setup.hs customization; and also clashilate itself for hardcore users who want to build their own buildHooks.

The implementation is almost 150 lines of not particularly nice code. It is also missing some features; most notably, it doesn't track file changes, so Clash and Verilator is always rerun, even if none of the Clash source files have changed. It is also completely, utterly untested. But it does give us what we set out to do: completely boilerplate-less integration of Clash and Verilator. A complete example package is here, and here's the money shot: the executables section of the package.yaml file.

executables:
  simulator:
    main: simulator.hs
    verbatim:
      x-clashilator-top-is: MyCircuit.Nested.Top
      x-clashilator-clock: CLK
    

Composable CPU descriptions in CλaSH, and wrap-up of RetroChallenge 2018/09

30 September 2018 (programming haskell fpga electronics retrochallenge clash)

My last post ended with some sample CλaSH code illustrating the spaghetti-ness you can get yourself into if you try to describe a CPU directly as a function (CPUState, CPUIn) -> (CPUState, CPUOut). I promised some ideas for improving that code.

To start off gently, first of all we can give names to some common internal operations to help readability. Here's the code from the previous post rewritten with a small kit of functions:

intensionalCPU (s0, CPUIn{..}) = case phase s of
    WaitKeyPress reg ->
        let s' = case cpuInKeyEvent of
                Just (True, key) -> goto Fetch1 . setReg reg key $ s
                _ -> s
            out = def{ cpuOutMemAddr = pc s' }
        in (s', out)                   
    Fetch1 ->
        let s' = goto Exec s{ opHi = cpuInMem, pc = succ $ pc s }
            out = def{ cpuOutMemAddr = pc s' }
        in (s', out)
  where
    s | cpuInVBlank = s0{ timer = fromMaybe 0 . predIdx $ timer s0 }
      | otherwise = s0
    

Using a State monad

To avoid the possibility of screwing up the threading of the internal state, we can use the State CPUState monad:

Continue reading »

CPU modeling in CλaSH

23 September 2018 (programming haskell fpga electronics retrochallenge clash)

My entry for RetroChallenge 2018/09 is building a CHIP-8 computer. Previously, I've talked in detail about the video signal generator and the keyboard interface; the only part still missing is the CPU.

The CHIP-8 instruction set

Since the CHIP-8 was originally designed to be an interpreted language run as a virtual machine, some of its instructions are quite high-level. For example, the framebuffer is modified via a dedicated blitting instruction; there is a built-in random number generator; and instructions to manipulate two 60 Hz timers. Other instructions are more in line with what one would expect to see in a CPU, and implement basic arithmetic such as addition or bitwise AND. There is also a generic escape hatch instruction but that doesn't really apply to hardware implementations.

The CPU has 16 generic-purpose 8-bit registers V0VF; register VF is also used to report flag results like overflow from arithmetic operations, or collision during blitting. Most instructions operate on these general registers. Since the available memory is roughly 4K, these 8-bit registers wouldn't be too useful as pointers. Instead, there is a 12-bit Index register that is used as the implicit address argument to memory-accessing instructions.

For flow control, the program counter needs 12 bits as well; the CHIP-8 is a von Neumann machine. Furthermore, it has CALL / RET instructions backed by a call-only stack (there is no argument passing or local variables).

Modeling the CPU's internal state

We can collect all of the registers described above into a single Haskell datatype. I have also added two 8-bit registers for the high and low byte of the current instruction, but in retrospect it would be enough to just store the high byte, since the low byte is coming from RAM exactly when we need to dispatch on it anyway. The extra phase register is to distinguish between execution phases such as fetching the first byte of the next instruction, or for instructions that are implemented in multiple clock cycles, like clearing the frame buffer (more on that below).

type Addr = Unsigned 12
type Reg = Index 16

data CPUState = CPUState
    { opHi, opLo :: Word8
    , pc, ptr :: Addr
    , registers :: Vec 16 Word8
    , stack :: Vec 24 Addr
    , sp :: Index 24
    , phase :: Phase
    , timer :: Word8
    , randomState :: Unsigned 9
    }
    

I implemented the random number generator as a 9-bit linear-feedback shift register, truncated to its lower 8 bits; this is because a maximal 8-bit LFSR wouldn't generate 0xFF.

lfsr :: Unsigned 9 -> Unsigned 9
lfsr s = (s `rotateR` 1) `xor` b4
  where
    b = fromIntegral $ complement . lsb $ s
    b4 = b `shiftL` 4
    

Input and output "pins"

Similar to how a real chip has various pins to interface with other parts, our CPU description will also have multiple inputs and outputs. The input consists of the data lines read from main memory and the framebuffer; the events coming from the keypad, and the keypad state; and the 60 Hz VBlank signal from the video generator. This latter signal is used to implement the timer register's countdown. The keypad's signals are fed into the CPU both as events and statefully; I've decided to do it this way so that only the peripheral interface needs to be changed to accomodate devices that are naturally either parallel (like a keypad matrix scanner) or serial (like a computer keyboard on a PS/2 connector).

type Key = Index 16
type KeypadState = Vec 16 Bool

data CPUIn = CPUIn
    { cpuInMem :: Word8
    , cpuInFB :: Bit
    , cpuInKeys :: KeypadState
    , cpuInKeyEvent :: Maybe (Bool, Key)
    , cpuInVBlank :: Bool
    }      
    

The output is even less surprising: there's an address line and a data out (write) line for main memory and the video framebuffer.

type VidX = Unsigned 6
type VidY = Unsigned 5

data CPUOut = CPUOut
    { cpuOutMemAddr :: Addr
    , cpuOutMemWrite :: Maybe Word8
    , cpuOutFBAddr :: (VidX, VidY)
    , cpuOutFBWrite :: Maybe Bit
    }
    

So, what is a CPU?

As far as CλaSH is concerned, the CPU is extensionally a circuit converting input signals to output signals, just like any other component:

extensionalCPU :: Signal dom CPUIn -> Signal dom CPUOut
    

The internal CPU state is of no concern at this level. Internally, we can implement the above as a Mealy machine with a state transition function that describes behaviour in any given single cycle:

intensionalCPU :: (CPUState, CPUIn) -> (CPUState, CPUOut)

extensionalCPU = mealy intenstionalCPU initialState
    

As far as a circuit is concerned, a clock cycle is a clock cycle is a clock cycle. If we want to do any kind of sequencing, for example to fetch two-byte instruction opcodes from the byte-indexed main memory in two steps, we need to know in intensionalCPU which step is next. This is why we have the phase field in CPUState, so we can read out what we need to do, and store what we want to do next. For example, in my current version the video framebuffer is bit-indexed (addressed by the 6-bit X and the 5-bit Y coordinate), and there is no DMA to take care of bulk writes; so to implement the instruction that clears the screen, we need to write low to all framebuffer addresses, one by one, from (0, 0) to (63, 31). This requires 2048 cycles, so we need to go through the Phase that clears (0, 0), to the one that clears (0, 1), all the way to (63, 31), before fetching the first byte of the next opcode to continue execution. Accordingly, one of the constructors of Phase stores the (x, y) coordinate of the next bit to clear, and we'll need to add some logic so that if phase = ClearFB (x, y), we emit (x, y) on the cpuOutFBAddr line and Just low on the cpuOutFBWrite line. Blitting proceeds similarly, with two sub-phases per phase: one to read the old value, and one to write back the new value (with the bitmap image xor'd to it)

data Phase
    = Init
    | Fetch1
    | Exec
    | StoreReg Reg
    | LoadReg Reg
    | ClearFB (VidX, VidY)
    | Draw DrawPhase (VidX, VidY) Nybble (Index 8)
    | WaitKeyPress Reg
    | WriteBCD Word8 (Index 3)      
    

So how should we write intensionalCPU? We could do it in direct style, i.e. something like

intensionalCPU (s0, CPUIn{..}) = case phase s of Fetch1 -> let s' = s{ opHi = cpuInMem, pc = succ $ pc s, phase = Exec } out = CPUOut{ cpuOutMemAddr = pc s', cpuOutMemWrite = Nothing , cpuOutFBAddr = minBound, cpuOutFBWrite = Nothing } in (s', out) WaitKeyPress reg -> let s' = case cpuInKeyEvent of Just (True, key) -> s{ registers = replace reg key (registers s), phase = Fetch1 } _ -> s out = CPUOut{ cpuOutMemAddr = pc s', cpuOutMemWrite = Nothing , cpuOutFBAddr = minBound, cpuOutFBWrite = Nothing } in (s', out) -- and lots of other cases as well, of course where s | cpuInVBlank = s0{ timer = fromMaybe 0 $ predIdx $ timer s0 } | otherwise = s0

If you think this is horrible and unreadable and unmaintainable, then yes! I agree! Which is why I've spent most of this RetroChallenge (when not fighting synthesizer crashes) thinking about nicer ways of writing this.

This post is getting long, let's end on this note here. Next time, I am going to explain how far I've gotten so far in this quest for nicely readable, composable descriptions of CPUs.

Back in the game!

22 September 2018 (programming haskell fpga electronics retrochallenge clash)

For most of this week, it seemed I will have to throw in the towel. As I mentioned in my previous entry last Saturday, I ran into what at first seemed like a CλaSH bug. However, further investigation showed that the error message was actually pointing at an internal bug in the Xilinx ISE synthesizer. The same generated VHDL didn't cause any problems when fed into the Yosys open source synthesizer, Altera Quartus, or the newer version of Xilinx Vivado. But the Papilio Pro I'm using is based on the Spartan 6 FPGA, which is not supported by the newer Xilinx tools, so I am stuck with ISE 14.7 from 2013. So the conclusion is, just like all other closed-source proprietary software from FPGA vendors, the Xilinx ISE is simply a piece of shit that falls over under its own weight on perfectly valid VHDL.

I was thinking of ordering a new FPGA board, but I only have until next Wednesday to finish this (I won't be able to put in any work on the last Retrochallenge weekend), so it's not even a given it would get here in time. Also, I'd like to do a bit more research on what board I should get -- on one hand, both Altera and Xilinx have nice, more modern dev boards with good IO ports for my retro-computing-oriented needs, but on the other hand, it feels a bit like throwing good money after bad, since these would still be programmed with proprietary shitty software, with no way forward when (not if!) they break.

Then there's Lattice's ICE40 line which is fully supported by the open source toolchain IceStorm, but the largest ICE40 is still quite small compared to the Spartan 7 or the Cyclone V series; not to mention that even the nicest ICE40 board I could find doesn't have a USB connector on board, so you have to play around with an Arduino and plug jumper wires into this weird connector to get anything working. Also, while I'm ranting, of course the Lattice ICE40 open source toolchain is not from Lattice themselves; instead, its bitstream format had to be reverse-engineered by awesome free software hackers

So anyway, I had a perfectly functioning board betrayed by its software toolchain. I tried some desparate ideas like generating Verilog instead of VHDL or getting rid of the unguarded block statements, but nothing made any difference. Then Thursday night I had an even wilder idea. If the Xilinx ISE is crashing because the generated VHDL is triggering some weird corner case in the synthesizer, then maybe using the same ISE version, but changing the target FPGA model, would get over the hurdle? And that's when I remembered I still have my first ever FPGA board: the Papilio One based on the Spartan 3E. Luckily, the Spartan 3-series is also supported by the 14 series ISE, so the same toolchain can serve both boards.

On Friday morning, I did the necessary changes to my code to target the Papilio One. The clock generator is different between the models, so I needed to replace that; the other difference was that the Spartan 3 doesn't seem to have wide enough blocks for 64-bit arithmetic. This shouldn't be a problem for the CHIP-8, but CλaSH generates code that converts everything into 64 bits. I initially overcame that by post-processing CλaSH's output with sed, but then I discovered that there is a flag -fclash-intwidth to set that properly.

With these changes, I was able to get it through the Xilinx ISE's synthesizer, and all the way through the rest of the pipeline! As before, the code is on GitHub.

And with this, I am where I was supposed to be a week ago at half-time. I probably won't have time to work on this project next weekend since we'll be travelling; this looks like a good time to take inventory of the project.

Older entries:

Entries from all tags