Posts tagged electronics
Composable CPU descriptions in CλaSH, and wrap-up of RetroChallenge 2018/09
30 September 2018 (programming haskell fpga electronics retrochallenge retro clash chip-8)My last post ended with some sample CλaSH code illustrating the spaghetti-ness you can get yourself into if you try to describe a CPU directly as a function (CPUState, CPUIn) -> (CPUState, CPUOut). I promised some ideas for improving that code.
To start off gently, first of all we can give names to some common internal operations to help readability. Here's the code from the previous post rewritten with a small kit of functions:
intensionalCPU (s0, CPUIn{..}) = case phase s of WaitKeyPress reg -> let s' = case cpuInKeyEvent of Just (True, key) -> goto Fetch1 . setReg reg key $ s _ -> s out = def{ cpuOutMemAddr = pc s' } in (s', out) Fetch1 -> let s' = goto Exec s{ opHi = cpuInMem, pc = succ $ pc s } out = def{ cpuOutMemAddr = pc s' } in (s', out) where s | cpuInVBlank = s0{ timer = fromMaybe 0 . predIdx $ timer s0 } | otherwise = s0
Using a State monad
To avoid the possibility of screwing up the threading of the internal state, we can use the State CPUState monad:
Continue reading »CPU modeling in CλaSH
23 September 2018 (programming haskell fpga electronics retrochallenge retro clash chip-8)My entry for RetroChallenge 2018/09 is building a CHIP-8 computer. Previously, I've talked in detail about the video signal generator and the keyboard interface; the only part still missing is the CPU.
The CHIP-8 instruction set
Since the CHIP-8 was originally designed to be an interpreted language run as a virtual machine, some of its instructions are quite high-level. For example, the framebuffer is modified via a dedicated blitting instruction; there is a built-in random number generator; and instructions to manipulate two 60 Hz timers. Other instructions are more in line with what one would expect to see in a CPU, and implement basic arithmetic such as addition or bitwise AND. There is also a generic escape hatch instruction but that doesn't really apply to hardware implementations.
The CPU has 16 generic-purpose 8-bit registers V0…VF; register VF is also used to report flag results like overflow from arithmetic operations, or collision during blitting. Most instructions operate on these general registers. Since the available memory is roughly 4K, these 8-bit registers wouldn't be too useful as pointers. Instead, there is a 12-bit Index register that is used as the implicit address argument to memory-accessing instructions.
For flow control, the program counter needs 12 bits as well; the CHIP-8 is a von Neumann machine. Furthermore, it has CALL / RET instructions backed by a call-only stack (there is no argument passing or local variables).
Modeling the CPU's internal state
We can collect all of the registers described above into a single Haskell datatype. I have also added two 8-bit registers for the high and low byte of the current instruction, but in retrospect it would be enough to just store the high byte, since the low byte is coming from RAM exactly when we need to dispatch on it anyway. The extra phase register is to distinguish between execution phases such as fetching the first byte of the next instruction, or for instructions that are implemented in multiple clock cycles, like clearing the frame buffer (more on that below).
type Addr = Unsigned 12 type Reg = Index 16 data CPUState = CPUState { opHi, opLo :: Word8 , pc, ptr :: Addr , registers :: Vec 16 Word8 , stack :: Vec 24 Addr , sp :: Index 24 , phase :: Phase , timer :: Word8 , randomState :: Unsigned 9 }
I implemented the random number generator as a 9-bit linear-feedback shift register, truncated to its lower 8 bits; this is because a maximal 8-bit LFSR wouldn't generate 0xFF.
lfsr :: Unsigned 9 -> Unsigned 9 lfsr s = (s `rotateR` 1) `xor` b4 where b = fromIntegral $ complement . lsb $ s b4 = b `shiftL` 4
Input and output "pins"
Similar to how a real chip has various pins to interface with other parts, our CPU description will also have multiple inputs and outputs. The input consists of the data lines read from main memory and the framebuffer; the events coming from the keypad, and the keypad state; and the 60 Hz VBlank signal from the video generator. This latter signal is used to implement the timer register's countdown. The keypad's signals are fed into the CPU both as events and statefully; I've decided to do it this way so that only the peripheral interface needs to be changed to accomodate devices that are naturally either parallel (like a keypad matrix scanner) or serial (like a computer keyboard on a PS/2 connector).
type Key = Index 16 type KeypadState = Vec 16 Bool data CPUIn = CPUIn { cpuInMem :: Word8 , cpuInFB :: Bit , cpuInKeys :: KeypadState , cpuInKeyEvent :: Maybe (Bool, Key) , cpuInVBlank :: Bool }
The output is even less surprising: there's an address line and a data out (write) line for main memory and the video framebuffer.
type VidX = Unsigned 6 type VidY = Unsigned 5 data CPUOut = CPUOut { cpuOutMemAddr :: Addr , cpuOutMemWrite :: Maybe Word8 , cpuOutFBAddr :: (VidX, VidY) , cpuOutFBWrite :: Maybe Bit }
So, what is a CPU?
As far as CλaSH is concerned, the CPU is extensionally a circuit converting input signals to output signals, just like any other component:
extensionalCPU :: Signal dom CPUIn -> Signal dom CPUOut
The internal CPU state is of no concern at this level. Internally, we can implement the above as a Mealy machine with a state transition function that describes behaviour in any given single cycle:
intensionalCPU :: (CPUState, CPUIn) -> (CPUState, CPUOut) extensionalCPU = mealy intenstionalCPU initialState
As far as a circuit is concerned, a clock cycle is a clock cycle is a clock cycle. If we want to do any kind of sequencing, for example to fetch two-byte instruction opcodes from the byte-indexed main memory in two steps, we need to know in intensionalCPU which step is next. This is why we have the phase field in CPUState, so we can read out what we need to do, and store what we want to do next. For example, in my current version the video framebuffer is bit-indexed (addressed by the 6-bit X and the 5-bit Y coordinate), and there is no DMA to take care of bulk writes; so to implement the instruction that clears the screen, we need to write low to all framebuffer addresses, one by one, from (0, 0) to (63, 31). This requires 2048 cycles, so we need to go through the Phase that clears (0, 0), to the one that clears (0, 1), all the way to (63, 31), before fetching the first byte of the next opcode to continue execution. Accordingly, one of the constructors of Phase stores the (x, y) coordinate of the next bit to clear, and we'll need to add some logic so that if phase = ClearFB (x, y), we emit (x, y) on the cpuOutFBAddr line and Just low on the cpuOutFBWrite line. Blitting proceeds similarly, with two sub-phases per phase: one to read the old value, and one to write back the new value (with the bitmap image xor'd to it)
data Phase = Init | Fetch1 | Exec | StoreReg Reg | LoadReg Reg | ClearFB (VidX, VidY) | Draw DrawPhase (VidX, VidY) Nybble (Index 8) | WaitKeyPress Reg | WriteBCD Word8 (Index 3)
So how should we write intensionalCPU? We could do it in direct style, i.e. something like
If you think this is horrible and unreadable and unmaintainable, then yes! I agree! Which is why I've spent most of this RetroChallenge (when not fighting synthesizer crashes) thinking about nicer ways of writing this.
This post is getting long, let's end on this note here. Next time, I am going to explain how far I've gotten so far in this quest for nicely readable, composable descriptions of CPUs.
Back in the game!
22 September 2018 (programming haskell fpga electronics retrochallenge retro clash chip-8)For most of this week, it seemed I will have to throw in the towel. As I mentioned in my previous entry last Saturday, I ran into what at first seemed like a CλaSH bug. However, further investigation showed that the error message was actually pointing at an internal bug in the Xilinx ISE synthesizer. The same generated VHDL didn't cause any problems when fed into the Yosys open source synthesizer, Altera Quartus, or the newer version of Xilinx Vivado. But the Papilio Pro I'm using is based on the Spartan 6 FPGA, which is not supported by the newer Xilinx tools, so I am stuck with ISE 14.7 from 2013. So the conclusion is, just like all other closed-source proprietary software from FPGA vendors, the Xilinx ISE is simply a piece of shit that falls over under its own weight on perfectly valid VHDL.
I was thinking of ordering a new FPGA board, but I only have until next Wednesday to finish this (I won't be able to put in any work on the last Retrochallenge weekend), so it's not even a given it would get here in time. Also, I'd like to do a bit more research on what board I should get -- on one hand, both Altera and Xilinx have nice, more modern dev boards with good IO ports for my retro-computing-oriented needs, but on the other hand, it feels a bit like throwing good money after bad, since these would still be programmed with proprietary shitty software, with no way forward when (not if!) they break.
Then there's Lattice's ICE40 line which is fully supported by the open source toolchain IceStorm, but the largest ICE40 is still quite small compared to the Spartan 7 or the Cyclone V series; not to mention that even the nicest ICE40 board I could find doesn't have a USB connector on board, so you have to play around with an Arduino and plug jumper wires into this weird connector to get anything working. Also, while I'm ranting, of course the Lattice ICE40 open source toolchain is not from Lattice themselves; instead, its bitstream format had to be reverse-engineered by awesome free software hackers
So anyway, I had a perfectly functioning board betrayed by its software toolchain. I tried some desparate ideas like generating Verilog instead of VHDL or getting rid of the unguarded block statements, but nothing made any difference. Then Thursday night I had an even wilder idea. If the Xilinx ISE is crashing because the generated VHDL is triggering some weird corner case in the synthesizer, then maybe using the same ISE version, but changing the target FPGA model, would get over the hurdle? And that's when I remembered I still have my first ever FPGA board: the Papilio One based on the Spartan 3E. Luckily, the Spartan 3-series is also supported by the 14 series ISE, so the same toolchain can serve both boards.
On Friday morning, I did the necessary changes to my code to target the Papilio One. The clock generator is different between the models, so I needed to replace that; the other difference was that the Spartan 3 doesn't seem to have wide enough blocks for 64-bit arithmetic. This shouldn't be a problem for the CHIP-8, but CλaSH generates code that converts everything into 64 bits. I initially overcame that by post-processing CλaSH's output with sed, but then I discovered that there is a flag -fclash-intwidth to set that properly.
With these changes, I was able to get it through the Xilinx ISE's synthesizer, and all the way through the rest of the pipeline! As before, the code is on GitHub.
And with this, I am where I was supposed to be a week ago at half-time. I probably won't have time to work on this project next weekend since we'll be travelling; this looks like a good time to take inventory of the project.
- I am very happy with how the video and keyboard peripheral interfaces turned out; the CλaSH code is nice and clean.
- I still need to write a blog post about the design I came up with for implementing the CPU. I'm convinced it should scale to more complex processors; but of course the proof of that pudding will be in implementing a real retro-CPU like a 6502 or a Z80.
- The font ROM is not hooked up yet; I plan to finish that tomorrow.
- The plan was to get it working first, and make it more performant later. I'm afraid I won't have time before the RetroChallenge finishes to implement improvements. The biggest deficiency here is that the framebuffer is accessed one bit at a time, so clearing the screen (a single opcode in CHIP-8!) takes 64⨯32 = 2048 cycles!
- The signal-less high-level simulation makes debugging the CPU implementation very convenient and fast. Being able to run the CHIP-8 machine in real-time, with interactive graphics and input, is immensely satisfying.
- I prefer the code I was able to write in CλaSH, when compared to Kansas Lava. Of course, I don't know how much of that is simply because this is the second time I was implementing these circuits.
- There's a lot to be improved in CλaSH itself. Significant improvements to compilation time are already coming, which will be welcome given that even this tiny CHIP-8 circuit is now taking about 5 minutes to compile; this issue is exacerbated by the fact that compilation cannot use multiple cores. The other problem is that CλaSH generates very inefficient HDL, putting big pressure on the vendor-specific synthesis tools. As we've seen, this pressure can be too big for crappy software like the old version of the Xilinx ISE I am forced to use with my FPGA.
- Writing these posts takes a lot of time! I love reading the posts of other RetroChallenge entrants, but there is no way I could have written more frequent updates. I wouldn't have had time to do the actual work then!
Very high-level simulation of a CλaSH CPU
15 September 2018 (programming haskell fpga electronics retrochallenge retro clash chip-8)Initially, I wanted to talk this week about how I plan to structure the CλaSH description of the CHIP-8 CPU. However, I'm postponing that for now, because I ran into what seems like a CλaSH bug, and I want to see my design run on real hardware before I describe it in too much detail. So instead, here's a post on how I am testing in software.
CPUs as Mealy machines
After stripping away all the nice abstractions that I am using in my description of the CPU, what remains is a Mealy machine, which simply means it is described by a state transition and output function s -> i -> (s, o). If that looks familiar, that is not a coincidence: this is, of course, just one argument flip away from the Kleisli category of the State s monad. Just think of it as being either this or that, depending on which one you have more intuition about. A lot more on this in my upcoming blogpost.
My CHIP-8 CPU is currently described by a Mealy machine over these types:
data CPUIn = CPUIn { cpuInMem :: Word8 , cpuInFB :: Bit , cpuInKeys :: KeypadState , cpuInKeyEvent :: Maybe (Bool, Key) , cpuInVBlank :: Bool } data Phase = Init | Fetch1 | Exec | StoreReg Reg | LoadReg Reg | ClearFB (VidX, VidY) | Draw DrawPhase (VidX, VidY) Nybble (Index 8) | WaitKeyPress Reg data CPUState = CPUState { opHi, opLo :: Word8 , pc, ptr :: Addr , registers :: Vec 16 Word8 , stack :: Vec 24 Addr , sp :: Index 24 , phase :: Phase , timer :: Word8 } data CPUOut = CPUOut { cpuOutMemAddr :: Addr , cpuOutMemWrite :: Maybe Word8 , cpuOutFBAddr :: (VidX, VidY) , cpuOutFBWrite :: Maybe Bit } cpu :: CPUIn -> State CPUState CPUOut
Running the CPU directly
Note that all the types involved are pure: signal inputs are turned into pure input by CλaSH's mealy function, and the pure output is similarly turned into a signal output. But what if we didn't use mealy, and ran cpu directly, completely sidestepping CλaSH, yet still running the exact same implementation?
That is exactly what I am doing for testing the CPU. By running its Mealy function directly, I can feed it a CPUIn and consume its CPUOut result while interacting with the world — completely outside the simulation! The main structure of the code that implements the above looks like this:
stateful :: (MonadIO m) => s -> (i -> State s o) -> IO (m i -> (o -> m a) -> m a) stateful s0 step = do state <- newIORef s0 return $ \mkInput applyOutput -> do inp <- mkInput out <- liftIO $ do s <- readIORef state let (out, s') = runState (step inp) s writeIORef state s' return out applyOutput out
Hooking it up to SDL
I hooked up the main RAM and the framebuffer signals to IOArrays, and wrote some code that renders the framebuffer's contents into an SDL surface and translates keypress events. And, voilà: you can run the CHIP-8 computer, interactively, even allowing you to use good old trace-based debugging (which is thankfully removed by CλaSH during VHDL generation so can even leave them in). The below screencap shows this in action: :main is run from clashi and starts the interactive SDL program, with no Signal types involved.
PS/2 keyboard interface in CλaSH
8 September 2018 (programming haskell fpga electronics retrochallenge retro clash chip-8)This week, most of my weekday evenings were quite busy, but I did manage to squeeze in a PS/2 keyboard interface in small installments; then today I went down the rabbit hole of clearing up some technical debt I've accumulated so far by not really looking into how CλaSH handled clock domains.
PS/2 signals
(Just to preempt any possible confusion, we're talking about the peripheral port of the IBM Personal System/2 introduced in the late '80s, not the Playstation 2 console)
The same way VGA is ideal for hobbyist video signal generation since it is both simple and ubiquitous, PS/2 is the go-to interface for keyboards. It is a two-directional, synchronous serial protocol with a peripheral-generated clock in the 10-20 KHz range. Any PC keyboard old enough will support it. One important caveat, though, is that the common USB-to-PS/2 adapters don't actually convert signals, and so they only work with keyboards that were originally designed with that conversion in mind. Here, we are only concerned with device to host communication; it is also possible to communicate in the other direction to e.g. change the Num Lock LED's state.
"Synchronous" here means that there is a separate clock line, unlike in the family of asynchronous serial protocols that were used in RS-232; it is this latter one that is usually meant as "serial communication" when unqualified. In synchronous serial communication, everything happens on the clock ticks; in asynchronous communication, there is no separate clock signal, so the data signal has enough structure that the two communicating sides can agree on the exact framing.
Turning the data line of PS/2 into a stream of bits is a straightforward process: the standard prescribes sampling the data line on the falling edge of the clock line. We also apply an 8-cycle debouncer for good measure, just because some pages on the Internet suggest it:
data PS2 dom = PS2 { ps2Clk :: Signal dom Bit , ps2Data :: Signal dom Bit } samplePS2 :: (HiddenClockReset dom gated synchronous) => PS2 dom -> Signal dom (Maybe Bit) samplePS2 PS2{..} = enable <$> isFalling low ps2Clk' <*> ps2Data' where ps2Clk' = debounce d3 low ps2Clk ps2Data' = debounce d3 low ps2Data
The second step in that pipeline is to shift in the bits, 11 at a time. A leading low bit signals the start of a packet; eight data bits and one parity bit follow; the packet is finished with one high bit. Of course, only the eight data bits are presented externally. I use a WriterT (Last Word8) (State PS2State) monad to implement this logic, and then turn that into a CλaSH Mealy machine, in a pattern that I plan to use a lot in implementing the CHIP-8 CPU later:
data PS2State = Idle | Bit Word8 (Index 8) | Parity Word8 | Stop (Maybe Word8) decodePS2 :: (HiddenClockReset dom gated synchronous) => Signal dom (Maybe Bit) -> Signal dom (Maybe Word8) decodePS2 = flip mealyState Idle $ \bit -> fmap getLast . execWriterT . forM_ bit $ \bit -> do state <- get case state of Idle -> do when (bit == low) $ put $ Bit 0 0 Bit x i -> do let x' = shiftInLeft bit x put $ maybe (Parity x') (Bit x') $ succIdx i Parity x -> do let checked = bit /= parity x put $ Stop $ enable checked x Stop x -> do when (bit == high) $ tell $ Last x put Idle
A quick change in hardware
To be able to try out on real hardware what I had at this point, I had to leave the trusty LogicStart Mega-Wing of my Papilio Pro, and instead switch over to the Arcade since that one has a PS/2 port. There are actually two ports on it, so that one could connect e.g. a keyboard and a mouse.
This change involved rewriting my UCF file since the pinout is different from the LogicStart. Also, the Arcade has 4+4+4 bits of VGA color output instead of the previous 3+3+2; of course with the black & white graphics of the CHIP-8, that color depth is all going to waste with this project.
PS/2 scan codes
Unfortunately, it is not enough to shift in the PS/2 data into a byte: we also have to make sense of that byte. While this could be as straightforward as interpreting each byte as the ASCII code of the character on the key pressed, the reality is not this simple. Keyboards emit so-called scan codes, where one or several bytes can encode a single keypress or key release event (see here for example for a list of some keyboard scan codes). I haven't been able to come up with an elegant way of handling this yet, so for now I just have some messy Mealy machine that returns a 16-bit code, where the high byte is zero for one-byte codes. You can see in the comment my frustration at both the implementation and the spec itself:
data KeyEvent = KeyPress | KeyRelease deriving (Generic, NFData, Eq, Show) data ScanCode = ScanCode KeyEvent Word16 deriving (Generic, NFData, Eq, Show) data ScanState = Init | Extended Word8 | Code KeyEvent Word8 -- TODO: rewrite this for clarity. -- All it does is it parses 0xE0 0xXX into an extended (16-bit) code, and everything else into -- an 8-bit code. The only complication is that the key release marker 0xF0 is always the -- second-to-last byte. Who does that?!? parseScanCode :: (HiddenClockReset dom gated synchronous) => Signal dom (Maybe Word8) -> Signal dom (Maybe ScanCode) parseScanCode = flip mealyState Init $ \raw -> fmap getLast . execWriterT . forM_ raw $ \raw -> do let finish ev ext = do tell $ Last . Just $ ScanCode ev $ fromBytes (ext, raw) put Init state <- get case state of Init | raw == 0xe0 -> put $ Extended raw | raw == 0xf0 -> put $ Code KeyRelease 0x00 | otherwise -> finish KeyPress 0x00 Extended ext | raw == 0xf0 -> put $ Code KeyRelease ext | otherwise -> finish KeyPress ext Code ev ext -> finish ev ext where fromBytes :: (Word8, Word8) -> Word16 fromBytes = unpack . pack
Driving a CHIP-8 pixel around
With the video output from last time and the keyboard from this post, but no CPU yet, our options to put everything together into something impressive are somewhat limited. I ended up showing a single CHIP-8 pixel that can be moved around in the CHIP-8 screen space with the arrow keys; this results in something tangible without needing a CPU or even a framebuffer yet. Note how well the code lends itself to using applicative do syntax:
VGADriver{..} = vgaDriver vga640x480at60 ps2 = decodePS2 $ samplePS2 PS2{..} (dx, dy) = unbundle $ do key <- parseScanCode ps2 pure $ case key of Just (ScanCode KeyPress 0xe075) -> (0, -1) -- up Just (ScanCode KeyPress 0xe072) -> (0, 1) -- down Just (ScanCode KeyPress 0xe06b) -> (-1, 0) -- left Just (ScanCode KeyPress 0xe074) -> (1, 0) -- right _ -> (0, 0) pixel = do x <- fix $ register 0 . (+ dx) y <- fix $ register 0 . (+ dy) x0 <- (chipX =<<) <$> vgaX y0 <- (chipY =<<) <$> vgaY pure $ case (,) <$> x0 <*> y0 of Just (x0, y0) -> (x0, y0) == (x, y) _ -> False
But wait! There's more!
In reality, after getting the PS/2 decoder working, but before hooking it up to the scan code parser, I thought I'd use the serial IO on the Papilio Pro to do a quick test by just transmitting the scan codes straight away as they come out of the decoder. This has then prompted me to clean up a wart on my UART implementation: they took the clock rate as an extra term-level argument to compute the clock division they need to do:
tx :: (HiddenClockReset domain gated synchronous) => Word32 -> Word32 -> Signal domain (Maybe Word8) -> TXOut domain tx clkRate serialRate inp = TXOut{..} where (txReady, txOut) = unbundle $ mealyState (tx0 $ clkRate `div` serialRate) (0, Nothing) inp
This bothered me because the clock domain already specifies the clock rate, at the type level. Trying to remove this redundancy has led me down a rabbit hole of what I believe is a CλaSH bug; but at least I managed to work around that for now (until I find an even better way).
This, in turn, forced me to use a proper clock domain with the correct clock period setting in my CHIP-8 design:
-- | 25.175 MHz clock, needed for the VGA mode we use. -- CLaSH requires the clock period to be specified in picoseconds. type Dom25 = Dom "CLK_25MHZ" (1000000000000 `Div` 25175000)
But then, this allowed me to start putting pixel clock specifications into the type of VGATimings, allowing me to statically enforce that the clock domain in which the VGA signal generator runs is at exactly the right frequency:
vgaDriver :: (HiddenClockReset dom gated synchronous, KnownNat w, KnownNat h) => (dom ~ Dom s ps, ps ~ (1000000000000 `Div` rate)) => VGATimings rate w h -> VGADriver dom w h -- | VGA 640*480@60Hz, 25.175 MHz pixel clock vga640x480at60 :: VGATimings 25175000 10 10 vga640x480at60 = VGATimings { vgaHorizTiming = VGATiming 640 16 96 48 , vgaVertTiming = VGATiming 480 11 2 31 }
Older posts:
- 2 September 2018: CλaSH video: VGA signal generator
- 29 August 2018: RetroChallenge 2018: CHIP-8 in CλaSH
- 12 May 2017: Rust on AVR: Beyond Blinking
- 28 December 2015: Hacks of 2015
- 2 March 2015: Initial version of my Commodore PET
- 29 March 2014: My first computer
- 19 January 2013: A Brainfuck CPU in FPGA (3 comments)
- 3 March 2011: Gettó-szakállvágó