Blog tags RSS

All entries

A tiny computer for Tiny BASIC

17 November 2020 (programming haskell clash fpga)

Just a quick post to peel back the curtain a bit on a Clash example I posted to Twitter. My tweet in question shows the following snippet, claiming it is "a complete Intel 8080-based FPGA computer that runs Tiny BASIC":

logicBoard
    :: (HiddenClockResetEnable dom)
    => Signal dom (Maybe (Unsigned 8)) -> Signal dom Bool -> Signal dom (Maybe (Unsigned 8))
logicBoard inByte outReady = outByte
  where
    CPUOut{..} = intel8080 CPUIn{..}

    interruptRequest = pure False

    (dataIn, outByte) = memoryMap _addrOut _dataOut $ do
        matchRight $ do
            mask 0x0000 $ romFromFile (SNat @0x0800) "_build/intel8080/image.bin"
            mask 0x0800 $ ram0 (SNat @0x0800)
            mask 0x1000 $ ram0 (SNat @0x1000)
        matchLeft $ do
            mask 0x10 $ port $ acia inByte outReady
    

I got almost a hundred likes on this tweet, which isn't too bad for a topic as niche as hardware design with Haskell and Clash. Obviously, the above tweet was meant as a brag, not as a detailed technical description; but given the traction it got, I thought I might as well try to expand a bit on it.

Now, the background to all this is that I'm working on a book on retrocomputing with Clash. So in this post, I will try to condense the 75k words of text and the 5k lines of code that I have written so far; it's going to be more of an extended table of contents than a Cliffs notes.

Tiny BASIC is an interactive BASIC interpreter; these days, we would call it a REPL. The computer above runs one of the original, Intel 8080-based versions of Tiny BASIC as its firmware. When the computer is turned on, it boots straight into a BASIC prompt, just like the Commodore PETs and Apple II's of yesteryear. The software assumes there is a peripheral controller connected to certain output ports of the CPU; all IO is done via a stream of bytes written to and read from this controller.

Input and output

Let's start with the interface. We see that logicBoard has two input signals and one output signal. The Maybe (Unsigned 8) input is a byte coupled with an "input ready" line; in other words, the value is Just a byte if there is new input coming in in a given clock cycle, and Nothing otherwise. The output of the same type is, unsurprisingly, the output of the whole computer in the same format. The extra Bool input is for backpressure: some output peripherals might need time to process an output byte before they are ready for the next one.

We will need to connect these incoming and outgoing bytes to the outside world somehow, and in a real hardware implementation, the IO controller actually contained a UART so that input and output was serialized into a stream of bits. However, by uncoupling the UART from the ACIA implementation, and exposing input and output as byte events, we make it easy to hook up alternative communication media:

The CPU

The Intel 8080 core we are using is also, of course, written in Clash. I can't really go through its implementation in this small space; the chapter where we construct it from scratch is at 13k words and builds heavily on techniques introduced in earlier chapters describing a CPU that uses Brainfuck as its machine code and a CHIP-8 machine. In this post, we'll just look at its interface, to see how it fits into the logicBoard:

declareBareB [d|
  data CPUIn = CPUIn
    { dataIn :: Maybe Value
    , interruptRequest :: Bool
    } |]

declareBareB [d|
  data CPUOut = CPUOut
      { _addrOut :: Maybe (Either Port Addr)
      , _dataOut :: Maybe Value
      , _interruptAck :: Bool
      , _halted :: Bool
      } |]

type Signals dom b = b Covered (Signal dom)
type Pure b = b Bare Identity
type Partial b = Barbie (b Covered) Last

intel8080 :: (HiddenClockResetEnable dom) => Signals dom CPUIn -> Signals dom CPUOut
    

We use the wonderful Barbies library for CPUIn and CPUOut so that we can switch between a record of signals and a signal of a record. Outwards, Signals dom CPUOut gives easy access to each output pin separately; but internally, the code describing a single clock cycle's state transition creates a full Pure CPUOut inside a Writer monad, with composable Partial CPUOut-producing fragments.

Looking back at logicBoard, we can see that it keeps interruptRequest at False and feeds the dataIn bus from the result of the memory mapper/address decoder. On the output side, addrOut and dataOut is, unsurprisingly, fed into the address decoder; the other pins of CPUOut are not used in this particular computer.

Address decoding

The Intel 8080 has a 16-bit address bus that it uses both for accessing memory and for communication with up to 256 peripheral controllers. The actual details of how a real Intel 8080 signals memory vs. IO port access is quite baroque; but because we are building full computers using an Intel 8080 ISA-compatible CPU, we don't look for pin compatibility. Instead, we capture the morally right addressing interface of the 8080 by using the type Either Port Address (or, with the type synonys resolved, Either (Unsigned 8) (Unsigned 16)) for addrOut. In this particular computer, the details of the original Tiny BASIC firmware prescribe the following memory layout:

Putting it all together, we need 6 KB of RAM from 0x0800 to 0x1fff. We split it into two parts for easier address decoding: 2 KB from 0x0800 to 0x0fff, and 4 KB from 0x1000 to 0x1fff:

   .--------.           
   | ROM 2KB|<---.      
   `--------'    |      .------.                                   
                 |      |      |     .------.                      
   .--------.    |      |      |     |      |<--------<( inByte    
   | RAM 2KB|<---|----->| 8080 |<--->| ACIA |                      
   `--------'    |      |      |     |      |>-------->( outByte   
                 |      |      |     |      |<--------<( outReady  
   .--------.    |      `------'     `------'                      
   | RAM 4KB|<---'      
   `--------'           
    

We want to write this, and exactly this in logicBoard, and the abstraction we use for this is memoryMap. In an address decoder, we start with an address of the whole memory space, and iteratively cut it down into smaller sub-spaces, until we get to something small enough that it maps to a single component. In this case, we start with Either Port Address, and cut it down into e.g. an Unsigned 11 to address a 2K memory component. The dataOut writes and the dataIn reads are restricted along the address decoding, so that the write request only goes to the selected component, and the read result is taken from that as well.

Under the hood, the memory map description uses a Reader of the address and the write, and a Writer of the read result:

newtype Addressing dom addr dat a = Addressing
    { unAddressing :: ReaderT
                      (Signal dom (Maybe addr), Signal dom (Maybe dat))
                      (Writer (Ap (Signal dom) (First dat)))
                      a
    }
    deriving newtype (Functor, Applicative, Monad)

memoryMap
    :: Signal dom (Maybe addr)
    -> Signal dom (Maybe dat)
    -> Addressing dom addr dat a
    -> (Signal dom (Maybe dat), a)
    

Why the Maybe in the address? Because we want to know if the CPU doesn't need memory access in a given cycle at all; this is useful when memory is shared between the CPU and other components. We can resolve memory access contention by prioritizing some peripherals (returning Nothing in dataIn, forcing the CPU to stall) and de-prioritizing others (by only allowing them to access their memory component when the addressing of that sub-space is Nothing).

The implementation of memoryMap and its combinators is not particularly interesting; mask routes based on the high bits and leaves the low bits for the sub-space to handle:

mask
    :: (KnownNat k, KnownNat n)
    => (HiddenClockResetEnable dom)
    => Unsigned (n + k)
    -> Addressing dom (Unsigned k)       dat a
    -> Addressing dom (Unsigned (n + k)) dat a
    

ram0 and romFromFile simply wrap Clash memory primitives (zero-initialized synchronous RAM, and synchronous ROM initialized from a bitfile image) into Addressing-compliant forms, and port hooks up an IO peripheral that responds to PortCommands:

type Port dom addr dat a
    = Signal dom (Maybe (PortCommand addr dat)) -> (Signal dom (Maybe dat), a)

port
    :: (HiddenClockResetEnable dom, NFDataX dat)
    => Port dom addr dat a
    -> Addressing dom addr dat a      
    

This is also where we find out where the second component of memoryMap's return vaue comes from: the whole point of IO peripherals is that they can have connections, and as such, output signals, going to other parts of the circuit (or directly the outside world), not just the CPU's data bus.

The full code

The full Clash source code of the Tiny BASIC computer, including the Intel 8080 core, is available on Github. There is no user documentation at all yet; I've been using all the time and energy I can put into hacking for writing my book instead of tidying up the related repositories. So that's still something I need to get around to eventually; pull requests are obviously welcome!

A "very typed" container for representing microcode

15 September 2020 (programming haskell clash)

I've been thinking a bit about describing microcode lately. My motivation was the Intel 8080-compatible CPU I've been building for my upcoming Clash book. As with everything else for that book, the challenge is not in getting it to work — rather, it is in writing the code as close as possible to the way you would want to explain it to another person.

So in the context of a microprocessor as simple as the Intel 8080 and using synchronous RAM, I think of the microcode as a sequence of steps, where each step consists of an internal CPU state transition, and a memory read or write request. For example, the machine instruction 0x34 (mnemonic INR M) increments by one the byte pointed to by the register pair HL. In my core, the micro-architecture has an 8-bit value- and a 16-bit address-register; the latter can be used for memory addressing. To use something else for addressing, you need to load it into the address buffer first. So the steps to implement INR M are:

  1. Get value of HL register pair into the address buffer
  2. Indirect read via the address buffer into the value buffer
  3. Replace value buffer's contents with its increment
  4. Update the status register (flags like "was the latest value zero")
  5. Indirect write

However, memory access happens on the transition between cycles, so the final write will not be its own step; rather, it happens as the postamble of step 4. Similarly, the correct address will have to be put on the address pins in the preamble of step 2 for the load to work out:

  1. Get HL into address buffer
  2. Set address to address buffer's contents for reading
  3. Store read value from data-in into value buffer
  4. Increment value buffer
  5. Update status register
  6. Set address to address buffer's contents for writing

What makes this tricky is that on one hand, we want to describe preambles as part of their respective step, but of course for the implementation it is too late to process them when we get to that step. So I decided to write out the microcode as a sequence of triplets, corresponding to the preamble, the state transition, and the postamble, and then transform it into a format where preambles are attached to the previous step:

[ (Nothing,       Get2 rHL,          Nothing)
, (Just Indirect, ReadMem            Nothing)
, (Nothing,       ALU ADD Const0x01, Nothing)
, (Nothing,       UpdateFlags,       Just Indirect)
]
    

Here, Indirect addressing means setting the address pins from the address buffer (as opposed to, e.g. the program counter); if it is in the postamble (i.e. write) position, it also means the write-request pin should be asserted.

So this is what the microcode developer writes, but then we can transform it into a format that consists of a state transition paired with the addressing:

(Nothing,
[ (Get2 rHL,          Just (Left Indirect))
, (ReadMem            Nothing)
, (ALU ADD Const0x01, Nothing)
, (UpdateFlags,       Just (Right Indirect))
])
    

So we're done, without having done anything interesting enough to warrant a blog post.

Or are we?

Disallowing memory addressing conflicts

Note that in the format we can actually execute, the addressing at each step is either a Left read address, or a Right write address (or Nothing at all). But what if we had two subsequent micro-steps, where the first one has a write request in its postamble, and the second one has a read request in its preamble? We are describing a CPU more than 40 years old, it is to be connected to single-port RAM, so we can't do read and write at the same time. This constraint is correctly captured by the Maybe (Either Read Write) type of memory requests in the normalized form, but it is not enforced by our naïve [(Maybe Read, Transition, Maybe Write)] type for what the microcode developer writes.

So this is what I set out to solve: to give an API for writing microcode that has self-contained steps including the read addressing, but still statically disallows conflicting writes and reads from subsequent steps. We start by going full Richard Eisenberg and lifting the memory addressing directives to the type level using singletons. While we're at it, let's also turn on Haskell 98 mode:

{-# LANGUAGE DataKinds, PolyKinds, ConstraintKinds, GADTs, FlexibleContexts #-}
{-# LANGUAGE TypeOperators, TypeFamilies, TypeApplications, ScopedTypeVariables #-}
{-# LANGUAGE StandaloneDeriving, DeriveFunctor #-}

data Step (pre :: Maybe a) (post :: Maybe b) t where
    Step :: Sing pre -> t -> Sing post -> Step pre post t
deriving instance Functor (Step pre post)
    

The plan, then, is to do enough type-level magic to only allow neighbouring Steps if at most one of the first post- and the second preamble is a type-level Just index.

The operations we want to support on microcode fragments is cons-ing a new Step and appeding fragments. For the first one, we need to check that the postamble of the new Step is compatible with the first preamble of the existing fragment; for the latter, we need the same check between the last postamble of the first fragment and the first preamble of the second fragment. First, let's codify what "compatible" means here:

type family Combine (post :: Maybe b) (pre :: Maybe a) :: Maybe (Either a b) where
    Combine Nothing Nothing = Nothing
    Combine (Just post) Nothing = Just (Right post)
    Combine Nothing (Just pre) = Just (Left pre)
    

Importantly, there is no clause for Combine (Just post) (Just pre).

Getting dizzy with the thin air of the type level? Let's leave ourselves a thread leading back to the term level:

combine
    :: forall a b (post :: Maybe b) (pre :: Maybe a).
       (SingKind a, SingKind b, SingI (Combine post pre))
    => Sing post -> Sing pre -> Demote (KindOf (Combine post pre))
combine _ _ = demote @(Combine post pre)
    

(This post is not sponsored by Singpost BTW).

Cons- and append-able fragments

For the actual fragments, we can store them internally almost in the normalized format, i.e. as a term-level list of (Maybe a, [(t, Maybe (Either a b))]). Almost, but not quite, because the first a and the last b need to appear in the index, to be able to give a restricted type to cons and append. So instead of storing them in the list proper, we will store them as separate singletons:

data Ends a b
    = Empty
    | NonEmpty (Maybe a) (Maybe b)

data Amble (ends :: Ends a b) t where
    End :: Amble Empty t
    More
        :: forall (a0 :: Maybe a) (bn :: Maybe b) n t. ()
        => Sing a0
        -> [(t, Demote (Maybe (Either a b)))]
        -> t
        -> Sing bn
        -> Amble (NonEmpty a0 bn) t
deriving instance Functor (Amble ends)      
    

Note that we need a special Empty index value for End instead of just NonEmpty Nothing Nothing, because starting with an empty Amble, the first cons needs to change both the front-end preamble and the back-end postamble, whereas later cons operators should only change the front-end.

type family Cons (b1 :: Maybe b) (ends :: Ends a b) where
    Cons b1 Empty = b1
    Cons b1 (NonEmpty a1 bn) = bn
    

We can now try writing the term-level cons. The first cons is easy, because there is no existing front-end to check compatibility with:

cons
    :: forall (a0 :: Maybe a) b1 (ends :: Ends a b) t. ()
    => Step a0 b1 t -> Amble ends t -> Amble (NonEmpty a0 (Cons b1 ends)) t
cons (Step a0 x b1) End = More a0 [] x b1
cons (Step a0 x b1) (More a1 xs xn bn) = More a0 ((x, _):xs) xn bn
    

We get into trouble when trying to fill in the hole in the cons to a non-empty Amble. And we should be, because nowhere in the type of cons so far have we ensured that b1 is compatible with the front-end of ends. We will have to use another type family for that, to pattern-match on Empty and NonEmpty ends:

    
type family CanCons (b1 :: Maybe b) (ends :: Ends a b) :: Constraint where
    CanCons b1 Empty = ()
    CanCons (b1 :: Maybe b) (NonEmpty a1 bn :: Ends a b) =
        (SingKind a, SingKind b, SingI (Combine b1 a1))
    

Unsurprisingly, the constraints needed to be able to cons are exactly what we need to fill the hole with the term-level value of combine b1 a1:

cons
    :: forall (a0 :: Maybe a) b1 (ends :: Ends a b) t. (CanCons b1 ends)
    => Step a0 b1 t -> Amble ends t -> Amble (NonEmpty a0 (Cons b1 ends)) t
cons (Step a0 x b1) End = More a0 [] x b1
cons (Step a0 x b1) (More a1 xs xn bn) = More a0 ((x, combine b1 a1):xs) xn bn
    

Now we are cooking with gas: we can re-use this idea to implement append by ensuring we CanCons the first fragment's backend to the second fragment:

type family CanAppend (ends1 :: Ends a b) (ends2 :: Ends a b) :: Constraint where
    CanAppend Empty ends2 = ()
    CanAppend (NonEmpty a1 bn) ends2 = CanCons bn ends2

type family Append (ends1 :: Ends a b) (ends2 :: Ends a b) where
    Append Empty ends2 = ends2
    Append ends1 Empty = ends1
    Append (NonEmpty a0 bn) (NonEmpty an bm) = NonEmpty a0 bm

append :: (CanAppend ends1 ends2) => Amble ends1 t -> Amble ends2 t -> Amble (Append ends1 ends2) t
append End ys = ys
append (More a0 xs xn bn) End = More a0 xs xn bn
append (More a0 xs xn bn) (More an ys ym bm) = More a0 (xs ++ [(xn, combine bn an)] ++ ys) ym bm
    

We finish off the implementation by writing the translation into the normalized format. Since the More constructor already contains almost-normalized form, we just need to take care to snoc the final element onto the result:

stepsOf
    :: forall (ends :: Ends a b) t. (SingKind a, SingKind b)
    => Amble ends t
    -> (Maybe (Demote a), [(t, Maybe (Demote (Either a b)))])
stepsOf End = (Nothing, [])
stepsOf (More a0 xs xn bn) = (fromSing a0, xs ++ [(xn, Right <$> fromSing bn)])
    

Putting a bow on it

What we have so far works, but there are a couple of straightforward improvements that would be a shame not to implement.

Nicer way to take steps

As written, you would have to use Step like this:

Step (sing @Nothing) UpdateFlags (sing @(Just Indirect))      
    

All this singing noise would be more annoying than the Eurovision Song Contest, so I wanted to avoid it. The idea is to turn those Sing-typed arguments into just type-level arguments; then do some horrible RankNTypes magic to keep the parameter order. Prenex? What is that?

{-# LANGUAGE RankNTypes #-}
      
step :: forall pre. (SingI pre) => forall t. t -> forall post. (SingI post) => Step pre post t
step x = Step sing x sing
    

So now we will be able to write code like step @Nothing UpdateFlags @(Just Indirect) and get a Step type inferred that has the preamble and the postamble appearing in the indices.

Custom type error message

Suppose we make a mistake in our microcode, and accidentally want to write after one step and read before the next (using >:> for infix cons):

step @Nothing         (Get2 rHL)                       @(Just IncrPC) >:>
step @(Just Indirect) ReadMem                          @Nothing >:>
step @Nothing         (Compute Const01 ADD KeepC SetA) @Nothing >:>
step @Nothing         UpdateFlags                      @(Just Indirect) >:>
End      
    

This is rejected by the type checker, of course; however, the error message is not as informative as it could be, as it faults the missing SingI instance for a stuck type family application:

• No instance for (SingI (Combine ('Just 'IncrPC) ('Just 'Indirect)))
arising from a use of ‘>:>

With GHC's custom type errors feature, we can add a fourth clause to our Combine type family. Unfortunately, this requires turning on UndecidableInstances for now:

{-# LANGUAGE UndecidableInstances #-}      
import GHC.TypeLits

type Conflict post pre =
    Text "Conflict between postamble" :$$: Text "  " :<>: ShowType post :$$:
    Text "and next preamble" :$$: Text "  " :<>: ShowType pre

type family Combine (post :: Maybe b) (pre :: Maybe a) :: Maybe (Either a b) where
    Combine Nothing Nothing = Nothing
    Combine (Just post) Nothing = Just (Right post)
    Combine Nothing (Just pre) = Just (Left pre)
    Combine (Just post) (Just pre) = TypeError (Conflict post pre)
    

With this, the error message changes to:

• Conflict between postamble 'IncrPC and next preamble 'Indirect

Much nicer!

Tracking fragment length

The final difference between what we have described here and the code I use for real is that in the real version, Amble also tracks its length in an index. This is needed because the CPU core is used not just for emulation, but also FPGA synthesis; and in real hardware, we can't just store lists of unbounded size in the microcode ROM. So instead, microcode is described as a length-indexed Amble n ends t, and then normalized into a Vec n instead of a list. Each instruction can be at most 10 steps long; everything is then ultimately normalized into a uniformly typed Vec 10 by padding it with "go start fetching next instruction" micro-ops.

The full implementation

Find the full code on GitHub, next to the rest of the Intel 8080 core.

Solving text adventure games via symbolic execution

1 August 2020 (programming haskell smt)

A couple weeks ago, I attended some of the FSCD talks that the time zone difference allowed. One of the satellite workshops of FSCD this year was the Workshop on Rewriting Techniques for Program Transformations and Evaluation, where Martin Lester presented a talk Program Transformations Enable Verification Tools to Solve Interactive Fiction Games.

Because I haven't been able to find slides or the paper itself online, let me summarize it with my own words here quickly. Back in the late seventies and early eighties, Scott Adams (unrelated to the Dilbert comic strip creator) was a prolific text adventure creator; the Digital Antiquarian has written about his games in length, but basically these were simplistic (and in my opinion, creatively quite weak) games published on very limited platforms, where they found a following. Martin Lester's talk was about taking an off-the-shelf open source interpreter for these interactive fiction games (ScottFree) and feeding it (and a given game data file's contents) to a compiler that turns it into an SMT problem, with the free variables corresponding to the user input. Then, if an SMT solver can find an assignment satisfying a winning end condition, that means we have found a possible transcript of user input that wins the given game.

This combination of semi-formal methods and interactive fiction, plus the fact that I wanted to play around with SMT solver-backed interpreters ever since the Rosette talk at last year's ICFP, meant that the pull of this topic was just irresistable nerd catnip for me. So I took a week of afternoon hacking time from working on my Clash book, and started writing a Scott Adams adventure game engine in Haskell, with the aim of doing something similar to Lester's work.

An SBV-based Scott Adams solver

And so now I'm here to tell you about how it went, basically a "poor man's Rosette". When I started this, I have never used SMT solvers and never even looked at SBV, so I am not claiming I became an expert in just a week. But I managed to bumble my way through to something that works well enough on a toy example, so... let's see.

Step one was to implement a bog-standard interpreter for these Scott Adams adventure games. I didn't implement every possible condition and instruction, just barely enough to get a non-trivial example from ScottKit working. My target was the fourth tutorial adventure; see if you can find a solution either by playing the game, or by reading the game script. Why specifically the fourth tutorial example? Well, the short answer is because that is what Lester used as his example in his talk.

Here is the Git commit of this implementation; as you can see, there is really nothing special in there. The game data is parsed using Attoparsec into a simple datatype representation, which is then executed in ReaderT WriterT State, with the Reader containing the static game data, the Writer the output messages, and the State the game state, which is mostly just the current locations of the various items the player can interact with:

data S = S
    { _currentRoom :: Int16
    , _needLook :: Bool
    , _itemLocations :: Array Int16 Int16
    , _endState :: Maybe Bool
    } deriving Show
makeLenses ''S

type Engine = ReaderT Game (WriterT [String] (State S))      
    

The second step was to use SBV so that (mostly) the same interpreter can also be executed symbolically. This is the interesting part. It started with changing the Writer output and the State representation to use SBV's symbolic types, and then following the type errors:

data S = S
    { _currentRoom :: SInt16
    , _needLook :: SBool
    , _itemLocations :: Array Int16 SInt16
    , _endState :: SMaybe Bool
    } deriving (Show, Generic, Mergeable)
    

Arithmetic works out of the box because of the Num instances; because the Haskell Prelude's Eq and Ord typeclasses hardcode the result type to Bool, and we want symbolic SBool results instead, we have to replace e.g. (==) with (.==).

For the control structures, my idea was to write Mergeable instances for the MTL types. The definitions of these instances are very straightforward, and it allowed me to define symbolic versions of things like when or case with literal matches. The end result of all this is that we can write quite straightforward monadic code, just replacing some combinators with their symbolic counterpart. Here's an example of the code that runs a list of instruction codes in the context of their conditions; even without seeing any other definitions it should be fairly straightforward what it does:

execIf :: [SCondition] -> [SInstr] -> Engine SBool
execIf conds instrs = do
    (bs, args) <- partitionEithers <$> mapM evalCond conds
    let b = sAnd bs
    sWhen b $ exec args instrs
    return b      
    

I decided to use concrete lists and arrays of symbolic values instead of symbolic arrays throughout the code. One interesting example of sticking to this approach is in the implementation of listing all items in the current room. The original concrete code looks like this:

let itemsHere =
        [ desc
        | (Item _ _ desc _, loc) <- zip (A.elems items) (A.elems itemLocs)
        , loc == here
        ]
unless (null itemsHere) $ do
    say "I can also see:"
    mapM_ (\desc -> say $ " * " <> desc) itemsHere
    

For the symbolic version, I had to get rid of the filtering (since loc .== here returns an SBool now), and instead, I create a concrete list of pairs of symbolic locations and concrete descriptions. By going over the full list, I can push all the symbolic ambiguity down to just the output:

let itemLocsWithDesc =
        [ (loc, desc)
        | (Item _ _ desc _, loc) <- zip (A.elems items) (A.elems itemLocs)
        ]
    anyHere = sAny ((.== here) . fst) itemLocssWithDesc
sWhen anyHere $ do
    say_ "I can also see:"
    forM_ itemLocsWithDesc $ \(loc, desc) ->
        sWhen (loc .== here) $ say $ literal $ " * " <> desc      
    

By the way, as the above code shows, I kept the user-visible text messages in the symbolic version. This is completely superfluous for solving, but it allows using the symbolic interpreter with concrete values: since all input is concrete, we can safely assume that the symbolic output values are all constants. In practice, this means we recover the original interactively playable version from the SBV-based one simply by running inside SBV's Query monad and getValue'ing the concrete String output from the SStrings coming out of the WriterT. I wouldn't be surprised if this turns out to be a major drain on performance, but because my aim was mostly to just get it working, I never bothered checking. Besides, since noone looks at the output in solver mode, maybe Haskell's laziness ensures there's no overhead. I really don't know.

"Turning the crank" on a stateful symbolic computation

So at this point, we have a symbolic interpreter which we can feed user input line by line:

stepPlayer :: SInput -> Engine (SMaybe Bool)
stepPlayer (verb, noun) = do
    perform (verb, noun)
    finished
    

The question then is, how do we keep running this (and letting the state evolve) for more and more lines of symbolic input, until we get an sJust sTrue result (meaning the player has won the game)? My original idea was to let the user say how many steps to check, and then generate a full list of symbolic inputs up-front. I asked around on Stack Overflow for something better, and it was this very helpful answer that pointed me in the direction of the Query monad in the first place. With this incremental approach, I can feed it one line of symbolic input, check for satisfiability with the newly yielded constraints, and if there's no solution yet, keep this process going, letting the next stepPlayer call create additional constraints.

I've factored out the whole thing into the following nicely reusable function; this is also the reason I am using ReaderT WriterT State instead of RWS so I can peel away naturally into a State.

loopState :: (SymVal i) => (Int -> Query (SBV i)) -> s -> (SBV i -> State s SBool) -> Query ([i], s)
loopState genCmd s0 step = go 1 s0 []
  where
    go i s cmds = do
        io $ putStrLn $ "Searching at depth: " ++ show i

        cmd <- genCmd i
        let cmds' = cmds ++ [cmd]

        push 1
        let (finished, s') = runState (step cmd) s
        constrain finished
        cs <- checkSat

        case cs of
            Unk -> error $ "Solver said Unknown, depth: " ++ show i
            Unsat -> do
                pop 1
                go (i+1) s' cmds'
            Sat -> do
                cmds' <- mapM getValue cmds'      
                return (cmds', s')
    

SBV bugs discovered on the way

Because I'm a complete SBV noob, I was reluctant to attribute problems to SBV bugs first; I ended up with a ton of Stack Overflow questions. However, it turned out I do still have my Midas touch of finding bugs very quickly in anything I start playing around with; this time, it started with SBV generating invalid SMTLib output from my code. Although my initial report was basically just "this several hundred line project Just Doesn't Work", I managed to cut it down into more reasonable size. The SBV maintainers, especially Levent Erkök, have been very helpful with quick turnaround.

The other bug I found was symbolic arrays misbehaving; although I ended up not using either SFunArray nor SArray in the final version of ScottCheck, it is good to know that my silly project has somehow contributed, if just a little, to making SBV itself better.

The money shot

So, lots of words, but where's the meat? Well first off, my code itself is on GitHub, and could serve as a good introduction to someone wanting to start using SBV with a stateful computation. And second, here is a transcript of ScottCheck verifying that the tutorial game is solvable, with the Z3 backend; the timestamps are in minutes and seconds from the start, to give an idea of how later steps become slower because there's an exponential increase in all possible inputs leading up to it. The words may look truncated, but that's because the actual internal vocabulary of the game only uses three letter words; further letters from the user are discarded (so COIN parses as COI etc.).

00:00 Searching at depth: 1
00:00 Searching at depth: 2
00:00 Searching at depth: 3
00:00 Searching at depth: 4
00:00 Searching at depth: 5
00:01 Searching at depth: 6
00:01 Searching at depth: 7
00:02 Searching at depth: 8
00:05 Searching at depth: 9
00:11 Searching at depth: 10
00:24 Searching at depth: 11
00:45 Searching at depth: 12
01:24 Searching at depth: 13
02:38 Searching at depth: 14
03:35 Solution found:
03:35   1. GO WES
03:35   2. GET CRO
03:35   3. GO EAS
03:35   4. GO NOR
03:35   5. GET KEY
03:35   6. GO SOU
03:35   7. OPE DOO
03:35   8. GO DOO
03:35   9. GET COI
03:35  10. GO NOR
03:35  11. GO WES
03:35  12. GO NOR
03:35  13. DRO COI
03:35  14. SCO ANY     
    

Integrating Verilator and Clash via Cabal

7 May 2020 (programming haskell clash fpga)

TL;DR: This is a detailed description of how I got Clashilator working seamlessly from Cabal. It took me three days to figure out how the pieces need to fit together, and most of it was just trawling the documentation of Cabal internals, so I better write this down while I still have any idea what I did.

Background information

We set the scene with the dramatis personæ first:

When designing some circuit, it is very useful to be able to simulate its behaviour. Getting debugging information out of a hardware FPGA is a huge hassle; iteration is slow because FPGA synthesis toolchains, by and large, suck; and driving the circuit (i.e. setting the right inputs in the right sequence and interpreting the outputs) can be almost as complicated as the circuit under testing itself. Of course, all of these problems apply doubly to someone like me who is just dabbling in FPGAs.

So during development, instead of synthesizing a circuit design and loading it onto an FPGA, we want to simulate it; and of course we want to use nice expressive languages to write the test bench that envelopes the simulated circuit. One way to do this is what I call very high-level simulation: in this approach, we take the Haskell abstractions we use in our circuit, and reinterpret them in a software context. For example, we might have a state machine described as i -> State s o: instead of lifting it into a signal function, we can just runState it in a normal Haskell program's main and do whatever we want with it.

However, sometimes we want to simulate whole circuits, i.e. Signal dom i -> Signal dom o functions that might have who knows what registers and memory elements inside. For example, if we have a circuit that generates video output from a frame buffer, there's a difference between a high-level simulation that renders the frame buffer's contents to the screen, and a lower level one that interprets the VGA signal output of the circuit. Timing issues in synchronizing the VGA blanking signals with the color lines will only manifest themselves in the latter. So for this kind of applications, Clash of course contains a signal simulator that can be used to feed inputs into a circuit and get outputs. For example, here's a simulation of a Brainfuck computer where only the peripheral lines are exposed: internal connections between the CPU and the RAM and ROM are all part of the Clash circuit.

There is only one problem with the Clash simulator: its performance. This small benchmark shows how long it takes to simulate enough cycles to draw 10 full video frames at 640 ⨯ 480 resolution (i.e. 4,192,000 cycles). Clash does it in ~13 seconds; remember, at 60 FPS, it shouldn't take more than 166 milliseconds to draw 10 frames if we want to simulate it in real time. Of course, real-time simulation at this level of detail isn't necessarily feasable on consumer hardware; but less than one frame per second means any kind of interactive applications are out.

In contrast, Verilator, an open-source Verilog simulator can run all 4,192,000 cycles in 125 ms. This could be faster than realtime, were it not for the additional overhead of drawing each frame to the screen (and the Haskell FFI of 4 million calls accross to C...), and of course this is a very simple circuit that only renders a single bouncing ball; anything more complex will only be slower. But still, that same benchmark shows that 20+ FPS is possibe, end-to-end, if we ditch Clash and use Verilator instead. Pong is fully playable at 13 FPS.

Clash, Verilog and Haskell

The interface between Clash and Verilator is simple: Verilator consumes Verilog code, so we can simply run Clash and point Verilator at its output. However, we still need to connect that Verilator code to the Haskell code to drive the inputs and interpret the outputs. Here are the glue files from the Pong example:

As I was preparing to write the next chapter of a Clash book I've been working on, I made a new Clash project and then, because I needed a Verilator simulation for it, I started going through the steps of making all these files. And of course, I realized this should be all automated. But what is all in that sentence?

Step one was to write generators for the C++ and Haskell source files and the Makefile. This is quite easy, actually; after all, it is the fact that these files are so regular that makes it infuriating writing them by hand. So we do a bit of text template substitution, using Clash's .manifest output as the source of input/output pin names and bus widths. This gives us a simple code generator: you run Clash yourself, point Clashilator at a .manifest file, and it outputs a bunch of files, leaving you ready to run make. Mission accomplished?

No, not really.

Clash, Verilog and... Cabal

While we've eliminated the boilerplate in the source files, one source of boilerplate remains: the Cabal package settings. Here's the relevant HPack package.yaml section from the Pong example:

extra-libraries: stdc++ 
extra-lib-dirs: verilator
include-dirs: verilator
build-tools: hsc2hs

ghc-options:
  -O3
  -fPIC -pgml g++
  -optl-Wl,--whole-archive -optl-Wl,-Bstatic
  -optl-Wl,-L_build/verilator -optl-Wl,-lVerilatorFFI
  -optl-Wl,-Bdynamic -optl-Wl,--no-whole-archive
    

Oof, that hurts. All those magic GHC options just to statically link to libVerilatorFFI.a, to be repeated accross all Clash projects that use Verilator...

Also, while Clashilator outputs a Makefile to drive the invocation of Verilator and the subsequent compilation of the C++ bits, it doesn't give you a solution for running *that* Makefile at the right time — not to mention running Clashilator itself!

The problem here is that in order to compile a Verilator-using Haskell program, we first need to compile the other, non-simulation modules into Verilog. And this is tricky because those modules can have external dependencies: remember Clash is just a GHC backend, so you can import other Haskell libraries. And how are we going to bring those libraries in scope? I myself use Stack but your mileage may vary: you could be using Cabal directly, or some Cabal-Nix integration. In an case, you'd basically need to build your package so you can compile to Verilog so you can run Verilator so you can... build your package.

To solve this seemingly circular dependency, and to get rid of the Cabal file boilerplate, I decided to try and do everything in the Cabal workflow. Whatever your preferred method of building Haskell packages, when the rubber hits the road, they all ultimately run Cabal. If we run Clash late enough in the build process, all dependencies will be installed by then. If we run Verilator early enough in the build process, the resulting library can be linked into whatever executable or library Cabal is building.

If we do all this during cabal build, everything will happen at just the right time.

Is such a thing       even possible? Yes it is.

So, what gives us at least a fighting chance is that Cabal is extensible with so-called hooks. You can write a custom Setup.hs file like this:

import Distribution.Simple

main = defaultMainWithHooks simpleUserHooks
    

Here, simpleUserHooks is a record with a bunch of fields for extension points; of particular interest to us here is this one:

buildHook :: PackageDescription -> LocalBuildInfo -> UserHooks ->
BuildFlags -> IO ()
    

At ths point, we have breached the defenses: inside buildHook, we can basically do arbitrary things as long as the effect is building the package. In particular, we can:

The result of all this is a bunch of new files under the build directory, and modified BuildInfos for all the components marked with x-clashilator-top-is. We put these back into the PackageDescription and then call the default buildHook, which then goes on to compile and link the Haskell simulation integrating the Verilator parts:

clashilatorBuildHook :: PackageDescription -> LocalBuildInfo -> UserHooks -> BuildFlags -> IO ()
clashilatorBuildHook pkg localInfo userHooks buildFlags = do
    pkg' <- clashilate pkg localInfo buildFlags
    buildHook simpleUserHooks pkg' localInfo userHooks buildFlags
    

All the details are in the full implementation, which is probably worth a look if you are interested in Cabal's internals. As I wrote in the beginning of this post, it took me days to wade through the API to find all the moving parts that can be put together to apply this level of violence to Cabal. The final code also exports a convenience function clashilatorMain for the common case where enabling Clashilator is the only desired Setup.hs customization; and also clashilate itself for hardcore users who want to build their own buildHooks.

The implementation is almost 150 lines of not particularly nice code. It is also missing some features; most notably, it doesn't track file changes, so Clash and Verilator is always rerun, even if none of the Clash source files have changed. It is also completely, utterly untested. But it does give us what we set out to do: completely boilerplate-less integration of Clash and Verilator. A complete example package is here, and here's the money shot: the executables section of the package.yaml file.

executables:
  simulator:
    main: simulator.hs
    verbatim:
      x-clashilator-top-is: MyCircuit.Nested.Top
      x-clashilator-clock: CLK
    

Composable CPU descriptions in CλaSH, and wrap-up of RetroChallenge 2018/09

30 September 2018 (programming haskell fpga electronics retrochallenge clash)

My last post ended with some sample CλaSH code illustrating the spaghetti-ness you can get yourself into if you try to describe a CPU directly as a function (CPUState, CPUIn) -> (CPUState, CPUOut). I promised some ideas for improving that code.

To start off gently, first of all we can give names to some common internal operations to help readability. Here's the code from the previous post rewritten with a small kit of functions:

intensionalCPU (s0, CPUIn{..}) = case phase s of
    WaitKeyPress reg ->
        let s' = case cpuInKeyEvent of
                Just (True, key) -> goto Fetch1 . setReg reg key $ s
                _ -> s
            out = def{ cpuOutMemAddr = pc s' }
        in (s', out)                   
    Fetch1 ->
        let s' = goto Exec s{ opHi = cpuInMem, pc = succ $ pc s }
            out = def{ cpuOutMemAddr = pc s' }
        in (s', out)
  where
    s | cpuInVBlank = s0{ timer = fromMaybe 0 . predIdx $ timer s0 }
      | otherwise = s0
    

Using a State monad

To avoid the possibility of screwing up the threading of the internal state, we can use the State CPUState monad:

Continue reading »

Older entries: