уторак, 17. март 2020.

TinyBasic made for my FPGA platform

This is a follow-up of my original FPGA computer post.

In my previous post, I have described how I have modified GCC cross compiler made originally for the moxie platform to generate assembly code for my FPGA platform. I have used my new cross compiler to make a port of TinyBasic for my platform. I have downloaded TinyBasic C code and modified it to be a bit more programmer-friendly. That port can be found here:

https://github.com/milanvidakovic/FPGABasic

Besides standard BASIC commands, I had a freedom to invent my own commands and to play with them. First of all, I have created a MODE command which is used to set the video card mode:
0 - text mode
1 - graphics mode of 640x480x2 colors, and
2 - graphics mode of 320x240x8 colors.

Besides MODE command, I now have the following graphics commands:
- PLOT x, y, color
- LINE x1, y1, x2, y2, color
- CIRCLE x, y, r
- DRAW x, y, "TEXT"

I have also added two key-related functions: KEY() and ISKEY(). Both functions return virtual key that has been pressed, but the first is a blocking one - it waits until some key is pressed, while the other one just immediately returns the virtual code of a last key being pressed.

I have also played with the file system on my "hard disk". I have created following commands:
- DIR - lists the content of the "hard disk" root folder,
- LOAD PROGRAM.BAS - loads a BASIC program into the computer memory,
- SAVE PROGRAM.BAS - saves a BASIC program on the "hard disk"
- EXEC PROGRAM.BIN - loads and executes a binary executable
- SYS ADDRESS - executes a machine program loaded at the given address.

The BASIC now boots from the SD card and can be used immediately. Here is the video of the computer booting from the SD card into the BASIC:


субота, 14. март 2020.

Modifying GCC to work with my FPGA computer

This is a follow-up of my original FPGA computer post.

In this text I will talk about the modification of the GCC compiler in order to work with my FPGA platform. I wanted to make a cross-compiler that would be able to compile C programs for my FPGA platform. This post will describe how far I have reached. Currently, a modified GCC compiler produces assembly code for my platform as a result of C program compilation.

When I say a cross-compiler, I think of a compiler that would compile C code on my PC, but the executable would be for my FPGA computer. GCC already supports a lot of cross-compilers and all you have to do is to choose the appropriate target when building GCC. That way, you will build a cross-compiler that would produce an executable for the target platform. However, if you have created a new platform, then you need to add a cross-compiler code to the GCC, and then build a cross-compiler for that new platform. That step is very complicated. Trying to add your own platform into the GCC is almost impossible job if you have not done something like that already (I haven't).

I have found a very useful blog post that describes how GCC generates code:

https://kristerw.blogspot.com/2017/08/writing-gcc-backend_4.html

Besides that blog, there is a simple GCC cross-compiler that is present in the list of supported cross compilers - moxie. The author is Anthony Green and he has initially made a blog about modifying GCC for a fictional platform - moxie:

http://atgreen.github.io/ggx/

The project managed to become an official part of GCC and is now incorporated in the GCC source tree.

There is a sandboxed environment made for this platform - moxiebox. The code is on the github repository: https://github.com/jgarzik/moxiebox

I have forked that repo and added my own modification of a moxie cross compiler that is adjusted for my own FPGA platform:

https://github.com/milanvidakovic/moxiebox

So, how can you use moxie to make your own gcc cross-compiler? We need to know at least fundamentals of GCC compiler to do so. First of all, there is a frontend and there is a backend. Frontend deals with the actual compiling and produces a target-independent representation code, which is a passed to the backend, which in turn generates target platform code. Between frontend and backend is an optimizer, whose job is obvious.

Backend starts target platform code generation by processing insns. An insn is a kind of a virtual assembly instruction created by the frontend during compilation. Your cross-compiler now need to generate a real machine instruction(s) out of an insn. That is where I have started to investigate moxie.

First of all, I have downloaded moxiebox from the github and unpacked that on the disk. Then I have installed necessary packages to be able to build moxiebox on my Ubuntu:

sudo apt install device-tree-compiler texinfo flex  build-essential libgmp-dev libmpfr-dev libmpc-dev
sudo apt install git subversion cvs

Then I have executed the
/moxiebox/contrib/download-tools-sources.sh script. After that, the moxiebox is ready to start building. You do so by executing the
/moxiebox/contrib/build-moxiebox-tools.sh script. It takes about an hour to build everything.

To make moxie-based gcc tools present in the path, you need to add them to PATH and LD_LIBRARY_PATH in your .bashrc file (at the end):

export PATH=/moxiebox/contrib/root/usr/bin:$PATH
export LD_LIBRARY_PATH=/moxiebox/contrib/root/usr/lib:$LD_LIBRARY_PATH

Now it is a good moment to start changing original moxie cross-compiler in order to work with my platform. Fortunately, moxie is quite similar to my FPGA CPU, so it was not a huge job.

First of all, I have changed /moxiebox/contrib/gcc/gcc/config/moxie/moxie.md file to generate my FPGA instructions instead of moxie ones.  Here is one example:

(define_insn "*movsi"
  [(set (match_operand:SI 0 "nonimmediate_operand" "=r,r,r,W,A,r,r,B,r")
(match_operand:SI 1 "moxie_general_movsrc_operand" "O,r,i,r,r,W,A,r,B"))]
  "register_operand (operands[0], SImode)
   || register_operand (operands[1], SImode)"
  "@
   xor.w\\t%0, %0
   mov.w\\t%0, %1
   mov.w\\t%0, %1
   st.w\\t%0, %1
   st.w\\t[%0], %1
   ld.w\\t%0, %1
   ld.w\\t%0, [%1]
   st.w\\t%0, %1
   ld.w\\t%0, %1"
  [(set_attr "length" "2,2,6,2,6,2,6,6,6")])

Then I have changed moxie.c and moxie.h in order to use my own register names and to generate stack frame epilogue and prologue the way I am used to. Prologue:

emit_insn (gen_movsi_push (hard_frame_pointer_rtx));
emit_move_insn (hard_frame_pointer_rtx, stack_pointer_rtx);
moxie_compute_frame ();

And epilogue:

emit_move_insn (stack_pointer_rtx, hard_frame_pointer_rtx);
emit_insn (gen_movsi_pop (hard_frame_pointer_rtx, hard_frame_pointer_rtx));
emit_jump_insn (gen_returner ());

I have tried to make a standard stack frame and to pass all the arguments over the stack frame, instead of registers (it is slower, but I understand it better). I have placed the patch which substitutes original moxie cross-compiler files here:

https://github.com/milanvidakovic/moxiebox/blob/master/contrib/contrib.zip

When I try to build the whole package I still get a lot of errors, but I now have the moxiebox-gcc compiler which (unfortunately) cannot build the whole executable, but can generate assembly file by using the -S argument:

moxiebox-gcc -S test.c -o test.s

After that, it was easy to use the customasm assembler to generate the executable for my platform.

Conclusion

I have done just a half of the job. Currently, only the assembly files are being generated out of the C files. Next step is to change GCC assembler (and linker) to produce a proper executable.



петак, 6. март 2020.

Cache implemented on my FPGA computer

Introduction

This is a follow-up of my original FPGA computer post.

My FPGA computer uses SDRAM as operating memory. It has static RAM too, but most of it is used as dual-port RAM for the VGA video subsystem. The SDRAM inside is 32MB, 16-bit data bus memory and it usually takes about six clock cycles for read and the same amount of cycles for write. The clock is 100MHz. Knowing all of this, it was about time to do some performance measurement:

I have made a simple program that counts from 1 to 10 000 000. If that program is loaded in SDRAM, it takes about 15 seconds to finish. However, if I load it in static RAM, it takes about 6 seconds to finish. So, there was an obvious motivation to try to implement the cache controller. You can look at the Verilog code here:
https://github.com/milanvidakovic/FPGAComputer32/blob/master/cpu.v

Implementation

I haven't used all of the static RAM in my FPGA computer, so I was able to make about 8KB of L1 cache. Here are the details:
- I have 4096 cache lines, each having two bytes. That is 8KB of cache.
- for each cache line, I have added 12-bit TAG, used for the direct mapping of the cache line. That consumes additional 5632 bytes of static RAM.
- I have implemented write-through policy, since I didn't have enough resources to make a write-back removal policy. I will try to make write-back, but it requires a complete rework of the cache controller, so, perhaps later...

Ho this thing works in practice? First of all, here is the Verilog code:

// cache TAG
reg [11:0] tag[4095:0];
// cache line
reg [15:0] cl[4095:0];

Each cache line (a row in the cl variable) holds two bytes of data. Whenever a CPU wants to do a read, the address from the address bus goes into the cache controller:

if (tag[addr[11:0]] == addr[23:12]) begin
// cache hit (required data is in cache)
data_r <= cl[addr[11:0]];
state <= next_state;
end
else begin
// cache miss -> we need to read from SDRAM
rd_enable_o <= 1'b1;
if (busy_i) begin
state <= READ_WAIT;
end
end

12 lower bits of the address (addr[11:0]) are used to address the cache line. To check if the wanted data is in cache, the tag is used. The same 12 lower bits address the tag which is assigned to a cache line. If the upper 12 bits of the address (addr[23:12]) match those in the tag, then we have a cache hit and the data can be returned directly from the cache. 

If that is not the case, then we need to perform a read from the SDRAM, and then:

rd_enable_o <= 1'b0;
if (rd_ready_i) begin
data_r <= rd_data_i;
// we store the fetched data into the cache
cl[addr[11:0]] <= rd_data_i;
// write tag
tag[addr[11:0]] <= addr[23:12];
state <= next_state;
end

When we finally obtain the data from the SDRAM, we return that data to the CPU, but we also write down that same data in the cache line, and we update the tag associated to that cache line with the upper 12 bits of the address.

That was the read cycle. Let's see how write works. When CPU wants to write data, it is saved into the SDRAM and into the cache as well:

// Write through, meaning that we save data in both SDRAM and cache
wr_data_o <= data_to_write;
// now we need to store the data that had to be saved into cache
cl[addr[11:0]] <= data_to_write;
// write tag
tag[addr[11:0]] <= addr[23:12];
wr_enable_o <= 1'b1;
if (busy_i) begin
state <= WRITE_WAIT;
end

As we can see, data is saved in both SDRAM and cache, and then we just return back:

wr_enable_o <= 1'b0;
if (~busy_i) begin
state <= next_state;
end

Performance

The cache controller works like a charm! The same counting example works now (almost) as fast as when it was executed in the static RAM (about 6 seconds to count from 1 to 10 million). 

Conclusion

Write-through implementation is simpler than write-back and maintains SDRAM in synchronization with the cache. However, it is slower, because CPU needs to wait for the data to be saved in SDRAM, instead of doing fast save just into the cache. Write-back is faster, since we don't have to wait for the slow SDRAM save, but the cache goes out-of-sync with the SDRAM (since we saved data in cache only). When we have a full cache, in case of write-back, we need to empty the corresponding cache line, by writing the content into the SDRAM, and then to write the new content in the cache.

The write cycle could be implemented as write-back, but with this setup, I cannot do that (not enough resources on FPGA chip). I will investigate that in future.