недеља, 22. август 2021.

Floating-point implementation in GCC for my FPGA computer

This is a followup of my original post.

My FPGA computer supports floating-point instructions. GCC port doesn't. If we try to make the following code:

float d = 0.5;
float e = 0.2;
float f;

f = d - e;

The compiler will generate the following assembly code:

# small.c:10:   float d = 0.5;
    mov.w   r01056964608  # tmp43,
    st.w    [r13 + (-12)], r0   # d, tmp43
# small.c:11:   float e = 0.2;
    mov.w   r01045220557  # tmp44,
    st.w    [r13 + (-16)], r0   # e, tmp44
# small.c:14:   f = d - e;
    ld.w    r0, [r13 + (-16)]   # tmp45, e
    st.w    [sp + (4)], r0  #, tmp45
    ld.w    r0, [r13 + (-12)]   # tmp46, d
    st.w    [sp], r0    #, tmp46
    call    __subsf3        #
    mov.w   r1r0  # tmp47,
    mov.w   r0r1  # tmp48, tmp47
    st.w    [r13 + (-20)], r0   # f, tmp48

We can see that when the C code has floating-point subtraction, the generated assembly code places two operands on the stack and then calls the __subsf3 function. What is __subsf3? It is a software implementation of the floating point operations. You can find the list of all soft-float functions here:

https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html

You can see that there are not only the arithmetic functions. There are also conversion functions, comparison functions, and other functions. For example, if we write this in C:

printf("%f\n"f);

the generated assembly code would be:

ld.w   r0, [r13 + (-20)]   # tmp51, f
st.w    [sp], r0    #, tmp51
call    __extendsfdf2       #
...

The __extendsfdf2 function extends float to double, in order to pass it as an argument to the printf function. 

Next, if we multiply float by the int like this:

int x = 5;
f = d * x

we will get this:

ld.w   r0, [r13 + (-8)]    # tmp61, x
st.w    [sp], r0    #, tmp61
call    __floatsisf     #
mov.w   r1r0  # _3,
# small.c:20:   f = d * x; // 0.5 * 5 -> 2.5
mov.w   r0r1  # tmp62, _3
st.w    [sp + (4)], r0  #, tmp62
ld.w    r0, [r13 + (-12)]   # tmp63, d
st.w    [sp], r0    #, tmp63
call    __mulsf3        #
mov.w   r1r0  # tmp64,
mov.w   r0r1  # tmp65, tmp64
st.w    [r13 + (-20)], r0   # f, tmp65

The __floatsisf function converts int to float in order to multiply x (the integer value) with the d (float value). The result of the conversion is stored as a temporary variable tmp62 and then floating-point multiplication of d and tmp62 is done and the result is stored in the f variable.

Writing my own soft-float implementation

Now the obvious step was to write the implementation of all the functions that compiler calls. Arithmetic functions were by far the most simple ones. For example, here is the _subsf3 function:

float __subsf3(float afloat b
{
    asm(
        "ld.w r0, [r13 + (8)]\n \
push r1\n \
ld.w r1, [r13 + (12)]\n \
fsub r0, r1\n \
pop r1\n"
    );
}
 
As you can see, all I had to do was to get the arguments and then to call the fsub r0, r1 instruction which does floating point subtraction.

However, the conversion and comparison functions were not that simple. Conversion from float to int goes like this:

int __fixsfsi(float a)
{
    union { fp_t frep_t u; } fb;
    fb.f = a;

    int e = ((fb.u & 0x7F800000) >> 23) - 127;
    if (e < 0)
        return 0;

    rep_t r = (fb.u & 0x007FFFFF) | 0x00800000;
    if (e > 23)
        r <<= (e - 23);
    else
        r >>= (23 - e);
    if (fb.u & 0x80000000)
        return -r;
    return r;
}

And, in the other direction (from int to float):

fp_t __floatsisf(int a)
{
    const int aWidth = sizeof a * 8;

    // Handle zero as a special case to protect clz
    if (a == 0)
        return fromRep(0);

    // All other cases begin by extracting the sign and absolute value of a
    rep_t sign = 0;
    if (a < 0) {
        sign = signBit;
        a = -a;
    }

    // Exponent of (fp_t)a is the width of abs(a).
    const int exponent = (aWidth - 1) - __clzsi2(a);
    rep_t result;

    // Shift a into the significand field, rounding if it is a right-shift
    if (exponent <= significandBits) {
        const int shift = significandBits - exponent;
        result = (rep_t)a << shift ^ implicitBit;
    } else {
        const int shift = exponent - significandBits;
        result = (rep_t)a >> shift ^ implicitBit;
        rep_t round = (rep_t)a << (typeWidth - shift);
        if (round > signBitresult++;
        if (round == signBitresult += result & 1;
    }

    // Insert the exponent
    result += (rep_t)(exponent + exponentBias) << significandBits;
    // Insert the sign bit and return
    return fromRep(result | sign);

Conclusion

As you can see, my modified GCC compiler does not generate native assembly code for floating-point operations. Instead, it generates soft-float function calls which I had to write on my own (or to borrow from the Internet). 

субота, 26. децембар 2020.

Networking with the FPGA computer

This is a followup of my original post.

As I mentioned in the SPI-related post, I have added the SPI interface to my FPGA computer. Not one, but two: one for the SD card, and the other one for the Ethernet card. Today I am going to talk about the Ethernet.

First of all, I have used the ENC28J60 module, which I use for my Raspberry Pi Zero and Arduino/ESP32 ethernet connectivity This is rather simple module, which uses SPI as an interface to the host computer. Since I have already used this module with the Arduino and ESP32, I have decided to reuse the corresponding Arduino library for this module and to adjust it to work with my FPGA computer.

The library I used for the Arduino is: https://github.com/njh/EtherCard

This library is written in C++. Since I haven't finished porting GCC to my FPGA, I don't have the support for the C++. This means that I had to unwrap the code from C++ to pure C. When I finished that, the only thing that I had to do was to replace Arduino-based SPI code with my FPGA SPI code. For example, one of the original functions was:

static void writeOp (byte opbyte addressbyte data) {
    enableChip();
    SpiPtr->beginTransaction(SPISettings(spiClk, MSBFIRST, SPI_MODE0));
    SpiPtr->transfer(op | (address & ADDR_MASK));
    SpiPtr->transfer(data);
    SpiPtr->endTransaction();
    disableChip();
}

My code is:

void enc28j60WriteOp(uint8_t opuint8_t addressuint8_t data)
{
        chipSelectLowE();
        // issue write command
        spiSendE(op | (address & ADDR_MASK));
        // write data
        spiSendE(data);
        chipSelectHighE();
}

The support for the TCP/IP protocol is built in the EtherCard library. It also had to be modified from C++ to C. After that, I was able to use the library.

Besides simple TCP/IP examples, I have decided to make use of this network support. I have added network drive support in my BASIC interpreter - I have added the DRIVE command. DRIVE 0 selects the SD card. DRIVE 1 selects the UART-based drive which communicates with the Raspbootin application on the PC via UART, while DRIVE 2 sets the network drive which also communicates with the Raspbootin application on the PC, but this time over Ethernet.

Here is the snapshot of the FPGA screen:


In the example above, the drive was set to be the network drive and directory was listed on the PC. The C1.BAS program was loaded from the PC via network drive.

The C code for the DIR command is here:

// called when the client request is complete
void my_callback (uint8_t statusuint16_t offuint16_t len) {
    memcpy(to_print_buffeth_buffer+offlen);
    to_print_len = len;

...
// DRIVE 2 - ETHERNET NETWORK DRIVE
to_print_len = 0;
browseUrl("/dir""", server_ip, 0, my_callback);
for (i = 0; i < 1000; i++) { // approx. 1MB max file size
    packetLoop(enc28j60PacketReceive(4500, eth_buffer));
    if (to_print_len > 0) {
        to_print_buff[to_print_len] = 0;
        printf("%s\n", to_print_buff);
        to_print_len = 0;
        return;
    }
}
printf("NETWORK TIMEOUT\n");
...

The code on the PC side is here:

if (req.startsWith("/dir")) {
    File currFile = new File(Rest.path);
    File dir = currFile.getParentFile().getParentFile();
    System.out.println(dir.getCanonicalPath());
    File[] files = dir.listFiles();
    StringBuilder sb = new StringBuilder();
    for (File f : files) {
        if (f.isDirectory()) {
            sb.append("<" + f.getName() + ">");
            sb.append("\n");
        }
    }
    for (File f : files) {
        if (f.isFile()) {
            sb.append(f.getName());
            sb.append("\n");
        }
    }
    String str = sb.toString();
    int size = str.length();
    System.out.println("size: " + size);
    out.print(str);
}

Conclusion

Since I have implemented the SPI interface on the FPGA computer, it was possible to connect the ENC28J60 Ethernet module to it. I cannot stress enough how important for me was to be able to port the GCC to my platform. That allows me to use all sorts of C code instead of programming in assembly language.

Network drive also makes development easy since I do all the programming on my PC, I compile it using the GCC, and then I load that program on the FPGA computer over Ethernet. No need to transfer the program to the SD card (card dance), or to use slower UART. This time I use the Ethernet!


четвртак, 24. децембар 2020.

Adding PS/2 mouse to my FPGA computer

This is a followup of my original post.

So far I had PS/keyboard only on my FPGA computer. The time has come to add the mouse, too. Without any investigation how PS/2 mouse works, I first tried to plug the mouse into my PS/2 keyboard connector and watch what would come from it. It didn't work. The keyboard worked, but the mouse didn't. I expected that the mouse would send bytes as I move or click, but it didn't. After a brief investigation, I found out that the PS/2 mouse needs an initialization in order to start sending bytes to the computer.

PS/2 is actually a bidirectional interface. Both computer and mouse/keyboard can send bytes to the other. Well, the initialization sequence for the mouse actually means that the computer needs to send one byte to the mouse. Unfortunately, that is not so simple. In order to send a command to the mouse, host (computer) needs to set both data and clock lines low for a given period of time, then to release both lines, and then to start setting bits of the command in synchronization with the clock that has just started to arrive from the mouse.

Fortunately, there is a module that already does all these steps, and can be found here.

I have replaced my original PS2 module with this one and now I have two ports in my computer:

// ####################################
// PS/2 keyboard instance
// ####################################
wire [7:0] ps2_data;
wire ps2_received;
reg [7:0] ps2_data_r;

PS2_Controller #(.INITIALIZE_MOUSE(0)) PS2 (
    // Inputs
    .CLOCK_50           (CLOCK_50),
    .reset              (~KEY[0]),

    // Bidirectionals
    .PS2_CLK            (gpio0[33]),
    .PS2_DAT            (gpio0[31]),

    // Outputs
    .received_data      (ps2_data),
    .received_data_en   (ps2_received)
); 

// ####################################
// PS/2 mouse instance
// ####################################
wire [7:0] ps2_data_mouse;
wire ps2_received_mouse;
reg [7:0] ps2_data_r_mouse;

PS2_Controller PS2_mouse (
    // Inputs
    .CLOCK_50           (CLOCK_50),
    .reset              (~KEY[0]),

    // Bidirectionals
    .PS2_CLK            (gpio0[2]),
    .PS2_DAT            (gpio0[4]),

    // Outputs
    .received_data      (ps2_data_mouse),
    .received_data_en   (ps2_received_mouse)
); 

The default value for the INITIALIZE_MOUSE parameter is 1, so the mouse controller initializes the mouse at reset.

I have allocated another IRQ for the mouse: 

localparam IRQ_PS2_MOUSE   = 5;

In the main irq loop, the mouse actually triggers the CPU interrupt #5:

always @ (posedge clk100) begin
    ...
    // ############################### IRQ2 - PS/2 keyboard #############################
    if (ps2_received) begin
        ps2_data_r <= ps2_data;
        // if we have received a byte from the keyboard, we will trigger the IRQ#2
        irq[IRQ_PS2] <= 1'b1;
    end
    else 
    begin
        irq[IRQ_PS2] <= 1'b0;
    end
    // ############################### IRQ5 - PS/2 mouse #############################
    if (ps2_received_mouse) begin
        ps2_data_r_mouse <= ps2_data_mouse;
        // if we have received a byte from the keyboard, we will trigger the IRQ#2
        irq[IRQ_PS2_MOUSE] <= 1'b1;
    end
    else 
    begin
        irq[IRQ_PS2_MOUSE] <= 1'b0;
    end
    ...
end

When the CPU detects that some bit in the irq register is set to 1, it triggers the interrupt handler routine. It does that by first checking if the interrupt vector is not zero. After that, the CPU pushes the current PC register and flags to the stack and then jumps to the interrupt handling routine. 

The mouse interrupt routine receives three bytes from the PS/2 mouse:

unsigned short int *PORT_PS2_MOUSE  = (unsigned short int *)(0x80000000 + 800)  ; // port for PS2 mouse

void ps2_mouse_irq_triggered()
{
asm 
    (
        "push r0\npush r1\npush r2\npush r3\npush r4\npush r5\npush r6\npush r7\npush r8\npush r9\npush r10\npush r11\npush r12\npush r13\n"
    );

    mouse_byte[mouse_counter++] = *PORT_PS2_MOUSE;
    if (mouse_counter == 3)
        mouse_counter = 0;

    asm 
    (
        "pop r13\npop r12\npop r11\npop r10\npop r9\npop r8\npop r7\npop r6\npop r5\npop r4\npop r3\npop r2\npop r1\npop r0\nmov.w sp,r13\npop r13\niret"
    );
}

void init_mouse() {
    mouse_counter = 0;
    *PS2_MOUSE_HANDLER_INSTR    = 1;
    *PS2_MOUSE_HANDLER_ADDR     = (int)&ps2_mouse_irq_triggered;
}

Three bytes for each mouse event come one by one. When all three bytes arrive, we are ready to process them in the main program:

        if ((mouse_counter == 0) && (
            mouse_byte[0] != old_mouse_byte[0] ||
            mouse_byte[1] != old_mouse_byte[1] || 
            mouse_byte[2] != old_mouse_byte[2])) {
                sprintf(str"mouse: %d, %d, %d"mouse_byte[0], mouse_byte[1], mouse_byte[2]);
                draw(1020REDstr);
                
                old_mouse_byte[0] = mouse_byte[0];
                old_mouse_byte[1] = mouse_byte[1];
                old_mouse_byte[2] = mouse_byte[2];
                ...

The first byte gives the button status: which button has been pressed. The second byte gives the x-axis speed of movement, or the amount of pixels the mouse has moved, while the third byte does the same, just for the y-axis. Both second and third byte are 8-bit signed values, meaning that if you read a value greater than 127, the value is negative (and can be calculated by subtracting it from the 256).

The whole demo can be seen here:



Conclusion

Adding PS/2 mouse was not that complicated once I managed to find the proper Verilog controller. After that, it was just the matter of allocating one more interrupt and writing handlers for it.


субота, 19. децембар 2020.

To BLIT or not to BLIT

This is a followup of my original post.

I have recently implemented the BLIT instruction for my FPGA computer. It is the most simple version of BLIT: copy the given number of bytes from the source memory location to the destination memory location. The syntax is like this:

mov.w r1, 1024  # destination address is in r1
mov.w r2, 9024  # source address is in r2
mov.w r3, 8000  # number of bytes is in r3
blit            # copy bytes

Registers r1, r2 and r3 are hardcoded. Later I might make it more flexible.

Results are quite impressive. When I copy 32KB using memcpy (not using BLIT), it takes approximately 100 milliseconds. When I use the BLIT instruction, it takes one millisecond!

How is BLIT implemented? Here is the Verilog code:

4'b1000begin
    // BLIT (r1, r2, r3) - r1 - dst; r2 - src; r3 - count
    case (mc_count)
        0begin
            addr <= regs[2] >> 1;
            regs[2] <= regs[2] + 2;
            regs[3] <= regs[3] - 2;
            mc_count <= 1;
            next_state <= EXECUTE;
            state <= READ_DATA;
        end
        1begin
            addr <= regs[1] >> 1;
            data_to_write <= data_r;
            regs[1] <= regs[1] + 2;
            next_state <= EXECUTE;
            state <= WRITE_DATA;
            if (regs[3] <= 0begin
                mc_count <= 2;
            end
            else 
                mc_count <= 0;
        end
        2begin
            state <= CHECK_IRQ;
            pc <= pc + 2;
        end
    endcase
end

In the code above we see that the CPU starts memory read at the address pointed by the r2 register in the first mc_count cycle. Then it obtains the word (two bytes) from memory and writes them to the address pointed by the r1 register. Both r1 and r2 are incremented by two and the r3 register is decremented by two; when it reaches zero, the instruction finishes.

Conclusion

The BLIT instruction does not execute in parallel with the CPU. It blocks the CPU while executing. Even with this constraint, it is approximately hundred times faster then copying bytes across the memory using the memcpy function. Therefore, it is worth using.

уторак, 30. јун 2020.

SPI interface on my FPGA computer

This is a follow-up of my original FPGA computer post.

SPI interface is a kind of a standard when it comes to connecting various peripherals to a computer (or, at least to a microcontroller). There is also I2C interface, but I will focus on the SPI in this post.

SPI stands for Serial Peripheral Interface. It is organized as a master-slave communication. If we presume that our FPGA computer is master, then the peripheral will be slave.

It usually has four important pins:
1. MISO (Master In Slave Out) - a wire which is used to transport data from slave to the master device,
2. MOSI (Master Out Slave In) - a wire which is used to transport data from master to the slave device,
3. SCL - clock (all the data transport is synchronized using this clock line), and
4. SS (Slave Select) - when active, the slave is selected (sometimes it is called CS - chip select). With this wire, it is possible to connect several peripherals to the same three mentioned wires (MISO, MOSI and SCK) and to have separate SS wires to each peripheral.

Why did I choose to use the SPI on my computer? First of all, SD cards have SPI built-in. This means that every SD card is actually a SPI slave device. Next, I use the ENC28J60 Ethernet module for my Arduino/ESP32/RaspberryPi Zero devices for the Ethernet connectivity. That module has SPI interface, too.


How did I integrate SPI into my FPGA computer. I have found a very nice implementation in Verilog here:
https://github.com/nandland/spi-master

BTW, that guy has excellent YouTube channel here: https://www.youtube.com/channel/UCsdA-aNqtMA1_2T15aXePWw

Next I had to integrate that module into my FPGA computer. I have decided to allocate an interrupt for the incoming data from the SPI and to ignore the module-controlled SS pin (I will manually activate SS signal from code, instead of letting that job to the SPI module):
// ####################################
// SPI Master instance
// ####################################
wire spi_start;
wire [7:0] spi_in;
reg [7:0] spi_out;
wire spi_ready;
wire spi_received;
reg [7:0] spi_in_r;
reg fake_CS;

SPI_Master_With_Single_CS spi0 (
.i_Clk(clk100),
.i_Rst_L(KEY[0]),
.i_TX_Count(1),
.i_TX_DV(spi_start),
.o_RX_Byte(spi_in),
.i_TX_Byte(spi_out),
.o_RX_DV(spi_received),
.o_TX_Ready(spi_ready),

.o_SPI_MOSI(gpio0[32]),
.i_SPI_MISO(gpio0[30]),
.o_SPI_Clk(gpio0[28]),
.o_SPI_CS_n(fake_CS)
);


The code above creates a SPI module named spi0 and connects it to a set of wires and registers. Next, in the main interrupt part, when the spi_received wire goes high (a byte has arrived on SPI), the IRQ_SPI interrupt is triggered:
// ##################### IRQ3 - SPI Master #####################
if (spi_received) begin
spi_in_r <= spi_in;
// if we have received a byte from the MISO,
  // we will trigger the IRQ#3
irq[IRQ_SPI] <= 1'b1;
end
else
begin
irq[IRQ_SPI] <= 1'b0;
end


In the CPU module, the IRQ_SPI interrupt causes processor to go to the predefined interrupt handler routine at the address of 56:
else if (irq_r[IRQ_SPI]) begin
// SPI byte received
pc <= 16'd56;
addr <= 16'd28;
irq_r[IRQ_SPI] <= 0;
end


All you have to do is to put some code at the address of 56 and to return from the interrupt handler routine using the IRET assembly instruction:

spi_irq_triggered:     push r0     ld.w    r0, [PORT_SPI_IN]   # PORT_SPI_IN.5_1, PORT_SPI_IN     ld.s    r0, [r0]    # _2, *PORT_SPI_IN.5_1     zex.s   r0, r0  # _3, _2     st.w    [received_byte], r0 # received_byte, _3    mov.w   r0, 1   # tmp29,     st.w    [received_from_slave], r0   # received_from_slave, tmp29     pop r0     iret

Now that I have the C compiler, the SPI interrupt handler routine can be written in C:
void init_spi()
{
    *SPI_HANDLER_INSTR  = 1;
    *SPI_HANDLER_ADDR   = (int)&spi_irq_triggered;
}

void spi_irq_triggered()
{
    received_byte = *PORT_SPI_IN;
    received_from_slave = 1;
    asm 
    (
        "mov.w sp,r13\npop r13\niret"
    );
}

In order to read the received byte, and to send some byte to the SPI, we need to implement some IO operations. As usual, I have done that in both direct and memory-mapped way. Here is the direct way using the IN and OUT assembly instructions:
// OUT [xx], reg
4'b0100: begin
`ifdef DEBUG
$display("%2x: OUT [%4d], r%-d",ir[3:0], data_r, (ir[15:12]));
`endif
case (mc_count) 
0: begin
// get the xx
addr <= (pc + 2) >> 1;
pc <= pc + 2;
mc_count <= 1;
next_state <= EXECUTE;
state <= READ_DATA;
end
1: begin
mbr <= data_r;
mc_count <= 2;
end
2: begin
case (mbr)
...
PORT_SPI_OUT: begin
spi_out <= regs[ir[15:12]];
spi_start <= 1'b1;
end
...
default: begin
end
endcase  // end of case (data)
mc_count <= 3;
end
3: begin
tx_send <= 1'b0;
spi_start <= 1'b0;
spi_start1 <= 1'b0;
state <= CHECK_IRQ;
pc <= pc + 2;
end
default: begin
end
endcase
end // end of OUT [xx], reg

What happens above? The OUT instruction is written in memory using four bytes. First two bytes are OPCODE of the instruction, and the second two bytes hold the port number (limiting the total number of available ports to 65536, but I think it is enough). 

In the first cycle (step 0) of the OUT instruction, the CPU sets the address to be read to be next two bytes after those two OPCODE bytes. Then the CPU waits for those two bytes to arrive (step 1). 

Then the CPU checks which IO port has been read from the memory, and of the port number is PORT_SPI_OUT, it means that we are trying to send some byte to the SPI, and the CPU sends the data to that port (step 2). In step 3 the CPU finishes sending and sets the next CPU state to be the IRQ check.

And, here is the memory-mapped IO way:
// Memory mapped IO
case (addr & 32'h3FFFFFFF)
...
PORT_SPI_OUT/2: begin
spi_out <= data_to_write;
spi_start <= 1'b1;
end
...
endcase

Memory-mapped is a bit simpler, but does the same job of sending a byte to the SPI.

OK, now that we have the working SPI interface, how can we use it to work with the SD card? I have made a Frankenstein-like code merging the original Arduino SD card code (written in C++) with some other pieces of code from the github in a way that now I have some elementary support for the SD cards. For example:

uint8_t sdcard_init(){
  writeCRC_ = errorCode_ = inBlock_ = partialBlockRead_ = type_ = 0;
  // 16-bit init start time allows over a minute
  uint32_t t0 = (uint32_t)get_millis();
  uint32_t arg;
   // must supply min of 74 clock cycles with CS high.
  for (uint8_t i = 0; i < 10; i++) spiSend(0XFF);

  chipSelectLow();

  // command to go idle in SPI mode
  while ((status_ = cardCommand(CMD0, 0)) != R1_IDLE_STATE) {
    if (((uint32_t)get_millis() - t0) > SD_INIT_TIMEOUT) {
      error(SD_CARD_ERROR_CMD0);
      goto fail;
    }
  }
 
  // check SD version
  if ((cardCommand(CMD8, 0x1AA) & R1_ILLEGAL_COMMAND)) {
    type(SD_CARD_TYPE_SD1);
  } else {
    // only need last byte of r7 response
    for (uint8_t i = 0; i < 4; i++) status_ = spiRec();
    if (status_ != 0XAA) {
      error(SD_CARD_ERROR_CMD8);
      goto fail;
    }
    type(SD_CARD_TYPE_SD2);
  }
  ... }

In the code above, we see that there are some spi-related functions, like spiSend() or spiRec(). Here are those:

void spiSend(int b)
{
    received_from_slave = 0;
    unsigned short int busy;
    do 
    { 
        busy = *PORT_SPI_OUT_BUSY;
    } while (busy);
    *PORT_SPI_OUT = b; //send the byte to the SPI
    
    do 
    { 
        busy = *PORT_SPI_OUT_BUSY;
    } while (busy);
}

uint8_t spiRec(void) {
    send_spi(0xFF);
    return read_spi();
}
int read_spi()
{
    while (!received_from_slave || *PORT_SPI_OUT_BUSY) 
    {
    }
    return received_byte;
}

Now, when we look at the spi_irq_triggered() function, we see that whenever that interrupt routine is triggered by the incoming byte from the SPI, that byte is stored in the received_byte variable. That byte is returned from the read_spi() function to the spiRec() function, and from that to the caller function.

OK, what next? How is this used? All of the interaction with the SD card is done by sending card commands and reading and writing 512 bytes of data, in so-called blocks:
uint8_t cardCommand(uint8_t cmduint32_t arg) {
  // end read if in partialBlockRead mode
  readEnd();

  // select card
  chipSelectLow();

  // wait up to 300 ms if busy
  waitNotBusy(300);

  // send command
  spiSend(cmd | 0x40);

  // send argument
  for (int8_t s = 24; s >= 0; s -= 8spiSend(arg >> s);

  // send CRC
  uint8_t crc = 0XFF;
  if (cmd == CMD0) crc = 0X95;  // correct crc for CMD0 with arg 0
  if (cmd == CMD8) crc = 0X87;  // correct crc for CMD8 with arg 0X1AA
  spiSend(crc);

  // wait for response
  for (uint8_t i = 0; ((status_ = spiRec()) & 0X80) && i != 0XFF; i++);
  return status_;
}

uint8_t readData(uint32_t block,
        uint16_t offsetuint16_t countuint8_tdst) {
  uint16_t n;
  if (count == 0return true;
  if ((count + offset) > 512) {
    goto fail;
  }

  #ifdef FAT_DEBUG
  printf("block: %d, offset: %d, count: %d\n", block, offset, count);
  #endif

  if (!inBlock_ || block != block_ || offset < offset_) {
    block_ = block;
    // use address if not SDHC card
    if (get_type()!= SD_CARD_TYPE_SDHC) block <<= 9;
    if (cardCommand(CMD17, block)) {
      error(SD_CARD_ERROR_CMD17);
      goto fail;
    }
    if (!waitStartBlock()) {
      goto fail;
    }
    offset_ = 0;
    inBlock_ = 1;
  }

  // skip data before offset
  for (;offset_ < offset; offset_++) {
    spiRec();
  }
  // transfer data
  for (uint16_t i = 0; i < count; i++) {
    dst[i] = spiRec();
//    printf("%x ", dst[i]);
  }

  offset_ += count;
  if (!partialBlockRead_ || offset_ >= 512) {
    // read rest of data, checksum and set chip select high
    readEnd();
  }
  return true;

 fail:
  chipSelectHigh();
  #if FAT_DEBUG
  printf("read data error code: %d\n", errorCode_);
  #endif
  return false;
}

uint8_t writeData(uint8_t tokenconst uint8_tsrc) {
  spiSend(token);
  for (uint16_t i = 0; i < 512; i++) {
    spiSend(src[i]);
  }
  spiSend(0xff);  // dummy crc
  spiSend(0xff);  // dummy crc

  status_ = spiRec();
  if ((status_ & DATA_RES_MASK) != DATA_RES_ACCEPTED) {
    error(SD_CARD_ERROR_WRITE);
    chipSelectHigh();
    return false;
  }
  return true;
}

uint8_t writeBlock(uint32_t blockNumberconst uint8_tsrcuint8_t blocking) {
  #if FAT_DEBUG
  printf("Write block number: %d\n", blockNumber);
  #endif
//  return true;
  // don't allow write to first block
  if (blockNumber == 0) {
    error(SD_CARD_ERROR_WRITE_BLOCK_ZERO);
    goto fail;
  }

  // use address if not SDHC card
  if (get_type() != SD_CARD_TYPE_SDHC) {
    blockNumber <<= 9;
  }
  if (cardCommand(CMD24, blockNumber)) {
    error(SD_CARD_ERROR_CMD24);
    goto fail;
  }
  if (!writeData(DATA_START_BLOCK, src)) {
    goto fail;
  }
  if (blocking) {
    // wait for flash programming to complete
    if (!waitNotBusy(SD_WRITE_TIMEOUT)) {
      error(SD_CARD_ERROR_WRITE_TIMEOUT);
      goto fail;
    }
    // response is r2 so get and check two bytes for nonzero
    if (cardCommand(CMD13, 0) || spiRec()) {
      error(SD_CARD_ERROR_WRITE_PROGRAMMING);
      goto fail;
    }
  }
  chipSelectHigh();
  return true;

fail:
  chipSelectHigh();
  return false;
}

Now that we are able to read and write 512-sized blocks, we need to figure out how the data is organized on SD cards. Well, the format is FAT32. That is an ancient format from Microsoft, but it is quite simple and is used everywhere.

The format can be found on Wikipedia and on this excellend blog post: https://codeandlife.com/2012/04/02/simple-fat-and-sd-tutorial-part-1/

So, if we want, for example, to list all files in the root folder, here is the code:
file_descriptor_t fd;
int next = 0;
while ((next = getDirEntry(&fd, next)) != 0)
{
    printf("%s %d bytes, cluster: %d (%d)\n"fd.dir_entry.filenamefd.dir_entry.filesizefd.curr_clusterfd.dir_entry.first_cluster);
}

The key code is in the getDirEntry() function:
uint32_t getDirEntry(file_descriptor_tfduint32_t index)
{
  int i,j;
  uint16_t cluster;
  uint32_t file_size;
  uint8_t b;
  uint8_t *buf = g_block_buf;
  char filename_upper[12];
  uint32_t counter = 0;

  for (i = 0; i < (dataStartBlock_ - rootDirStart_); i++)
  {
    b = readBlock(rootDirStart_ + i, g_block_buf);
    for(j = 0; j < 16; j++)
    {
      if (*(buf + j*32)==0 || *(buf + j*32)==0x2e || *(buf + j*32)==0xe5 || *(buf + j*32 + 0x0b) == 0xf)
      { 
        continue// free, or deleted file/folder, or phantom entry for long names?
        if (counter > index)
          return 0;
      }
      
      if(counter == index)
      {
        file_size = *(buf + j*32 + 0x1c);
        file_size += *(buf + j*32 + 0x1c + 1)<<8;
        file_size += *(buf + j*32 + 0x1c + 2)<<16;
        file_size += *(buf + j*32 + 0x1c + 3)<<24;
        cluster = *(buf + j*32 + 0x1a);
        cluster += *(buf + j*32 + 0x1a + 1) << 8;
        cluster += *(buf + j*32 + 0x14 + 0) << 16;
        cluster += *(buf + j*32 + 0x14 + 1) << 24;

        strncpy(filename_upper, (char*)(buf+j*32), 11);
        filename_upper[11] = '\0';

        // fill in dir_entry
        memmove(fd->dir_entry.filename, filename_upper, 12);
        fd->dir_entry.attributes = *(buf + j*32 + 0x0b);
        memmove(fd->dir_entry.unused_attr, buf + j*32 + 0x0c14);
        fd->dir_entry.filesize = file_size;
        fd->dir_entry.block = rootDirStart_ + i;
        fd->dir_entry.slot = j;
        fd->dir_entry.first_cluster = cluster;
        fd->curr_cluster = cluster;
        return counter + 1;
      } else if (counter > index) {
        return 0;
      }
      counter++;
    }
  }
  return 0;
}

The code above loads chunks of 512 bytes from the root directory start block, and then tries to iterate through the directory structure until it finds the right entry, given by its index. The directory structure is this:
typedef struct
{
  char filename[12];  /** The file's name and extension, total 11 chars padded with spaces. */
  uint8_t attributes;  /** The file's attributes. Mask of the FAT_ATTRIB_* constants. */
  uint8_t unused_attr[14]; /** Attributes in directory which are unused or unsupported */
  uint16_t first_cluster;     /** The cluster in which the file's first byte resides. */
  uint32_t filesize;   /** The file's size. */
  uint32_t block; /** The number of a block from the rootDirStart_ where this entry resides. */
  uint32_t slot; /** The number of the slot in the block where this entry resids. Each slot is 32 bytes large. */
dir_entry_t;


Since my FPGA computer is big endian, I couldn't just read bytes for file size and cluster address. Instead, I had to compute those numbers byte-by-byte.

Conclusion

Initial implementation of the SPI was simple enough. It is what you can do with it what matters. I was able to use the SPI to integrate SD card into my FPGA computer. That way, I don't need the Arduino/ESPP32 anymore to do the role of SD card reader, as I used to have.