3

ECE4760 rp2040 DMA machine

 1 year ago
source link: https://people.ece.cornell.edu/land/courses/ece4760/RP2040/C_SDK_DMA_machine/DMA_machine_rp2040.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

ECE4760 rp2040 DMA machine

Cornell University ECE4760
Direct Memory Access computing machine
RP2040

DMA on RP2040

DMA uses memory controllers separate from the CPU to accelerate data movment between memory locations, or between peripherials and memory. The RP2040 has 12 DMA channels which can stream an agregate of over 100 megabytes/sec without affecting CPU performance, in many cases. There are a huge number of options available to set up a DMA transfer. You can think of a DMA channel controller as a separate, programmable, processor with the main job of moving data. Memory on the rp2040 is arranged as a bus matrix with separate memory bus control masters for each ARM core and for the DMA system, and several memory bus targets accessed by the masters. Each bus target can be accessed on each machine cycle.

Here we use the DMA subsystem to produce a complete computing system, independent of the main ARM cpus. The DMA machine makes use of memory-copy ability, transport_triggered operations, and self-modifying code. The code consists of a sequence of DMA block descriptors stored in an array. The implemented operations are Turing Complete, and run at about the speed of an Arduino. About 8 million DMA blocks/second can be fetched/executed. There is a history of using only memory-moves to build a general cpu. In 2013 Stephen Dolan published x86 mov is Turing-Complete describing an example of a one-opcode machine. The paper Run-DMA by Michael Rushanan and Stephen Checkoway shows how to do this with one version (Raspbery Pi 2) of ARM DMA. The DMA system on the RP2040 has more transport-triggered functions and is a little easier to build. Joseph Primmer and I built a DMA processor using the Microchip PIC32 DMA system. Addition and branching had to be based on table-lookup. See DMA Weird machine.

The DMA machine is a fetch-execute cpu where the fetch function is done by one DMA channel, which loads DMA control block images from RAM into another (execute) DMA channel. The 'program' which is loaded consists of a carefully crafted series of DMA control blocks which together act as a general purpose computer. By using DMA1 blocks to modify following DMA1 control block images in the array, just before they are transfered to the hardware DMA1 control registers, we can perform addition, increments, conditional jumps, and/or/not logic operations, and any other operations required. The design is made easier by several transport- triggered actions in the DMA subsystem. These include an adder in the 'channel sniffer' and atomic SET/CLEAR/XOR write functions on all SFRs. The basic fetch/execute machine uses the channel DMA0 read address as a program counter. Every fetch that occurs leaves the read address pointing to the next block location. DMA0 reads the next block from the RAM array and copies it to the DMA1 channel hardware control registers, then chains to the newly loaded DMA1 channel. The DMA1 channel performs whatever daata move is specified, then chains to DMA2. DMA2 resets the DMA0 write_address to point to DMA1 control registers. Program branching is implemented by using DMA1 to load a new DMA0 read address to the DMA0 control registers. Writing a program of DMA blocks is very much like programming in some strange assembly language for a machine with one accumulator register and only memory-to-accumulator operations.

The following diagram is an attempt to summarize all this madness.
Black arrows are data flow. Blue arrows represent chaining between channels.
The control block array is just an array of ints that are read by DMA0 in sets of four.

DMAcpu2.png

In addition to a straight copy operation, there are a few transport-triggered operations in the RP2040 DMA system which happen as a side effect of reading or writing a specific address:

  • Writing to certain shadow registers associated with each special function register (SFR) clears, sets, or XORs bits in the SFR.
    The DMA sniffer data register is one SFR that can be used this way.
    This allows universal logic operations in each SFR as a result to transfering data to the register.
    Every peripherial control register (ports, timers, DMA, everything) is an SFR with three shadow registers!
    • For two logic bits A and B:
    • load B then SET bits using A as a mask implements A OR B
    • load B then CLR bits using NOT(A) as a mask implements A AND B
    • load B then XOR bits using A implements A XOR B
  • The DMA sniffer system itself supports computing a CRC32 on-the-fly while a channel is transfering data.
    The sniffer can also do a running add while transfering data. The 32-bit add capability makes
    other operations much easier to implement.
  • Data being copied out of the sniffer data register can be logically inverted or bit-reversed.
  • Any DMA channel transfer can be byte-swapped.
  • Of course, writing certain SFRs may have system functions.
    For instance, writing to an i/o port sets the value of output pins.
    Or setting an interrupt flag will force a cpu interrupt.

The programming process has to map these unusual primitive operations into familiar mathematical and logical operations, and some form of conditional branch or jump. The sniffer add operation and the bitwise SFR operations means we can directly implement these functions.
But remember that every basic operation is just a data move.

For add a sequence of DMA blocks could be:

  1. move one operand to the sniffer_data register
  2. move the other operand to the bit_bucket (discard) with sniffer enabled (this does the add)
  3. move sniffer_data register to the result address.

For shift-left we just do an ADD of a variable to itself (multiply by two).

For a logic operation (OR, AND, XOR, etc):

  1. move one operand to the sniffer_data register
  2. move the other operand to the sniffer SET, CLR, or XOR write address (e.g. DMA_SNIFF_DATA_CLR)
  3. move sniffer_data register to the result address.

For subtract of (A-B) we have to explicitly compute the 2's complement negative of B:

  1. move B to the sniffer_data register
  2. move 0xFFFFFFFF to the XOR write address ( DMA_SNIFF_DATA_XOR) to invert bits
  3. move unity to the bit_bucket (discard) with sniffer enabled (this adds 1 to from the 2's complement)
  4. move the A operand to the bit_bucket (discard) with sniffer enabled (this does an add)
  5. move sniffer_data register to the result address.

For a shift-right the process is much more annoying. A right-shift is a bit-reversed left-shift :

  1. move the variable to the sniffer_data register
  2. move the sniff_rev_mask to the DMA_SNIFF_CTRL_SET.
    This will cause a write from the sniff_data register to reverse the order of the bits in the word.
  3. move the sniff_data register to a temp_register (with bits reverse-order)
  4. move the temp_register back to sniff_data
  5. move the temp_register to the bit_bucket (discard) with sniffer enabled (doubling it; left-shift)
  6. move the sniff_data register to the result address (with bits reverse-order, restoring the correct order)
  7. move the sniff_rev_mask to the DMA_SNIFF_CTRL_CLR.
    This turns off the bit-reverse option

An unconditional jump is easy.
One step: move the jump target address to the DMA0 hadware read address control word.

The hardest operation to get right is a conditional jump. Every jump condition (e.g. jump on negative number) must be converted into an absolute address and all data possibilities (e.g. positive, zero, negative) MUST JUMP! This is because the last step of setting up the conditional jump is to push data to the DMA0 hardware. This weird constraint means that jump conditions need to be converted to small integers representing block addresses. I will outline the jump-on-negative-number-in-variable scheme.

  1. move variable to be tested to the sniffer_data register with DMA byte-swap turned on.
    This moves the sign-bit to bit 7. Bits 4-6 will be the same as the sign bit, as long as the absolute value
    of the register is less than pow(2,28).
  2. move 0xFFFFFFeF to the CLR write address ( DMA_SNIFF_DATA_CLR) to isolate bit 4.
    (Or any bits from 4 to 7). The result will be zero for a positive number (or zero) and 16 for a negative number, if you chose bit 4.
  3. move the desired address ADDR of a jump for positive input to the bit_bucket (discard) with sniffer enabled.
    The result will be an address of either ADDR or ADDR+16, with 16 being the size of one block in the program array.
    Each of these addresses may contain an unconditional jump to anywhere else in the program.
  4. move the sniff_data register to the DMA0 hadware read address control word to force the actual jump to one of the two locations.

The programs below are in blog-style reverse time order, newest stuff at the top.
The following program list is in time-order.

  1. Test program to validate basic execution model and test GPIO output, add, OR operation, conditional branch, and unconditional jump.(23dec2022)
  2. Direct Digital Synthesis is used to test timer-regulated execution speed, SPI output, and combining the DMA channel byte-swap function and CLR-masking to isolate the top byte of the 32-bit accumulator to use as an index into a sine-table. Insertion of the pointer to the sine table requires self-modifying code. Performance is good enough to use for audio synthesis rates. (28dec2022)
  3. Updated test program which implements add, subtract, shift-left, shift-right, and a couple of different ways of generating a conditional jump. (2jan2023)
  4. Refactored and generalized version. DMA channel dependencies are cleaned up for compatability with other software (e.g. VGA generation). The fetch/execute architecture is separated from the DMA program definition. (3jan2023)
  5. Use the DMAcpu machine to read the ROSC random-bit, shift it into the sniffer, then use the result to compute a CRC32 value using the sniffer hardware, and then output that to an SPI channel to make audio white noise. (7jan2023)
  6. The white noise generator was low-passed filtered using a 1-pole IIR, mostly just to see if the DMAcpu could do the arithmetic. It took a 38 step program about 4 uSec to compute a low-passed sample. (11jan2023)
  7. Merging the DMAcpu with VGA generation. Since both use the DMA system heavily, a test was necessary to see if either one broke when merged. Video also gave a way to visually test the random number generation quality. (11jan2023)
  8. Refining the DMAcpu random number generator and simulating Diffusion-Limited Aggregation (DLA). While testing the DLA, I noticed that there is some serial correlation in the DMAcpu random number generation. This code eliminates the correlation. (13jan2023)

DMAcpu and DLA, with refined random number generation. (13jan2023)
DLA runs for 100s of millions random number evaluations, rather than the 100s of thousands used to generate the distributions in projects below. The older random generator produced a slightly biased DLA, so I wrote a test program that just plotted sequential rands as points in 2D. A clear, but rare, diagonal line was produced. Introducing a slight delay decorrelated the ROSC bit, but slows down the random number generator to about one uSec. This speed is about the same as the C rand() function, but it is a true (as opposed to pseudorandom) random number generator. The two images use the improved DMAcpu random function. The left image has a one-pixel seed in the center of the screen. The image on the right has the text "ECE 4760" as a seed.

DLA.jpg
DLA_ece4760.jpg

DLA code, ZIP ( also corellation test code, random distributions code)

DISCLAIMER! The ROSC is not shown by the manufacturer or by me to have any reliable level of random generation. Further, of the three rp2040's I have tested, each gives a somewhat different oscillator speed, and different distributions of bits. Do not use this for any critical project without doing your own tests!


DMAcpu and 256 color VGA. -- Distribution testing. (11jan2023)
Test programs from the random number generation page were converted to use the DMAcpu-generated random numbers. The ROSC ring oscillator rnd_reg is used to drive a CRC32 in the DMAcpu. The tests chosen were a 20-coin toss binomial distribution and summing several uniform random numbers to form a normal distribution. The serial interface allows the user to choose normal/binomial, a scale factor, and (for normal) the number of uniform numbers to be added to make one approximate normal sample. The images below show a binomial distribution with 1.3 million total events, with each event being the number of heads from 20 coin tosses. The normal distribution is built from 3 million events each consisting of the sum of 12 uniformly distributed numbers. The red dot and blue dots are the expected distributions. There may be slightly too few samples near the peak in the normal distribution. The DMAcpu program is just four blocks which perform a CRC operation on the ROSC random bit and the last CRC result, stores the result, signals the thread that a new value is read, then jumps back to the beginning block.

binomial.jpg
normal.jpg

DISCLAIMER! The ROSC is not shown by the manufacturer or by me to have any reliable level of random generation. Further, of the three rp2040's I have tested, each gives a somewhat different oscillator speed, and different distributions of bits. Do not use this for any critical project without doing your own tests!

Code, ZIP


Filtering white noise using DMAcpu.(11jan2023)
The white noise generator was low-passed filtered using a 1-pole IIR, mostly just to see if the DMAcpu could do the arithmetic. It took a 38 step program about 4 uSec to compute a low-passed sample. I don't think this is a practical use for the DMAcpu, but it did test a lot of functions. The filter implemented is a simple one-pole IIR filter with a filter coefficient set by right shifting:
output = old_output + [(input - old_output) >> n]
It took a 38 DMA block program about 4 uSec to compute. Both channels are set to the DAC with the filtered output on channel A and unfiltered on channel B. Since my right-shift function only works with positive numbers, the actual function computed was:
output = old_output + (input>>n) - (old_output >> n)
Shown below are time domain and spectra waveforms for an n=4.
Top trace is the unfiltered noise in both the time domain, and the magenta spectra on the right.

time_domain.jpg
two_spectra.jpg

Code, ZIP


Generating white noise from the DMAcpu. (7jan2023)
The program reads the ROSC random-bit and shifts it into the sniffer data register. This shift register is then used as a seed for CRC32 hardware computation and the resulting scrambled bits are truncated to 12-bits for the SPI DAC to produce good sounding white noise. The machine runs at a sample rate of 50 KHz in the example code, but will run as fast at 500 KHz. The image below shows the spectrum at a sample rate fo 100 KHz. The spectrum is down about 6 db at 50 KHz, but flat through the audio spectrum. The time to generate a new audio sample is about 0.9 uSec. (the timing code in the linked program, but not shown below, adds 0.3 uSec.)

dma_white_noise%20(2).jpg

The code shows the simplifed block syntax.
Note that sniffer function alernates between add and CRC32.
Also, sending a value through the sniffer twice, in add mode, doubles it (shift-left)
// dma_sniffer_ set to add: add function code is 0xf in the calc field
build_block(&sniff_calc_mask, DMA_SNIFF_CTRL_SET, 1, STANDARD_CTRL);
// load a random bit fom ROSC to sniff data reg: dma_hw->sniff_data
build_block(rnd_reg, &dma_hw->sniff_data, 1, STANDARD_CTRL);
// pass shift-var thru the sniffer twice to the bit_bucket
build_block(&dma_noise_temp, &bit_bucket, 2, STANDARD_CTRL | SNIFF_EN) ;
// store back to shift-var
build_block( &dma_hw->sniff_data, &dma_noise_temp, 1, STANDARD_CTRL);
// dma_sniffer_ set to CRC32: CRC32 function code is 0x0 in the calc field
build_block(&sniff_calc_mask, DMA_SNIFF_CTRL_CLR, 1, STANDARD_CTRL);
// compute CRC32 build_block(&dma_noise_temp, &bit_bucket, 1, STANDARD_CTRL | SNIFF_EN) ;
// dma_sniffer_ set to add
build_block(&sniff_calc_mask, DMA_SNIFF_CTRL_SET, 1, STANDARD_CTRL);
// limit to 12 bit data: mask value is 0xffff000
build_block( &sniff_dac_data_mask, DMA_SNIFF_DATA_CLR, 1, STANDARD_CTRL);
// OR in the DAC control word
build_block( &dac_config_mask, DMA_SNIFF_DATA_SET, 1, STANDARD_CTRL);
// send to DAC
build_block(&dma_hw->sniff_data, &spi0_hw->dr, 1, (DMA_CHAIN_TO(fix_chan) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_16) | DMA_IRQ_QUIET | DMA_EN));
// unconditional jump to start of program
// push the DMA_blocks[0] address into the program counter (fetch channel read pointer)
// !!NOTE that this block throttles the machine to the frequency of Timer 3 !!
// To run at full DMA speed, change to DMA_TREQ(DREQ_FORCE)
build_block(&DMA_blocks_addr, &dma_hw->ch[fetch_chan].read_addr, 1, DMA_CHAIN_TO(fix_chan) | DMA_TREQ( DREQ_DMA_TIMER3) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_IRQ_QUIET | DMA_EN) ;

Code, ZIP


Improved organization of the test program (3jan2023)
The explicit DMA channel numbers were replaced with macros defining the actual channels so that the machine can be more easily used with other DMA-based protocols. The fetch/execute architecture is separated from the DMA program definition to make it easier to modify for new DMA progams. A macro is added to make the DMA channel control register specification more compact and easier to read.

Code, ZIP


More operations and better conditional jumps (2jan2023)
The program asks the user for two numerical values then computes the sum, difference, shifted values and the sign of the first input value. Most of the explaination of this program is above, describing how each operation is decomposed into a series of data-moves.

Test program, project ZIP


An application: Direct Digital Synthesis (12/28/2022)
DDS is an attractive example because it requires fast, precise, timing, but is essentially a pointer increment followed by a table-lookup, then a 16-bit SPI load. In other words, mostly data motion. The DMA program takes to form of a linear set of instructions with no branching, just a loop back to the beginning of the program. Rather than running the DMAcpu at full speed, the loop back block is paced by one of the high precision DMA timers set to a fixed frequency. The program ran correctly up to the limit of the SPI DAC chip, 500,000 samples/sec, but for audio synthesis I used a lower frequency pace of 200,000 loops(samples)/sec. At that rate, the DMA machine ran about 25% of the time. The generated frequency matched the math to within the accuracy of my scope (about 0.1%).
Algorithm:

  1. dds_accum += dds_inc (32 bits) where dds_accum is the DDS phase accumulator
    and dds_inc is incremental speed of rotation of the phasor (proportional to the frequency)
    where: dds_inc = Fout * pow(2,32 )/ Fs ; with Fs = 2e5 and Fout the desired sinewave frequency
  2. The high byte of dds_accum becomes the index into sine table:
    Use DMA BSWAP to move it to low byte of sniffer data register
    clear upper bytes using the transport triggered CLR write with mask 0xffffff00
    Multiply by 2 to convert index into a byte-count of short ints
  3. add byte-count to the base address of the sine_table to form a pointer to the next entry
  4. Shove the pointer just computed into the NEXT BLOCK read address, so it can copy the table value to the SPI channel
  5. Do a 2-byte transfer from sine table to SPI_data register.
    The SPI transfer takes about 0.8 uSec, but if the pacing timer is set to 200 KHz no wait is necessary here.
  6. Stall waiting for pacing timer, then jump back to step 1

The DMA-machine program:

  1. Send a timing pulse to GPIO2. The length of the pulse will be the execution time of the loop.
    build_block(&pin_on, &iobank0_hw->io[2].ctrl, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;
  2. ADD increment to the accumulator. This is the phasor used to look up a sine value
    // === add dds_accum and dds_inc by transport-triggered operation in sniff reg
    build_block(&dds_accum, &dma_hw->sniff_data, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;
    // pass dds_inc thru the sniffer to the bit_bucket
    build_block(&dds_inc, &bit_bucket, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN | SNIFF_EN) ;
    // store the sniff data reg back to dds_accum
    build_block( &dma_hw->sniff_data, &dds_accum, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;
  3. Form pointer to next sine-table entry from the accumulator
    // load dds_accum to sniffer BUT byte reversed! see BSWAP
    build_block(&dds_accum, &dma_hw->sniff_data, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN | BSWAP) ;
    // clear high bytes -- leave low byte alone -- the clear_high_bytes mask is 0xffffff00
    build_block(&clear_high_bytes, DMA_SNIFF_DATA_CLR, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN ) ;
    // mult by 2 for 'short' array pointer by addding sniffer to itself
    build_block( &dma_hw->sniff_data, &bit_bucket, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN | SNIFF_EN) ;
    // add to sine table base address
    build_block(&sine_table_addr, &bit_bucket, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN | SNIFF_EN) ;
  4. Move the just-formed sine table pointer into the NEXT BLOCK read address
    build_block(&dma_hw->sniff_data, next_block_addr, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN ) ;
  5. Move sine table entry to SPI connected to DAC-- spi0_hw->dr
    // NOTE that the read address is just a place-holder for the previous block to overwrite.
    // NOTE that the SPI CS line is driven automatically by the write to the SPI data reg
    build_block(sine_table_addr, &spi0_hw->dr, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_16) | DMA_EN ) ;
  6. Clear the timing GPIO pin
    build_block(&pin_off, &iobank0_hw->io[2].ctrl, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;
  7. Jump back to the beginning, but wait for the DMA pacing timer.
    // push the DMA_blocks[0] address into the program counter (DMA0 read pointer)
    // !!NOTE that this block throttles the machine to the frequency of Timer 3 !!
    // To run at full DMA speed, change TREQ to DREQ_FORCE
    build_block(&DMA_blocks_addr, &dma_hw->ch[0].read_addr, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_DMA_TIMER3) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_IRQ_QUIET | DMA_EN) ;

DDS program, ZIP


First DMA test program. (12/23/2022)
This program just toggles an i/o pin to trigger an oscilloscope, then runs through basic proof-of-concept constructions. This program performs several operations using only DMA-logic. the DMA machine is completely asynchronous and independent of MAIN, once started. MAIN sets up the DMA machine program, defines variables for the machine, then prints the results of an ADD and OR operation on the serial console. No other microcontroller resources are needed (except memory, of course) to make the machine run. The execution speed is about 8 million blocks/sec. To make life easier I defined a macro to insert DMA control blocks into the array defining the program.
build_block(read_addr, write_addr, count, ctrl)
builds a DMA control block according to the specs in the data sheet. Remember that control blocks are pulled one-at-a-time from the array, placed in the DMA1 hardware registers by DMA0, then triggered to perfrom the desired data move. After the move, DMA1 chains to DMA2 to reset the DMA0 write address to point to DMA1 control registers.

The DMA-machine program:

  1. Sends a two trigger pulses to GPIO2, for an oscilloscope, and to time the machine.
    // === set the GPIO2 pin by transfering a control word directly to the pad control register.
    The parameters on line 2 configure the DMA control register so that the channel runs as soon
    as possible, with a width of 32 bits, no increments, and chaining to channel DMA2 when done.
    The word transfered is 0x3300 which enables output and writes a 1.
    build_block(&pin_on, &iobank0_hw->io[2].ctrl, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;
    // === clear the pin
    The word transfered is 0x3200 which enables output and writes a 0.
    build_block(&pin_off, &iobank0_hw->io[2].ctrl, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;
    // === set pin -- repeat
    build_block(&pin_on, &iobank0_hw->io[2].ctrl, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;
    // === clear the pin -- repeat
    build_block(&pin_off, &iobank0_hw->io[2].ctrl, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;
  2. Adds two 32-bit variables and store the result back to a variable.
    Three DMA1 blocks:
    // === add two variables by transport-triggered operation in sniff reg
    // assumes: dma_sniffer_enable(1, sniffer_add, true);
    // === load a var to sniff data reg: dma_hw->sniff_data
    build_block(&dma_var_1, &dma_hw->sniff_data, 1, DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;
    // == pass another var thru the sniffer to the bit_bucket--data has to pass thru for the add to work.
    // the bit_bucket is just a dummy variable to discard transfered data. the add occurs as the data passes
    // through the sniffer.
    // result is in sniffer_data register
    // notice the SNIFF_EN is set to turn on the add function for one block
    build_block(&dma_var_2, &bit_bucket, 1, DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN | SNIFF_EN) ;
    // = = store the sniff data reg back to var_2
    build_block( &dma_hw->sniff_data, &dma_var_2, 1, DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;
  3. Computes the OR of two 32-bit variables and stores the result
    These operations use transport-triggered operations built into SFR to implement logic.
    Three DMA1 blocks:
    // === OR two variables by transport-triggered operation in sniff reg
    // = === load a var to sniff data reg: dma_hw->sniff_data
    build_block(&dma_var_3, &dma_hw->sniff_data, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;
    // == load another var to the SET reg. EVERY SFR has SET, CLR, XOR
    build_block(&dma_var_4, DMA_SNIFF_DATA_SET, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;
    // == store the sniff data reg back to var_5
    build_block( &dma_hw->sniff_data, &dma_var_5, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;
  4. Multiply a variable by a constant.
    Three blocks:
    // === mult a variable by a constant in sniff reg
    // in this case, times 4
    // by substituting the '4' to a variable, you can do general mult
    // == clear sniff data reg: dma_hw->sniff_data (with no clear get MAC operation)
    build_block(&dma_var_0, &dma_hw->sniff_data, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;
    // == pass the var thru the sniffer to the bit_bucket 4 times. note SNIFF_EN is on
    build_block(&dma_var_6, &bit_bucket, 4,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN | SNIFF_EN) ;
    // == store the sniff data reg back to var_2
    build_block( &dma_hw->sniff_data, &dma_var_7, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;
  5. Computes a conditional skip based on user input from a thread.
    Three DMA blocks to compute branch:
    // === conditional skip
    // the dma_flag variable can take only values 0, 16, 32, or 48 as set by user thread
    // these numbers correspond to jumping 0, 1, 2, or 3 blocks ahead.
    // == read flag to sniffer
    build_block(&dma_flag, &dma_hw->sniff_data, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;
    // == form target block address by adding jump on zero address; jump_zero_addr = block_addr(16) ;
    // had to count the blocks to find out the next one AFTER the DMA0 load was block number 16.
    build_block( &jump_zero_addr, &bit_bucket, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN | SNIFF_EN) ;
    // move sniffer data to read addr of DMA0 to force next read from new location
    // sniffer contains zero_jump_address + offset to one_jump
    // == push the new block address to DMA0 block
    build_block(&dma_hw->sniff_data, &dma_hw->ch[0].read_addr, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_IRQ_QUIET | DMA_EN) ;
  6. The skip targets are in steps of 16 bytes/DMA block stored in the array.
    The targets just change the length of a pulse on GPIO2.

    // === TARGET if dma_flag == 0 -- THIS is block number 16 in the program list
    // == set pin
    build_block(&pin_on, &iobank0_hw->io[2].ctrl, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;

    // TARGET if dma_flag == 16
    // == set pin
    build_block(&pin_on, &iobank0_hw->io[2].ctrl, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;

    // TARGET if dma_flag== 32
    // === clear the pin
    build_block(&pin_off, &iobank0_hw->io[2].ctrl, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_EN) ;

    // === TARGET if dma_flag == 48
    // === unconditional jump to start of program (#1 in this list)
    // push the DMA_blocks[0] address into the program counter (DMA0 read pointer)
    build_block(&DMA_blocks_addr, &dma_hw->ch[0].read_addr, 1,
    DMA_CHAIN_TO(2) | DMA_TREQ( DREQ_FORCE) | DMA_DATA_WIDTH(DMA_SIZE_32) | DMA_IRQ_QUIET | DMA_EN) ;

Test code, ZIP




Not current below this line!

The DMA test program performs several operations using only DMA-logic. the DMA machine is completely asynchronous and independent of MAIN, once started. MAIN sets up the DMA machine program, defines tables for the machine, then prints the results of an ADD and NOR operation on the serial console. No other microcontroller resources are needed (except memory, of course) to make the machine run. The execution speed is about 200,000 blocks/sec.
The DMA weird-machine program:

  1. Sends a trigger pulse to PortA, for an oscilloscope
    One DMA2 block to transfer one byte to LATA.
  2. Increments a variable to be used in the arithmetic below
    Two DMA2 blocks:
    -- move variable value to low byte of source address field (increment array) of next DMA block.
    -- move value in source address to variable. Contents of increment array is (source addr low-byte value)+1.
  3. Adds two 8-bit variables and stores the result (terminal image showing sum, modulo 256, and result of NOR operation below)
    Four DMA2 blocks:
    -- move variable_1 value to low byte of source address field (increment array) of DMA block 3 blocks later.
    -- move variable_2 value to low byte of source length field (increment array) of DMA block 2 blocks later.
    -- move variable_2 value to low byte of cell length field (increment array) of DMA block 1 block later.
    -- increment through the table specified by the previous three blocks and store result into variable_4.
  4. Computes the NOR of the two 8-bit variables and stores the result
    These operations use transport-triggered operations built into SFR to implement logic.
    Four DMA2 blocks:
    -- move variable1 value to an SFR that supports (CLEAR, SET, INVERT) write locations.
    -- move variable2 value to the SFR SET write location.
    -- move 0xff value to the SFR INVERT write location.
    -- move NOR value in the SFR to variable_3
  5. Sets a print strobe, to be cleared by MAIN when the variables are printed.
    This is necessary because the DMA machine is completely asynchronous with the CPU
    -- One DMA2 block to move a 0x01 to the print strobe, which is cleared by the CPU in MAIN.
  6. Computes a conditional branch to see if the print strobe is cleared, and loop until it is cleared.
    Five DMA blocks to compute branch:
    -- move print_strobe to low byte of source address field (offset array) of next DMA block.
    This effectively multiplies the logical 0/1 to 0/4 because the jump address is 4 bytes
    -- move the offset to low byte of source address field (jump array) of next DMA block.
    This will select the jump address entry from the jump table.
    -- move the actual target block address to low two byte of source address field of the DMA0 block two blocks ahead.
    -- move the next block to DMA0 control registers.
    -- define the DMA0 block to be moved by the previous block.
  7. Increments a variable, modulo 3, to choose one of three output wavefroms to send to PortB
  8. Computes a conditional branch to one of three waveform generators based on the mod 3 variable:
    -- 1 microsec pulse
    -- 2 microsec pulse
    -- 8 microsec pulse
  9. Unconditional jump back to the beginning of the program (item 1 on this list)

Fetch/execute machine details:
The syntax given below assumes that DMA0 or DMA2 block images can be built by defining them using PLIB DMA commands, then copying the blocks (in the perparation stage) to the large array or arrays of block images. Macros hide the actual preparation and abstract the preparation to specifying the source address, destination address, source length (in bytes), destination length (in bytes), and the cell transfer length (in bytes). For example:
make_DMA2_block(LED_pattern2, (void*)&LATB, 64, 1, 64);
constructs and stores a block destined to be copied to the DMA2 control block which moves 64 bytes from a memory array, in a burst of 64, to the 1 byte port B output latch. The effect is to generate a burst of output transitions on the port, when the block is later loaded into the DMA2 control register and executed.

  • Output to a port:
    There at least two ways to make a pulse on an output pin. One way was described above of moving an array to a port.
    The other way is to use the PIC32 special-function-register shadow registers explained in a separate bullet below:
    make_DMA2_block(&test_var_0, (void*)&PORTB, 1, 1, 1); // test_var_0 = 0x01
    make_DMA2_block(&inv_mask, (void*)&PORTBINV, 1, 1, 1); // inv_mask = 0xff
    The first block turns on port B, bit zero. The second block inverts the entire latch byte to turn off bit zero.
  • Increment a variable:
    Increment uses the value of a variable as an index into an array which has contents equal to index+1. For this to work you have to be able to copy the variable into the low byte of the source address of the next block. You also have to align the array in memory so that the base address of the array has a zero-valued low order byte. Perparation includes the array definition, which in this case causes the byte-aligned variable to cycle through three values. The macro next_blk_src_addr calculated to memory address of the source field of the next defined block.
    unsigned char jmp_inc_array[] __attribute__ ((aligned(256))) = {1, 2, 0} ;
    make_DMA2_block(&inc_value, next_blk_src_addr, 1, 1, 1);//load to low-byte of source address in next block
    make_DMA2_block(jmp_inc_array, &inc_value, 1, 1, 1); // read array entry back into variable
  • Logic operations:
    Special function registers (e.g. DMA control blocks, or i/o ports), which are writable all have three shadow, write-only, registers. The shadow registers are write-only addresses and act as bit-mask modifiers of the main register. The three registers allow you to set, clear, or invert individual bits in the main register. As an example, writing 0x04 to PORTBINV negates the third bit of PORTB. The general naming scheme is sfrSET, sfrCLR and sfrINV. I chose to use sfr that are normally used to set compare data for DMA transfer termination. The actual termination function is not turned on, so these registers are not used by the weird machine. For example, to compute the bit-wise NOR:
    make_DMA2_block(&test_var_1, scratch_sfr, 1, 1, 1); //scratch_sfr load
    make_DMA2_block(&test_var_2, scratch_sfr_set, 1, 1, 1); // OR in another variable
    make_DMA2_block(&inv_mask, scratch_sfr_inv, 1, 1, 1); // invert to make NOR operation
    make_DMA2_block(scratch_sfr, &test_var_3, 1, 1, 1); // store to third variable
  • Add two variables:
    To add two variables you need to supply three pieces of information to one block for one table lookup. The table lookup uses the fact that moving a series of bytes is exactly the counting operation which is necessary to add. Variable test_var_1 is used as the offset into an array to start the count. Variiable test_var_2 is used as both the size of the source, and the number of bytes to move, but all of the bytes move to one target byte, essentially making a counter. There an edge condition when test_var_2=0, which means that the size of the transfer has to be at least 256 bytes, hence the 0x100 size in the last block. As usual, the actual value of a table-lookup is substituted into the low-order byte of the next block's source field, or size field. Note that the inc_array is 768 byte long (256x3) to account of the zero edge case and longest possible increment sequence.
    // --load first operand into blocK+3 source addr
    make_DMA2_block(&test_var_1, (void*)(DMA_blocks+length_of_block*(N+3)+DCH0SSA_OFFSET), 1, 1, 1);
    // -- load second operand into block+2 source SIZE !!!!CANNOT BE ZERO!!!
    // Hence the 0x100 offset two blocks down
    make_DMA2_block(&test_var_2, (void*)(DMA_blocks+length_of_block*(N+2)+DCH0SSIZ_OFFSET), 1, 1, 1);
    // -- load second operand into block+1 cell SIZE !!!!CANNOT BE ZERO!!!
    // Hence the 0x100 offset in the next block
    make_DMA2_block(&test_var_2, (void*)(DMA_blocks+length_of_block*(N+1)+DCH0CSIZ_OFFSET), 1, 1, 1);
    // --read sum array entry into variable
    make_DMA2_block(inc_array, &test_var_4, 0x100, 1, 0x100);
  • Conditional jump:
    Since every operation is a memory transfer, each branch of a computed jump must jump, so that the operations are uniform. There are several steps required.
    1. Assuming that the branch depends on the value of a byte variable, the branch scheme needs a lookup table in which each entry is the 4*(byte value).
      example of a modulo 3 multiply table:
      unsigned char offset_array[] __attribute__ ((aligned(256))) = {0, 4, 8} ;
      The multiply table will be used to generate an 4-byte offset for each possible increment value.
      The table must be aligned in memory so that the lowest byte of the address is zero.
    2. The branch scheme also needs a lookup table in which each entry is the actual memory address of the next block to execute,
      depending on the index, which is the value of the incremented variable
      The address will be moved into the new DMA0 block.
      example of jump table: unsigned int jmp_array[3] __attribute__ ((aligned(256))) ;
      later in the code, set
      io_jmp_array[0] = DMA_blocks + one_short_pulse_label*length_of_block ;
      io_jmp_array[1] = DMA_blocks + two_short_pulse_label*length_of_block ;
      io_jmp_array[2] = DMA_blocks + one_long_pulse_label*length_of_block ;
      But NOTE that these are virtual addresses which need to be converted to physical addresses.
      The conversion is done by only using the lower two bytes of the array value.
    3. DMA block N copies the 1-byte increment variable into the low byte of the source address field of block N+1,
      which contains the base address of the offset array.
    4. DMA block N+1 copies the modified offset array address contents into the low byte of the source address field of block N+2,
      which contains the base address of the jump array.
    5. DMA block N+2 copies the modified jump array address contents into the low 2 bytes of the source address field of block N+4,
    6. DMA block N+3 copies block N+4 into the DMA0 control registers
    7. DMA block N+4 is the block (with new source pointer) to actually copy into the DMA0 control registers,
      thus implementing a jump by updating the source pointer to the DMA block list in memory.
    make_DMA2_block(&inc_value, next_blk_src_addr, 1, 1, 1);
    make_DMA2_block(jmp_offset_array, next_blk_src_addr, 1, 1, 1);
    make_DMA2_block(io_jmp_array, (next_blk_src_addr+length_of_block), 2, 2, 2);
    make_DMA2_block(next_blk_addr, DMA0_addr_2, length_of_block, length_of_block, length_of_block);
    make_DMA0_block(DMA_blocks, DMA2_addr_2, number_of_blocks*length_of_block, length_of_block, length_of_block);
  • Unconditional jump:
    An unconditional jump requires two blocks. This first block moves the second block to DMA0 control registers.
    The DMA0_addr is the loacation of the DMA0 control registers, DMA_blocks the the address of the jump target.
    The second block is to be moved to DMA0 to force jump to beginning of pgm and start loading blocks into DMA2
    make_DMA2_block(next_blk_addr, DMA0_addr, length_of_block, length_of_block, length_of_block);
    make_DMA0_block(DMA_blocks, DMA2_addr, number_of_blocks*length_of_block, length_of_block, length_of_block);

Direct Digital Synthesis -- A possible practical use for the DMA machine (and optimizing execution)
DDS uses a table-lookup to send sine values to a SPI-attached DAC. It is possible to do DMA transfer to the SPI using framed mode, which autogenerates a chip select on the channel slave-select line. However, the chip select is limited to one pin and there can only be one peripherial on the channel. The serial DMA machine allows you to define an arbitrary chip select pin and manipulate it. The downside is that the maximum speed for the transfer is around 11.4 Ksamples/sec (when using the standard 192 byte full DMA block definition). The example code waits for a timer event, toggles the chip select, sends two bytes through SPI to the DAC, increments an array pointer, then auto-loops back to wait the beginning for a timer event. To turn off the machine, just freeze timer3 so that another SPI device can access the bus. The demo code does this with a serial command.
-- The rate-limiting step in the DMA weird machine execution is loading the 192 byte blocks for every operation. Careful consideration of the contents of the DMA control block suggests that the last two words are not needed for this machine (unless you try to use transport-triggered compare). This shaves 32 bytes off. Another 12 bytes can be pulled off the end because each control register has three shadow registers for transport-triggered logic operations. The first word of the block is constant and can be set once, saving 16 bytes. The net result is 132 byte transfers which speeds up execution by about 1.5 times. The sample rate jumps to 18 Ksamples/second. Code.
-- Just running the DAC transfer as fast as possible with NO time-trigger control speeds up the sample rate to 23 Ksamples/sec. The speed up happens because the block size is cut to 100 bytes (minimum). The minimum size does not include the ability to set up a time trigger using the DMA block interrupt detect hardware. Code.
-- Changing the code to use 2-byte transfers to the SPI channel requires a modified increment table which limits the maximum sine resolution to 128 samples/cycle. The scheme makes a table in which the increment is two, rather than one. The effect is to remove two blocks from the DMA-block DDS loop, and raising the maximum synthesis frequency to 23.6 KHz (still with timer control). For the DDS sinewave this gives a frequency range from 2.95 Hz for a 8-sample sine to 184 Hz for a 128-sample sine. The 23.6 KHz synthesis rate corresponds to a timer interval of 1700 cycles. This means that changing the sample rate allows frequency control of better than 0.1%. Changing the length of the sine table by one sample yields frequency control of 1/(sine_table_size).
Code. <<use this version for DDS>>

Pseudorandom or random sequence generation
This example uses the CRC hardware module to generate a pseudorandom 16-bit number sequence. OPtionally, reading a floating ADC input adds some entropy to make the sequence truly random, but not cryptograhic quality. The sequence is output through the SPI DAC interface for spectral analysis. If the ADC is used, it is read every eighth interation of the LFSR, with 8-bits of the ADC reading XORed with the lower 8-bits of the LFSR seed. Running the CRC LFSR, emitting the SPI data, computing the conditional ADC read all runs at about 10KHz. The code needs a 16-bit SFR to use as a 16-bit ALU. The OCR5 set/reset registers were used. This version of the code optimizes for speed by eliminating possible timer control, so the system just runs as fast as it can. Eliminating SPI output would speed up random number genration about 30%. Eliminating the ADC read would speed it up by about 25%, but makes the sequence completely repeatable, and dependent on the initial seed chosen. The output noise spectrum drops with a 3db point at about 25% of the sample frequency and a minimum at the sample frequency at least 30db down.
Code (with ADC read every three LFSR operations)
Spectrum of DAC output with no ADC reads. Sample rate is about 16.8 KHz.

spectrum_no_ADC.jpg

Older versions:

Time synced operation:
It is possible to sync overall machine operation to a timer by modifying one DMA2 block definition to trigger a transfer on a timer event. Note that this is a blocking-wait, which kills DMA execution until the timer event. The could be useful for a small program that, for example, sends a word to the SPI channel on a regular schedule to run a DAC. The sequential machine would wait for a timer event, drop the chip-select line, transfer a word to the SPI buffer, raise the chip-select line, then loop to wait for the next timer event. The DMA2 block definition which waits, then executes a NOP could be:
DmaChnOpen(2, 0, DMA_OPEN_AUTO);
DmaChnSetTxfer(2, &inc_value, &bit_bucket, 1, 1, 1);
DmaChnSetEventControl(2, DMA_EV_START_IRQ(_TIMER_3_IRQ));
DmaChnSetEvEnableFlags(2, DMA_EV_CELL_DONE);
DmaChnEnable(2);
memcpy(DMA_blocks+length_of_block*N, &DCH2CON, length_of_block);
N++;
This code runs the main DMA loop at 100 Hz by waiting for timer3 event.

Optimizing test code execution speed
-- The execution speed of the DMA machine is limited by the need to load a 192 byte control block for each operation. By reducing the felxibility of the machine, certain chunks of the DMA2 block do not need to be reloaded each time. An optimized version with about 1.4 speed-up minimizes DMA2 block updates, but still allows full functions described above. optimized code. The minimum execution time for one block dropped from 10 µsec to 7 µsec because the bytes per block were reduced from 192 to 132.
-- It is possible to optimize further, but the ability to trigger a block from an outside source (perhaps a timer) is lost. By eliminating the copy of the interrupt control registers, the copy count drops to 100 bytes, and the minimum block execution time drops to 5.5 µsec. The general test code above still runs, but time sync is much harder. The DCHxSSA, source address, register is the first address copied and the DCHxCSIZ, cell size register, is the last (see datasheet page 52).

A different (and probably inferior) way to run the Fetch/Execute cycle
The method used above is optimal in terms of wasting no cycles because the fetch/execute cycle is asychronous. As soon as an operation finishes, then next one can start. However, all four DMA channels are needed to make the machine run. One channel is the fetch unit, another is the execute unit, and two others are just used to clear interrupt flags in the first two channels. If a timer and output compare unit are used to generate two time-synched interrupt flags, then the two (fetch and execute) DMA channels can be triggered by the interrupt flags. The up side of this scheme is that it frees up two DMA channels. The down side is that the slowest operation determines the execution rate of the machine. Most operations are fairly fast, but add is much slower and branch is a little slower. Including add operation drops performance by a factor of 10. Branch operation drops performance by a factor of 2.5. Tuning becomes quite dificult. But for reference, a running code (without add operation) is included which runs about 0.4 as fast as the async code. Code.


Copyright Cornell University January 13, 2023


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK