1. Don't move data. Use a ring buffer with pointers/indexes to head and tail. This is true regardless of language (C, Java, Do-more, IEC-61131-3, ...).
I know that not ALL blocks of data are FIFO or LIFO, just random blocks of data. But start with those that are. Especially "large" types like STRINGs or other structures, then BITs if you are moving at the bit-level and not optimizing your bit table moves on byte boundaries via casting. Oh, and tables/blocks that are LONG.
2. Use "memory" type copies where you can, not "value" type copies. MEMCOPY is faster than MOVER and the other "Assignment" type instructions. MEMCOPY just moves bytes around from source to destination. All of the other do "type" intelligent assignment, so you can move VALUEs from an unsigned 16 bit integer into a 32 bit signed (e.g. values from V0..V9 into D0..D9). Note that the MEMCOPY would NOT work with the V0..V9 to D0..D9 example, but that was just an illustration of the BEHAVIOR of the MOVER-type assignment instruction. MOVER loads the value and the source-type, then writes out that "value" and the destination-type, properly handling any size/type conversion. This is done regardless of whether the source and destination are of the same type/size, hence it is slower than straight memory byte access (think C's memcpy) of MEMCOPY.
Note: #1 is MUCH FASTER vs. #2, O(1) vs. O(n), if you understand big-O notation. All I'm doing in #2 is changing the "constant" in front of O(n).