The biggest difference was in getting rid of the indexed values in the compares in the inner loop. Instead I use 1 MATH with the indexed values, and then use the result in the compares.
There also used to be a compare at the top of the outer loop to check for valid data. I got rid of it, and used an ending value in the outer FOR that was calculated in another program block.
The other thing that helped was re-calculating the starting index of the inner FOR on the fly.