Maybe it's because the representation is quite strange to me, and that I don't quite get how it represents 2^256 numbers with the limbs.
When representing a big number we have to split them into smaller limbs that can fit in registers like using 4 64-bit integers (a0, a1, a2, a3) for a 256-bit integer. This is called radix 2
64 representation.
This works fine but the problem is that each time you perform any operations on the numbers you can get an overflow. For example to add A + B you have to do add A.a0 + B.a0 which can overflow and that has to carry to the next step.
R.a0 = A.a0 + B.a0
if(overflowed) => carry = 1 else 0
R.a1 = A.a1 + B.a1 + carry
if(overflowed) => carry = 1 else 0
R.a2 = A.a2 + B.a2 + carry
if(overflowed) => carry = 1 else 0
R.a3 = A.a3 + B.a3 + carry
And this is just addition, when adding x and y each having n base b digits the result will have at most n+1 base b digits. It is also easy since your carry is either 0 or 1 (it is as simple as if R.a0 > A.a0 => carry = 1 else carry = 0).
When multiplying x with y each having n base b digits the result will have at most 2n digits and your carry is bigger and has to be computed and stored correctly.
To solve this problem the simplest way is to use a smaller integer that can hold that overflow like using 32-bit limbs (UInt32), cast them to 64-bit and compute A.a0 + B.a0, etc. but now you have to add 8 limbs instead of 4. So this can't be the most efficient solution.
But what if we could keep track of the overflow while maximizing the efficiency?
The solution is to leave only a little space empty on each limb. To do that we use a different representation like using 5 52-bit integers which is called radix 2
52 (each limb now has 52 bits instead of 64 except the last one).
Now you have an empty room to work with ergo you don't have to constantly worry about the overflow. You also don't have a lot of limbs to increase the code size.
Not only this simplifies your algorithm, it also lets you perform more operations at once before you need to reduce the result. For example you can compute A+B+C+D like this which is very simple and efficient since the overflow is not lost:
R.a0 = A.a0 + B.a0 + C.a0 + D.a0
R.a1 = A.a1 + B.a1 + C.a1 + D.a1
R.a2 = A.a2 + B.a2 + C.a2 + D.a2
R.a3 = A.a3 + B.a3 + C.a3 + D.a3
In the end you can perform the reduction only once and reduction algorithms are usually pretty fast with prime numbers.
To answer your question, we shouldn't place any value in any of the limbs like the least significant limb that is bigger than 2
52 because that would make them not-normalized and any operations on such values could lead to lost data.