There's no need to bullshit here about "optimising datapaths". SHA-256 is basically just a pair of 32-bit-wide shift registers with some cobinatorial logic thrown in the feedback loops. The cryptographers at NIST/NSA/etc. worked really hard to make sure that this logic is not minimisable in any meaningfull way because that would make it susceptible to cryptoanalysis. The "architectural" tricks would've already been exploited by the cryptoanalysts. There isn't any way to optimize power by e.g. not clocking parts of the circuit when not in productive use, which is where the most of modern CPUs and GPUs save power. So please no further low-power bullshit unless you can tell us how your low-power strategy applies to a circuit with 50% signal toggle probability. Nobody's going to run a pocket bitmine on a battery power.
Thinking this is worth highlighting again, compared to a CPU, every operation is "worst case instruction mix", if anyone is telling you that TDP should be much less than worst case power on a SHA-256(256) ASIC then they are full of crap.
Intel in order to implement their "cloud on a chip" projects (Experimental/prototype) 80+ core CPUs, had to come up with aggressive load balancing and core de-activation strategies. These appear to be shelved, although the management strategies may be applied in future mainstream CPUs, it was probably found that it was no more effective than running multiple virtual machines on say a 6 core hyperthreaded chip, due to having to idle so many cores at once.
A sort of visualisation of the difference this makes for the less technical might be imagining a simple implementation of a "pong" game on the side of a building with 60W lightbulbs being the pixels. Imagining a 48x32 playfield with the bats being 12 bulbs (2x6) the ball being 1 bulb, for 25 bulbs total. Implementing it as white on black means you only have to have 25 bulbs lit at a time, 1500W, could jusssst about do it off a household circuit. However, if you want to do black on white, you have to have (48*32)-25 bulbs lit, which would be 90,660W. The difference here is extreme to make the point, but the CPU is more like the white on black, and SHA core more like the black on white. Though to get closer to real percentages, we could say a 16x16 array where the CPU can light between 2 and 5 bulbs at once, and the SHA core randomly lights between 7 and 9 bulbs at once. So instantaneous power on CPUlike array may be between 120 and 300W whereas the SHAlike array would be between 420 and 540W.... but the CPU designer would split the difference and call TDP about 220W, so you might be left with the idea that you could SHA on a 16x16 array of lightbulbs for only 220W power usage... where the 16x16 lightbulbs are our proxy for a 100+ Million transistor array.
Not only that but also, imagine we are building a pneumatic/steam computer in about 1830, and we have large piston clearances and pipes made out of stitched leather. There's a physical limit to how many switches/pistons can be activated at once, because although they should not consume steam to remain in position, there are leaks everywhere and the boiler can maintain only so many PSI, get a bigger boiler, the leaks just blow harder and you're not much further ahead. This is sort of what is happening on small process nodes now, everything bleeds electrons, worse than that, unlike the steam computer, it's a more closed environment so your leakage doesn't just go into open air, it makes "pressure" in a pipe that should be empty at the moment, leading to spurious operation. Try and raise the activation threshold of your valves and you need more PSI, which in turn leaks harder... meanwhile the resistance to flow makes more heat, consumes power and your holes get bigger.
Anyway, when you see the little demo engine going chuff, chuff, chuff, don't be too impressed, thinking if they can do it once, they can copy and paste it 20, 50, 200 times, it's not whether the pistons and valves do the right thing, it's whether they do it with close enough tolerance, and stitch the leather pipes together well enough that they can make 20 or 50 of them run without losing steam everywhere.