Author

Topic: Does core have any SHA256 SIMD parallelization code for "ONE" message? (Read 265 times)

staff
Activity: 4284
Merit: 8808
I'm just going to ask it, since I can't easily find the list, maybe it's just in front of me and I can't see it. Anyone know how to get the list from Intel which chips have SHA? (and from AMD too.)

https://en.wikipedia.org/wiki/Intel_SHA_extensions

Intel Goldmont chips (sever market atom) and Ice Lake.  (I haven't used it on Ice Lake, but it's finally reported there). Intel has been pre-announcing it on arches back to skylake then failing to deliver.

Anything AMD Zen and Zen+/Zen2  (so all the threadripper and epyc), which is what all of Bitcoin's development using SHA-NI has been on.

Instruction latency of sha-ni is such that you're still better interleaving independent processing of several messages... but even without that its much faster than anything else except maybe a super wide many messages AVX512 version.
legendary
Activity: 3416
Merit: 1912
The Concierge of Crypto
I'm just going to ask it, since I can't easily find the list, maybe it's just in front of me and I can't see it. Anyone know how to get the list from Intel which chips have SHA? (and from AMD too.) ... there used to a flag or tick box or option to select these things on intel's ark site...

Here is the one where you can find them by features:
https://ark.intel.com/content/www/us/en/ark/search/featurefilter.html

So I can go there and select AES-NI ... but can't find one specific for SHA-NI. (Ok, granted, I'm not even sure why I'd want one, but if you're working on something that uses it, then this would be a good processor to play with your software right? hehe.)

AES-NI is useful for third party full disk encryption that takes advantage of it, such as DiskCryptor, TrueCrypt, VeraCrypt (Bitlocker? I don't use that.)
legendary
Activity: 3416
Merit: 1912
The Concierge of Crypto
Quote
sha1msg1
__m128i _mm_sha1msg1_epu32 (__m128i a, __m128i b)
sha1msg2
__m128i _mm_sha1msg2_epu32 (__m128i a, __m128i b)
sha1nexte
__m128i _mm_sha1nexte_epu32 (__m128i a, __m128i b)
sha1rnds4
__m128i _mm_sha1rnds4_epu32 (__m128i a, __m128i b, const int func)
sha256msg1
__m128i _mm_sha256msg1_epu32 (__m128i a, __m128i b)
sha256msg2
__m128i _mm_sha256msg2_epu32 (__m128i a, __m128i b)
sha256rnds2
__m128i _mm_sha256rnds2_epu32 (__m128i a, __m128i b, __m128i k)

Oh, that's what you need? Didn't know Intel made sorta built in functions for some chips, almost like an ASIC.

AMD seems to have them too for Ryzen processors. It's not easy to sort through the chips list to find them, they could be under SSE or something else.
legendary
Activity: 1042
Merit: 2805
Bitcoin and C♯ Enthusiast
Even older Xeons have AES-NI, ... is that different or part of SHA-NI?

Yes, they are different (in case it wasn't clear, AES is Advanced Encryption Standard). There are hundreds of CPU intrinsics available. You can see Intel's intrinsics here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/
AES is a group of them that they added in a lot more CPUs compared to SHA so it is normal to see old/cheap CPUs support it.
legendary
Activity: 3416
Merit: 1912
The Concierge of Crypto
Even older Xeons have AES-NI, ... is that different or part of SHA-NI? I think 5th or 6th generation Xeons can be bought for cheap and support AES-NI. I'm a fan of these reburb or off-lease rack servers and workstations.
legendary
Activity: 1042
Merit: 2805
Bitcoin and C♯ Enthusiast
Damn, I was afraid of this. It looks like assembly code and I can't read it. Gotta put it in the to-do list now.

State of the art is ... get a CPU that doesn't suck. Smiley SHA-NI is much faster than any of these SIMD techniques esp in the one message case.
Yeah, I've been reading some benchmarks on this. It's fantastic. It is surprising that only a handful of CPUs have SHA Extensions although Intel introduced it in 2013!
staff
Activity: 4284
Merit: 8808
I am currently exploring parallelization of SHA256 algorithm using SIMD based on a paper I've found which is basically parallelization of the "message scheduling" step that according to the authors takes up 26% of the computation time.

If I understand bitcoin core's code (eg. AVX2), it seems like it doesn't support computing SHA256 of a large data using SIMD (eg. SHA256 of a single 512+ byte long data), but only has the code for computing SHA256 of multiple messages in parallel (ie. SHA256 of m1, m2, ..., m8) and return multiple hashes (ie. h1, h2, ... h8).

If I am reading the code wrong, please explain how it does that.
And if I am right then is there any reason why they didn't add this feature? It seems to be useful for computing the message digest of a big transaction specially the legacy ones which could easily be bigger than 512 bytes.

P.S. If you have any scientific paper about this topic that is newer than 2012 please let me know.

You're looking in the wrong place.

https://github.com/bitcoin/bitcoin/commit/c1ccb15b0e847eb95623f9d25dc522aa02dbdbe8#diff-58b88805302ed488ea34900368aab920

Most of the hashing in bitcoin is small messages (e.g. 64 bytes), and the N-message parallelization is much faster, when its available.

But for big messages there is SIMD too, it's just in different files.

State of the art is ... get a CPU that doesn't suck. Smiley SHA-NI is much faster than any of these SIMD techniques esp in the one message case.
legendary
Activity: 1042
Merit: 2805
Bitcoin and C♯ Enthusiast
I am currently exploring parallelization of SHA256 algorithm using SIMD based on a paper I've found which is basically parallelization of the "message scheduling" step that according to the authors takes up 26% of the computation time.

If I understand bitcoin core's code (eg. AVX2), it seems like it doesn't support computing SHA256 of a large data using SIMD (eg. SHA256 of a single 512+ byte long data), but only has the code for computing SHA256 of multiple messages in parallel (ie. SHA256 of m1, m2, ..., m8) and return multiple hashes (ie. h1, h2, ... h8).

If I am reading the code wrong, please explain how it does that.
And if I am right then is there any reason why they didn't add this feature? It seems to be useful for computing the message digest of a big transaction specially the legacy ones which could easily be bigger than 512 bytes.

P.S. If you have any scientific paper about this topic that is newer than 2012 please let me know.
Jump to: