1 - page 2. | Bitcointalksearch.org

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: ?? on ??

why you will not use xmm and xmmword? it is 128 bit? less code, less operation, fastest

when I finish i will upload.

the problem is with modulo for my self:)

try something like:

Code:

%define arg1f XMM0
   %define arg2f XMM1
   %define arg3f XMM2
   %define arg4f XMM3
   %define arg1 RDI
   %define arg2 RSI
   %define arg3 RDX
   %define arg4 RC
   %define arg5 R8
   %define arg6 R9

%macro MULT 1,2,3
;Multi XMM0:XMM1:XMM4:XMM5 by XMM2:XMM3:XMM6:XMM7
   movups arg3f,[%1+%3]
   movaps xmm7,arg3f

   mulps arg3f,arg1f
   movshdup arg4f, arg3f
   addps arg3f, arg4f
   movaps xmm6,xmm7
   movhlps arg4f, arg3f
   addss arg3f, arg4f
   movss [%2+%3], arg3f;

   mulps xmm6,arg2f
   movshdup arg4f, xmm6
   addps xmm6, arg4f
   movaps arg3f,xmm7
   movhlps arg4f, xmm6
   addss xmm6, arg4f
   movss [%2+4+%3], xmm6

   mulps arg3f,xmm4
   movshdup arg4f, arg3f
   addps arg3f, arg4f
   movaps xmm6,xmm7
   movhlps arg4f, arg3f
   addss arg3f, arg4f
   movss [%2+8+%3], arg3f

   mulps xmm6,xmm5
   movshdup arg4f, xmm6
   addps xmm6, arg4f
   movhlps arg4f, xmm6
   addss xmm6, arg4f
   movss [%2+8+4+%3], xmm6

%endmacro

Are you saving some bits at the end of each dword for the carry? Otherwise you're going to lose accuracy in the final answer, because SIMD stuff will just overflow instead of carry.

I was thinking of applying the excellent technique of 5x52-bit numbers used in libsecp256k1, where we use 5 quadwords that each hold 52 bits of the real number each (except for the most significant qword which holds less). There's also a 10x26-bit variant for 32-bit machines.

If that technique is applied, we can defer the carry and add the numbers hundreds of times without corruption.

We would only need 3 SSE adds, or 2 AVX adds in that case. But I have heard that using the AVX instructions incurs some kind of speed penalty like AVX-512?

An alternative is just stuffing them in the R8-R15 registers and RAX/RBX for the 5 quadwords for both operands and then clobber one of the operands with the sum.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: albert0bsd on December 03, 2022, 09:38:24 AM

Ok, lets to to write it by parts using openAI

Code:

Write in assembly a code to add two numbers of 512 bits

Code:

; Define the two 512-bit numbers to be added
.data
num1: .quad 0x1234567890ABCDEF, 0x1234567890ABCDEF, 0x1234567890ABCDEF, 0x1234567890ABCDEF
num2: .quad 0x0987654321FEDCBA, 0x0987654321FEDCBA, 0x0987654321FEDCBA, 0x0987654321FEDCBA

; Define the 512-bit result
result: .quad 0, 0, 0, 0

; Main program
.text
.global main
main:

; Initialize the result to 0
mov $0, %rax
mov %rax, result(,%rax,8)

; Loop through each of the 64-bit words in the numbers
mov $0, %rax
loop:

; Add the corresponding 64-bit words from num1 and num2
movq num1(,%rax,8), %rdx
addq num2(,%rax,8), %rdx

; Store the result in the corresponding 64-bit word in the result
mov %rdx, result(,%rax,8)

; Increment the loop counter and check if it has reached 4
inc %rax
cmp $4, %rax
jne loop

; End the program
ret

Damn! I didn't know OpenAI could write code.

I used Dall-E to generate images before, but I wasn't aware of anything like this. Yeah, I've used Copilot, but I haven't generated any assembly with it.

This stuff could be very useful if it indeed works (AI generated code is sometimes buggy). It may not know how to generate a secp256k1 operation in ASM yet, but I think we'll get there soon (plus, ARM support!).

albert0bsd

hero member

Activity: 862

Merit: 662

Ok, lets to to write it by parts using openAI

Code:

Write in assembly a code to add two numbers of 512 bits

Code:

; Define the two 512-bit numbers to be added
.data
num1: .quad 0x1234567890ABCDEF, 0x1234567890ABCDEF, 0x1234567890ABCDEF, 0x1234567890ABCDEF
num2: .quad 0x0987654321FEDCBA, 0x0987654321FEDCBA, 0x0987654321FEDCBA, 0x0987654321FEDCBA

; Define the 512-bit result
result: .quad 0, 0, 0, 0

; Main program
.text
.global main
main:

; Initialize the result to 0
mov $0, %rax
mov %rax, result(,%rax,8)

; Loop through each of the 64-bit words in the numbers
mov $0, %rax
loop:

; Add the corresponding 64-bit words from num1 and num2
movq num1(,%rax,8), %rdx
addq num2(,%rax,8), %rdx

; Store the result in the corresponding 64-bit word in the result
mov %rdx, result(,%rax,8)

; Increment the loop counter and check if it has reached 4
inc %rax
cmp $4, %rax
jne loop

; End the program
ret

albert0bsd

hero member

Activity: 862

Merit: 662

Quote from: ?? on ??

And EC.b can be used here:

Thank you for Pointing where b is used, I recheck the code and that function is not there, anyway it is easy to add.

Quote from: NotATether on December 02, 2022, 12:42:43 PM

Do you know whether it's making use of SSE? But even more important than that, maybe there's a series of assembly instructions you can run to run repeated calls faster.

The SSE are used in some other code of JLP those are used to generate 4 sha256 and 4 rmd160 hashes of fixed data at the same time.

That is only useful when you are working with addresses or hashes rmd160 those SSE functions boost that part of the code some 2 or 3 times faster.

Code here:

https://github.com/JeanLucPons/VanitySearch/blob/master/SECP256K1.cpp#L380

Functions are inside of the hash folder

Also I use those in keyhunt for address and rmd160 search.

About if that code can be optimized, I already check it a lot of times and it is already optimized.

Regards

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: albert0bsd on December 02, 2022, 10:45:20 AM

The fastest implementation for secp256k1 code that I ever see and use it is already inside of kangaroo tool.

https://github.com/JeanLucPons/Kangaroo/tree/master/SECPK1

I wonder if there is a way to optimize it further though? Do you know whether it's making use of SSE? But even more important than that, maybe there's a series of assembly instructions you can run to run repeated calls faster.

Quote from: ?? on ??

But since I use secp256k1 curve only for testing and research I do no care much for any of possible vulnerabilities and attacks.

The safest (not necessary the fastest) secp256k1 is the one used in Bitcoin Core. But I don't use it because I keep getting wrong answers when I do arithmetic. Maybe the privkey bytes are not being filled correctly or something.

albert0bsd

hero member

Activity: 862

Merit: 662

Quote from: ?? on ??

also some good ecc implementation based on gmp: https://github.com/masterzorag/ec_gmp

I used that library for some tools that I made but it is not optimized for secp256k1 also it is some kind of vulnerable to some side channels attacks and incomplete because it declare EC.b parameter but it never use.

A lot of improvements can be made to that implementation.

The fastest implementation for secp256k1 code that I ever see and use it is already inside of kangaroo tool.

https://github.com/JeanLucPons/Kangaroo/tree/master/SECPK1

Same library that I actually use in my keyhunt code.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: albert0bsd on December 01, 2022, 07:33:33 AM

Quote from: ?? on ??

The main problem in secp256k1 is modulo p

Actually we only need a good framework to do operations with big numbers, this also using all the capabilities of modern CPU.

Regards.

Literally this.

Have you heard of GAP? It's a C language framework for doing huge integer math and knows about group theory and such. A really smart assembly guy recommended it to me a few months ago.

https://www.gap-system.org/Download/

albert0bsd

hero member

Activity: 862

Merit: 662

Quote from: ?? on ??

The main problem in secp256k1 is modulo p

Actually we only need a good framework to do operations with big numbers, this also using all the capabilities of modern CPU.

Regards.

PowerGlove

hero member

Activity: 510

Merit: 4005

Quote from: pooya87 on November 30, 2022, 10:59:44 PM

Quote from: ?? on ??

as I see there is no known library for this

Writing an entire ECC library in ASM is impossible, [...]

I don't know about that, man; heavier lifts have been made before. FASM is one example (an assembler written in assembly). If I check how many lines of code that has:

Code:

grep '^$' -rv ./fasm-1.73.30/fasm/source | wc -l

I get 35483. Even an elaborate, fully-featured secp256k1 library in x86-64 assembly would fit (more than) comfortably in 1/4 of that.

If that doesn't convince you (i.e. you feel that a significant fraction of FASM's source code is likely table-generated) then think of feats like the first RollerCoaster Tycoon game: Chris Sawyer wrote that in (99%) assembly. I don't know how familiar you are with gamedev, but something like RollerCoaster Tycoon completely dwarfs a secp256k1 library in terms of complexity.

pooya87

legendary

Activity: 3472

Merit: 10611

Quote from: ?? on ??

as I see there is no known library for this

Writing an entire ECC library in ASM is impossible, we are talking about thousands of lines of code that would be a lot more in ASM and as I said before the benefits is not as great as you'd think. However parts of the code can be written in ASM like what libsecp256k1 does by writing the field element code in ASM.

albert0bsd

hero member

Activity: 862

Merit: 662

Quote from: ecdsa123 on November 28, 2022, 07:48:10 AM

I'm looking for library written in pure nasm/masm - assembly for x8086-64 (not arm) Intel version.

Write code in ASM is really hard, i have a long time without write anythin in ASM by my self the last code that check in ASM and edit just some lines was the libaesni for some of my old projects.

If there are any other developers interesting in write this code please let me know.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: ecdsa123 on November 28, 2022, 07:48:10 AM

Hello All

I'm looking for library written in pure nasm/masm - assembly for x8086-64 (not arm) Intel version.

anybody knows?

I mean if you challenge me to do it, I might actually come up with an optimized secp256k1 ASM for GNU/Linux one day... who knows Cheesy

I'm already working on a version that uses GMP which itself is heavily optimized.

pooya87

legendary

Activity: 3472

Merit: 10611

Quote from: ?? on ??

I have check and analyse for comparison : sha256 in pure asm (rewrite by myself) and c++.
in pure asm we have almost 120x faster than c++ (in line asm optimised)

Have you ever published this code or has anybody else (more specifically a c++ expert) seen the code because 120x speed up does not sound right to me unless the code written in c++ is bad or entirely different (eg. simple implementation of SHA256 vs using intel SHA intrinsics) or your benchmark could be flawed.

PowerGlove

hero member

Activity: 510

Merit: 4005

Quote from: garlonicon on November 28, 2022, 10:49:41 AM

I used FASM and downloaded it in modern times, few years ago, when exploring how BIOS is constructed: http://flatassembler.net/

Yup. FASM is pretty special and Tomasz Grysztar is an exceptional programmer! Writing a self-hosting assembler puts him in a very small group.

Quote from: garlonicon on November 28, 2022, 10:49:41 AM

I also used it some time ago when I tried to write my own operating system from scratch: https://wiki.osdev.org/Main_Page

Nice one! 0xAA55 (and 0x7C00) is burned in my memory, too. Grin

garlonicon

copper member

Activity: 944

Merit: 2257

Quote

I'm looking for library written in pure nasm/masm - assembly for x8086-64 (not arm) Intel version.

If you are looking for a library, then check headers like "immintrin.h". The keyword is "intrinsics", for example x86 intrinsics list that means you can just call a C-like function in your code, and it will be converted to the pure assembly instruction when compiled. It will be easier to write some code in C++ or similar language, and call some functions, than writing everything in assembly, unless you know assembly very well.

Quote

Anybody else remember downloading TASM/MASM/NASM over dial-up and working through binders of printed out tutorials?

I used FASM and downloaded it in modern times, few years ago, when exploring how BIOS is constructed: http://flatassembler.net/
I also used it some time ago when I tried to write my own operating system from scratch: https://wiki.osdev.org/Main_Page

Quote

so you will be shocked when you will check how many "optimised " are design in pure asm for differents mathematics problems in 2022 yr:)

Well, assembly can speed things up if you can use the right opcodes, and if your processor supports it. In other cases, you may end up with code, that is correct, but not supported by your processor. So, the first thing is checking your hardware, and what is available, because some opcodes may trigger an error. Also, it is a high chance that using your CPU is not the best way of solving that, and if your code will be hardware-specific by design, then it may be profitable to prepare your code for some GPU or ASIC (but then you probably would need some custom hardware).

Edit: Also note that typical compilers has some flags that can be used to optimize it for some architecture with a given features. Compilers like GCC can produce those instructions, so check them first, it may be faster than your code in assembly, unless you really know that language well.

PowerGlove

hero member

Activity: 510

Merit: 4005

Off-topic (sorry), but just wanted to say that seeing questions like this on Bitcointalk really cheers me up!

I'm probably reading too much into it (nostalgia will sometimes do that to you), but someone asking for an assembly listing is (no joke) the highlight of my week (I really miss the old days).

Anybody else remember downloading TASM/MASM/NASM over dial-up and working through binders of printed out tutorials? Grin

ecdsa123

jr. member

Activity: 51

Merit: 114

1

Topic: 1 - page 2. (Read 1132 times)