Pages:
Author

Topic: Speeding up signature verification - page 2. (Read 23912 times)

legendary
Activity: 1072
Merit: 1189
January 13, 2013, 12:52:01 PM
#19
You can't build an UTXO set (needed for validation) without doing most of the verification anyway (everything except signature checking, basically).
"most of the verification anyway" ... I like to weight this "most" by real-time not by code size or "logical" verification-classes and then it turns out that the signature checking needs 99% or far more of it (depends on your disk speed).

I sure, I fully agree. In terms of CPU usage, signatures validation outweighs everything. But before the last checkpoint (=exactly what is stored in bootstrap.dat), signature checking is disabled anyway (and this hasn't changed in 0.Cool.

Also, I assume that importing bootstrap.dat on your hardware will take more than 5 minutes. Done completely in RAM, it's probably possible, but the client also maintains a database on disk, and does synchronous writes at times to make sure block data and index/coins remain consistent with eachother.
member
Activity: 70
Merit: 10
January 13, 2013, 12:42:17 PM
#18
You can't build an UTXO set (needed for validation) without doing most of the verification anyway (everything except signature checking, basically).
"most of the verification anyway" ... I like to weight this "most" by real-time not by code size or "logical" verification-classes and then it turns out that the signature checking needs 99% or far more of it (depends on your disk speed).

Or put it in actual numbers: I need up to block height 216283:    112.7 sec cpu-time to put all data into place for every possible (including script validation) check. Well, real time is about 2-3 times larger depending on how quick the disk is (I needed e.g. 260 sec) from which I read the block chain.
And doing full verification (with script-verifications) needs 22419827/940 sec > 6.5 hours (I did not do it, only estimated it from a sample of about 890000 OP_CHECKSIG) in the most recent redeeming scripts.

And I don't believe this huge real-time safe is because I use hashtables and not level-DB structures. :-) Therefore I expect when loading a new bootstrap.dat (say to blockheight 216283) from disk into the client with pre-0.8.0 I only need about 5 minutes.

BTW: B-trees may now much better reasoned because you need NOW much more deletions in your data structures due to the redeemed scripts.

smtp
legendary
Activity: 1120
Merit: 1164
January 13, 2013, 12:11:38 PM
#17
If having to wait ten minutes to an hour is what makes you not want to use Bitcoin, how about you wait until we come up with a overlay network on top of Bitcoin with instant transactions? It'll happen eventually; I'm working on one such concept myself.
I guess waiting > 24 h for the whole block chain is more the truth (not 10 minutes to an hour) this is annoying. :-(

It took me a week to apply for my PayPal account back in the day.
legendary
Activity: 1072
Merit: 1189
January 13, 2013, 11:57:10 AM
#16
At least 0.8.0 will have an option to not verify the bootstrap.dat, I think.

You can't build an UTXO set (needed for validation) without doing most of the verification anyway (everything except signature checking, basically). If you mean trusting someone to build a pre-indexed blockchain for you, you're far better off running an SPV node - it's faster, needs less disk, needs less bandwidth to maintain, and doesn't require trust in a centralized source for your block data.

The current git head (pre-0.Cool code still has some problems that cause initial syncup to be slow (in particular, the block fetching algorithm is crappy, and gets very easily confused to the point where it stops doing anything for several minutes). From reports, it also seems it's significantly slower at verifying on Windows systems. Still, if combined with parallel+optimized signature verification code (which may or may not make it into 0.8 release), and the block fetching system gets fixed, I hope we can get close to a sync time of 1 hour on reasonably recent systems.
legendary
Activity: 1072
Merit: 1189
January 13, 2013, 11:50:07 AM
#15
But what I dislike is the idea to expect in future not only a cpu/processor for a bitcoin-client, but also a (high-end ?) GPU for the customer to get reasonable real-time behaviour of his bitcoin-client. This I call a miss-conception to the future.

In the future - at some point (near or far) - it's not reasonable to expect that every end user will run a fully verifying bitcoin node. If they use a desktop client at all (I assume much more will happen on mobile devices, for example), it will probably be with an SPV client or even lighter. At this point in the future, "reasonable real-time behaviour for the bitcoin-client" will only apply to those that have to (miners, payment processors, merchants) or volunteer to. I expect such people to run a node 24/7, and time to do a sync will only apply once.

That doesn't mean we should tell everyone now to move away (even though many people are, unfortunately but understandably) for fully validating clients. While the economy grows, we need a network to sustain it. In this respect, you are right, and we should do efforts to keep performance as good as possible (and I believe I do what I can for that). But as retep says, correctness is of much higher importance - if there is a problem with the code, it can be a disaster (lost coins, forked network, ...).

In particular, I have reasonable confidence in the 25% speedup optimization for secp256k1 after having spent time on it myself and tested it by comparing the output of the optimized operations with the non-optimized OpenSSL ones. I'd still like some profession cryptographer to look at the code still.

GPU code to do ECDSA verification would in the first place be an experiment, in my opinion. If it indeed helps in significantly faster verification, some may consider it. But as things stand now, I don't think there's an actual need.
member
Activity: 70
Merit: 10
January 13, 2013, 11:37:32 AM
#14
If having to wait ten minutes to an hour is what makes you not want to use Bitcoin, how about you wait until we come up with a overlay network on top of Bitcoin with instant transactions? It'll happen eventually; I'm working on one such concept myself.
I guess waiting > 24 h for the whole block chain is more the truth (not 10 minutes to an hour) this is annoying. :-(
At least 0.8.0 will have an option to not verify the bootstrap.dat, I think.

smtp
member
Activity: 70
Merit: 10
January 13, 2013, 11:32:01 AM
#13
Someone just needs to write it...
Emphasis is "just" ... since February 2011. :->>

But what I dislike is the idea to expect in future not only a cpu/processor for a bitcoin-client, but also a (high-end ?) GPU for the customer to get reasonable real-time behaviour of his bitcoin-client. This I call a miss-conception to the future.

smtp
legendary
Activity: 1120
Merit: 1164
January 13, 2013, 11:30:17 AM
#12
To answer your question "Why risk it?": To not lose more and more potential new users because they are annoyed by the huge real-time necessary to verify the current block chain with the official bitcoin versions.
It seems most people don't see this real danger -- this real-time behaviour might restrict (or even kill) bitcoin usage/spread seriously.  :-(

You know what turns potential new users off of Bitcoin? Headlines like "Bitcoin hacked! Thousands lose all their Bitcoins!"

If having to wait ten minutes to an hour is what makes you not want to use Bitcoin, how about you wait until we come up with a overlay network on top of Bitcoin with instant transactions? It'll happen eventually; I'm working on one such concept myself.
member
Activity: 70
Merit: 10
January 13, 2013, 11:17:15 AM
#11
Risk vs reward. No offense to Hal, but 25% isn't a very big improvement, and signature verification is something that absolutely must work correctly. Get it wrong and people can lose a lot of money very quickly. Why risk it?
Yes, inprinciple I agree, 25% is not much (I only got 22%). But if Hal's secp256k1Verify code is integrated into pre-0.8.0 it could end up to be inserted in the next 0.8.0 release officially. This means, other people recognize this as an important improvement.
And in the light that the probable bottle-neck is (or just has become) this ECDSA_verify heavily(!), I can understand their decision.

To answer your question "Why risk it?": To not lose more and more potential new users because they are annoyed by the huge real-time necessary to verify the current block chain with the official bitcoin versions.
It seems most people don't see this real danger -- this real-time behaviour might restrict (or even kill) bitcoin usage/spread seriously.  :-(

smtp
legendary
Activity: 1072
Merit: 1189
January 13, 2013, 11:15:27 AM
#10
It's not a matter of parallel vs. not parallel, it has to do with the fact that GPUs are terrible at point multiplication, the difficult part of signature verification. A regular CPU far outclasses them at it, and for much less energy.

I'm pretty sure that a good OpenCL implementation of ECDSA verification, running on high-end ATI gpu's, will be significantly faster than what OpenSSL (or any implementation) can do on a CPU. People have already written EC addition in OpenCL, reaching speeds of tens of millions of operations per second. EC multiplication is much slower and more complex to implement efficiently, but I expect it would reach tens of thousands of operations per second at least.

Someone just needs to write it...
hero member
Activity: 798
Merit: 1000
January 13, 2013, 11:09:38 AM
#9
Why wont GPUs help much? Surely checking signatures in parallel would be an improvement?

It's not a matter of parallel vs. not parallel, it has to do with the fact that GPUs are terrible at point multiplication, the difficult part of signature verification. A regular CPU far outclasses them at it, and for much less energy.
legendary
Activity: 1072
Merit: 1189
January 13, 2013, 11:02:30 AM
#8
Risk vs reward. No offense to Hal, but 25% isn't a very big improvement, and signature verification is something that absolutely must work correctly. Get it wrong and people can lose a lot of money very quickly. Why risk it?

The secp256k1-specific optimization has been implemented, and there's a pull request to integrate it in the reference client. It does indeed achieve some 20% improvement.

What smtp asked was about the parallel batch verification, with potentially much higher speedsup (Hal mentioned x4). The problem is that this is a much more invasive change, and the risk of getting crypto stuff wrong is extremely high. It's also not certain which degree of speedup is possible, as we need to do recovery of R before batch verification is possible (and changing this requires a hard fork basically, as the signatures are part of the scripting rules).

That said, if someone is interested in investigating, and potentially supplying a patch that is well testable... by all means, do so.
legendary
Activity: 1190
Merit: 1004
January 13, 2013, 10:41:08 AM
#7
Why wont GPUs help much? Surely checking signatures in parallel would be an improvement?
legendary
Activity: 1120
Merit: 1164
January 13, 2013, 10:24:21 AM
#6
Since nearly 2 years has gone but no activity for this "batch verification"-algorithm has (seemingly) done/published, I like to ask: why?

smtp

Risk vs reward. No offense to Hal, but 25% isn't a very big improvement, and signature verification is something that absolutely must work correctly. Get it wrong and people can lose a lot of money very quickly. Why risk it?
member
Activity: 70
Merit: 10
January 13, 2013, 09:25:57 AM
#5
The batch verification thing does sound interesting. It's almost ideal for BitCoin in fact.

Since nearly 2 years has gone but no activity for this "batch verification"-algorithm has (seemingly) done/published, I like to ask: why?

smtp
Hal
vip
Activity: 314
Merit: 4276
February 08, 2011, 04:08:58 PM
#4
I've put the code up at github. I'm a C programmer more than C++; it shows.

https://github.com/halfinney/bitcoin/tree/secp256k1

Here's a self contained test program that compares the speeds. Build with -lcrypto.

Code:
#include 
#include
#include
#include
#include

#define REPS 1000


// Split a secp256k1 exponent k into two smaller ones k1 and k2 such that for any point Y,
// k*Y = k1*Y + k2*Y', where Y' = lambda*Y is very fast
static int
splitk (BIGNUM *bnk1, BIGNUM *bnk2, const BIGNUM *bnk, const BIGNUM *bnn, BN_CTX *ctx)
{
    BIGNUM *bnc1 = BN_new();
    BIGNUM *bnc2 = BN_new();
    BIGNUM *bnt1 = BN_new();
    BIGNUM *bnt2 = BN_new();
    BIGNUM *bnn2 = BN_new();
    static unsigned char a1b2[] = {
        0x30, 0x86, 0xd2, 0x21, 0xa7, 0xd4, 0x6b, 0xcd,
        0xe8, 0x6c, 0x90, 0xe4, 0x92, 0x84, 0xeb, 0x15,
    };
    static unsigned char b1m[] = {
        0xe4, 0x43, 0x7e, 0xd6, 0x01, 0x0e, 0x88, 0x28,
        0x6f, 0x54, 0x7f, 0xa9, 0x0a, 0xbf, 0xe4, 0xc3,
    };
    static unsigned char a2[] = {
        0x01, 0x14, 0xca, 0x50, 0xf7, 0xa8, 0xe2, 0xf3,
        0xf6, 0x57, 0xc1, 0x10, 0x8d, 0x9d, 0x44, 0xcf,
        0xd8,
    };
    BIGNUM *bna1b2 = BN_bin2bn(a1b2, sizeof(a1b2), NULL);
    BIGNUM *bnb1m = BN_bin2bn(b1m, sizeof(b1m), NULL);
    BIGNUM *bna2 = BN_bin2bn(a2, sizeof(a2), NULL);

    BN_rshift1(bnn2, bnn);
    BN_mul(bnc1, bnk, bna1b2, ctx);
    BN_add(bnc1, bnc1, bnn2);
    BN_div(bnc1, NULL, bnc1, bnn, ctx);
    BN_mul(bnc2, bnk, bnb1m, ctx);
    BN_add(bnc2, bnc2, bnn2);
    BN_div(bnc2, NULL, bnc2, bnn, ctx);

    BN_mul(bnt1, bnc1, bna1b2, ctx);
    BN_mul(bnt2, bnc2, bna2, ctx);
    BN_add(bnt1, bnt1, bnt2);
    BN_sub(bnk1, bnk, bnt1);
    BN_mul(bnt1, bnc1, bnb1m, ctx);
    BN_mul(bnt2, bnc2, bna1b2, ctx);
    BN_sub(bnk2, bnt1, bnt2);

    BN_free(bnc1);
    BN_free(bnc2);
    BN_free(bnt1);
    BN_free(bnt2);
    BN_free(bnn2);
    BN_free(bna1b2);
    BN_free(bnb1m);
    BN_free(bna2);
    return 0;
}

static int
secp256k1Verify(const unsigned char hash[32], const unsigned char *dersig, size_t sigsize, const EC_KEY *pkey)
{
    int rslt = 0;;
    const EC_GROUP *group = EC_KEY_get0_group(pkey);
    const EC_POINT *Y = EC_KEY_get0_public_key(pkey);
    const EC_POINT *G = EC_GROUP_get0_generator(group);
    EC_POINT *Glam = EC_POINT_new(group);
    EC_POINT *Ylam = EC_POINT_new(group);
    EC_POINT *R = EC_POINT_new(group);
    const EC_POINT *Points[3];
    const BIGNUM *bnexps[3];
    BIGNUM *bnp = BN_new();
    BIGNUM *bnn = BN_new();
    BIGNUM *bnx = BN_new();
    BIGNUM *bny = BN_new();
    BIGNUM *bnk = BN_new();
    BIGNUM *bnk1 = BN_new();
    BIGNUM *bnk2 = BN_new();
    BIGNUM *bnk1a = BN_new();
    BIGNUM *bnk2a = BN_new();
    BIGNUM *bnsinv = BN_new();
    BIGNUM *bnh = BN_bin2bn(hash, 32, NULL);
    static unsigned char beta[] = {
        0x7a, 0xe9, 0x6a, 0x2b, 0x65, 0x7c, 0x07, 0x10,
        0x6e, 0x64, 0x47, 0x9e, 0xac, 0x34, 0x34, 0xe9,
        0x9c, 0xf0, 0x49, 0x75, 0x12, 0xf5, 0x89, 0x95,
        0xc1, 0x39, 0x6c, 0x28, 0x71, 0x95, 0x01, 0xee,
    };
    BIGNUM *bnbeta = BN_bin2bn(beta, 32, NULL);
    BN_CTX *ctx = BN_CTX_new();
    ECDSA_SIG *sig = d2i_ECDSA_SIG(NULL, &dersig, sigsize);

    if (sig == NULL)
        goto done;

    EC_GROUP_get_curve_GFp(group, bnp, NULL, NULL, ctx);
    EC_GROUP_get_order(group, bnn, ctx);

    if (BN_is_zero(sig->r) || BN_is_negative(sig->r) || BN_ucmp(sig->r, bnn) >= 0
        || BN_is_zero(sig->s) || BN_is_negative(sig->s) || BN_ucmp(sig->s, bnn) >= 0)
        goto done;

    EC_POINT_get_affine_coordinates_GFp(group, G, bnx, bny, ctx);
    BN_mod_mul(bnx, bnx, bnbeta, bnp, ctx);
    EC_POINT_set_affine_coordinates_GFp(group, Glam, bnx, bny, ctx);
    EC_POINT_get_affine_coordinates_GFp(group, Y, bnx, bny, ctx);
    BN_mod_mul(bnx, bnx, bnbeta, bnp, ctx);
    EC_POINT_set_affine_coordinates_GFp(group, Ylam, bnx, bny, ctx);

    Points[0] = Glam;
    Points[1] = Y;
    Points[2] = Ylam;

    BN_mod_inverse(bnsinv, sig->s, bnn, ctx);
    BN_mod_mul(bnk, bnh, bnsinv, bnn, ctx);
    splitk(bnk1, bnk2, bnk, bnn, ctx);
    bnexps[0] = bnk2;
    BN_mod_mul(bnk, sig->r, bnsinv, bnn, ctx);
    splitk(bnk1a, bnk2a, bnk, bnn, ctx);
    bnexps[1] = bnk1a;
    bnexps[2] = bnk2a;

    EC_POINTs_mul(group, R, bnk1, 3, Points, bnexps, ctx);
    EC_POINT_get_affine_coordinates_GFp(group, R, bnx, NULL, ctx);
    BN_mod(bnx, bnx, bnn, ctx);
    rslt = (BN_cmp(bnx, sig->r) == 0);

    ECDSA_SIG_free(sig);
done:
    EC_POINT_free(Glam);
    EC_POINT_free(Ylam);
    EC_POINT_free(R);
    BN_free(bnp);
    BN_free(bnn);
    BN_free(bnx);
    BN_free(bny);
    BN_free(bnk);
    BN_free(bnk1);
    BN_free(bnk2);
    BN_free(bnk1a);
    BN_free(bnk2a);
    BN_free(bnsinv);
    BN_free(bnh);
    BN_free(bnbeta);
    BN_CTX_free(ctx);
   
    return rslt;
}



main()
{
    EC_KEY *pkey;
    EC_GROUP *group;
    const EC_POINT *ecpub;
    unsigned char sig[100];
    unsigned siglen = sizeof(sig);
    unsigned char hash[32];
    struct timeval tv1, tv2;
    double time1, time2;
    int i;
    int rslt;

    ENGINE_load_builtin_engines();
    CRYPTO_malloc_init();

    group = EC_GROUP_new_by_curve_name(NID_secp256k1);
    pkey=EC_KEY_new();
    EC_KEY_set_group(pkey,group);
    EC_KEY_generate_key(pkey);
    ecpub = EC_KEY_get0_public_key(pkey);
    ECDSA_sign(0, hash, 32, sig, &siglen, pkey);

    rslt = ECDSA_verify(0, hash, 32, sig, siglen, pkey);
    printf("rslt = %d\n", rslt);

    rslt = secp256k1Verify(hash, sig, siglen, pkey);
    printf("rslt = %d\n", rslt);

    hash[0]++;

    rslt = ECDSA_verify(0, hash, 32, sig, siglen, pkey);
    printf("rslt = %d\n", rslt);

    rslt = secp256k1Verify(hash, sig, siglen, pkey);
    printf("rslt = %d\n", rslt);

    hash[0]--;

    gettimeofday(&tv1, NULL);
    for(i=0; i        rslt = ECDSA_verify(0, hash, 32, sig, siglen, pkey);
    }
    gettimeofday(&tv2, NULL);
    printf("rslt = %d\n", rslt);
    time1 = (tv2.tv_sec - tv1.tv_sec + (tv2.tv_usec - tv1.tv_usec)/1000000.) / REPS;
    printf("time: %g\n", time1);

    gettimeofday(&tv1, NULL);
    for(i=0; i        rslt = secp256k1Verify(hash, sig, siglen, pkey);
    }
    gettimeofday(&tv2, NULL);
    printf("rslt = %d\n", rslt);
    time2 = (tv2.tv_sec - tv1.tv_sec + (tv2.tv_usec - tv1.tv_usec)/1000000.) / REPS;
    printf("time: %g\n", time2);
    printf("%f%% speedup\n", (time1-time2)/time1);

    exit(0);
}
legendary
Activity: 1526
Merit: 1134
February 08, 2011, 07:28:00 AM
#3
Great stuff! A 25% win is pretty significant when you look at it in terms of hardware costs rather than initial load time. It would certainly drop the cost of running a VISA scale node.

The batch verification thing does sound interesting. It's almost ideal for BitCoin in fact.
Hal
vip
Activity: 314
Merit: 4276
February 08, 2011, 02:11:52 AM
#2
I implemented an optimized ECDSA verify for the secp256k1 curve used by Bitcoin, based on pages 125-129 of the Guide to Elliptic Curve Cryptography, by Hankerson, Menezes and Vanstone. I own the book but I also found a PDF on a Russian site which is more convenient.

secp256k1 uses the following prime for its x and y coordinates:

p = 0xfffffffffffffffffffffffffffffffffffffffffffffffffffffffefffffc2f

and the curve order is:

n = 0xfffffffffffffffffffffffffffffffebaaedce6af48a03bbfd25e8cd0364141

The first step is to compute values beta, lambda such that for any curve point Q = (x,y):

lambda * Q = (beta*x mod p, y)

This is the so-called efficiently computable endomorphism, and what it means is, you can multiply any curve point by this special value lambda very quickly, by doing a single mod-p multiply.

The book tells (well, hints) how to compute lambda and beta, and here are the values I found:

lambda = 0x5363ad4cc05c30e0a5261c028812645a122e22ea20816678df02967c1b23bd72

beta = 0x7ae96a2b657c07106e64479eac3434e99cf0497512f58995c1396c28719501ee


Given that we can multiply by lambda quickly, here is the trick to compute k*Q. First use the shortcut to compute Q' = lambda*Q. Next, k must be decomposed into two parts k1 and k2, each about half the width of n, such that:

k = k1 + k2*lambda mod n

Then

k*Q = (k1 + k2*lambda)*Q = k1*Q + k2*lambda*Q = k1*Q + k2*Q'

That last expression can be evaluated efficiently via a double multiply algorithm, and since k1 and k2 are half length, we get the speedup.

The missing piece is splitting k into k1 and k2. This uses the following 4 values:

a1 = 0x3086d221a7d46bcde86c90e49284eb15
b1 = -0xe4437ed6010e88286f547fa90abfe4c3
a2 = 0x114ca50f7a8e2f3f657c1108d9d44cfd8
b2 = 0x3086d221a7d46bcde86c90e49284eb15

(it's ok that a1 = b2)

Use these as follows to split k:

c1 = RoundToNearestInteger(b2*k/n)
c2 = RoundToNearestInteger(-b1*k/n)

k1 = k - c1*a1 - c2*a2
k2 = -c1*b1 - c2*b2


With all this, I measure about a 25% speedup on raw signature verifications. For Bitcoin, initial block load from a wifi-connected local node is reduced from:

real   36m21.537s
user   24m43.277s
sys   0m27.950s

to:

real   32m59.777s
user   18m21.145s
sys   0m28.262s

Not a big difference, and it would probably be even less significant when fetching over the net.
Hal
vip
Activity: 314
Merit: 4276
February 07, 2011, 05:31:27 PM
#1
In another thread, [mike] wrote:

I don't know of any algorithms that will make ECDSA much faster using GPUs. The best bet for speeding this up (assuming we care) is to exploit the special properties of the secp256k1 curve:

http://portal.acm.org/citation.cfm?id=1514739

http://www.hyperelliptic.org/tanja/KoblitzC.html

I'm trying to inplement the secp256k1 shortcut. Should have results shortly. Unfortunately I only expect about 20% speedup. We'll see.

I'm also looking at batch signature verification:

http://cseweb.ucsd.edu/~mihir/papers/batch.pdf

http://www.math.snu.ac.kr/~jhcheon/publications/2007/BatchVer_PKC2007_CL.pdf

This can theoretically speed up verification by almost a factor of 4 (table 6 in 2nd paper) if you have a lot of signatures in a batch. It requires a slight change in the ECDSA signature format: (r, s) is replaced by (R, s), where R is the EC point of which r is the x coordinate. This change could be applied retroactively, as R is calculated and discarded every time the sig is verified.

We do tend to have sigs in batches, namely blocks; sometimes several dozen or even hundreds, and this will grow. Batch verification returns true iff all sigs are valid. A good block should never have invalid signatures so it makes sense to batch the verify.

I need to research some security aspects, namely: does R need to be checked to see if it's on the curve (ie y^2 = x^3 + 7)? And what is an appropriate security parameter for the probability the batch verify test could be fooled? The papers talk about 2^80 but that seems too conservative.
Pages:
Jump to: