Very good speed!
It took months to me to get my current 45 million points (only x) per second (per core) on my mobile Xeon.
On my laptop if I use 4 cores the speed is less than 4x.
I compute three points: x, beta*x, beta2*x. With "45 million points per second" I mean 15 million of (x, beta*x,beta2*x). Do you compute the y coordinates too?
From these 3 points you can get 6 compressed public keys: "02x", "03x", "02beta*x", "03beta*x", "02beta2*x", "03beta2*x".
If you have the y coordinate too, you can get the uncompressed keys: "04xy", "04beta*xy", "04beta2*xy", "04x(p-y)", "04beta*x(p-y)", "04beta2*x(p-y)"
Then in 1 second I could get about 90 million compressed public key per core, and at least 170 million compressed and uncompressed public keys per core.