Pages:
Author

Topic: [neㄘcash, ᨇcash, net⚷eys, or viᖚes?] Name AnonyMint's vapor coin? - page 11. (Read 95279 times)

legendary
Activity: 2968
Merit: 1198
I shouldn't have to manually convert divs to multiplications to get the job done faster. It's kind of elemental.

No it isn't elemental and it isn't even a valid optimization (without sacrificing accuracy with -funsafe-math, etc.).

I will insist on that. It is elemental.


Incorrect.  See page 30 here: https://software.intel.com/sites/default/files/article/326703/fp-control-2012-08.pdf

Actually I'd suggest you at least browse the whole document or read some other similar coverage of the topic.

In your particular program it might do the conversion because you have a specific denominator where it can determine that the optimization is safe. It isn't in general.

As for the rest of your program, the problem likely is that it is stupid. Nobody writes a program to perform 20 divisions that way, so the heuristics chosen as a compromise between compiler implementation effort, compiler speed, and runtime speed don't happen to handle it. Doing so would likely involve significant complications having to do with storing local variables in portions of an SSE register (which also requires tool chain support for debugging) and also has the potential to make some code slower.

If you instead did this:

Code:
double b(double a[],double d)
{
  int i;
  for (i=0; i<20; ++i)
    a[i] /= d;

then you would get more or less exactly what you want:

Code:
divpd %xmm1, %xmm2
movapd %xmm2, (%rax)
movapd 16(%rax), %xmm2
divpd %xmm1, %xmm2
movapd %xmm2, 16(%rax)
movapd 32(%rax), %xmm2
divpd %xmm1, %xmm2
movapd %xmm2, 32(%rax)
        ...
legendary
Activity: 1708
Merit: 1049
I shouldn't have to manually convert divs to multiplications to get the job done faster. It's kind of elemental.

No it isn't elemental and it isn't even a valid optimization (without sacrificing accuracy with -funsafe-math, etc.).

I will insist on that. It is elemental.

GCC also has the same behavior (converting divs => muls) in very low levels of optimizations because the results are the same. Even at -O1 or -O2. This is *not* reserved for higher level and unsafe optimizations.

b=b/g;
   b=b/g;
   b=b/g;
   b=b/g;
   b=b/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;

=> -00 (no optimization) output like (divsd = scalar divs of sse):

  400744:   f2 0f 5e 45 b0          divsd  xmm0,QWORD PTR [rbp-0x50]
  400749:   f2 0f 11 45 f0          movsd  QWORD PTR [rbp-0x10],xmm0
  40074e:   f2 0f 10 45 f0          movsd  xmm0,QWORD PTR [rbp-0x10]
  400753:   f2 0f 5e 45 b0          divsd  xmm0,QWORD PTR [rbp-0x50]
  400758:   f2 0f 11 45 f0          movsd  QWORD PTR [rbp-0x10],xmm0
  40075d:   f2 0f 10 45 f0          movsd  xmm0,QWORD PTR [rbp-0x10]
  400762:   f2 0f 5e 45 b0          divsd  xmm0,QWORD PTR [rbp-0x50]
  400767:   f2 0f 11 45 f0          movsd  QWORD PTR [rbp-0x10],xmm0
  40076c:   f2 0f 10 45 f0          movsd  xmm0,QWORD PTR [rbp-0x10]

=> -01 (low optimization) output like:

400728:   f2 0f 10 44 24 20       movsd  xmm0,QWORD PTR [rsp+0x20]
  40072e:   f2 0f 59 c1             mulsd  xmm0,xmm1
  400732:   f2 0f 59 c1             mulsd  xmm0,xmm1
  400736:   f2 0f 59 c1             mulsd  xmm0,xmm1
  40073a:   f2 0f 59 c1             mulsd  xmm0,xmm1
  40073e:   f2 0f 59 c1             mulsd  xmm0,xmm1
  400742:   66 44 0f 28 d0          movapd xmm10,xmm0
  400747:   f2 0f 11 44 24 20       movsd  QWORD PTR [rsp+0x20],xmm0
  40074d:   f2 0f 10 44 24 08       movsd  xmm0,QWORD PTR [rsp+0x8]
  400753:   f2 0f 59 c1             mulsd  xmm0,xmm1
  400757:   f2 0f 59 c1             mulsd  xmm0,xmm1
  40075b:   f2 0f 59 c1             mulsd  xmm0,xmm1

=> -02 and -O3 more of the same, but 20x scalar one after the other (and probably intentionally avoiding xmm0 which in my experience is slower):

  40060f:   90                      nop
  400610:   f2 0f 59 e9             mulsd  xmm5,xmm1
  400614:   f2 0f 59 e1             mulsd  xmm4,xmm1
  400618:   f2 0f 59 d9             mulsd  xmm3,xmm1
  40061c:   f2 0f 59 d1             mulsd  xmm2,xmm1
  400620:   f2 0f 59 e9             mulsd  xmm5,xmm1
  400624:   f2 0f 59 e1             mulsd  xmm4,xmm1
  400628:   f2 0f 59 d9             mulsd  xmm3,xmm1
  40062c:   f2 0f 59 d1             mulsd  xmm2,xmm1
  400630:   f2 0f 59 e9             mulsd  xmm5,xmm1
  400634:   f2 0f 59 e1             mulsd  xmm4,xmm1
  400638:   f2 0f 59 d9             mulsd  xmm3,xmm1
  40063c:   f2 0f 59 d1             mulsd  xmm2,xmm1
  400640:   f2 0f 59 e9             mulsd  xmm5,xmm1
  400644:   f2 0f 59 e1             mulsd  xmm4,xmm1
  400648:   f2 0f 59 d9             mulsd  xmm3,xmm1
  40064c:   f2 0f 59 d1             mulsd  xmm2,xmm1
  400650:   f2 0f 59 e9             mulsd  xmm5,xmm1
  400654:   f2 0f 59 e1             mulsd  xmm4,xmm1
  400658:   f2 0f 59 d9             mulsd  xmm3,xmm1
  40065c:   f2 0f 59 d1             mulsd  xmm2,xmm1
  400660:   66 0f 2e f5             ucomisd xmm6,xmm5

And finally at -Ofast you get ~5 scalar multiplications and the rest are broken down multiplications to ...additions (scalar and packed). Because adds are faster than muls...  -Ofast can break down accuracy. I don't know if its related to muls => adds, I'll have to analyze that at some point. But divs => muls is definitely safe. They may not do it with the proposed 1/n (or it could be conditional for n not being zero), but they do it. We are using such code daily if our binaries are compiled with anything above -O0 (and they are).

edit:

I tried a variation where instead of /g (g=2), (which becomes *0.5), I put a g value like 2.432985742898957284979048059480928509285309285290853029850235942 ...

now at that level, up to -O3 it uses divs and only at -Ofast levels it turns them to muls and packed adds.

So the difference between gcc and freepascal compilers is that the gcc can understand easily where you can turn divs into muls safely (if high precision is not affected), while freepascal just doesn't do it at all even with safe numbers that don't affect accuracy like /2 (or 0.5).
legendary
Activity: 2968
Merit: 1198
But -fprofile-generate on top of -Ofast shouldn't give me 500ms instead of 1200 (~2.5x speedup). That's totally absurd. It's just running a profiler (with OVERHEAD) to monitor how the flow goes, writing the profile on the disk and then using the profile for the next compilation run. It's not supposed to go faster. It never goes faster with a profiler.

Somewhat, but compilation is a series of heuristics. Getting the fastest possible program is likely theoretically and certainly practically impossible.
legendary
Activity: 2968
Merit: 1198
I shouldn't have to manually convert divs to multiplications to get the job done faster. It's kind of elemental.

No it isn't elemental and it isn't even a valid optimization (without sacrificing accuracy with -funsafe-math, etc.).

These are not real numbers. a/b = a*(1/b) is FALSE in general for floating point

This kind of assumption on your part indicates part of the problem you are having in not understanding programming very well.

That said, I agree with you the compiler is just not doing a very good job with SSE instructions in some cases for multiple scalars. It does better with arrays (in some cases at least).
sr. member
Activity: 420
Merit: 262
The source of the brownouts in Mindanao has been caught on video ... very strange phenomenon indeed:

https://www.youtube.com/watch?v=v_lj77Km0Ao

The phenomenon has also been captured on video in Compton, California:

https://www.youtube.com/watch?v=07helw6mjeg

The Canadian version is a bit more stiff like hockey:

https://www.youtube.com/watch?v=ko4goyw1Q84
sr. member
Activity: 420
Merit: 262
That is entirely the wrong conceptualization. The semantics of the C++ code is captured by the type system.

Which, again is a spec on a book. It's theory.

Compilation and asm output is all about the compiler, no?

If the output is violating the invariants, it is a bug and should be reported.
legendary
Activity: 1708
Merit: 1049
Did the low-level optimizations violate any of the invariants of the C or C++ code? See the problem is that these languages have corner cases where there are insufficient invariants and thus you don't get what you thought the invariants were.

-Ofast probably...

But -fprofile-generate on top of -Ofast shouldn't give me 500ms instead of 1200 (~2.5x speedup). That's totally absurd. It's just running a profiler (with OVERHEAD) to monitor how the flow goes, writing the profile on the disk and then using the profile for the next compilation run. It's not supposed to go faster. It never goes faster with a profiler.

That is entirely the wrong conceptualization. The semantics of the C++ code is captured by the type system.

Which, again is a spec on a book. It's theory.

Compilation and asm output is all about the compiler, no?
sr. member
Activity: 420
Merit: 262
Elegance and comprehensibility via holistic unification of design concepts. You basically have to know the C compiler source code now to know what it will do. The 1000+ pages of specification is a clusterfuck.

Actually in order to understand what the compiler will try to do, you must first have a good grasp of another few thousand pages: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf

...and even then, practice will destroy theoretic behavior. I'm seeing this over and over and over.

Did the low-level optimizations violate any of the invariants of the C or C++ code? See the problem is that these languages have corner cases where there are insufficient invariants and thus you don't get what you thought the invariants were.

The language is just a syntax written on a book for programmers ("that's how you'll code the X language"), and the text file that the coder codes. But it's all happening in the compiler really.

That is entirely the wrong conceptualization. The semantics of the C++ code is captured by the type system. If the type system invariants are not tight as you expected due to inability of expression in the language design or obscure corner cases, then the optimization or generation of assembly output would not match what you think it should be.
legendary
Activity: 1708
Merit: 1049
Elegance and comprehensibility via holistic unification of design concepts. You basically have to know the C compiler source code now to know what it will do. The 1000+ pages of specification is a clusterfuck.

Actually in order to understand what the compiler will try to do, you must first have a good grasp of another few thousand pages: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf

...and even then, practice will destroy theoretic behavior. I'm seeing this over and over and over. The counterintuitive functionality that results from asm (weird hardware behavior) or compiler parameters (weird translation to asm) is mind-boggling.

I run into a very weird bug this morning, and I'm still scratching my head. The GCC compiler has an argument where you can run it with -fprofile-generate.

It allows you to run the program in real-time profiling mode, the program will save relevant information on how it runs on a disk file, and then you recompile with -fprofile-use. With -fprofile.use, GCC will read the disk file, see how execution went of the profile-test binary, and re-code the binary (after understanding the logic and knowing what it has to do better) to perform better.

So I have this small benchmark that does 100mn loops of 20 divisions by 2. Periodically it bumps up the values so that it continues to have something to divide /2. I time this and see the results.

Code:
#include    
#include    
#include
 
int main()
{
printf("\n");

const double a = 3333333.3456743289;  //initial randomly assigned values to start halving
const double aa = 4444555.444334244;
const double aaa = 6666777.66666666;
const double aaaa = 32769999.123458;

unsigned int i;
double score;
double g; //the number to be used for making the divisions, so essentially halving everything each round

double b;
double bb;
double bbb;
double bbbb;

g = 2;  

b = a;
bb = aa;
bbb = aaa;
bbbb = aaaa;

double total_time;
clock_t start, end;
 
start = 0;
end = 0;
score = 0;

start = clock();
 
 for (i = 1; i <100000001; i++)
 {
   b=b/g;
   b=b/g;
   b=b/g;
   b=b/g;
   b=b/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
    
   if (b    < 1.0000001)  {b=b+i+12.432432432;}  //just adding more stuff  in order for the number
   if (bb   < 1.0000001)  {bb=bb+i+15.4324442;}  //to return back to larger values after several
   if (bbb  < 1.0000001)  {bbb=bbb+i+19.42884;}  //rounds of halving
   if (bbbb < 1.0000001)  {bbbb=bbbb+i+34.481;}
}

 end = clock();

 total_time = ((double) (end - start)) / CLOCKS_PER_SEC * 1000;
  
 score = (10000000 / total_time);
 printf("\nFinal number: %0.20f", (b+bb+bbb+bbbb));
 
 printf("\nTime elapsed: %0.0f msecs", total_time);  
 printf("\nScore: %0.0f\n", score);
 
 return 0;
}

(executed in quad q8200 @ 1.75ghz underclock)

gcc Maths4asm.c -lm -O0  => 6224ms
gcc Maths4asm.c -lm -O2 and -O3  => 1527ms
gcc Maths4asm.c -lm -Ofast  => 1227ms
gcc Maths4asm.c -lm -Ofast -march=nocona => 1236ms
gcc Maths4asm.c -lm -Ofast -march=core2 => 1197ms  (I have a core quad, technically it's core2 arch)
gcc Maths4asm.c -lm -Ofast -march=core2 -fprofile-generate => 624ms.
gcc Maths4asm.c -lm -Ofast -march=nocona -fprofile-generate => 530ms.
gcc Maths4asm.c -lm -Ofast -march=nocona -fprofile-use => 1258ms (slower than without PGO, slower than -fprofile-generate)
gcc Maths4asm.c -lm -Ofast -march=core2 -fprofile-use => 1222ms (slower than without PGO, slower than -fprofile-generate).

So PGO optimization made it worse (L O L), but the most mindblowing thing is the running of the profiler, getting execution times down to 530ms. The profiler run (-generate) should normally take this to 4000-5000ms or above, as it monitors the process to create a log file. I have never run into a -fprofile-generate build that wasn't at least 2-3 times slower than a normal build - let alone 2-3 times faster. This is totally absurd.

And then, to top it all, -fprofile-use (using the logfile to create the best binary) created worse binaries.

Oh, and "nocona" (pentium4+) suddenly became ...the better architecture instead of core2 Cheesy

This stuff is almost unbelievable. I thought initially that the profiler must be activating multithreading, but no. I scripted simultaneous use of 4 runs, they all give the same time - that means, there was no extra cpu use in other threads.

Quote
Add more elegant syntax and less noisy implementation of polymorphism, first-class functions, etc.. Then fix corner cases (e.g. 'const') where the C++ compiler can't give you the correct warning nor (as explained in the video) enforce the programmer's intended semantics at the LLVM optimization layer.

There is no "can't give you". It's simply not programmed to give you. It can be programmed to do whatever you want it to do. From compiling only safe, to giving you the correct warning, to, to, to. You quoted something similar in the post above on how C++ is being "upgraded" to do such stuff.

The language is just a syntax written on a book for programmers ("that's how you'll code the X language"), and the text file that the coder codes. But it's all happening in the compiler really. If you have a great compiler => "wow the X language is fast". If you have a very verbose compiler that can actually help you to code better => "wow, the X language is excellent in warnings and helping you code"... if the compiler can switch on / off, 'safe' and 'unsafe' execution styles => "wow the X language is very flexible", etc etc. A language, ultimately, is as good as its compiler - in terms of features. Syntax and structure are different issues and I generally prefer simple-to-read (or write) code instead of high levels of abstractions. It's not that I'm bad at abstract thinking. It's simply more time consuming for me to start searching multiple files to see what each thing does, and then see references to other parts of code, etc etc. How is that supposed to be readable?

Quote
AlexGR, I think you would be well served by taking a compiler design course and the implementation of both a low-level imperative paradigm and a high-level functional programming paradigm languages. This would start to help you see all the variables involved that you are trying to piecemeal or oversimplify. The issues are exceedingly complex.

Oh they are, I have no doubt about it.
sr. member
Activity: 420
Merit: 262
wait a sec shelby. shouldnt this be the job of the US non-accredited investors to decide what they can or cant do.

Originally I supported this perspective, which also seems to be supported by the actual history of the blueksky laws in the USA:

http://digitalcommons.law.yale.edu/cgi/viewcontent.cgi?article=2679&context=fss_papers

What changed my mind is the altcoin arena has turned totally away from any sane decentralized designs to the proof-of-shit/stake designs which are really just MLM investment scams in disguise that prey on the gambling instinct in humans.

It is regressing the progress of our technologies, not advancing them.


More abstractly, you are essentially arguing that every member of society is an island and can protect him/herself from every threat alone.

Shouldn't it be the job of every citizen of a nation to defend themselves against a nuclear bomb using their handgun.

The reason the USA protects the little man from his own gambling instinct is because it is known to be an impoverishing addiction that leads to deleterious outcomes for society. Whether it didn't entirely come about for that reason, the public seems to support the regulation of gambling.

My more abstract generative essence statement on regulation is as follows. Any market which is not self-regulating, i.e. which is dysfunctional and not behaving as a free market, will end up regulated by special interests. My definition of "free market" is decentralized market. So the reason the altcoin market is not decentralized and is dysfunctional, is because of asymmetric information. The speculators entirely lack the ability to understand the technobabble. They entirely rely on a few "experts" to guide them. And thus the market does not function. It regresses.



You need a Nash equilibrium for security aspects, but you also need one for adoption.  If someone premines the entire coin and has the market cornered by design on day one, there's zero incentive for anyone to adopt it when the issuer and his cronies have such advantage over everyone else.  You're just replicating central banking except the value of central banking units is derived through coercion, and without it, it would have no value.

Which is precisely why it can't become a "value transfer system" where that is supposed to mean a currency and unit-of-exchange, and not a transfer of value from gamblers and fools to snakeoiltechnocoolbabble salesmen.
sr. member
Activity: 420
Merit: 262
TPTB, I must say, I love when someone challenges you and you spill out all this good stuff.

I intend to pull all that deep theoretical stuff together one day more coherently and also delve more into CoinCube's interesting theoretical explorations.

Just been lacking available energy due to my illness (which appears to be on the mend), and now off on the tangent of deep coding and computer science research and practicalities.

I need to go back to sleep now, because the cure is asking for so much sleep.

My posts will be sort of jumbled and all over the place due to this chaotic state of my life, energy, falling asleep at any time, waking for an hour, sleepy again, brownouts interrupting my work or sleep (difficult to sleep without aircon here), etc.. In and out of consciousness. Sometimes I am not sure if I was awake or dreaming when I wrote something.
sr. member
Activity: 420
Merit: 262
Elegance and comprehensibility via holistic unification of design concepts...

Add more elegant syntax and less noisy implementation of polymorphism, first-class functions, etc...

One of the big issues these days is asynchronous programming. The following delve into this issue:

[...]

I am absolutely astonished they undid making Rust a tasklet based runtime i.e. like stackless Python with a language native coroutine M:N threading model, especially with the excellent actor based concurrency model already built in there which should have been perfect. Originally their i/o was based on libuv, probably the leading portable asynchronous i/o library written in C, so all their i/o was also async. A full fat coroutine + all async i/o new systems programming language ticking every box for a C++ replacement - Rust v0.12 sounds great, doesn’t it?

Unfortunately they ended up canning the greenlet support because theirs were slower than kernel threads which in turn demonstrates someone didn’t understand how to get a language compiler to generate stackless coroutines effectively (not surprising, the number of engineers wired the right way is not many in this world, but see http://www.reddit.com/r/rust/comments/2l0a4b/do_rust_web_servers_use_libuv_through_libgreen_or/ for more detail). And they canned the async i/o because libuv is “slow” (which it is only because it is single threaded only, plus forces a malloc + free per async operation as the buffers must last until completion occurs, plus it enforces a penalty over synchronous i/o see http://blog.kazuhooku.com/2014/09/the-reasons-why-i-stopped-using-libuv.html), which was a real shame - they should have taken the opportunity to replace libuv with something better (hint: ASIO + AFIO, and yes I know they are both C++, but Rust could do with much better C++ interop than the presently none it currently has) instead of canning always-async-everything in what could have been an amazing step up from C++ with most of the benefits of Erlang without the disadvantages of Erlang.

A huge missed opportunity I think, and sadly it looks like the ship has sailed on those decisions Sad, and both untick pretty major boxes for some people including me. As it is a personal goal of mine to see AFIO become the asynchronous filesystem and file i/o implementation entering the C++ standard library to complement ASIO as the C++ Networking library entering in C++ 17, I can’t complain too much I suppose, it’s just it renders Rust as less likely to make C++ obsolete/encourage C++ to up its game.

[...]

 
+Steve Klabnik Oh for sure. As it happens, I'm currently in a week long private discussion about how best to tweak C++/LLVM to ideally support stackless coroutines via a new "safe C++" modifier which has the compiler enforce safer and much more coroutine efficient programming idioms on functions and namespaces so marked up (indeed such safe C++ would substantially close the gap with Rust and Go I think). So I absolutely understand the difficulties in not just deciding on a design, but persuading people - most of whom are not async wired - that implementing the design is a good idea. There is also a big problem that we need reference compiler implementations to demonstrate the design works, and that means raising at least $40k to fund the requisite developers.

No, my objection to dropping async i/o in Rust was more that there is nothing wrong with libuv, it's just it's slow. Slow is 100% fine for a v1.0 language release, so long as you're faster than Python it isn't important. I'd prefer that all the people who jump onboard with 1.0 code against the right kind of i/o library API with the right kind of semantics, and we'll fix up performance over the coming years. Moreover, my day job's main code base is currently wasting a chunk of time dealing with Rust's incomplete i/o facilities, and for us at least a slow but fully async i/o library baked into the language would be far better than currently wrestling with pushing networking apis into threads just to work around blocking issues and bugs. mio is a non starter for us, as are most of the other async i/o frameworks for Rust, because we need Windows and besides we don't want to get locked into an expensive to replace library which may get orphaned.

Anyway, I'm sure you all had the same discussions when you decided to drop built in libuv, I guess coming from an async background I like async. For many if not most programmers async just isn't important, and it's an easy drop given the average cost benefit.

[...]

 
+Niall Douglas I think that 'slow' wasn't the full objection here, exactly. Let me lay out a slightly fuller history, slightly modified from a HN comment:

------------

In the beginning, Rust had only green threads. Eventually, it was decided that a systems language without systems threads is... strange. So we needed to add them. Why not add choice? Since the interfaces could be the same, why not abstract over them, and you could just choose which one you wanted?

At the same time, the problems with green threads by default were becoming issues. Segmented stacks cause slow C interop. You need a runtime to manage them, etc. Furthermore, the overall abstraction was causing an unacceptable cost. The green threads weren't very green. Plus, with the need to actually release someday looming, decisions needed to be made regarding tradeoffs. And since Rust is supposed to be a systems language, having 1:1 threads and basically no runtime makes more sense than N:M threads and a runtime. . So libgreen was removed, the interface was re-done to be 1:1 thread centric.

[...]

+Steve Klabnik Regarding green threads - which aren't necessarily stackless coroutines - yes a new userspace implementation is highly unlike to beat the kernel which has had years of tuning. This is one of the big objections to Microsoft's resumable functions proposal before WG21, some of us think it too heavy and too ungenerically useful outside its narrowly defined use case.

Stackless coroutines are a bit different though, because they require no C stack at all. The way we're thinking about them for C++ is that the compiler will emit, for each compiled function, its maximum possible stack frame size with stack frame constructor and destructor. It also prevents you writing code which causes uncalculatable stack consumption, so that's the "safe C++" part (interestingly, via those same safeguards we can also guarantee some function can never fail unexpectedly which is great for high reliability ultra low latency scenarios, but that's an aside). To call that function, one simply constructs its stack at some arbitrary location the callee asks for (could be static mem, could be malloc, could be the C stack), and calls the function setting the stack frame register to the new stack frame.

One can now pause and resume the execution of that function with a complete knowledge of what context needs to be saved and restored. What you get looks as if stackful coroutines, but context switching is optimal and no actual stack is needed except if you call the C library which can't be a resumption point. The price is that the programmer can't do some things, and can only call inline defined C++ functions or other safe C++ functions or the C library, and alloca with a dynamically calculated value is obviously verboten.

Anyway, my point is that you can modify the language to make stackless coroutines efficient, and I do wish Rust had done that for 1.0. But I entirely accept that stuff must be cut to reach a 1.0 release, else you'd be at it forever. Thanks for the extra info though.

[...]

+Niall Douglas yeah, absolutely: there are different ways to build them. That's just about the particular way they had been built at the time, which is why they had to go. Smiley https://github.com/rustcc/coroutine-rs also popped up recently, though I haven't checked the implementation.


Coroutines are computer program components that generalize subroutines for nonpreemptive multitasking, by allowing multiple entry points for suspending and resuming execution at certain locations. Coroutines are well-suited for implementing more familiar program components such as cooperative tasks, exceptions, event loop, iterators, infinite lists and pipes.

Comparison with subroutines

When subroutines are invoked, execution begins at the start, and once a subroutine exits, it is finished; an instance of a subroutine only returns once, and does not hold state between invocations. By contrast, coroutines can exit by calling other coroutines, which may later return to the point where they were invoked in the original coroutine; from the coroutine's point of view, it is not exiting but calling another coroutine. Thus, a coroutine instance holds state, and varies between invocations; there can be multiple instances of a given coroutine at once. The difference between calling another coroutine by means of "yielding" to it and simply calling another routine (which then, also, would return to the original point), is that the latter is entered in the same continuous manner as the former. The relation between two coroutines which yield to each other is not that of caller-callee, but instead symmetric.

Any subroutine can be translated to a coroutine which does not call yield.

To implement a programming language with subroutines requires only a single stack that can be preallocated at the start of program execution. By contrast, coroutines, able to call on other coroutines as peers, are best implemented using continuations. Continuations may require allocation of additional stacks, and therefore are more commonly implemented in garbage-collected high-level languages.[citation needed] Coroutine creation can be done cheaply by preallocating stacks or caching previously allocated stacks.

Comparison with generators

Generators, also known as semicoroutines, are also a generalisation of subroutines, but are more limited than coroutines. Specifically, while both of these can yield multiple times, suspending their execution and allowing re-entry at multiple entry points, they differ in that coroutines can control where execution continues after they yield, while generators cannot, instead transferring control back to the generator's caller. That is, since generators are primarily used to simplify the writing of iterators, the yield statement in a generator does not specify a coroutine to jump to, but rather passes a value back to a parent routine.

However, it is still possible to implement coroutines on top of a generator facility, with the aid of a top-level dispatcher routine (a trampoline, essentially) that passes control explicitly to child generators identified by tokens passed back from the generators:

Comparison with mutual recursion

Using coroutines for state machines or concurrency is similar to using mutual recursion with tail calls, as in both cases the control changes to a different one of a set of routines. However, coroutines are more flexible and generally more efficient. Since coroutines yield rather than return, and then resume execution rather than restarting from the beginning, they are able to hold state, both variables (as in a closure) and execution point, and yields are not limited to being in tail position; mutually recursive subroutines must either use shared variables or pass state as parameters. Further, each mutually recursive call of a subroutine requires a new stack frame (unless tail call elimination is implemented), while passing control between coroutines uses the existing contexts and can be implemented simply by a jump.

Common uses

*    State machines within a single subroutine, where the state is determined by the current entry/exit point of the procedure; this can result in more readable code compared to use of goto, and may also be implemented via mutual recursion with tail calls.
*    Actor model of concurrency, for instance in video games. Each actor has its own procedures (this again logically separates the code), but they voluntarily give up control to central scheduler, which executes them sequentially (this is a form of cooperative multitasking).
*    Generators, and these are useful for streams – particularly input/output – and for generic traversal of data structures.
*    Communicating sequential processes where each sub-process is a coroutine. Channel inputs/outputs and blocking operations yield coroutines and a scheduler unblocks them on completion events.


stackfulness

In contrast to a stackless coroutine a stackful coroutine can be suspended from within a nested stackframe. Execution resumes at exactly the same point in the code where it was suspended before. With a stackless coroutine, only the top-level routine may be suspended. Any routine called by that top-level routine may not itself suspend. This prohibits providing suspend/resume operations in routines within a general-purpose library.

first-class continuation

A first-class continuation can be passed as an argument, returned by a function and stored in a data structure to be used later. In some implementations (for instance C# yield) the continuation can not be directly accessed or directly manipulated.

Without stackfulness and first-class semantics, some useful execution control flows cannot be supported (for instance cooperative multitasking or checkpointing).

[...]

In general, stackful coroutine is more powerful than stackless coroutine. So why do we want stackless coroutine? short answer: efficiency.

Stackful coroutine typically needs to allocate a certain amount of memory to accomodate its runtime-stack (must be large enough), and the context-switch is more expensive compared to the stackless one, e.g. Boost.Coroutine takes 40 cycles while CO2 takes just 7 cycles in average on my machine, because the only thing that a stackless coroutine needs to restore is the program counter.

That said, with language support, probably stackful coroutine can also take the advantage of the compiler-computed max-size for the stack as long as there's no recursion in the coroutine, so the memory usage can also be improved.

Speaking of stackless coroutine, bear in mind that it doesn't mean that there's no runtime-stack at all, it only means that it uses the same runtime-stack as the host side, so you can call recursive functions as well, just that all the recursions will happen on the host's runtime-stack. In contrast, with stackful coroutine, when you call recursive functions, the recursions will happen on the coroutine's own stack.


C#'s await/async is compiler feature which rewrites method into special class-finite_state_machine. All method's local variables are automatically moved into fields of that class, and code of method is moved into special class method which depending on current state just jumps by switch into right await position (http://www.codeproject.com/Articles/535635/Async-Await-and-the-Generated-StateMachine ) . That is similar in something to stackless coroutines.

In C++ stackless coroutines can be implemented just within 100 code lines, for example look to Boost.Asio. Of course syntax sugar is less sweet - local variables must be moved into class field by hands, but everything else is similar in shape and content. For instance state machine is generated automatically (with similar switch inside), by macros. (See talk by Christopher M. Kohlhoff http://blip.tv/boostcon/why-c-0x-is-the-awesomest-language-for-network-programming-5368225 , code example - https://github.com/chriskohlhoff/awesome/blob/master/server.cpp ).

Boost.Coroutine library provides stackful coroutines for C++ on wide range of platforms.
Stackful coroutines are much more powerfull than stackless ones:

___________________
Major advantage #1:

Stackful coroutine allows to encapsulate all asynchronous logic inside components - in a such way, that client code would look EXACTLY SAME to SYNCHRONOUS code.
For example, Python's gevent library does monkey patching of stadard sockets which automaticly turns all code that use them into "asyncronous", without any changes (http://www.gevent.org/intro.html#monkey-patching ).

Another demo can be found at Boost.Coroutine: it has example of simple server which asynchronously reads data from tcp port and prints it on screen, and all this happens in one thread. Reading loop looks exactly same as "normal" blocking code:

___________________
Advantage #2, Performance:

When you have chained levels of awaits - special code is executed at each of them - one level awaits another, and so on. It is O(N).

But in case of stackful coroutines, no matter how long chain you have - each level "implicitly" awaits the bottom one, that is O(1). Among other things this is extremely fast: http://www.boost.org/doc/libs/1_54_0/libs/coroutine/doc/html/coroutine/performance.html

___________________
Advantage #3, the Killer:

Exactly same syntax of await can be emulated with help of Stackful Coroutines.
I have made small proof-of-concept: https://github.com/panaseleus/await_emu :


I am thinking about all these issues w.r.t. to Rust and for example how to implement the await/async emulation I coded for EMCAScript 1.6:

Code: (asyncify.js)
'use strict';
/*
Inputs a generator function which must 'yield' only Promises, and may
optionally return a value. The return value may optionally be a Promise.

Returns a function which inputs the arguments of the generator function, and
returns a Promise which resolves when the generator function is done.
The return value of the generator function—or undefined—is the resolved value.

The optional '_this' input sets the value of 'this' for the generator function.

Inspiration from and wanting it to work without transpile in ES6:
  https://thomashunter.name/blog/the-long-road-to-asyncawait-in-javascript/
  http://jlongster.com/A-Study-on-Solving-Callbacks-with-JavaScript-Generators#Async-Solution--2--P
Wanted a solution built on correct primitives of Promises & generators (e.g. not node-fibers):
  https://github.com/yortus/asyncawait#2-featuregotcha-summary
  http://howtonode.org/generators-vs-fibers
  https://blog.domenic.me/youre-missing-the-point-of-promises/
*/
function asyncify(func, _this) {
  return function() {                             // inputs 'arguments'
    return new Promise((resolve, reject) => {     // inputs functions to invoke to resolve and reject the returned Promise
      // Function that iterates (via recursion) each Promise 'yield'ed then return value (of a generator function 'gen')
      function f(gen, previous/*= undefined*/) {
        const {value, done} = gen.next(previous)  // employ ES6 to destructure the returned named tuple into new const vars, https://davidwalsh.name/es6-generators
        if (done)
          // 'value' is the return value or undefined if none,
          // https://davidwalsh.name/es6-generators
          resolve(value)
        else
          // Assume the returned 'value' is a Promise; so
          // recurse our iteration function when the Promise resolves
          value.then(_ => f(gen, _)).catch(_ => reject(_))
      }
      f(func.apply(_this, arguments))             // iterate the input 'func' function
    })
  }
}


Code: (AsyncQueue.js)
'use strict';
/*
Queue of Promises resolved in FIFO order, that enables chaining asynchronous
events which operate on a value.

Pushing to the end of queue returns a Promise which resolves to the value set
by the last shift off the front of the queue, or to the initialized value in
case no prior shifts occurred.

Shifting off the start of the queue inputs the value to resolve the next queued
Promise (which is saved to the initialized value if no next Promise is queued).
*/
object('currentScript').then(_ => fetch[(_ ? _() : document.getElementsByTagName('script').last())['id']]( // register module output with fetch.js
  function(value) {
    var fifo = []
    this.push = () => new Promise(_ => {fifo.push(_); if (fifo.length == 1) _(value)})
    this.shift = _ => {fifo.shift(); if (fifo.length) {fifo[0](_)} else {value = _}}
  }
))
sr. member
Activity: 420
Merit: 262
Watch the video.

"Look, we are like c++ but much safer"... but as he points out at some point, well, if c++ evolves they might go bust. I mean what's your selling point? That your compiler notifies you? And is there anything that prevents a compiler software of c++ to notify the user that what he is doing is unsafe?

Elegance and comprehensibility via holistic unification of design concepts. You basically have to know the C++ compiler source code now to know what it will do. The 1000+ pages of specification is a clusterfuck.

Add more elegant syntax and less noisy implementation of polymorphism, first-class functions, etc.. Then fix corner cases (e.g. 'const') where the C++ compiler can't give you the correct warning nor (as explained in the video) enforce the programmer's intended semantics at the LLVM optimization layer.

AlexGR, I think you would be well served by taking a compiler design course and the implementation of both a low-level imperative paradigm and a high-level functional programming paradigm languages. This would start to help you see all the variables involved that you are trying to piecemeal or oversimplify. The issues are exceedingly complex.
sr. member
Activity: 420
Merit: 262
I was thinking synereo was possibly meant

Same answer though. I don't even really know what that is, beyond some vague thing about social media. No idea how it was launched, what it does, etc. Never looked at it.

A competing social network for maskcoin or jambox or w/e hes calling it now. Hes trying to imply that you and Shelby intentionally gang on together on things maybe?

I dont know, but it seems like you broke the fella so i guess we probably wont know

I've read most of the 50+ page Synereo white paper, expended several hours viewing some of their YouTube Hangouts, done some limited discussion with their founding developer (username here Elokane), and posted in every recent Synereo thread in Altcoin Discussion.

Synereo was launched as a vaporware ICO and the math whiz on the project is Greg Meredith who is into process calculus research and was one of key persons apparently on Microsoft's BizTalk design. Greg is into using Scala and also is collaborating on the math modeling of Ethereum's upcoming, promised Casper design (which btw several of us, excluding smooth, have criticized in the Ethereum Paradox thread for its fundamental insoluble flaws).

I have pointed out that there are numerous P2P (aka distributed) social networking projects, so the idea of Synereo being the first and able to sweep the world, is very slim, especially they have no compelling features afaics. Thus I have criticized them for preselling tokens ("AMPS") with no adoption and on hype. Their major claim as an innovative feature is an "Attention Model" which is composed of reputation ("Reo") and a counter-vailing force of being able to pay to override reputation with the AMPS tokens. In other words, they aim to make the content that the users share more relevant. I had pointed out that the Reo needs to be fine-grained on for example #hashtags, and Elokane indicated that although that is not in the white paper they are implementing something like that, yet there is no holistic public specification afaik. They are claiming to be very close to beta, but I've pointed out that doesn't mean they are any where near adoption. I have also pointed out that Facebook users don't seem to have major complaints about the relevance of shared content on feeds, thus I doubt anyone will adopt Synereo (because their friends won't be there and much less content sharing and other chicken and egg dilemmas).

Also I have pointed out that the economics of advertising is the most someone could expect to earn by being paid to share (the AMPs model) is perhaps about $1 (in developing world) to $10 (first-world) per day and probably not that much. It simply isn't worth anyone's time. People don't join social networks to be paid some palty income. They join for other more important reasons. Thus I've argued the economic model for the AMPS is fundamentally flawed.

Thus I have argued they are preselling shit which no market.

Also I don't really understand the process calculus well enough to know if it is technobabble bullshit or not, but it sure looks like it to me. It looks like ivory tower shit that has no real implications in the real world. What did BizTalk do that was relevant? I did a Google search and it seems basically no one used it? Excuse me for being skeptical but the selling of ICOs is becoming too lucrative and attractive for every Joe who has some technobabble to make n00bs drool.

Smooth is not involved in my JAMBOX project at all. I occasionally trade ideas with him about technology. My JAMBOX project will when it is crowdsourced (not for tokens just for Tshirts!) will explain that it targets compelling features and economics. I have not yet announced that, because for one thing is that at the moment I am working on potentially creating a new programming language based on top of Rust, or perhaps contributing to Rust. Because JAMBOX is based on the concept of empowering mobile apps, and so I need to be sure the language we are using is the best in severals ways one of which is JIT compilation.

I don't hate Synereo's people. I just wish they hadn't done a vaporware ICO, both for the legal reasons of selling unregistered investment securities to non-accredited USA investors apparently in violation of securities law as provided for by the Supreme Court's Howey test and simply because it is the antithesis of the objective ethics (i.e. no zero-sum games) of meritocratic software development to sell vaporware.
legendary
Activity: 1708
Merit: 1049
A better compiler certainly could do better with sqrt() in some cases, especially with the flag I mentioned (and even without, given sufficient global analysis, but as I said how much of that to do is somewhat of a judgement call), but I'm just pointing out that the program you fed it was not as simple as it appeared, in terms of what you were asking for.

I'm pretty sure it would choke even if I asked it to do
b=b+1 or b*1
bb=bb+1...
bbb=bbb+1...
bbbb=bbbb+1...

Maybe I'll try it out...

I made a variant of the program that does 100mn loops of divisions... It was finishing too fast, so I put the divisions 5 times in each loop.

   b=b/g;  //g=2 so it halves every time
   b=b/g;
   b=b/g;
   b=b/g;
   b=b/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;

Pascal compiler binary was awfully slow... in such an arrangement it took ~7s while c, at -O2/-O3 -march=nocona was at 1500ms.

When I rearranged them as

   b=b/g;  
   bb=bb/g;
   bbb=bbb/g;
   bbbb=bbbb/g;
   b=b/g;  
   bb=bb/g;
   bbb=bbb/g;
   bbbb=bbbb/g;
   b=b/g;  
   bb=bb/g;
   bbb=bbb/g;
   bbbb=bbbb/g;
   b=b/g;  
   bb=bb/g;
   bbb=bbb/g;
   bbbb=bbbb/g;
   b=b/g;  
   bb=bb/g;
   bbb=bbb/g;
   bbbb=bbbb/g;

...the pascal compiler took the hint that it could do b,bb,bbb,bbbb together, and dropped down to 6 secs.

GCC on the other hand was smart enough to understand that each line was not very dependent on the other so it got on with the job - although it still didn't use PACKED sse (=true SIMD), only SCALAR (SISD).

I then tried to multiply the result instead of dividing it (multiply by 1/g, which is 0.5, so it's the same as /2). Multiplications are often way faster than divisions. Pascal went down to ~4s that way. It means their compiler sucks because that should be automated anyway - I shouldn't have to manually convert divs to multiplications to get the job done faster. It's kind of elemental. Anyway GCC with C was unaffected. It was already converting the divisions to multiplications at -O2/-O3 levels. Only at -O0 it was around 5-6secs.

I then hardwired asm into pascal. Initially scalar multiplications and then packed multiplications - all SSE. Scalar took me down to 3.4s, while packed (actual SIMD use) took me to 2.6s.

Final code was like:

Code:
for i:= 1 to 100000000 do //100mn loop

   begin;

   asm   // THE PACKED WAY / SIMD doing 20 multiplications in 10 instructions= 2680ms

     movlpd xmm1, g      //the multiplier (value of 0.5) is loaded in xmm1 lower space
     movhpd xmm1, g      //the multiplier (value of 0.5) is loaded in xmm1 higher space
     movlpd xmm2, b      //b is loaded in xmm2 lower space
     movhpd xmm2, bb     //bb is loaded in xmm2 higher space
     movlpd xmm3, bbb    //bbb is loaded in xmm3 lower space
     movhpd xmm3, bbbb   //bbbb is loaded in xmm3 higher space
     MULPD xmm2, xmm1    //multiply b and bb residing on xmm2 with the multiplier of 0.5 that resides in xmm1
     MULPD xmm3, xmm1    //multiply bbb and bbbb residing on xmm2 with the multiplier of 0.5 that resides in xmm1
     MULPD xmm2, xmm1    //round 2
     MULPD xmm3, xmm1    //round 2
     MULPD xmm2, xmm1    //round 3
     MULPD xmm3, xmm1    //round 3
     MULPD xmm2, xmm1    //round 4
     MULPD xmm3, xmm1    //round 4
     MULPD xmm2, xmm1    //round 5
     MULPD xmm3, xmm1    //round 5
     movlpd b, xmm2      //returning results of b, from the lower part of xmm2, back to pascal's b variable
     movhpd bb, xmm2     //returning results of bb, from the higher part of xmm2, back to pascal's bb variable
     movlpd bbb, xmm3    //returning results of bbb, from the lower part of xmm3, back to pascal's bbb variable
     movhpd bbbb, xmm3   //returning results of bbbb, from the higher part of xmm3, back to pascal's bbbb variable

     end;

Most of the pascal delays that are taking it up to 2.6s are not related to my code. The loop itself doing zero calculations costs 1.4s by itself, so there is definitely overhead there.

Anyway I went back to gcc and c to see what it's doing.

At -O3 it was generating MULSD (sse scalar multiplier / SISD fashion):

The 20 divisions had been converted to 20 separate scalar multiplying SSE instructions. So Single Instruction Single Data. Again the compiler fails to pack the data and do them in batches. It's using 20 instructions where it could use 10.

Code:
Disassembly of section .text:

00000000004005a0
:
  4005a0: 53                   push   %rbx
  4005a1: 48 83 ec 20           sub    $0x20,%rsp
  4005a5: bf 0a 00 00 00       mov    $0xa,%edi
  4005aa: e8 a1 ff ff ff       callq  400550
  4005af: e8 ac ff ff ff       callq  400560
  4005b4: 48 89 c3             mov    %rax,%rbx
  4005b7: f2 0f 10 15 59 03 00 movsd  0x359(%rip),%xmm2        # 400918 <_IO_stdin_used+0x48>
  4005be: 00
  4005bf: f2 0f 10 05 59 03 00 movsd  0x359(%rip),%xmm0        # 400920 <_IO_stdin_used+0x50>
  4005c6: 00
  4005c7: f2 0f 10 1d 59 03 00 movsd  0x359(%rip),%xmm3        # 400928 <_IO_stdin_used+0x58>
  4005ce: 00
  4005cf: f2 0f 10 25 59 03 00 movsd  0x359(%rip),%xmm4        # 400930 <_IO_stdin_used+0x60>
  4005d6: 00
  4005d7: 31 c0                 xor    %eax,%eax
  4005d9: f2 0f 10 0d 57 03 00 movsd  0x357(%rip),%xmm1        # 400938 <_IO_stdin_used+0x68>
  4005e0: 00
  4005e1: f2 0f 10 2d 57 03 00 movsd  0x357(%rip),%xmm5        # 400940 <_IO_stdin_used+0x70>
  4005e8: 00
  4005e9: f2 44 0f 10 0d 56 03 movsd  0x356(%rip),%xmm9        # 400948 <_IO_stdin_used+0x78>
  4005f0: 00 00
  4005f2: f2 44 0f 10 05 55 03 movsd  0x355(%rip),%xmm8        # 400950 <_IO_stdin_used+0x80>
  4005f9: 00 00
  4005fb: f2 0f 10 3d 55 03 00 movsd  0x355(%rip),%xmm7        # 400958 <_IO_stdin_used+0x88>
  400602: 00
**400603: f2 0f 59 e1           mulsd  %xmm1,%xmm4
  400607: f2 0f 59 e1           mulsd  %xmm1,%xmm4
  40060b: f2 0f 59 e1           mulsd  %xmm1,%xmm4
  40060f: f2 0f 59 e1           mulsd  %xmm1,%xmm4
  400613: f2 0f 59 e1           mulsd  %xmm1,%xmm4
  400617: f2 0f 59 d9           mulsd  %xmm1,%xmm3
  40061b: f2 0f 59 d9           mulsd  %xmm1,%xmm3
  40061f: f2 0f 59 d9           mulsd  %xmm1,%xmm3
  400623: f2 0f 59 d9           mulsd  %xmm1,%xmm3
  400627: f2 0f 59 d9           mulsd  %xmm1,%xmm3
  40062b: f2 0f 59 c1           mulsd  %xmm1,%xmm0
  40062f: f2 0f 59 c1           mulsd  %xmm1,%xmm0
  400633: f2 0f 59 c1           mulsd  %xmm1,%xmm0
  400637: f2 0f 59 c1           mulsd  %xmm1,%xmm0
  40063b: f2 0f 59 c1           mulsd  %xmm1,%xmm0
  40063f: f2 0f 59 d1           mulsd  %xmm1,%xmm2
  400643: f2 0f 59 d1           mulsd  %xmm1,%xmm2
  400647: f2 0f 59 d1           mulsd  %xmm1,%xmm2
  40064b: f2 0f 59 d1           mulsd  %xmm1,%xmm2
  40064f: f2 0f 59 d1           mulsd  %xmm1,%xmm2
  400653: 66 0f 2e ec           ucomisd %xmm4,%xmm5
  400657: 76 11                 jbe    40066a
  400659: 66 0f ef f6           pxor   %xmm6,%xmm6
  40065d: f2 0f 2a f0           cvtsi2sd %eax,%xmm6
  400661: f2 0f 58 e6           addsd  %xmm6,%xmm4
  400665: f2 41 0f 58 e1       addsd  %xmm9,%xmm4
  40066a: 66 0f 2e eb           ucomisd %xmm3,%xmm5
  40066e: 76 11                 jbe    400681
  400670: 66 0f ef f6           pxor   %xmm6,%xmm6
  400674: f2 0f 2a f0           cvtsi2sd %eax,%xmm6
  400678: f2 0f 58 de           addsd  %xmm6,%xmm3
  40067c: f2 41 0f 58 d8       addsd  %xmm8,%xmm3
  400681: 66 0f 2e ea           ucomisd %xmm2,%xmm5
  400685: 76 10                 jbe    400697
  400687: 66 0f ef f6           pxor   %xmm6,%xmm6
  40068b: f2 0f 2a f0           cvtsi2sd %eax,%xmm6
  40068f: f2 0f 58 d6           addsd  %xmm6,%xmm2
  400693: f2 0f 58 d7           addsd  %xmm7,%xmm2
  400697: 83 c0 01             add    $0x1,%eax
  40069a: 3d 00 e1 f5 05       cmp    $0x5f5e100,%eax
  40069f: 0f 85 5e ff ff ff     jne    400603
  4006a5: f2 0f 11 44 24 18     movsd  %xmm0,0x18(%rsp)
  4006ab: f2 0f 11 54 24 10     movsd  %xmm2,0x10(%rsp)
  4006b1: f2 0f 11 5c 24 08     movsd  %xmm3,0x8(%rsp)
  4006b7: f2 0f 11 24 24       movsd  %xmm4,(%rsp)
  4006bc: e8 9f fe ff ff       callq  400560
  4006c1: 48 29 d8             sub    %rbx,%rax
  4006c4: 66 0f ef c9           pxor   %xmm1,%xmm1
  4006c8: f2 48 0f 2a c8       cvtsi2sd %rax,%xmm1
  4006cd: f2 0f 5e 0d 8b 02 00 divsd  0x28b(%rip),%xmm1        # 400960 <_IO_stdin_used+0x90>
  4006d4: 00
  4006d5: f2 0f 59 0d 8b 02 00 mulsd  0x28b(%rip),%xmm1        # 400968 <_IO_stdin_used+0x98>
  4006dc: 00
  4006dd: 66 48 0f 7e cb       movq   %xmm1,%rbx
  4006e2: f2 0f 10 24 24       movsd  (%rsp),%xmm4
  4006e7: f2 0f 10 5c 24 08     movsd  0x8(%rsp),%xmm3
  4006ed: f2 0f 58 e3           addsd  %xmm3,%xmm4
  4006f1: f2 0f 10 44 24 18     movsd  0x18(%rsp),%xmm0
  4006f7: f2 0f 58 c4           addsd  %xmm4,%xmm0
  4006fb: f2 0f 10 54 24 10     movsd  0x10(%rsp),%xmm2
  400701: f2 0f 58 c2           addsd  %xmm2,%xmm0
  400705: bf d4 08 40 00       mov    $0x4008d4,%edi
  40070a: b8 01 00 00 00       mov    $0x1,%eax
  40070f: e8 5c fe ff ff       callq  400570
  400714: 66 48 0f 6e c3       movq   %rbx,%xmm0
  400719: bf ea 08 40 00       mov    $0x4008ea,%edi
  40071e: b8 01 00 00 00       mov    $0x1,%eax
  400723: e8 48 fe ff ff       callq  400570
  400728: f2 0f 10 05 40 02 00 movsd  0x240(%rip),%xmm0        # 400970 <_IO_stdin_used+0xa0>
  40072f: 00
  400730: 66 48 0f 6e fb       movq   %rbx,%xmm7
  400735: f2 0f 5e c7           divsd  %xmm7,%xmm0
  400739: bf 05 09 40 00       mov    $0x400905,%edi
  40073e: b8 01 00 00 00       mov    $0x1,%eax
  400743: e8 28 fe ff ff       callq  400570
  400748: 31 c0                 xor    %eax,%eax
  40074a: 48 83 c4 20           add    $0x20,%rsp
  40074e: 5b                   pop    %rbx
  40074f: c3                   retq  

At the -Ofast level is the first time where packed instructions start making their appearance (time 1.1s) but they are coupled with a few extra unsafe-math flags for semi-intentional loss of accuracy, and that's problematic. The debug at that level is some scalar muls and a lot of packed additions and packed moves. For some reason it's breaking the divisions down not to 10x packed multiplications but ~5scalar ones and a lot of extra additions (packed).

Bottom line: Everyone seems to have a lot to do to get the best out of our hardware. The freepascal compiler is lacking elementary logic in processing divs as multis and has several slow parts.

As for C... the sse stuff are there for like 15 years. LOL. When are they gonna (properly*) use them? And how about avx, avx2, etc? Should we wait till 2100? I bet they'll claim "we are taking advantage of AVX" and doing scalar stuff (SISD) there too - wasting 256bit / 512bit width.

* One could argue that they are using sse right now, but it's not that useful without exploiting the SIMD capability.


Quote

I did... the impression I get with all new language projects is that those with high targets often aim to be the next c/c++. The references of the speaker to c++ and how it's very similar (but safer) in many ways confirm that this is what they are having in the back of their mind. "Look, we are like c++ but much safer"... but as he points out at some point, well, if c++ evolves they might go bust. I mean what's your selling point? That your compiler notifies you? And is there anything that prevents a compiler software of c++ to notify the user that what he is doing is unsafe? If they wanted, they could issue a warning or even block compilation altogether on suspected un-safeness. It's doable. It's not a language issue, it's a compiler issue. There could be a compiler flag in c or c++, like --only-allow-safe-code, and suddenly you'd get 100 warnings on how to change your code or it won't compile.
sr. member
Activity: 420
Merit: 262
The reason that programmers employ linked lists is precisely the reason we shouldn't ever use them. And analogously why we shouldn't use pointers. And thus why Rust is a major improvement because it allows us to avoid Java/JVM's "always boxed" design.

Stuff like integers overflowing, variables not having the precision they need, programs crashing due to idiotic stuff like divisions by zero, having buffers overflow etc etc - things like that are ridiculous and shouldn't even exist.

Impossible unless you want to forsake performance and degrees-of-freedom. There is a tension between infinite capability that requires 0 performance and 0 degrees-of-freedom.

Sorry some of the details of programming that you wish would disappear, can't.


Watch the video.
legendary
Activity: 1708
Merit: 1049
A better compiler certainly could do better with sqrt() in some cases, especially with the flag I mentioned (and even without, given sufficient global analysis, but as I said how much of that to do is somewhat of a judgement call), but I'm just pointing out that the program you fed it was not as simple as it appeared, in terms of what you were asking for.

I'm pretty sure it would choke even if I asked it to do
b=b+1 or b*1
bb=bb+1...
bbb=bbb+1...
bbbb=bbbb+1...

Maybe I'll try it out because since yesterday I'm trying something else with no success:

Quote
I'm not sure the deal with Pascal, I never use it.

1) I like the Turbo-pascal-like IDE of Free Pascal in the terminal. It's very productive to me - although I'm not producing much of anything  Grin
2) I like the structure, syntax, simplicity and power.
3) See for example how I embedded ASM with my preferred syntax (intel, instead of the more complex at&t). See the elegance. See the interactivity with the program variables without breaking my balls about anything.

I just dropped in a few lines as a replacement by

Code:
asm
     movlpd xmm1, b
     movhpd xmm1, bb
     SQRTPD xmm1, xmm1
     movlpd xmm2, bbb
     movhpd xmm2, bbbb
     SQRTPD xmm2, xmm2
     movlpd b, xmm1
     movhpd bb, xmm1
     movlpd bbb, xmm2
     movhpd bbbb, xmm2
 end;

...and IT WORKED. Like a boss.

Now, trying to do the same since yesterday with c:

Code:
//This replaces the c sqrts

     asm("movlpd xmm1, b");      
     asm("movhpd xmm1, bb");    
     asm("SQRTPD xmm1, xmm1");  
     asm("movlpd xmm2, bbb");
     asm("movhpd xmm2, bbbb");
     asm("SQRTPD xmm2, xmm2");
     asm("movlpd b, xmm1");
     asm("movhpd bb, xmm1");
     asm("movlpd bbb, xmm2");
     asm("movhpd bbbb, xmm2");

...and the result is:

gcc Math3asm.c -lm -masm=intel
/tmp/ccNTa80M.o: In function `main':

Math3asm.c:(.text+0x4f): undefined reference to `b'
Math3asm.c:(.text+0x5c): undefined reference to `bb'
Math3asm.c:(.text+0x69): undefined reference to `bbb'
Math3asm.c:(.text+0x76): undefined reference to `bbbb'
Math3asm.c:(.text+0x91): undefined reference to `b'
Math3asm.c:(.text+0x9a): undefined reference to `bb'
Math3asm.c:(.text+0xa7): undefined reference to `bbb'
Math3asm.c:(.text+0xb0): undefined reference to `bbbb'
Math3asm.c:(.text+0xbd): undefined reference to `b'
Math3asm.c:(.text+0xc6): undefined reference to `bb'
Math3asm.c:(.text+0xcf): undefined reference to `bbb'
Math3asm.c:(.text+0xd8): undefined reference to `bbbb'
collect2: error: ld returned 1 exit status

Ah fuck me with this bullshit. I google to find what's going on and I drop into this:

https://gcc.gnu.org/ml/gcc-help/2009-07/msg00044.html

...where a guy gets something similar... and here comes da bomb:

Quote

> Compilation passes - but the linker shouts: "undefined reference to `n'"
>
> What am I doing wrong? Shouldn't it be straight forward to translate
> these simple commands to Linux?


gcc inline assembler does not work like that.  You can't simply refer to
local variables in the assembler code.


L O L
sr. member
Activity: 420
Merit: 262
The reason that programmers employ linked lists is precisely the reason we shouldn't ever use them. And analogously why we shouldn't use pointers. And thus why Rust is a major improvement because it allows us to avoid Java/JVM's "always boxed" design.

Edit: Sean Parent relates this concept more from its generative essence of memory locality trumping the big O time complexity of the data structure.
Pages:
Jump to: