Is not placing the Merkle root for the block's witness data in a transaction rather than in the block's header, not a quintessential example of a "kludge" or a "hack"?
The block header would emphatically _not_ be the right place to put it. Increasing the header size would drive up costs for applications of the system, including ones which do not need the witness data. No one proposing a hardfork is proposing header changes either, so even if was-- thats a false comparison. (But I wish you luck in a campaign to consign much of the existing hardware to the scrap heap, if you'd like to try!)
What about the accounting trick needed to permit more bytes of transactional data per block without a hard-forking change? This hack doesn't just add technical debt, it adversely affects the economics of the system itself.
Fixing the _incorrect costing_ of limiting based on a particular serialization's size-- one which ignores the fact that signatures are immediately prunable while UTXO are perpetual state which is far more costly to the system-- was an intentional and mandatory part of the design. Prior to segwit the discussion coming out of Montreal was that some larger sizes could potentially be safe, if there were a way to do so without worsening the UTXO bloat problem.
In a totally new system that costing would likely have been done slightly differently, e.g. also reducing the cost of the vin txid and indexes relative the txouts; but it's unclear that this would have been materially better; and whatever it was would likely have been considerably more complex (look at the complex table of costs in the ethereum greenfield system), likely with little benefit.
Again we also find that your technical understanding is lacking. No "discount" is required to get a capacity increase. It could have been constructed as two linear constraints: a 1MB limit on the non-witness data, and a 2MB limit on the composite. Multiple constraints are less desirable, because the multiple dimensions potentially makes cost calculation depend on the mixture of demands; but would be perfectly functional, the discounting itself is not driven by anything related to capacity; but instead is just a better approximation of the system's long term operating costs. By better accounting for the operating costs the amount of headroom needed for safety is reduced.