Old Is Gold: Optimizing Single-Threaded Applications with Exgen-Malloc

fanf2 a day ago

At a quick skim this looks like they reinvented something very similar to phkmalloc, but they didn’t cite phkmalloc nor include it in their benchmarks.

https://phk.freebsd.dk/sagas/phkmalloc/

https://cgit.freebsd.org/src/tree/lib/libc/stdlib/malloc.c?h...

jauntywundrkind a day ago

It feels like there's so many weird interesting wins from abandoning SMP CPU coherency. Giving each core its own memory space & own work skips by so many gotchas & contentions.

This is nicely moving down the stack from some other nearby work. ByteDance just released code for Parker, a Linux multi-kernel approach where each core gets its own copy of Linux (and there's one coordinator core). There's another more general multi-kernel on one system approach that also has been quite active recently, that's more general (not strictly 1:1 cores kernel). https://www.phoronix.com/news/Linux-Parker-Proposal https://www.phoronix.com/news/Multi-Kernel-Linux-v2

(Obviously we can and do do lots of single thread per core work already: these emerging multi-kernel ideas are trying to push new territory, new isolation, eliminate yet more contention.)

bcrl a day ago

Parker is what Larry McVoy advocated for Linux back during the early days of multiprocessor scaling work. The idea was basically to treat an MP system as a cluster. Everything old is new again!
Personally, I would never agree to give up SMP CPU coherency. Multiprocessor systems are hard enough to debug with hardware cache coherency that adding in entirely new unpredictable non-deterministic behaviour would lead to more developers losing the rest of their hair prematurely. And it would likely introduce an entirely new class of security issues that nobody ever imagined that would require even worse performance draining software workarounds.
Some things are best done in hardware.
- gregw2 19 hours ago
  
  Larry (SGI) had lived through IRIX fine grained locking and even SGI's NUMA hardware cache coherency based on Stanford research right? Was his take that the complexity wasn't worth it given his experiences at SGI, or that it was just too much for an open source community to tackle without owning the hardware layers?
  (And did Maddog (DEC) with a different set of experiences agree?)
  - vacuity 15 hours ago
    
    The trend of multicore and NUMA means that hardware increasingly looks like a traditional network of many separate computers. The natural conclusions of single-core scaled up to, say, 4 cores, shift when there are 8+ cores. Locality becomes crucial; just as you wouldn't split up data-path dependencies across LANs, you shouldn't split them up across NUMA sockets either. Ignoring arguments about locking, message passing, cache management, and whatever, the most pressing argument for multikernels (or at least, far increased per-core state and reduced shared state) is that locality is essential for performance.
    
    layla5alive 3 hours ago
    
    Yup, data movement and contention and coherencey are the things that will increasingly dominate power use as core scaling continues. Exploiting locality is a must for high performance systems.
    Linux would benefit from a scheduler per CCD (in AMD parlance) approach being a first-class option. CCD pinning is a mechanism to push in this direction today, but partitioning kernel scheduler(s) along hardware boundaries would reduce complexity and overhead for a lot of use cases..
- vacuity a day ago
  
  See also Barrelfish for a multikernel research implementation. I think fos also qualifies.
  > Personally, I would never agree to give up SMP CPU coherency. Multiprocessor systems are hard enough to debug with hardware cache coherency that adding in entirely new unpredictable non-deterministic behaviour would lead to more developers losing the rest of their hair prematurely. And it would likely introduce an entirely new class of security issues that nobody ever imagined that would require even worse performance draining software workarounds.
  What are you envisioning is the alternative hardware (or is it software?), and why? I assume this is referring to some mechanism for multikernel support that doesn't rely on cache coherence. It seems like there are probably alternatives to full cache coherence that would be neutral, or better, after experience. You didn't provide substantive evidence, but on the other hand, at least multikernels on unmodified hardware seem promising.