Understanding Software Dynamics [book review]

pclmulqdq 3 months ago

I read through this book last year, and it was a very good book. I loved the first half, and it hits on something that I tell other engineers about performance, which is that while your performance tool will tell you the truth, it will tell you the truth about a very narrow question. That means that you should be using it as a hypothesis testing tool, which means developing the hypothesis first based on an understanding of how the computer works. I'm pretty sure it's the first half, on how to think about software to generate these hypotheses, that sells the book.

However, the last third/half lost me. It primarily discusses (and advertises) the usage of one tracing tool that the author built. All performance tools, particularly the tracing tools which tend to be very heavy, have strengths and weaknesses, and you are going to need to mix your tools if you want to really understand things.

It's well worth the money and the time for the first half of the book, though.

eatonphil 3 months ago

A gang of us are reading through this book right now and with fortune, Dick Sites has joined along too. It's quite an interesting book and quite challenging too. I love the performance archaeology Sites has done and also like the emphasis on 1) understanding his stated five fundamental resources (disk, network, cpu, memory, and software critical sections) and 2) how profiling (with hardware performance counters) can be cheap and effective but will only help you out with average performance and not p99 behavior. For that you need tracing.

We're halfway through the book so my takeaways my differ by the end. But it's possibly the most densely packed book I've read. Will definitely require future rereading.

nsguy 3 months ago

Book looks very interesting.
random nit (obviously not having read the book): I would say p99 behaviour can be captured in profiling, it's just going to be the p99 of the profiles. E.g. if you sample 10,000 stack traces out of your executable, 1% of those are going to be in that p99, sort of by definition. Tracing through requests (e.g.) is useful but I wouldn't make as strong as a statement as that's the only way of understanding p99 performance based on my experience.
- pclmulqdq 3 months ago
  
  While this is correct in a literal sense, you are missing something that the profiler tells you about your p99 (and p99.9) tails of the end-to-end system: sources of "slowness" are often correlated in these requests. Some systems I have worked on have p99 times that are built out of a combination of 90th percentile events that you would find in a profiler. In this case, a profiler doesn't tell you about anything being particularly bad.
  Profilers also say nothing about queueing, and can very much mislead you if you care about latency in specific.
  If your "slowness" is driven by a single function (or made of truly uncorrelated events), you can accurately measure your tails with a profile. If not, a trace will give you meaningfully more information.
  - nsguy 3 months ago
    
    Sure. The profiler is going to give you information related to what it is looking at. If your bottleneck is disk I/O then you need to look at disk I/O. If your bottleneck is some other mechanism that's not purely cycles then you need to look at the relevant metrics.
    Your slowness is always a function of the underlying building blocks, their performance distribution and bottleneck. And sure, two 90th percentiles can make for the 99th percentile. A profiler won't magically convey the information about what sequence of operations a request is doing under the hood.
    I agree that having visibility into the requests via tracing can help zoom in on the problem. But so can having metrics on the underlying systems, e.g. if you have a queue in your system you could look at the performance of that queue.
    I'll admit that most of my experience is tuning for maximal throughput rather than a given percentile, usually systems with high performance/throughput yield a much flatter distribution of latencies at a given workload. A rule of thumb. I also tend to think about my "budget" in the various parts of the system to get to my desired performance characteristics, a luxury you don't have on "legacy" systems you need to troubleshoot some behaviour on where tracing also lets you get a "cut" through the stack that shows you what's going on.
    
    signa11 3 months ago
    
    > ... A profiler won't magically convey the information about what sequence of operations a request is doing under the hood. ...
    have a look at kutrace [https://github.com/dicksites/KUtrace]. it does.
- eatonphil 3 months ago
  
  Yeah, bad choice of words. Good thing I didn't write the book. :) How about: the long tail of bad behavior can hide behind sampling profilers.

CalChris 3 months ago

TL;DR? Well his article Benchmarking "Hello, World!" develops a lot of the ideas which show up in his book.

https://queue.acm.org/detail.cfm?id=3291278

rramadass 3 months ago

Nice; Thank You.
This looks quite good and a better investment of time before diving into the book itself.
- signa11 3 months ago
  
  if that is the kind of thing you like, this: https://www.youtube.com/watch?v=D_qRuKO9qzM is that kind of thing you would like.
  - rramadass 3 months ago
    
    Great; Almost a video summary of the book by the author himself!
    
    signa11 3 months ago
    
    yeah, that’s how i ended up buying it. reading it (along with the exercises) is quite rewarding indeed. i feel that the description of caches should have been more detailed though. curt-schimmel’s book comes to mind here…