After working for five years on http proxy, caching, high speed file logging, distributed quota, I believe the following summarizes the results of my experience.
1) Make sure you use all the CPU
2) Reduce your CPU usage
As innocent as they look, they are by no means simple to achieve. Locks, system calls, network settings, kernel, number of threads, contention, context switches, memory allocation, etc will make it difficult to saturate the CPU. Reducing your CPU usage is comparatively easy, just use the profiler. This itself is a bit tricky because some profilers will use global locks when collecting data points and we get a perfect example of Heisenberg Uncertainty Principle in action.
I think of software in three dimensions. Speed, Simplicity and Power.
Speed - How fast it runs
Simplicity - How simple is it to use / maintain
Power - The range of problems that can be solved using it
Think iphone. Think SQL. Memcache has speed and simplicity, but not powerful enough.
Writing high performance software is fun, but don't forget the simplicity and power aspects of it. If it is not simple and not powerful, most likely no one will care about how fast it runs. If you make it simple and speedy, your target customer set becomes small. Not everyone is solving the exact same problem. Make it speedy and powerful and customers will have tough time using it. They will probably claim it is not powerful, because that power is hidden, not simple enough to use. Make it simple and powerful and not speedy, customers will compare it with some other product which has high TPS and won't buy. Not that they need the speed, but only to future proof their investment. Every customer feels they will grow and need more and more.
Overdoing speed, simplicity or power will have some impact on the rest two. Make sure to choose the right combination.