¶Profile, profile, profile
A common belief is that you should always profile your code first to diagnose a performance problem before trying to optimize. This is generally a good idea, although sometimes it's taken to absurd extremes. You don't really need a profiler to see that an O(N^2) algorithm with N=40000 is likely a problem -- all you need is to break into the app with a debugger a few times. That having been said, the quicker your profiler is to use, the easier it is to just use the profiler and get some hard data. There's nothing like a profile showing a particular function at 98% of the CPU to identify a culprit.
It seems that most people I know prefer call graph profiling for the detail of data produced, but I like sampling profilers myself, because they're less intrusive and the data is more reliable, although less precise. They also often have the nice advantage that you can simply start them at any time and nonintrusively profile the whole system on the spot, without having to launch the application under it or terminate the app when profiling completes. Therefore, when the program I was working on was unexpectedly running one-third of the speed that it should have, I just launched AMD CodeAnalyst and fired off a standard 20-second no-launch sampling run (which is in a profiling project named "whatever").
Well, the profile showed a bunch of function names that started with @ILT... which, if you're familiar with Visual C++, stands for "incremental link table." Which means that the reason that the program was running slowly was that I was running the unoptimized debug build.
Sheepishly, I stopped the program, changed the configuration from Debug to Release, and solved the performance problem of the day.