Mack, it looks like your post did not go through, but I may have finally had a breakthrough on this issue of mine. Tentatively, I am going to say the problem was my Max Pooled Statements was not high enough for some of our datasources.
All of mine were set to 100, which is what they have been at FOREVER, so although the CF10 documentation says "300" is now the default, I think I simply copied over "100" from my old installations every time we upgraded CF, without even thinking obviously.
Meanwhile, our use of cfqueryparam has gone up over the years, and I never knew use of cfqueryparam had anything to do with Max Pooled Statements. Now for some reason that still doesn't make a lot of sense to me (although Anit tried explaining to me on the phone), when your pool runs out, all threads become BLOCKED. They do not wait or queue up like you might expect; they just stop, and this apparently can and does lock up the whole damn CF server, which will then start giving you happy joy 503 errors. Even better is that none of this information get logged in any sensible way to help trace back the problem such as "you have x number of cfqueryparam statements, which exceeds your pool size" or something to that effect.
So last night after I upped my pools from 100 to 300 I did not have any problem with my server for a whole 12 hours (gasp!). Happy me, right? Wrong. It all came crashing down in the middle of the work day today with a different super joy error:
[Macromedia][Oracle JDBC Driver][Oracle]ORA-01000: maximum open cursors exceeded.
I knew intrinsically this had something to do with the changes I made to the Max Pooled Statements, so now the issue became our Oracle database itself, which I do not manage. So I asked our DBA and he said "open cursors" was set to exactly 300 on his end, so no wonder we ran out of cursors when I upped it on my end. And I upped it for several of our datasources, not just the "big" one, because I had upped it before but ONLY for the one I thought was the problem, not the others we maintain, and I still had the damn problem so I ruled out the whole Max Pooled Statements thing at one point and moved on. Turns out it was a different DSN that needed to be increased, or a combination of them all...I'm still not 100% sure.
So now Oracle has been tuned to allow 2000 open cursors, and I have increased all my datasources to have a much larger Max Pool Size because I actually counted how many cfqueryparams we were using and, yeah, there are a lot. To make things more confusing, they are not on the main production site, they are on the intranet, which gets very little load (except at night when our Google Appliance scans the crap out of it) so go figure. Most of the 503 / server lock-up events were at night, but once in a while it was during the day. Fun times troublshooting that.
Moral of the story if you are having similar issues: count up your useage of unique cfquery tags that use the cfqueryparam (and/or cfstoredproc) and make sure your Max Pooled Statements are at least equal to that number, THEN make sure your database is configured to allow at least that many open cursors for however many datasources you have configued. If you have 3 datasources, each with 300 max pooled statements, then your DB should allow at least 900 cursors to be safe. I am speaking for Oracle, I'm not sure how the other DBs handle it although I can't image they are that different.
Here are some great articles that go on to explain: http://jehiah.cz/a/maximum-open-cursors-exceeded and http://headsplodeblog.blogspot.com/2010/06/coldfusion-and-ora-01000-to o-many-open.html
I will only be convinced that I've finally found the probelm once I have days, if not weeks, of flawless uptime, so we will see, and I will report back if errors go away for good. If they don't, I will feel like an idiot posting all this information! I am grateful for Adobe's help and everyone else on these forums who have been listening to and trying to help me solve this problems, which I think may have been the root of my other problems and here.