Friday, July 10, 2020

Rebuilding a Sabermetric Workbench in a LOCKDOWN (part 1)

Three months into the COVID19 lockdown and I've finally decided to brush the cobwebs off my old database and get back into running some baseball queries again.  Initially I had been hesitant to re-acquaint myself with SQL because the sabermetric industry has evolved so far into a transcendent thing I can barely recognize since I became too busy to keep pace back in 2012 or so. But I think back to the last time there was this much economic anxiety in the air and the most pacifying refuge in those times was ripping retrosheet queries so why not give it a shot this time around as well? 

First,  the first few links at Sky's Saberizing a Mac series at BTB are still active-- downloading MAMP and SQL PRO went smoothly. The baseball-databank seems to be defunct but I did find the Lahman database here. The Colin Wyers tutorial has moved to here.


I tried some basic stuff to shake off the dust-- Highest single season walk rates (min 300 PA):

SELECT b.playerID, b.yearID
, SUM(b.ab)+SUM(b.bb)+SUM(b.sh)+SUM(b.sf)+SUM(b.ibb) as PA
, ROUND(SUM(b.bb)/(SUM(b.ab)+SUM(b.bb)+SUM(b.sh)+SUM(b.sf)+SUM(b.ibb))*100,1) as BB_pct
from batting b
GROUP BY b.playerId, b.yearID
HAVING PA > 300
ORDER BY BB_pct desc

LIMIT 100;
I've always had an affection for players that could earn walks without much power-- it's the underdog thing I suppose. But I'm not sure if I've ever pursued this particular rabbit hole before and that really surprises me.


Ah yes all the old classics are here. Yes, old reliables Bondsba and Ruthba and Willite, yes, of course. And some Robinya character but he's from 1890 though so that doesn't really count but wait wait WAIT-- who is this 'Fainfe01'? Who is this Fainfe01 with the third highest single-season walk rate ever???

You see, I was almost immediately reminded of the strange similarities between the unnatural abilities of baseball statistics and those of the dark arts. There is an uneasy way a query can unearth an old story or legend or myth. Or quite possibly at times a query can unite two long lost soul mates across space and time, kept apart for reasons only the logic of the gods of fate could possibly understand.

This Fainfe01 is Ferris Fain who was born in San Antonio, who grew up in Oakland and was recruited by Lefty O'Doul himself but because of his war service didn't see major league time until his debut at age 26. He ripped 400+ OBPs across his entire career, won two batting titles and led the league in doubles once, but still never hit more then 10 home runs in a season. This is Ferris Fain whose career was hampered by battles with the bottle and a terrible temper and suspensions and fights with teammates and fans alike, and finally succumbed to injuries and retired to growing marijuana in the fields of the Sierra Nevada Mountains.

So this 1954 season was absolutely no fluke. So now I run a query on all time career walk rates, and again there he shows up. Maybe I sound as though I am being a bit too dramatic about this, but you have to understand this is my wheelhouse and somehow Fain has slipped through my detection. I suppose it's the modestly sized career-- just 4900 PA. Maybe I never ran a query with that minimum stipulation before, I suppose I likely rounded up to 5000 in the past. (For that matter I don't know what possessed me to place the minimum at just 300 PA in the single-season query before but these are questions you don't ask when desperately seeking for the light in the darkness of a post-COVID world.)

SELECT CONCAT(p.namefirst, ' ', p.namelast) as player
, MAX(B.YEARID) AS LASTYR
, COUNT(DISTINCT b.yearID) as years
, SUM(b.ab) + SUM(b.bb) + SUM(b.hbp) + SUM(b.sf)+ SUM(b.sh) as PA
, ROUND(SUM(b.h) / SUM(b.ab), 3) as AVG
, ROUND( (SUM(b.h) + SUM(b.bb) + SUM(b.hbp)) / (SUM(b.ab) + SUM(b.bb) + SUM(b.hbp) + SUM(b.sf)), 3) as OBP
, ROUND( (SUM(b.h) + SUM(b.2b) + 2*SUM(b.3b) + 3*SUM(b.hr)) 
/ SUM(b.ab), 3) as SLG   
, ROUND((SUM(b.h) + SUM(b.bb) + SUM(b.hbp)) 
/ (SUM(b.ab) + SUM(b.bb) + SUM(b.hbp) + SUM(b.sf)) + (SUM(b.h) + SUM(b.2b)
+ 2*SUM(b.3b) + 3*SUM(b.hr)) / SUM(b.ab), 3) as OPS
, ROUND(SUM(b.bb) 
/(SUM(b.ab) + SUM(b.bb) + SUM(b.hbp) + SUM(b.sf)+ SUM(b.sh))*100,1) as BB_pct
    FROM batting b, people p
    WHERE b.playerID = p.playerID  AND b.yearID > 1900
    GROUP BY b.playerID
    HAVING PA > 4900       
    ORDER BY BB_pct desc

    limit 100;


The Lahman zip folder comes with a table called "People" that logs biographical info which used to be called "master" so I'll have to adjust my old saved code accordingly. I ran into a couple of errors initially-- mainly with the JOIN process, and some embarrassingly beginner mistakes with the GROUP BY command. Also had some trouble organizing the components in the BB% code-- for instance, I had to remember that IBB are included in the BB count, while SH and SF are separate events entirely. But ultimately I confirmed my results at Fangraphs.

Also had to run indexes on playerIDs, yearIDs, teamIDs, after realizing the JOIN was taking way too long:

CREATE INDEX player_idx ON people (playerID);
CREATE INDEX player_idx ON batting (playerID);CREATE INDEX year_idx ON batting (yearID);CREATE INDEX team_idx ON batting (teamID); 


Despite the rust, I was very much encouraged by the success of this first day back. Tomorrow I'll incorporate yearly WAR totals into the db.

Ferris Fain, I don't know where you've been all my life, or how we missed each other like two ships in the night for so long, but here we are.