Friday, July 10, 2020

Rebuilding a Sabermetric Workbench in a LOCKDOWN (part 1)

Three months into the COVID19 lockdown and I've finally decided to brush the cobwebs off my old database and get back into running some baseball queries again.  Initially I had been hesitant to re-acquaint myself with SQL because the sabermetric industry has evolved so far into a transcendent thing I can barely recognize since I became too busy to keep pace back in 2012 or so. But I think back to the last time there was this much economic anxiety in the air and the most pacifying refuge in those times was ripping retrosheet queries so why not give it a shot this time around as well? 

First,  the first few links at Sky's Saberizing a Mac series at BTB are still active-- downloading MAMP and SQL PRO went smoothly. The baseball-databank seems to be defunct but I did find the Lahman database here. The Colin Wyers tutorial has moved to here.


I tried some basic stuff to shake off the dust-- Highest single season walk rates (min 300 PA):

SELECT b.playerID, b.yearID
, SUM(b.ab)+SUM(b.bb)+SUM(b.sh)+SUM(b.sf)+SUM(b.ibb) as PA
, ROUND(SUM(b.bb)/(SUM(b.ab)+SUM(b.bb)+SUM(b.sh)+SUM(b.sf)+SUM(b.ibb))*100,1) as BB_pct
from batting b
GROUP BY b.playerId, b.yearID
HAVING PA > 300
ORDER BY BB_pct desc

LIMIT 100;
I've always had an affection for players that could earn walks without much power-- it's the underdog thing I suppose. But I'm not sure if I've ever pursued this particular rabbit hole before and that really surprises me.


Ah yes all the old classics are here. Yes, old reliables Bondsba and Ruthba and Willite, yes, of course. And some Robinya character but he's from 1890 though so that doesn't really count but wait wait WAIT-- who is this 'Fainfe01'? Who is this Fainfe01 with the third highest single-season walk rate ever???

You see, I was almost immediately reminded of the strange similarities between the unnatural abilities of baseball statistics and those of the dark arts. There is an uneasy way a query can unearth an old story or legend or myth. Or quite possibly at times a query can unite two long lost soul mates across space and time, kept apart for reasons only the logic of the gods of fate could possibly understand.

This Fainfe01 is Ferris Fain who was born in San Antonio, who grew up in Oakland and was recruited by Lefty O'Doul himself but because of his war service didn't see major league time until his debut at age 26. He ripped 400+ OBPs across his entire career, won two batting titles and led the league in doubles once, but still never hit more then 10 home runs in a season. This is Ferris Fain whose career was hampered by battles with the bottle and a terrible temper and suspensions and fights with teammates and fans alike, and finally succumbed to injuries and retired to growing marijuana in the fields of the Sierra Nevada Mountains.

So this 1954 season was absolutely no fluke. So now I run a query on all time career walk rates, and again there he shows up. Maybe I sound as though I am being a bit too dramatic about this, but you have to understand this is my wheelhouse and somehow Fain has slipped through my detection. I suppose it's the modestly sized career-- just 4900 PA. Maybe I never ran a query with that minimum stipulation before, I suppose I likely rounded up to 5000 in the past. (For that matter I don't know what possessed me to place the minimum at just 300 PA in the single-season query before but these are questions you don't ask when desperately seeking for the light in the darkness of a post-COVID world.)

SELECT CONCAT(p.namefirst, ' ', p.namelast) as player
, MAX(B.YEARID) AS LASTYR
, COUNT(DISTINCT b.yearID) as years
, SUM(b.ab) + SUM(b.bb) + SUM(b.hbp) + SUM(b.sf)+ SUM(b.sh) as PA
, ROUND(SUM(b.h) / SUM(b.ab), 3) as AVG
, ROUND( (SUM(b.h) + SUM(b.bb) + SUM(b.hbp)) / (SUM(b.ab) + SUM(b.bb) + SUM(b.hbp) + SUM(b.sf)), 3) as OBP
, ROUND( (SUM(b.h) + SUM(b.2b) + 2*SUM(b.3b) + 3*SUM(b.hr)) 
/ SUM(b.ab), 3) as SLG   
, ROUND((SUM(b.h) + SUM(b.bb) + SUM(b.hbp)) 
/ (SUM(b.ab) + SUM(b.bb) + SUM(b.hbp) + SUM(b.sf)) + (SUM(b.h) + SUM(b.2b)
+ 2*SUM(b.3b) + 3*SUM(b.hr)) / SUM(b.ab), 3) as OPS
, ROUND(SUM(b.bb) 
/(SUM(b.ab) + SUM(b.bb) + SUM(b.hbp) + SUM(b.sf)+ SUM(b.sh))*100,1) as BB_pct
    FROM batting b, people p
    WHERE b.playerID = p.playerID  AND b.yearID > 1900
    GROUP BY b.playerID
    HAVING PA > 4900       
    ORDER BY BB_pct desc

    limit 100;


The Lahman zip folder comes with a table called "People" that logs biographical info which used to be called "master" so I'll have to adjust my old saved code accordingly. I ran into a couple of errors initially-- mainly with the JOIN process, and some embarrassingly beginner mistakes with the GROUP BY command. Also had some trouble organizing the components in the BB% code-- for instance, I had to remember that IBB are included in the BB count, while SH and SF are separate events entirely. But ultimately I confirmed my results at Fangraphs.

Also had to run indexes on playerIDs, yearIDs, teamIDs, after realizing the JOIN was taking way too long:

CREATE INDEX player_idx ON people (playerID);
CREATE INDEX player_idx ON batting (playerID);CREATE INDEX year_idx ON batting (yearID);CREATE INDEX team_idx ON batting (teamID); 


Despite the rust, I was very much encouraged by the success of this first day back. Tomorrow I'll incorporate yearly WAR totals into the db.

Ferris Fain, I don't know where you've been all my life, or how we missed each other like two ships in the night for so long, but here we are.




Sunday, April 22, 2018

Striking out and winning in Colorado

Friday night on the broadcast JD mentioned a pet theory he had about success in Coors field. Considering how much the ball flies off the bat when put in play in the high elevation of Colorado, he wondered if the team that struck out less often therefore won more games.

So I brushed the dust off my retrosheet database that's been dormant since my two toddlers tied me up with duct tape and scrounged up some data to investigate JD's theory.

I'm using data from 2002-2012 for this mostly because it was readily available to me.

If the team that struck out more in a single game lost that game, I'd mark it as a "THEORY_YES." If the team that struck out fewer times won I'd mark it as a "THEORY_NO". If the teams struck out the same amount of times I marked it as a tie.

In Coors field from 2002-2012:

THEORY_YESTHEORY_NOTIEN/AYES%
COL GAMES442352105055.7

So, yes, the team that strikes out less wins more often in Coors-- 55.7% of the time (if we throw out the "tie" games. JD's suspicions are correct in this respect.

But how does this compare to games in other parks? It stands to reason that if you aren't striking out as much as the other team you are probably having a better day offensively, no matter what park you are in. That's just the laws of baseball.

So for all other games in that same time period:

THEORY_YESTHEORY_NOTIEN/AYES%
COL442352105055.7
NON-COL GAMES1395294952736359.5

So that 55.7% advantage is actually less than the league as a whole. Your guess is as good as mine as to why this is the case, but my guess is randomness. The league rate sits at 60%-ish pretty steadily from year-to-year over that timeframe, but with a yearly sample of just 81 games, the rate for games  in Colorado jumps around quite a bit.


YEARTHEORY
YES
THEORY NOTIEN/A gamesYES%
COL200249266065.3
NON-COL GAMES20021278852248160
COL200336378049.3
NON-COL GAMES20031273865248159.5
COL2004363312052.2
NON-COL GAMES20041228873280058.4
COL2005432810060.6
NON-COL GAMES20051237870272158.7
COL2006432810060.6
NON-COL GAMES20061242881255058.5
COL200742405051.2
NON-COL GAMES20071302861209060.2
COL2008323514047.8
NON-COL GAMES20081238909232057.7
COL2009432515063.2
NON-COL GAMES20091240872265058.7
COL201043299059.7
NON-COL GAMES20101346822213062.1
COL201138358052.1
NON-COL GAMES20111280845261060.2
COL201237368050.7
NON-COL GAMES20121288845253060.4


I cross-referenced a lot of this data with baseball-reference and found no discrepancies, but since I'll readily admit I may be rusty if anyone is passionate enough to want to check my work here it is:
https://docs.google.com/spreadsheets/d/1Z91VCEqpSDng2OJ8z1Wv9g3JMnJSCQyNaITVGSptoE4/edit?usp=sharing

Saturday, November 21, 2015

All Time Greatest Breakouts

Jeff Sullivan ran a post for Fangraphs/Fox Sports suggesting Bryce Harper's 2015 season was a breakout for the ages. This is undoubtedly true, but I ran a few queries that seem to better quantify the "breakoutness" of a breakout.

Jeff did a number of year 1/year 2 queries that look at the greatest improvements from one year to the next. This can be misleading in terms of a breakout season because the player could just as easily have had an off-year or an injury year prior to his "breakout" and create the illusion of a major jump in performance.

So here we are looking at the greatest jumps in wRC+ from a player's breakout year to his previous high. I used the same 300 PA requirement used in the article for both seasons.

Harper falls to 29th on the list of greatest breakouts by this method, comparing his 197 wRC+ in 2015 season with his previous high of 137 in 2013 nets him a jump of 60 points, not quite as high as the top ten:

All time Breakouts 
#NameAgeSeasonwRC+PAprevios high wRC+JUMPprevious high seasonprevious high PAYears since previous highYears since breakout
1Al Kaline2019551566817680195453512
2Tito Francona2519591684439276195650033
3Cito Gaston2619701446296876196941913
4Chick Hafey2419271593948475192537523
5Fernando Tatis2419991416396774199835012
6Eli Marrero2820021004462674199934335
7Fred Dunlap25188421447814272188136934
8Clyde Barnhart2719231503867872192150523
9Devin Mesoraco2620141464407472201335213
10Danny Thompson2519721006242971197031822

Devin Mesoraco's showing inside the top ten is surprising to me, perhaps he's been forgotten about already since a dismal injury-plagued follow-up in 2015.

Al Kaline jumps from fifth in the article to number one here, but with just two full seasons before his breakout year.

Other recent notables include Jose Bautista's 2010 breakout at #19, Justin Turner and Brandon Moss at #26 and Alex Avila's standout 2010 tied with Harper.

And personal favorite John Lowenstein broke out in his 12th year as a major leaguer to hit for a 173 wRC+ 65 points higher than his next best performance previously. He was 35 years old.

Here is the top 100:  https://docs.google.com/spreadsheets/d/1iF8gb76VmxzXUUjLIBs8dbkzWlVn1vSlknrkJOQCees/edit?usp=sharing


Here is a picture of a face, parts of it belonging to the author, parts to John Lowenstein:

Sunday, May 24, 2015

Doctored Ball Ejections

-- Courtesy of David Vincent

06/11/1920 Slim Sallee       CIN  Throwing doctored ball (ejected)
06/14/1922 Sam Jones         NYA  Throwing cut ball (not ejected)
07/27/1922 Dave Danforth     SLA  Throwing doctored ball (cover of ball cut, ejected)
08/01/1923 Dave Danforth     SLA  Throwing doctored ball (discolored. ejected)
08/14/1924 Bob Shawkey       NYA  Throwing doctored ball (discolored, ejected)
07/20/1944 Nelson Potter     SLA  Spit ball (ejected)
04/27/1968 Rich Nye (and others)  Spit ball (six automatic balls called in game; Durocher ejected)
08/18/1968 Phil Regan        CHN  Vaseline ball (three ejected in rhubarb; ruled twice in game)
07/14/1978 Don Sutton        LAN  Scuffed ball (ejected)
09/30/1980 Rick Honeycutt    SEA  Sandpaper and tack (ejected)
08/23/1982 Gaylord Perry     SEA  Throwing doctored ball (ejected)
08/03/1987 Joe Niekro        MIN  Emery board in back pocket (ejected)
08/10/1987 Kevin Gross       PHI  Sandpaper on glove (ejected)
10/08/1988 Jay Howell        LAN  Pine tar on glove (NLCS Game 3, ejected)
05/01/1999 Brian Mohler      DET  Sandpaper on left thumb (ejected)
06/09/1999 Byung-Hyun Kim    ARI  Heat balm on bandage in sleeve (ejected)
05/17/2003 Zach Day          MON  Glue on hand (ejected)
08/20/2004 Julian Tavares    SLN  Pine tar on cap (ejected, 10-day suspension)
06/14/2005 Brendan Donnelly  ANA  Pine tar on glove (ejected)
10/22/2006 Kenny Rogers      DET  Foreign substance on heel of pitching hand (World Series, not ejected)
06/19/2012 Joel Peralta      TBA  Pine tar on glove (ejected before pitching)
04/23/2014 Michael Pineda    NYA  Pine tar on neck (ejected)
05/21/2015 Will Smith        MIL  Foreign substance on non-pitching arm (rosin and sunscreen; ejected)
05/23/2015 Brian Matusz      BAL  Foreign substance on non-pitching arm (ejected)