Tuesday 11 July 2017

Batted balls anaysis - how would today's MLB players do in the old ball parks?

I created an addition to my previous dashboard that shows batted balls of current MLB players (2017 season Mar - Jun) and compares them to the dimensions (including wall heights) of old ballparks. For example, what would Aaron Judge's batted ball have looked like in old Yankees Stadium ( a lot less Homeruns!):

See the dashboard or play with it right here:

Monday 10 July 2017

Batted ball analysis - UPDATE

I had quite some feedback about the batted ball dashboard which mainly resulted in discussions about using the "cricket style rope" for ballpark dimensions, basically ignoring the wall in outfield.
As it was hard to disagree with that this reduced the relevance significantly I did some more work to add the wall heights when guesstimating if a batted ball would have been a HR in another park.
Basically I calculated the height of the batted ball at the wall distance and compared the two. This still has some major caveats, as I use general ballpark dimensions, general wall heights, and my calculation of ball height at distance X is a rough estimation*. But the first result are promising in parks like Fenway, so I feel like I'm on the right track.

The dashboard has been updated: https://public.tableau.com/profile/rj7974#!/vizhome/MLBBattedBallAnalysis2017/BattedBallsDashboard

*

Friday 7 July 2017

Batted ball analysis - ballpark factors

I am currently reading George Will's Men at Work, and at some point it mentions how hitters' successes are influenced by, or depending on, the ballparks they play the majority of their games in. It made me wonder how many of the homeruns (HRs) batters hit in a specific ballpark, would be HRs in the other parks as well. I created a dashboard in Tableau to visualize this concept, and this post is about how I did it.

DATA
Scope: To keep the size of the data sources somewhat manageable, I used 2017 season data from March until June.

Batted balls: I used Baseball Savant for batted ball data (see my previous post on how to get Baseball Savant data into Tableau), which I filtered to records with Type = "X" (batted balls only) and the Distance column could not be Null, as a value is required for the analysis.

Ball park dimensions: I used the general park dimensions from Andrew Clem's website. Although this is limited to LF, LC, CF, RC and RF is give a generic sense of the ballpark, that would have to be good enough for my analysis. At some point I want to refine these dimensions by calculating the spray / horizontal angle for every dimension change but then to be really accurate I would also have to consider the wall height, so for now I accepted the general dimensions as a starting point. Here are all ballpark outlines based on these assumptions, with Fenway park highlighted:

I divided the field in 4 areas (L/RF, LC, RC, CF) of 22.5 degrees, and split the LF and RF (to 11.25 deg. each). I then assigned the provided values to the specific field zones resulting in the outline per ballpark, as shown on the right.


Calculations: Based on the X and Y of where the batted baseball landed (fields Hc X and Hc Y) and the distance provided, I calculated the horizontal or spray angle with straight center field being 0 degrees. Wether or not a batted ball would have been a HR in a ballpark is calculated by comparing batted ball distance and "spray"angle to thee general ballpark outlines.


CAVEATS
This dashboard does not include weather, elevation / air pressure, opposition, ball seam height and such, but strictly looks at general ballpark dimensions, also disregarding fence height (and possibly some outfielders who could have caught the ball). It is definitely my plan to continue working on this, but including these factors significantly complicates the calculations (yet also makes the result more relevant).

DASHBOARD
After bringing in the data in Tableau, I used calculated fields for actual events (results of the batted ball in the park it was hit in) and potential events (results if the ball was hit in another park, with the caveats mentioned above).
Tab 1 (batted balls dashboard) shows the actual HR top 30, and the individual batted ball results for a selected player in every ballpark, ball locations, HR ratios and HR ratios per ballpark.

For example, Paul Goldschmidt hit a sharp flyball to LF (376 ft., -14.5 spray angle) that was caught for an out. This ball would have been a HR in 63% of all ballparks. The Homerun he hit on April 5th would have been a HR in 90% of all ballparks:

Tab 2 (ballpark dimensions) explains more about the used dimensions, and shows selected batted balls for a selected player in a selected ballpark of at least a selected amount of distance. Yes this dashboard has a number of filters that can be changed by the user.
Sticking with Goldschmidt, it shows that his actual HRs would have all been HRs in Fenway, but of his 149 "Outs & errors" batted balls, 17 would have been a HR as well, as would 4 of his doubles have been:

The last tab, potential HR hitters 2017, ranks every player and his batted ball result had he played in a specific stadium the whole time; had Votto played all his games in Fenway, he would have had 44 HRs. The first non-Fenway park on the list? Minute Maid Park, again with Joey Votto.

CONCLUSION
The work I did on this dashboard mostly gave me a better understanding of the influence of ballpark dimensions on batted ball outcomes. The caveats reduce the reliability of these outcomes to a point they are highly debatable, which is why further research is required. Using more detailed park dimensions, including fence height, and perhaps finding a way to incorporate realistic schedules (rather than 162 games in one park) will improve the significance of this analysis. But hey, if StatCast can get away with excluding walls (for now) I should be able to so as well, no?

I do believe it gives some insight how trades could impact player values based on the ballpark they would move to, even if not as large of an impact suggested in these dashboard.

If you like this dashboard, or have any comments or suggestions, look me up on Twitter: @rjweise
Cheers!

Tuesday 27 June 2017

Spray angle / horizontal angle for Baseball Savant data

When you first show batted balls from baseball savant (type = "X") in Tableau by dragging the hc X and Hc Y fields on the columns and rows, you see something like this:
Left and right field are correct, but we tend to see this being up-side-down. Also you will notice that home plate is not 0,0 but more like 125, 200 (this is not the same for every ballpark it seems, but will do for this demo)

By creating two fields

X = [Hc X]-125
Y = ([Hc Y]-200)*-1

and replace the HcX/Y with them we can easily change the picture (colour coded by hit location):


To calculate the spray angle / horizontal angle we can use Tableau's ATAN2 function (Returns the arc tangent of two given numbers (X and Y), result is in radians):

((ATAN2([Y],[X]) / PI() * 180) - 90) * -1
where the PI() * 180 converts the radians to degrees, and the -90 makes the vertical axis (basically straight up from home plate) the 0 degrees line. The * -1 is making left of this 0-line negative, and right of the line positive.

Because of pop-ups and foul balls behnd the catcher you end up with some large degrees numbers, but by showing only the -45 to 45 degrees range, we get a picture we could expect:


Where am I going with this from here? Combining it with the ballpark data in this article and data from this excellent site about ballparks with the end goal to show for every hit ball in baseball savant in what park it would be a home run.

Friday 2 June 2017

Pitch elevation in 2017, and does it lead to success?

As I was listening to a number of podscast that discussed pitch elevation in the current season, I wanted to put my Baseball Savant datasets stored on Google BigQuery into practice. So I created a view with all Pitching data from Baseball Savant for the 2015 and 2016 season, as well as the 2017 season up to and including May, joined it with some data from the Crunch Time Baseball player map, saved it as a table in Google Cloud Storage (the 1,6 million records are too large to export directly from GBQ. I did connect natively through Tableau Desktop at work, but this is not available in the Public version of Tableau) and exported to CSV from there.


This simple visualization shows per team, type of pitcher, season and pitch type what the average height above home plate in feet was. As concluded in some of the podcasts, the Dogers bullpen is throwing Four seam Fastballs a lot higher to fight the increasing number of flyball hitters (see my "Flying it out to yonder" dashboard for more on that topic), but when looking at all pitch types Boston is actually the biggest "raiser". They also lead to pack in Four seamers thrown by Starting Pitchers, although they don't have the highest increase compared to 2016 (that looks like a race between the Twins, Brewers and the Marlins).

What I find interesting is that the Astros, the best team in baseball so far this season has a decreasing average height above home plate for the bullpen throwing Four seamers, the lowest average for all pitches by RP's and by far the lowest average when including all pitchers and pitch types. With those settings the Dodgers are actually middle of the pack and declining compared to 2016.

As the Oakland A's are almost at the bottom in that graph, just above the Astros, Diamonbacks and Angels, let's hope this declining trend is the way to success in the current season!

Thursday 1 June 2017

Using raw Baseball Savant data in Tableau

In addition to my previous post on how to get Baseball Savant data out of the mlb website and into Tableau, the following shows the "code" for calculated fields in Tableau to use the BBS data more usefully.

To determine the team of the batter (reverse for pitchers):
IF [Inning Topbot] = "Bot" THEN [Home Team] ELSE [Away Team] END

Identify barrels (although BBS now provides this in a column as well):
IF [Launch Speed] >= 98 AND [Launch Angle] >= 26 - ([Launch Speed] - 98)
AND [Launch Angle] <= 30 + ([Launch Speed] -98)
THEN "Barrel"
ELSE ""
END

To determine At Bats we need to first group the events as follows:


Then create a calculated field:
IF [Events At Bats] = "AB" THEN 1 ELSE 0 END









Same for Hits, On Base, and Plate Appearances:
IF [Events Hits] = "Hits" THEN 1 ELSE 0 END


IF [Events On Base] = "On Base" THEN 1 ELSE 0 END

IF [Events Plate Appearance group] = "PA" THEN 1 ELSE 0 END
From here it is easy to calculate things like:
OBP: SUM([On Base])/SUM([PA])
Batting Average: SUM([Hits])/SUM([AB])

To calculate Slugging we need to calculate total bases first:
IF [Events] = "single" THEN 1
ELSEIF [Events] = "double" THEN 2
ELSEIF [Events] = "triple" THEN 3
ELSEIF [Events] = "home_run" THEN 4
ELSE 0
END

Slugging: SUM([Total Bases])/SUM([AB])

Let's double check if this is correct. I ran this for Josh Donaldson and compared it to his ESPN-page:
Now that I know using the raw data from BBS works in Tableau I can start focusing on doing some more interesting analysis and visualizations.



Wednesday 31 May 2017

From Baseball Savant to Tableau

Baseball Savant is a great site: very user friendly, interesting and good quality stats, nice visualizations, etc. Then why do I still have the need to "have this data"? As many will have noticed the download function times out and is limited to a set number of records somewhere in the range of 30,000. Also there is no way that I am aware of to link "live" to the data. No intention here to say a bad thing about Baseball Savant, as what they do offer is great, it just doesn't always meet my needs. Combine that with a desire to learn more about R and cloud data storage, and we have a challenge: freely querying Baseball Savant data without limitations, and being able to use that from a cloud environment in tools like Tableau for analysis and visualization.

Part 1: downloading the data from Baseball Savant (I'll refer to it as BBS from here on)
As I mentioned there are limitations on what you can download from BBS, and as I didn't want to download the data in chucks (like per player) I looked for a way to automate this process. Look no further than Bill Petti's awesome package for R, called baseballr. Among many other things Bill's package lets you download BBS data in chunks, and then use R code to combine it all together.

I am very new to R / R-studio, and just started making my way through Analyzing Baseball Data with R which also has a great website: https://baseballwithr.wordpress.com/. That being said I do have coding experience in other languages, so I was able to hack together the following rather quickly:

setwd("<file path") #change this to the workdirectory you want

rm(list=ls())           #removes all datasets, vars etc from the environment

for (y in c(2015, 2016, 2017))              #loop through the years
  {
  yearpart <- y
  for  (m in c(3,4,5,6,7,8,9,10))                                #loop through the months
  {
    monthpart <- m
    
    for (d in c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31))  
      #loop through the days of the month
    {
      daypart <- d
      startpartD <- (paste(yearpart,"-",monthpart,"-",daypart,sep=""))
      endpartD <- startpartD
            
      scrapebb <- try(scrape_statcast_savant_batter_all(startpartD,endpartD),silent=T)                
        #the try function is used to skip the next step if the data is null (no play data for that day). Here I'm using Bill's package
      
      Filetext <- paste("bbsavant_B_",startpartD,"_",endpartD,".csv",sep="")                                         #combine variables and text for the file name
      
      try(write.csv(scrapebb,file = Filetext))          
         #write the csv. Try is there in case the scrapebb function did not return anything (and then does not exist)
      print(Filetext)
      remove(list=c("d","daypart","endpartD","Filetext","startpartD"))                                  #removes all daily vars from the environment
    }
    remove(list=c("m", "monthpart"))                                                                    #removes all monthly vars from environment
  }
}

I know this is not the most efficient way to do this, but it works. I creates daily data sets from all batter data. Replace 'scrape_statcast_savant_batter_all' with 'scrape_statcast_savant_pitcher_all' and 'bbsavant_B_' with 'bbsavant_P_'  to get and write pitcher data. Data is very similar, but there are some small differences.

From here I created folders per year for Pitchers and Batters, and moved the appropriate daily exports to the right folders. Per folder I ran the following to combine the daily data files into combined year data sets:

setwd("<file path>") #change this to the workdirectory you nwat

#Make sure to set the setWD() and clear the objects from workspace

file_list <- list.files(pattern="*.csv") #Getting a List of Files in a Directory

#Merging the Files into a Single Dataframe
for (file in file_list){
  
  # if the merged dataset doesn't exist, create it
  if (!exists("dataset")){
    dataset <- read.table(file, header=TRUE, sep=",")
  }
  
  # if the merged dataset does exist, append to it
  if (exists("dataset")){
    temp_dataset <-read.table(file, header=TRUE, sep=",")
    dataset<-rbind(dataset, temp_dataset)
    rm(temp_dataset)
  }
  
}
write.csv(dataset, "combined_<year>_<Batters/Pitchers>.csv")

Part 2: querying the data in a place I can access from anywhere (aka Cloud)
Over the last year I tried numerous ways to store and access data from the cloud, where Google Sheets worked best for me. However, when combining a few large data sources, queries and even Sheet files become slow, really too slow to work with. I didn't want to set up a server environment at home so a hosted cloud option appeared the way to go. To store the data I was automatically thinking of a database, and since I am familiar with PostGres I would want to set up a PostGres database and an admin tool to run queries.

Oh yeah, one more small requirement: free, or very low cost. This proved a hard requirement to meet. I think Amazon AWS is fairly affordable to host a Postgres Database on a VM, but since I have been a long time user of Google environment I checked that out first (after initially looking at other options, but not finding anything applicable) and created a PostGres database. Although that allowed me to store and query data, it feels somewhat over-complicated, and I got distracted by some other offerings from Google, specifically Google BigQuery:

"BigQuery is Google's fully managed, petabyte scale, low cost enterprise data warehouse for analytics. BigQuery is serverless. There is no infrastructure to manage and you don't need a database administrator, so you can focus on analyzing data to find meaningful insights using familiar SQL."

That sounds like my thing. I have large datasets that I want to query and analyze fast, and querying will be fairly simple. And I can use GBQ for free for a year, which should give me enough time to decide if it works for me. And even better, Tableau connects natively to GBQ and Google Sheets, while Google also offers a new product called Data studio (currently in Beta, and limited compared to Tableau, but it appears to be actively improved on continuously) for data visualization.

So the Google Platform is my environment of choice, at least for the free year.

Loading the combined BBS yearly datasets is super easy, and querying uses typical SQL, also allowing to link with other data sources, like Crunchtimebaseball's fantastic MLB player map. I haven't figured out yet if it is possible to create a Google sheet to pulls CrunchTimeBaseball's data as a csv on a weekly basis, and then let GBQ refresh directly from the Google sheet, but it looks promising.

So in summary I now have BBS data in a cloud environment that I can query and combine with other data sources, from anywhere.

PART 3: analysing and visualizing data in Tableau or other tools
As mentioned Tableau Desktop native connects to GBQ, but unfortunately this is not available in Tableau Public. Perhaps there is a way to build your own connector, or it might just be a matter of time before it becomes available. But for now, in GBQ you can create a dataset with a query, and save it as a view or to Google Sheets that Tableau Public can connect to. And lastly Google's Data Studio can use the GBQ data natively, and looks promising.

I'll keep posting updates as I learn more about the Google Cloud platform and how it works with Tableau.

Note that I only downloaded 2015 and 2016 from Baseball Savant, as the people of BBS mention the data prior to 2015 was not yet fully mature and reliable.

UPDATE: I tried loading the GBQ data into DataStudio and it works very easily:
Unfortunately embedding is currently not supported.