This is my first real project using R and Python. I am going to do some basic data analysis of NCAA Women’s Soccer, focusing on the Pac-12 conference. I do not claim to be an expert in any fashion. My favorite sport is hockey, though I do enjoy most of the other big sports. I grew up playing and watching soccer, but only really began to follow women’s collegiate soccer the past couple of seasons. Specifically, I am a UCLA student and fan. I will speak mostly towards their roster when bringing up examples or doing individual analysis.
I am fairly new to the languages of Python and R, entering the world of programming for data science just a few months ago. This project is meant to get my feet wet and test my capabilities. For those trying (or thinking about it) to learn code, side projects are a great learning tool. You can really apply your knowledge via trial and error into something you are passionate about. It is a very good way to get acclimated to a language without following step-by-step instructions from a website, book, or class.
I scraped and organized the data using Python and did the analysis in R. It is possible to do everything in a single language, but I felt most comfortable doing it this way. Acquiring and cleaning the data often presents the most challenges. Here are known issues in the data:
It’s important to be familiar with the data. Soccer is a particularly hard sport to quantify. There are very few goals per game. Very few games in a season. Without advanced player tracking technology or passing data, we have very little to work with. We do have games played, games started, and minutes that tell us about the usage of a player. Goals, assists, points (goals*2 + assists), shots, and shots on goal tell us about offensive production. We have miscellaneous categories such as game winning goals and penalty kicks. But there are no specific measurements for defense. Using minutes and percentage of games started as a proxy can tell us how much a coach trusts a player, but it does not directly measure performance. In short, counting stats cannot be solely used to judge a player’s value. The eye test is more important in soccer than most other major sports. However, numbers can still be applied in many situations, a few we will take a look at below. Let’s first overview the dynamics of the conference for the 2017 season.
Parity in college soccer is relatively low. Stanford is the top team in the country and on a tier of its own in the Pac-12. UCLA and USC can be considered the second-tier. Both were legitimate contenders for a national championship. The third tier makes up schools who made the tournament: Arizona, Cal, Colorado, and Washington St. The bottom tier comprises the rest. A large disparity exists in the talent level between the best and the worst, resulting in very few upsets and relatively predictable games. We must be very careful when making comparisons across teams. Here is why:
There are some large discrepancies. If you’re on a good team, you’re going to have the ball a lot. Teams like Stanford and UCLA dominate possession and get a lot of high quality opportunities (that lead to goals) while limiting their opponent’s. Good players on bad teams simply do not have the quality of teammates around them to facilitate the same volume of scoring chances. Additionally, positions must be taken into account. Forwards and midfielders often have different roles. Jessie Fleming, member of Canada’s national team and 2016 Olympian, is described as the ‘straw that stirs the drink’ for UCLA. She finished fourth on the team with 5 points in 10 conference games. For context, UCLA’s top scorer was Hailie Mace with 15 points in 11 games. Point is that Jessie Fleming was not relied upon for offense. She played as a holding midfielder who would control possession and set up the attack from deep. Stanford’s Andy Sullivan holds a similar pedigree as one of the best players in the country and did not score at all in conference play. This is why we will not focus too much on offensive numbers, and rather, explore other variables that can give more insight.
Enter the usage statistics mentioned above. Games played, games started, and minutes played contain a lot of information. Generally, good players will start most/all of their games played, and will play a lot of minutes. Of course, there will still be variability between teams and some schools may use players in situations where they otherwise wouldn’t if they had a better option. Still, usage stats are less volatile and more encompassing than pure offensive numbers. Now that you have become acquainted with NCAA women’s soccer, let’s dive right into some analysis.
How does the distribution among freshmen, sophomores, juniors, and seniors look? Pretty much what you would expect. The more experienced seniors play the most minutes on average. There are some superstar freshmen and some not-so-great seniors, which is why they all have pretty similar ranges. But the overall trend is the further along in college you are, the more minutes you’ll be playing.
Three quarters of freshmen come in as substitutes or only play about half of the game. Also note that freshmen average the least GP per player out of the four classes, meaning that they are most likely to serve as ‘healthy scratches’ and not play at all. There is typically an adjustment period between club soccer and college, and it’s hard to displace upperclassmen who know the system and have the experience. However, top recruits may be expected to step into a starting role immediately. 10 freshmen started at least 10 games last season. 6 of them were from Stanford, UCLA, or USC. It’s no secret that the top schools usually get the top recruits. Recruiting is one of the major responsibilities of a coach and is crucial for sustained success. By sophomore year, a lot of the top talents have established themselves as starters and the interquartile range gets smaller and smaller leading up to seniors, where late bloomers see bigger minutes.
I mentioned games started divided by games played as another useful usage tool. It correlates extremely well with minutes per game played:
Perhaps a model can be made in the future to ‘fill in’ the missing values of Stanford player minutes. Just by knowing games played and games started, you can make a pretty accurate guess of minutes per game. Taking into account position and year can paint a clear picture about the usage of a player. Let’s try it out.
Player X (UCLA): 10/11 GS/GP, senior, defense
Her ratio is ~90% GS/GP. Just looking at the graph, we can guess she probably plays between 60-80 minutes per game. This implies that she is a starter on defense for UCLA, and occupies a pretty big role on a top team in the conference.
Player X is Mackenzie Cerda. Sure enough, she plays 71 minutes per game and is an experienced starter on the back line for UCLA.
A great set of information is released at the end of conference play; the All-Pac-12 Conference Teams. Officials for the conference who supposedly watch a lot of soccer come together to compile a first team, second team, and third team consisting of 36 players as well as an All-Freshmen team of 12. First years are eligible for the All-Conference teams. While this is a subjective ranking, it provides a solid baseline for cohort analysis. Here is a set of graphs distinguishing All-Conference players.