Introduction and data
The aim of the following analysis is to quantitatively evaluate finishing skill in football and to create an idea on who the best finishers are. By finishing skill here we will mean the ability of a player to transform shots into goals. As simple as that. It should be noted that finishing, as defined above, is different from goalscoring, which is potentially a broader and more complicated concept. The best finisher doesn’t necessarily mean the best goalscorer or the best forward. Let’s see what comes out!
Finding free accessible football data is always a problem (it’s the main problem, actually …) but here and there something good appears. The data used in this analysis can be found here. The database includes club football data starting from the year 2001, covering club competitions in Europe (but not only) and gives information on various events/aspects related to a match. After initially examining the data and doing all the inevitable data wrangling/cleaning processes, I decided to keep the information about shots taken between the years 2006 and 2017 (12 years) from players in the European 5 main leagues (Spanish La Liga; Italian Serie A; English Premier League; German Bundesliga; and French Ligue 1). Data before the year 2006 didn’t look to have the same accuracy and completeness and in order not to endanger the accuracy of the analysis, I excluded them.
In total we have 685,953 non-penalty shots taken by 14,267 distinct players. For each shot we know the player who takes it, the output (goal/no goal) and the location from where the shot is taken (therefore we are able to calculate the distance from the center of the goal).
Fig-1 gives an initial view on the data we are considering, by showing the total number of shots and goals for all players. From an initial look, we could divide the players in Fig-1 into three “groups”:
- The first group is composed by Cristiano Ronaldo and Messi, who have taken much more shots and have, of course, scored more than any other player. Cristiano in particular has taken an incredible amount of shots and it’s interesting to note that he has scored less goals than Messi, despite taking 1000+ more shots.
- The second group includes some of the best forwards of the last 10-15 years in Europe, who have scored between 200 and 300 goals.
- The third group includes the rest of all the players, whose numbers are more smoothly distributed.
Our data of 685,953 shots is relatively very large and the time range of 12 years is considerable (12 years is the time frame between 4 World Cups!!!) but, you know, data is never enough. One of the problems we face is that, although this time range includes most of the career of many players (e.g it includes almost all of Messi’s career) it leaves out many years of football from a lot of players. For example, the data includes only 284 shots (56 goals) from Filippo Inzaghi and 484 shots (57 goals) from Alessandro Del Piero. The best years of their career are excluded. Inevitably, the estimation of their finishing skill is expected to be affected by this and it should be treated with additional caution.
This analysis is mainly based on the simple and primitive parameter of “conversion rate”, which shows the ratio between goals and shots taken for each player. It obviously takes values in the 0.0 – 1.0 range (unless you can score two goals from one shot…Hi Leo!) and, in principle, higher the conversion rate for a player a better finisher he is. The conversion rate for all players is shown in Fig-2. In overall, 70,085 from the 685,953 shots were scored, giving an average conversion rate of 0.102.
The conversion rate is an indicator of a player’s finishing ability but it can’t be directly used for our purpose. There are three main flaws related to the conversion rate. Discussion on these flaws and also proposed solutions/corrections are presented in the following paragraphs.
1 – Firstly, not all shots are the same, i.e. they have different probabilities to be transformed into goals. Some shots are easy to be scored and some are more difficult. Also, different players tend to take different type of shots and because of this, directly comparing their conversion rates wouldn’t be a fair way to judge on their finishing ability.
One reasonable way to deal with this is to build/apply an expected goals model to our data and then to categorize shots based on the expected goal value, allowing us to take into account the difficulty of each shot. Unfortunately, we don’t know enough parameters to do this. Instead, to categorize shots, we will use the only parameter we have: the distance from the center of the goal. This is obviously not ideal, since the shot difficulty depends on many other variables, but considering that the distance from the goal is the most influential parameter in almost every expected goals model, it makes sense to use it. This will hopefully allow us to take into account shot difficulty.
I have divided all shots into 25 categories based on their distance from the center of the goal. Each category has a range of 15m and they overlap each other by 14m (more than 93%). Namely, these categories are:
- Category 1: 0m to 15m from the center of the goal
- Category 2: 1m to 16m from the center of the goal
- Category 3: 2m to 17m from the center of the goal
- Category 23: 22m to 37m from the center of the goal
- Category 24: 23m to 38m from the center of the goal
- Category 25: 24m or further from the center of the goal (the last category).
Although this categorization divides shots into discrete groups, the fact that they overlap by more than 90 percent of their length and they offset by just 1m, allows us to look at this division as a quasi-continuous one.
To see if this procedure improves our analysis, let’s have a look at the following graph.
Fig-3 shows the distribution of shots according to the distance from the goal for few players. On the left we have all shots while on the right we have distributed shots in the 0m to 15m range only. As we can observe from the histograms on the left, each player has his own distribution and shooting pattern (some take more short shots and some take more long shots). When we restrict the shot range, although we can still notice differences among players, they are much smaller compared to when we plot the distribution of all shots. It’s not a perfect approach but it’s not a null procedure either. Replacing distance with expected goals though, would considerably improve our analysis.
Using distance as a criteria to categorize shots is expected to make relatively more sense as we move away from the goal. Shots located near the goal can originate from various sources (e.g. cross, through ball, etc), they can be shot either with the foot or the head and you can also have special chances like one vs one with the goalkeeper, etc. All these variables influence the probability of the shot to be transformed into a goal. While we move away from the goal, the influence of all these parameters tend to decrease and the shot distance becomes even more the main predictor. I don’t have any number supporting this claim, so consider it as just a speculation.
2 – The second problem that prevents us from directly using the conversion rate is related to the high variation in the number of shots taken by each player. Let’s have a look at the histogram of number of shots by each player (Fig-4). The vast majority of the players have taken few shots. Actually, the median number is around 6 shots/player, which is very low. Even though in the further sections of this analysis we have not considered players with less than 10 shots, still some correction in this regard is needed. This is because it’s tricky to compare different proportions (conversion rate is a proportion!). Some player has scored 5 goals from 100 shots and some other has scored 50 from 1000 shots. Who is better?
To account for variation in number of shots taken we use empirical Bayes estimation, which is a method that considerably improves the estimation of the conversion rate of all players. The idea is very simple: initially we model the conversion rates of our data set as a beta distribution, which is a very appropriate distribution for parameters restricted between 0 and 1. We consider it as a prior distribution and then we combine it with the individual data of each player (number of total shots and goals) to get an updated estimate of the conversion rate. As we’ll see later, this allows us to build distinct distributions who represent the conversion rate of each player.
3 – Thirdly, as we previously noted, the overall average conversion rate is around 0.10 but that doesn’t remain constant for all the range of shots taken per player. Looking again at Fig-2, we notice that players with a lot of shots tend to have a higher conversion rate. We need to include this information in our analysis and for that we will use beta-binomial regression, a technique that basically incorporates the number of total shots in building the prior distribution.
The following section shows the application of the above described methodology on the 1st of the 25 categories in which we divided shots.
Let’s go step-by-step and find out who are the best finishers for shots within the 1st category (shots in the 0m to 15m range).
Fig-5 shows shots and goals for each player while Fig-6 shows shots and the conversion rate, confirming the ascending trend of the conversion rate with the increase of shots taken.
In order to improve the estimation of the conversion rate for each player, we apply the empirical Bayes estimation and the beta-binomial regression methods. Fig-7 shows the histogram of conversion rate (after we filter for players with 200+ shots, with the aim to have a more accurate representation of the conversion rate) and the fitting beta distribution, which we consider as a prior. It doesn’t look like we have a very good fit here but it considerably improves while we move through the next shot categories.
This prior distribution and the data (shots and goals) of each player will allow us to build a distinct distribution of the conversion rate for each player. The properties of these distributions is what we will use to evaluate player’s finishing ability.
The two graphs in Fig-8 show how the conversion rate of each player is transformed. What happened here is that all conversion rates moved towards the trend line, considerably narrowing the initial range of values. The role of the beta-binomial regression procedure here is that it makes it possible to shrink the conversion rates towards the trend line. Without it, the conversion rates would have shrunk towards the horizontal line of the overall average conversion rate.
Fig-9 is a combination of the two graphs of Fig-8. The vertical lines here show the degree of transformation of the initial conversion rates. As we can see, not all the players are influenced the same: players with a lot of shots (and also players close to the trend line) are slightly affected, since the initial estimation of the conversion rate for them is much more representative.
We can now plot posterior distributions of the conversion rate for all the players in our database. We can use these distributions not only to judge about a player’s finishing ability but also to compare and rank them based on this skill. For example, Fig-10 shows the posterior distributions of the conversion rate for three well known players: Džeko, Icardi and Salah. The dashed curve represents the prior, which is basically the distribution of the average conversion rate of all players, i.e. that’s how an average finisher is expected to perform. If a player’s curve is on the right of the prior it means that he probably is a better finisher than the average and the more on the right the better. On the other hand, players who have their curve positioned on the left of the prior (e.g. Džeko) are probably worse finishers than the average. The height and width of the curve is related to the number of shots taken by a player. Džeko’s curve being higher and narrower means that he took more shots than Icardi and Salah.
One positive aspect of representing conversion rates as a distribution (instead of as a single value) is that it makes comparisons between players more meaningful. Looking at Salah’s and Icardi’s curves, we might infer that Salah is probably a better finisher than Icardi. Only probably though…everything is expressed in probabilistic terms. The two curves overlapping means that there is also a small probability that Icardi is a better finisher. We can calculate these probabilities by using the properties of each distribution. The probability that Icardi is a better finisher than Salah is just 2.8%, while the probability that Salah is a better finisher than Icardi is 97.2%.
Conversion rate distributions are very effective in visually comparing two or three players to each other but if we want to compare more players it gets difficult because curves will eventually overlap a lot. We could avoid this by building credible intervals. Credible intervals show the range of values within which a player’s conversion rate lies with a certain predefined probability (this predefined probability can be set to 50%, 90%, 95%, 99%, etc). Fig-11 shows the top 10 finishers for shots in the 0m to 15m range, specifying for each player the median conversion rate and the 90% credible interval. There are some surprising names on that list, but things start to become more “normal” as we move away from the goal.
Full application and results
The previous section presented the application of our methodology on the 1st out of the 25 categories in which we divided shots. Now we will repeat the same procedure on all shots and, to avoid repetition, we won’t go into details again.
Let’s see how the number of shots and goals changes from one shot range to another. If we plot shots and goals for all players, separated for each of the categories, we get Fig-12. Looking at the graph, it’s quite nice how the trend line slope gradually decreases while we move away from the goal, representing the decrease in the conversion rate. We could also notice Messi and Cristiano Ronaldo somehow close to each other and separated from the rest of the players (those two dots in the upper part of each scatter), especially for categories that show shots close to the goal. As e move away from the goal though, we notice that Messi’s shots are reduced until he eventually joins the rest of the players. Cristiano’s shots, from the other hand, remain almost constant with the increase in shooting distance. It’s just incredible how many long-range shots he has taken…more than 1,300 shots at a distance 24m or greater. Let’s see how this influences the estimation of his finishing ability.
So, for all the above shot ranges we applied the same procedure as in the detailed example we previously showed. As a result, for each player and for each shot range we get a distribution that characterizes their conversion rate (i.e. their finishing ability). Now let’s sum up the results!
Fig-13 shows the median conversion rates for all players and for all shot ranges. I have pointed out Messi, Higuaín and Ribéry, as the only players who hold 1st places in at least one of the categories. With the exception of few categories, Messi’s median conversion rate line is completely isolated from all other players. To illustrate it better let’s see at Fig-14 (which basically shows the same data as Fig-13 but differently visualized). We have separated shot ranges and plotted the median conversion rates of all players as a histogram and then added the overall median and Messi’s median.
In order to see how some of the best finishers and some of the main attackers of the last 12 years rank according to their finishing ability, I built 3 graphs in which for each of the shot ranges players are ranked according to their median conversion rate (Fig-15, Fig-16 and Fig-17). Please note that, in these graph, the vertical axis is in logarithmic scale, since even one player’s rank can change by few orders of magnitude (the number of players for each range is up to almost 5,000).
It’s just incredible how much Messi dominates this stat. Few other players manage to be good at some of the shot ranges but their rank inevitably and dramatically drops in other ranges. Higuaín and Lacazette are probably the players with the closest finishing ability to Messi, although their rank drops a lot in long-range shots (shots longer than 20m). Salah and Griezmann show more or less a similar pattern but with worse ranking.
Other players show different trends. Del Piero and Dybala, for example, have a remarkably similar pattern, ranking very behind in short-range shots and then gradually improving while moving away from the goal. Cristiano Ronaldo’s ranking looks somehow similar, although his rank in short and mid-range shots isn’t as bad as Del Piero’s or Dybala’s, as he ranks between the 40th and 190th place (Del Piero and Dybala can rank as low as the 400th and 500th place). The three of them eventually join the top 10 finishers in long-range shots. Also, Ibrahimović’s rankings are not so good for short-range shots but for mid/long-range shots he improves spectacularly, becoming the 2nd best finisher after Messi.
Just for illustration, in Fig-18 I built a Messi vs Cristiano comparison of the posterior distributions of the conversion rate, for all shot ranges. The difference is so huge that their curves barely intercept. They become comparable in the long-range shots, where Messi’s curve is still notably on the right.
As we previously explained, we can actually calculate the probability that Messi is a better finisher than Cristiano Ronaldo (and vice versa) for each shot range. These values are shown in Fig-19. This may look harsh on Cristiano but that’s how the numbers sum up. Both players have very narrow curves, due to the large number of shots they have taken, and that limits Cristiano’s chances.
So, since all the above comparisons are made for each shot range, it would be nice to have a single ranking for all shots. For this I plotted (Fig-20) each player’s rank for each shot range (grey dots) and then ranked them according to the median rank (red dots). It’s not such a rigorous method but it will do…we are doing this just for fun.
Fig-20 shows the top 20 finishers since 2006. Actually, before posting the article I asked my followers in Twitter who they thought was the best finisher and I got very good suggestions (if I compare them to our final list). Apparently finishing skill is perceivable and many people have a very good intuition on this regard.
Many people from Twitter suggested Messi as the best finisher but few of them noted that “Messi seems too obvious”. When I started this analysis my idea was that he would end up as a very good finisher but I was almost sure that 5-10 players would perform better than him. According to the analysis we conducted, he is the best finisher, by far. Now that I think about it, what kind of finisher scores 91 goals in one year?!
Some other frequent suggestions included Icardi, Agüero, Higuaín, Cavani. These are what we might call “the obvious good finishers”. Higuaín represents a very unfortunate case. One of the most clinical finishers of the 21st century and failing to deliver in those finals with Argentina…
If a Top 20 list doesn’t make you happy, I have a Top 200 finishers list (Fig-21). It doesn’t really make sense to rank so many players but anyway, here it is. You should not read read this list literally…it’s just useful to get a general idea.