Sports Analytics Academy

While the concept of a data analysis pipeline may not initially excite you, the reality of hours lost in sporting organisations each week in data entry and re-running reports highlights the need for a clearer understanding of its importance. In this article, we delve into the 10 essential steps of a data analysis pipeline, illustrating how transitioning each step to computer code can lead to a well-oiled system that gets you back on the tools and interacting with your athletes.

1. Data Collection

Without meaningful data, you cannot generate meaningful insights.

Have you ever completed a data stocktake? Can you succinctly articulate why you require each data source? If not, this should be your first step. A typical high-performance unit often has at least half a dozen data sources. These sources might include GPS units, force plates, velocity-based training tools, match statistics, and athlete self-reported questionnaires. It’s easy for data to be collected for data’s sake without progressing any further than that. Therefore, it’s crucial to have a clear purpose for each data point collected, ensuring that all data serves a strategic role in driving performance and outcomes.

2. Data Compilation

‘Matthew Brown’; ‘Matthew J. Brown’; ‘Brown, Matthew’; ‘Matt Brown’; ‘M Brown’

Have you tried to join datasets and faced a situation like this? For every new data source, a new potential for errors arises when merging data. To combat this, maintaining a single source of truth, such as an ID database, is crucial. An ID database lists the unique identifier number for each athlete in each data source. By using this ID database as an intermediary, the necessity for names to be perfectly aligned across all data sources is reduced, simplifying the data merging process and minimising errors.

3. Data Wrangling

Life tip: NEVER use 999 (or any number) to indicate missing values

So you’ve joined the datasets together, and now you can run your analysis, right? Unfortunately, it’s not often this easy. Two of the most common data wrangling tasks are removing duplicate entries and treating missing values. In statistical software like SPSS, a dummy value such as 999 was often used to indicate a missing value. However, when these data files are read into programming languages like R or Python, there is no indication of what the 999 represents. As a result, it’s best to leave missing values as simply empty cells. This approach ensures clarity and makes it easier to handle missing data appropriately during analysis.

4. Data Exploration

There’s a gorilla hiding in your data!

Ever jumped straight to answering a question without first considering what your data looked like? Well, turns out you’re not alone. In 2020, Yanai & Lercher provided college students with a data set, in which half were tasked to test a specific hypothesis regarding the data, while the other half were simply asked to explore the data set and come to their own conclusions.

The catch? Those tasked to test a specific hypothesis were 5 times less likely to find the gorilla. This illustrates the risk of tunnel vision when focusing too narrowly on a predetermined question. To avoid missing critical insights, approach your data with an open mind and a healthy dose of skepticism. Explore your data thoroughly before jumping to conclusions to ensure you don’t overlook any hidden gorillas.

5. Data Manipulation

“A verb is an action or ‘doing’ word” - Grade 2 classroom teacher

Now that we have ensured there are no hidden issues in our data, it’s time to shape it by filtering, sorting, aggregating, and creating new variables. Remember your 2nd-grade grammar lessons about ‘verbs’? In data manipulation, verbs are equally essential. In R, we use verbs like select, filter, mutate, summarise, and arrange from the dplyr package. In Python, we use similar verbs in pandas, such as loc or iloc for selection, query for filtering, assign for creating new variables, and groupby combined with agg for aggregation. By mastering these verbs in your chosen language, you can efficiently transform and analyse your data.

6. Data Modeling

If all else fails, just use a linear model.

1-way ANOVA, paired t-test, repeated-measures ANOVA, linear regression, independent t-test… If these terms bring back haunted flashbacks to STAT101, you’re in good company. But did you know they are all essentially the same under the hood—a linear model? Each one predicts a continuous dependent variable and assumes normally distributed residuals. The only difference lies in the number and type of independent variables. Once you break free from the analysis paralysis of choosing the “right” statistical test and understand these tests holistically as linear models, learning more complex models like generalized linear models and mixed models won’t seem so daunting. This perspective can simplify your approach to data modeling, allowing you to focus on the relationships in your data rather than getting bogged down by statistical jargon.

7. Data Visualisation

A picture can tell a thousand words, but make sure they’re the right ones.

Ever heard of graph crimes? It’s a fun rabbit hole to explore, and it teaches you what not to do. Here are a few simple steps to avoid ending up on one of those pages:

Type of Graph: Different data requires different types of graphs. From Data to Viz offers excellent decision trees to help you choose the right graph for your data.
Real Estate is Expensive: Evaluate every aspect of your graph. Ask yourself if each element is necessary. Can you incorporate your legend into the title by colouring words? Are all the grid lines needed? In nearly every case, a simple bar chart is more readable than a ‘FIFA-style’ radar chart.
Colouring: Keep your colour scheme consistent. If you pick colours for certain groups, stick with them throughout. Coolors is a great resource for colour palettes. Coolors helps you choose eye-friendly colour palettes, including options for those with different types of colour blindness.

8. Data Dissemination

Present the results. Use words if necessary.

During my undergraduate statistics course, I received a grade of 55% on a paper I felt I had aced. I assumed I must have made some calculation errors, but upon reviewing my paper, I found all my calculations were correct. When I asked my professor why I lost marks, they replied, “There is no point in learning statistics if you cannot convey the results to someone else.” Initially, I felt aggrieved, but I learned a crucial lesson: no matter how perfect your analyses and visualisations are, if you cannot interpret and explain the results clearly, your work is worthless. If you have done the hard work in the previous three steps, the dissemination should simply be the narrative that ties the analyses and visualisations together.

9. Data Interactivity

If you build the dashboard, they will come.

Once your new reports, filled with robust analyses and beautiful data visualisations, are ready, you’ll likely encounter an influx of data requests. Coaches will want to explore different aspects of the data, ask new questions, and gain insights for their specific needs. This is where the power of interactive dashboards comes into play. Interactive dashboards empower coaches to take ownership of the data, freeing you up to get back to what you love rather than writing more reports. While a simple Excel spreadsheet can sometimes accomplish this, learning frameworks such as Shiny in R or Streamlit in Python provide much more power and flexibility. If you don’t have any coding ability, no-code options such as Tableau and Power BI make these capabilities accessible to the mass market as well.

10. Data Automation

If something can be automated, it should be automated.

With your dashboards up and running, it’s time to set the wheels in motion and ensure that the entire pipeline hums along with minimal intervention. Most data collection tools, such as GPS units and force plates, offer APIs that allow different software to communicate and exchange data. If you’ve already completed the earlier pipeline steps in code (using R or Python, for example), it’s a relatively straightforward process to schedule regular tasks on your computer to fetch the latest data and feed it into your dashboards. While this process may seem tedious, if you calculate the hours you spend producing manual reports each week, you should see that it quickly becomes more efficient to automate as many processes as possible.

The 10 Steps of the Data Analysis Pipeline