I passed my pre-employment checks today. The email from NHS England came in, the start date is confirmed, and in June I'll walk into the Northwest Manchester Office as a Data Analyst intern on the 10,000 Black Interns programme.

I've been sitting with it for a few hours now, trying to write something that isn't just a gratitude post. Gratitude is there, obviously. But what I actually want to do is look back at the thing that got me here, because I think the story is more useful than the announcement.

The thing that got me here was an NHS GP appointment no-show dataset, a lot of late nights in MySQL Workbench and Google Colab, and a finding that genuinely surprised me.

The problem nobody was talking about in my circle

Missed GP appointments cost the NHS about £1.22 billion a year. Forty point six million appointments, booked and never attended. Roughly £40 million evaporating every single month.

£1.22B Annual cost
40.6M Appointments missed
1 in 23 Appointment slots wasted

When I first saw those numbers I had to re-check them. Not because they seemed wrong, but because I couldn't understand how a number that large wasn't a bigger part of the public conversation. We talk about NHS funding. We talk about waiting lists. We don't really talk about the fact that one in every twenty-three appointments ends up as a wasted slot a sick person could have filled.

That was the hook. I wasn't chasing a Kaggle gold medal. I wanted to understand a real problem in a system I care about, using the public data NHS England already publishes every month.

Phase one: just SQL, and just questions

I started the project with MySQL and 52,855 aggregated rows covering 920 million appointments between July 2023 and November 2025. No machine learning. No fancy dashboards. Just a database, thirty-odd queries, and a list of questions I kept adding to as each answer opened up three more.

The overall DNA rate came out at 4.42%. Fine, that's the baseline. But the moment I broke it down by appointment mode, the picture changed completely.

Face-to-face appointments: 5.56% DNA rate. Video and online appointments: 0.46% DNA rate. That's a twelve-fold difference.

Twelve times. I remember running the query three different ways because I didn't trust the first answer. Same result each time.

When you sit with a number like that, it stops being a statistic and starts being a strategy. If the NHS shifted even a modest percentage of suitable face-to-face appointments to video, the reduction in wasted capacity would be measured in hundreds of millions of pounds.

Other things fell out of the SQL work. October is consistently the worst month of the year, every year, across the three years in the data. Appointments with Other Practice Staff have nearly double the DNA rate of GP-led appointments (6.2% vs 3.2%). Routine consultations get missed almost twice as often as acute ones, because people show up when they feel ill and deprioritise preventive care.

Each of those is an intervention waiting to be designed.

Phase two: Python, and where the statistics started mattering

Once I'd squeezed what I could out of SQL, I moved the analysis into Python. Pandas, NumPy, a lot of matplotlib and seaborn, and Google Colab for the compute.

This is where the project stopped being description and started being analysis. I engineered features the raw data didn't have. Season. Month-over-month change in DNA rate. A risk_level flag splitting periods into high-risk and low-risk. A months_since_start variable to make time a continuous number rather than a label.

The statistical tests were where I learned the most. A chi-square test on appointment mode against DNA status came back significant at p < 0.001, with a Cramér's V of 0.073. Small effect size, but real. ANOVA across seasons confirmed autumn is genuinely different from winter, not just noisy. A t-test between high-risk and low-risk periods validated that the risk_level feature I'd engineered was meaningful rather than circular.

I spent a long time on the seasonal decomposition. Splitting the monthly DNA rate into trend, seasonality, and residual made something click for me that a bar chart never did. The trend was gently downward, from about 4.75% in mid-2023 to 4.42% by late 2025. The seasonal component peaked in October every year with almost mechanical regularity. The residual was mostly small. Which is to say: the NHS is slowly improving, the October spike is predictable, and a sensible organisation would plan for it in August rather than react to it in November.

Phase three: the machine learning lesson I didn't expect

This is the part of the project I'm proudest of, and it's also the part that went least according to plan.

I trained three models: logistic regression, random forest, XGBoost. My first attempt at logistic regression gave me 93% accuracy and I got briefly very excited. Then I actually looked at what the model was using to predict, and realised I'd left the appointments volume column in the features. The model was essentially learning "high-volume rows tend to have more DNAs in absolute terms", which was data leakage dressed up as performance.

A year ago that result would have disappointed me. Now I think it's the most useful finding in the whole project.

Pulled the column out. Re-ran everything. Accuracy dropped to 56%. ROC-AUC of 0.57. Recall on the DNA class of 59% with class_weight='balanced'.

Random forest scored worse. XGBoost scored about the same on accuracy but collapsed on DNA recall down to 15%. Logistic regression, the simplest model in the set, was the best.

What a 56% accuracy on aggregated categorical data is really telling you is: you cannot predict individual patient no-shows from service-level features alone. You need patient-level data. Booking lead time. Prior DNA history. Distance from the practice. Time of day. Deprivation index. None of that is in the public NHS Digital release, and for good reason.

So the model isn't the product. The diagnosis is. If the NHS wanted to build a real risk-scoring system for appointments, the bottleneck is not algorithms, it's the data architecture that would let appointment-level records flow into a model without compromising patient confidentiality. That's a very different conversation from "let's try XGBoost with more trees".

What this work actually did for me

I started this project to have something real to point at. A reason for a recruiter to read past the top of my CV. What I didn't expect was how much it would sharpen the way I think.

It taught me to distrust my own first result. 93% accuracy felt amazing for about four minutes, and then it became a lesson I'll carry into every model I ever build.

It taught me that the interesting question in applied data science is rarely "which algorithm wins". It's "what does the winning algorithm's ceiling tell us about the data we do and don't have?"

It taught me that a good finding is one you can explain to a non-technical person and see them sit up straighter. The twelve-fold difference between video and face-to-face appointments did that every single time I mentioned it. People who had never heard the phrase "DNA rate" in their life immediately understood why it mattered.

And it taught me that work done for your own curiosity has a different texture than work done for a grade. I wasn't optimising for marks. I was trying to answer a question that genuinely bothered me.

The bridge to June

When I interviewed for the NHS England internship, the DNA project was the thing I kept coming back to. Not because the model was impressive, because it wasn't. Because the shape of the work matched the shape of what NHS data teams actually do. Start with a real operational problem. Interrogate the data honestly. Be willing to say the model doesn't work and explain why that's still useful. Translate the finding into something a manager can act on.

That framing is what I'll carry into the Northwest Manchester Office in June. I don't know yet what I'll be working on. I do know the dissertation I'm running in parallel. Unsupervised clustering of lung disease patient phenotypes with Dr Jenkins. Will stretch a different set of muscles, and I'll need both.

Today I got a start date. What I'm actually celebrating is that the work I did on my own time, with free tools and public data, turned out to be the most honest preparation I could have done.

If you're somewhere in the middle of a project that feels slow and unglamorous, keep going. The 93% accuracy that turned out to be a bug taught me more than any clean result ever did. Real skill is built on the ones that don't work on the first try.

· · ·