The Agile approach in data science explained by an ML expert

Gone are the days when extensive planning and organisation of tasks are set in stone before teams embark on a major project. It’s becoming increasingly normal for teams to update their Kanban boards on a weekly basis and adapt their projects to the changing requirements.

Many of us would find this flexible approach a little familiar mainly because many teams are attempting to apply the Agile approach in their work processes.

We wanted to get a real-life perspective about the Agile approach in data science, so we got in touch with Christos Hadjinikolis, the Lead ML-Engineer of Data Reply UK, to talk about this topic. You can check out his personal website to learn more about his credentials and work in Machine Learning (ML).

Christos Hadjinikolis is a Lead ML Engineer with excellent presentation and communication skills. He's passionate about research and through his doctoral studies and consulting experience. He has been exposed to both deep learning and evolutionary AI. He also sees himself as a competent problem-solver with a proven academic record in the field of AI. He also has practical experience in Big Data engineering and advanced analytics, while specialising in graph-analytics.
Christos Hadjinikolis the Lead Machine Learning Engineer.
Image Source: Christos’ personal website.

Agile in Data Science

Christos spoke about the Agile approach in data science in his session Doing Data Science The Agile Way at the Mcubed London event in 2018. You can watch the video for more information.

Hadjinikolis defines Agile Software Development as a time-boxed, iterative and incremental approach to software delivery. The Agile approach in data science is based on the Manifesto for Agile Software Development and its twelve principles (you can click on the anchor texts to find out what they are if you don’t know).

According to Christos, a data scientist’s role in an Agile team is to experiment and learn what works and what doesn’t. Agile Software Development is all about creating experiments, not tasks. Experimentation tolerates failure as education.

“If practicing Agile Software Development produces value from learning through failing fast and safe… then a Data Scientist’s role in an Agile team is to help the team to fail faster!”

Christos Hadjinikolis at Mcubed London 2018.

When we do an experiment, we start by forming a hypothesis and determining the variables. Then, we proceed to collect and analyse data based on those variables.

When the results are not what we wanted or expected, it shows that the hypothesis doesn’t work or there’s not enough data to test this hypothesis. So, we adjust the hypothesis and redo the experiment to see if that hypothesis works or not.

Story splitting

In Agile, a use case is split into different user stories. A user story is an explanation of a software feature written from the perspective of the end-user. A user would describe the functionality they want in a feature and the benefit gained from that functionality.

Here’s an example we came up with. In a non-tech use case of baking the town’s best range of cupcakes, a user would say in a user story, “As a cupcake seller, I want more vanilla essence in the cupcake batter, so that the cupcakes will be tastier for cupcake eaters.”

In this user story, the “cupcake seller” is the description of the user, “more vanilla essence in the cupcake batter” is the functionality requested and “the cupcakes will be tastier for cupcake eaters” is the benefit of the functionality.

The baker would test the hypothesis that more vanilla essence in the cupcake batter would result in tastier cupcakes. The baker would experiment with several batches of cupcake batters with different amounts of vanilla essence, and then, record the data to analyse if more or less vanilla essence is better.

The baker would repeat experiments for other user stories about other cupcake features like the amounts of sugar, flour, ground flaxseed, water, colouring, oven temperature, flavouring and icing.

The same can be done in software development. Let’s look at it from the lens of a customer service colleague who is logging an issue about an online shopping app feature as a user story in the use case of improving the online shopping app.

This user would say, “As a customer service representative, I want the quantity of items in the shopping cart to be changeable, so that the customers can buy more than one of the same product“.

The software developer would then experiment with hypothetical ways to enable the quantity of items to change from 0 or 1 to the maximum number in stock.

Agile cuts time-consuming planning and speeds up turnaround of projects

Frameworks based on Agile including Scrum, Kanban, Extreme Programming, Feature-Driven Development and Dynamic Systems Development Method (DSDM) may emulate the adaptivity encouraged by Agile Software Development.

But they don’t necessarily have to be followed if they don’t necessarily meet the needs of failing faster, getting results faster and deploying software faster to give value to the customer faster.

That’s the beauty of the Agile approach. Nothing has to be concrete from the start and adjustments can be made along the way. Especially when there’s uncertainty about data availability, no time is wasted on excessive planning and documentation, and instead, more time is spent on pleasing the end-user.

Data Science-specific practices based on Agile Values will eventually be developed in research

Christos commented that the emergence of new Data Science-oriented practices will drive the application of Agile in the research domain.

“The focus there is about delivering business requirements, in the form of features and products, fast in a volatile, constantly evolving environment.”

“To support this, a number of underpinning practices have been developed, covering areas like modelling and design, coding and testing, risk handling and quality assurance.” 

“Some of these underpinning practices can directly be transferred in the Data Science world (e.g. user stories and backlogs, timeboxing and retrospectives).”

“So, a clear benefit of trying to apply Agile in Data Science is that gradually, similar Data Science-specific underpinning practices will eventually be developed and these will, of course, be based on the same Agile drives: adaptive planning, evolutionary development, early delivery and continual improvement, and more generally, flexible responses to change.”

The downsides of the Agile approach in data science

However, Agile may not work for everyone and for every situation. When we asked him about the downsides, he said that people tend to wrongly assume that Agile is the only way a team can work and be productive when it’s not.

“Ever since Agile emerged—in the concrete form that we know it today through the Agile manifesto—many hurried to undermine the effectiveness of other development models, e.g. Waterfall.”

“There is nothing wrong with the Waterfall model either; the real question is whether these practices or models are fit for purpose! There are surely research projects as well as business requirements around the delivery of software that could potentially be delivered through the Waterfall approach or maybe through a combination of the two.”

To this, he recommended that project managers and teams should strive to increase their effectiveness and efficiency.

“If that can be done by building on top of the Agile values then great; if not, then maybe they will need to try and come up with a different formula.”

“People focusing too much on what Agile is and what is not—if it needs to be Scrum or Kanban or if too much documentation or too much time spent in design is not Agile—are bound to make mistakes.” 

The Agile Industrial Complex

He also agreed that the imposition of Agile on teams, i.e. The Agile Industrial Complex, is defeating the purpose of Agile in finding what works best for teams in working adaptively.

“Once more, I can’t stress enough how there is no single perfect development model. Project managers need to always assess what is fit for purpose.”

“Primarily though, they should focus on the underpinning values and principles that Agile or other development models are characterised by. When they do, a recurrent mistake that I have experienced through my consulting career is the oversimplification of Agile as an anti-methodology, anti-documentation and anti-planning development model.”

“I appreciate that this makes understanding Agile much easier, but at the same time it is a very unfair representation of what Agile is! Imposing it on this basis is surely wrong. Equally, practicing Agile is definitely not something that comes through imposition.”

Many data scientists are not aware of the Agile manifesto

Another sad thing is that, when Christos asked the attendees, who were mostly data scientists, at his Mcubed London 2018 session to raise their hands if they knew about the Agile Manifesto, very few hands were seen. This indicated that most of the data scientists in the room were not aware of this Agile approach, so we asked him why.

“I can’t be too sure about this but if I was to point at anything, that would be how Data Science has, until recently (5 years ago), been so disjoint from the delivery of production-ready solutions. It was more focused on research and discovery to aid decision making.”

“Lately, the evolution and growth of ML as well as of cost-effective services to support it, necessitates the interaction of the two worlds. Never before has it been so much the case that ML models are such an integral component of software.”

“Before, Data Scientists did not need to worry about the operationalisation and maintenance of their model. Concepts like versioning, robustness, code-coverage and testing where not so much imposed or needed, let alone challenges related to things like dealing with technical debt and refactoring.”

“The traditional work environment would be a Jupyter notebook with access to a database! So, Data Scientists did not need to be exposed to so many practices to govern how they would work to deliver new insights.”      

Challenges of operational production level Data Science solutions

“This mostly has to do with bridging the gap between software engineers and data scientists. Software engineers not exposed to data science can’t really do this because they fail to appreciate how exactly to maintain ML-pipelines.”

He notes that in contrast to traditional software pipelines, there are many more issues that need to be addressed and he recommends the 2014, NIPS seminal paper on the “Hidden Technical Debt in Machine Learning Systems” for you to learn more.

“Equally, Data Scientists don’t appreciate the complexity of developing and maintaining code-bases and software solutions in a flexible and robust way to allow for things like CI/CD to be supported.”

“This gap is now partially addressed through the emergence of a new paradigm: the ML engineer, a hybrid data scientist and software engineer, equipped with the knowledge to deal with challenges from both worlds.”

“However, that is not enough to account for everything. What is also necessary is the emergence of appropriate tooling to support the development and maintenance of ML pipelines. A good example is Apache KubeFlow, AWS Sagemaker and the less mature but fast evolving Google AI platform.”

“What is surely not helpful is the bad practice of finding ways to schedule and run python notebooks in production!”

“I can’t stress enough how many times I have dealt with this in my career! Python notebooks are not made to be run as part of production pipelines—yet so many companies just do so!”

His plea to every project manager running an ML project out there is, “This is madness! Please stop it!”

Working together to find the right fit for the team

In Christos’ words, the most important factor in making ML-Ops (or any data science operations) agile is culture.

“ML-Ops are here to help cultivate collaboration between data scientists and engineers to support the ML-lifecycle. They are a manifestation of Agile for Data Science in a way!”

“What’s needed is for this mentality towards the development of production level ML solutions to be supported by practitioners, project managers and stakeholders the same.”

“Everyone needs to take risks and own responsibility. Data Scientists need to develop the courage of supporting their experiments even if they may appear to delay production; they need to help stakeholders and project managers appreciate the actual value of experimentation.

“This will often prove to be very challenging; loss aversion will eventually kick in and when it does people will be more reluctant to change, and they will want to stick to what they know.”

“But this is to be expected! It is natural human behaviour, and this is what we, as a community, are up against.”

“At the end of the day, we need to remember that it is almost impossible to find the right balance or get it perfectly right. There is no formula for it. Nevertheless, value will come simply from trying to get it right, and that is more than enough!”