Agile data science projects

Agile data science projects

...and other fairy tales - Part I


Which personality type are you?

I'm not asking if you're an INTP (also, MBTi is out of fashion), or something.
I'm asking what is your reaction when you hear the word agile.
Do you

  • laugh nervously

  • start shouting and leave the room

  • tremble in fear

  • stare into the void (and it looks back)

  • or prepare your paycheck because you're the elephant agile coach in the room?
    (Just joking, I really like good agile coaches, I only have problems with bs-generator people, but honestly, that's not limited to the agile coach profession.)

Actually, most people I know hate agile, but honestly, they make a point, because only a handful of companies do agile the right way.

To be honest, I'm really sad because of that. As an organizational psychologist, fan of lean, Six Sigma and co. I believe in agile with all of my heart.
But yes, I get it why are people freaking out: scrum, standup, planning, etc. are no longer just words, but manifestations of disgust and evil.
Because at most places, agile means

Jira with micromanaged waterfall, but with more weird meetings.

But a good agile is not about that.
Good agile could even miss daily standup and Jira. Crazy, right?
Because all in all, what is the essence of agile?

Ship working software, and ask stakeholders if they like it or not. Repeat.

Ship, ask, repeat. That's all.
Show each small valuable release, and interpret the feedbacks, even if they are in opposition to the original specification.
Because specifications are NEVER CLEAR!
Let me tell you a story about one of my former projects. I was a junior then.
At the beginning, the stakeholder specifically asked to develop a PDF export on the very page where our embedded PowerBI report was shown.
Guess what - we were working in sprints, we had Jira, we had refinement sessions, whatsoever, although we didn't demo the progress. We spent weeks developing the product, the PDF export part took like 2 weeks to design, format, test, integrate, etc.

...yes, you already know the conclusion.
He told that he couldn't even imagine that the PowerBI report will be this comprehensive. He said, drop that PDF to garbage, it's not 1990. LOL.

Not even stakeholders know what they want - not because they are stupid.
No, no, actually they are very far from stupid.
It's just how the human brain works - all of us have a limited scope in our minds. We can imagine like the next 3 steps very well, the following 2 so-so, but after that we can just have some misty assumptions. And there is nothing wrong with that, this is how we work - and we need to adjust the way we organize our development lifecycles.

So, as a fundamentalist agile fan, I bet you not to give up on agile because of former bad experiences. Start with this simple mantra: ship, review, repeat. And after that if you feel that you could use some process refinement or any of the scrum ceremonies would come handy, but only then, start the standard agile stuff.

But how do we organize an agile data science project?

Well, let me draft you a blueprint.
It consists of the element of scrum processes and roles.

Start from the beginning, why is it worth to speak about agile data science projects?
Because data science (and all analytics) projects are extremely difficult to be agile, and there's a strong reason for that:

All analytics projects are a mixture of R&D and standard development.

All of analytics projects have research phases even if the tech stack is well-known - because the team needs to find out details of

  • data specifics (are those IDs distinc? does averaging this value make sense?)

  • data quality (is this usable? how should we fix bad data?)

  • possible analysis methods (is data okay for ML, or we should start with just a BI)

and these are not known at the planning phase of a project.
Sure you can get a big pile of example data, but everyone knows that the hassles come with productionalizing. Example data is only enough for a PoC, although you can optimize the quality of your initial PoC assessing the phases in the following section.

Step 0 - Validation and pre-development planning:

  • guess the ROI of the project - Are the benefits worth the development and operational costs? Look at a maximum 2-3 years period, nowadays projects become legacy in 2-3 years, therefore shouldn't plan for longer. You need to guess the ROI, because you don't want to be fired for flushing money down the toilet. Don't build a spaceship for going to the groceries next corner.

  • assess the pull power on the project - Do the stakeholders really need this, or is it a higher management idea? Are the stakeholders supportive on the project or they focus instantly at the mistakes? If it seems that the stakeholder climate is hostile, you should try avoiding or rescoping the project. If there is no pull power, the project will be sent to sunset painfully soon.

  • explore the scale of the data - Can you use a compact Python code to extract, or do you need to set up a Spark cluster? This might mean a totally different set of competences in the team and a different infrastructure, never skip this part.

  • assess the available resources - Do you have the expertise in your team? Do you need to hire contractors? Do you have infrastructure or do you need the start it from the ground up? Does it fit the enterprise architecture and standards?

  • assess security and compliance - Do you need to anonymize Ids? How long history can you store? Do you need to build a standard with the law department?

  • find out the expected refresh times - Is it enough to run this monthly? If the business wants real-time refresh, is it real-realtime? Or are hourly batches fine?

  • explore the incremental load logic - Do you need to append only, or is there a complex upsert logic? What can be the effects of this logic?

  • validate if the desired analysis is doable - Create, let's say a Jupyter notebook and do a max. 100 line analysis validation.
    Load a sample of the data (e.g. cut horizontally for a limited time window or vertically for a limited amount of IDs ).
    Check if the desired analysis is doable (e.g. fit a baseline linear regression for ML or generate a simple SQL logic for a KPI.)
    Visualize the results and validate with stakeholders like it. It can be an excel table, a pie chart or a text output, but they need to understand the result.
    You need to be sure they understand the output and they are committed to pull the project from their side.

Step1 - Develop MVP(s)

You might wonder why I wrote MVP(s) and not just MVP.
This is because until the very end you will have multiple parallel streams that you need to review with the stakeholders. Let me go into details.

In analytics products (fancy name in 2017) or data science products (fancy name in 2020) or machine learning products (fancy name in 2022) or AI products (current fancy name in 2024) you have multiple parallel streams that bring value/can go wrong on their own and can pull the others under water.

These streams or sub-MVPs might overlap, might be done by the same people but it is highly important that each need to be reviewed by stakeholders separately, they are a different area of responsibility.

The three sub-MVPs are good data, good business logic, good serving.

Let's call them

  • The data engineering responsibility

  • The analytics responsibility

  • The serving responsibility

Data engineering responsibility

  • data is there

  • data is good quality

  • data is properly preprocessed

  • data is at the desired granularity

Data engineers (or the person who carries the data engineer hat) need to constantly validate the data with the stakeholders. If they are more advanced stakeholders, it can be an SQL table, or an Excel with raw data. If not, aggregate, help them understand the limitations and properties of the underlying data - even build your own validation vizualizations. In BI projects that can overlap with the final viz layer, but that is not a neccessity. They key point is that the data engineer should improve the underlying data by continuously reiterating with the stakeholders.

Analytics responsibility

  • business logic is appropriate

  • stakeholders understand and are able to interpret the results

  • if consists of statistical/ML models, performance metrics are validated by stakeholders. If they don't understand metrics, it is neccessary to transform the numbers to a metric that they can understand. They can accept/refuse the model performance only if they understand what are the consequences

  • if contains AI, there is a way to assure the quality of the outputs, and the limitations are clearly communicated and validated by stakeholders. Don't surprise them at the end of a project, they need to now where/when/how much the model hallucinates, etc.

Analytics stream can be the biggest win and the ugliest losing point in a project. Always be transparent and ensure the stakeholders understand the logic and the limitations so they can accept or refuse the results at an early stage. For analytics/ML/AI you need to fail fast. It is not a shame to reduce the scope to a high quality BI or email report instead of a low quality ML/AI project.

Serving responsibility

  • Results are presented in the desired form
    (email, BI, chat, web app, whatever)

  • Results are refreshed at the required periods
    (this might be a cooperation between the front-end/viz team with the data engineer team)

  • Results are enjoyable to use

  • Results are displayed adjusted to the consumer behavior/skills.
    (don't overcomplicate things, don't give an API for a non-dev client circle, provide a report/email summary instead)

The serving of the outputs can be validated continuously with acknowledging that the underlying data might not be perfect yet.

As you can see, these three responsibilities can go on their own track and need to be continuously validated.

Many projects fall into the trap of reviewing the analytic product only when the data, the business logic and the serving is there and they are integrated. An error realized at this point might mean a total reorganization of the project. Don't fall into this pit, always review the responsibility areas parallelly. Fail fast,

And at the end, don't forget the golden rule for Minimum Viable Product:

In MVP, it should be minimum AND viable - not minimum OR viable.

If you follow these concepts, there is a bigger chance that you project will succeed and your team won't quit because they are crazy about the fake-agile ceremonies, 12 hours a week.

In the following parts of this series, I'll share examples of agile data projects in details.

I write the blog after bringing 3 small kids to sleep. Tomorrow morning will be hard. Please support me with my morning coffee if you would like to read more of my thoughts.