Data Science using Agile Methodology

INTRODUCTION
A data science team asks great questions, explores the data, and delivers key insights.
The best way to generate business value is to deliver a constant stream of key insights in short two-week sprints.
A short sprint will also help the team pivot so they can ask new questions based on what they learn from the data.

WORK ON A DATA SCIENCE PROJECT
Typical project upfront requirements and we need to understand what we are going to build before to start the planning project. It also focuses on delivering scopre, schedule, and budget.
Finally, a typical project delivers a product or service.

Data Science project is different. The team explore new opportunity, make data more accessible for the compani and a better understand of it. So, there is no a detailed description about what to find before you start looking into the data.
On the contrary, project managment requires to define process and a fully understand your deliverable.
Data Science is an empirical process, where we need to expect the unexpected if we want to gain new insights. It is a process focused of research. We have to stop planning and start exploring. Data Science process is not a project but a discovery.

Data science teams are empirical and exploratory. We are used in a meeting to talk about: mission, objectives, and outcomes. That is why it is difficult to step back and imagine a team of a pure exploration.

TYPICAL PROJECT MANAGEMENT APPROACH
In a typical project management approach, a data science team is composed by one reseach lead two data analysts and one project manager. The goal is to make the needs of the company or of the customer actionable and increase value or revenue. Tipical questions are:
What do we know about the customer?
What we assume?
Why does our customers shop with us instead of our competitors?
What might make our customers shop even more?

By doing that, data analysts create reports or analyze social media platforms and create a word cloud of feedbacks from thousands of customers. The problem is that by knowing more about our customers that generate more questions. We continue to drop what we found to explore new aree. In other words, we know what we want to find but we are not learing something new. Moreover, if we don’t know how long the team is working to find insights, then we don’t know the bugget and its time to be delivered.

In other words, a data science project won’t fit with into a traditional project management framework.

HOW TO USE SUCCESS CRITERIA
Thomas Edison said about a failed experiment that was not a fail. There is always something new to learn from fails, and there is the opportunity to be free to try a different way. This is the same approach that Data Science Teams have. They run many experiments on the data and not every experiment leads to insights. It is a good idea to create success criteria, and objectives for the team. That is a good compromise between project management and data science.
The goals are:
Make the team transparent and not isolated from the rest of the organiation.
Try to solve large problems, this means that the lead research needs to have ambitious questions. If the questions are too timid, then it might be difficult to sho results.
Show what the team is learning using regularly scheduled storytelling sessions.

If you are the project manager of the data science team you must be able to:
Work hard to stay connected with the rest of the company.
Stay transaparent about what you find.
Give frequent demonstrations of insights, by doing that, the rest of the organization understand the value in data science.

USE A DSLC
We already said that data sciece is about experiment and exploration, and exploration, by definition, is about looking fro something unfamiliar.
We cannot plan the work as we do in a tipical project. Moreover, we won’t necessarily have a set of clear objectives. Crucilly, the absence of a plan doesn’t mean an absence of intent.
Data science increases organizational knowledge:
new insights can decrease the time-to-market TTM: is the length of time it takes from a product begin until being available for sale
build good will create new revenue
avoid costs
build good will

To work differently we need a life cycle.
The first is the SDLC: software development life cycle (plan, analyze, design, code, test, deploy). It is typically called waterfall model. In fact, each of the phases have to be completed before the next begins.
The second file cycle is CRISP-DM: cross industry standard process for data mining (business understanding, data understanding, data preparation, modeling, evaluetion, deployment). This process is usefull for data instead of software.

The CRiSP-DM is rigid for quick results. That is why DSLC: data science life cycle was intriduced, composed by six steps:
1 - identify the key roles in the tema and the rest of the organization
2 - question
3 - research
4 - results
5 - insights
6 - learn

The DSLC is loosly based on scientific method.
The most important thing to keep in mind is that the DSLC it is not designed to run in phases.
To tell a story of data is great to start identifying the key roles.
Have a question is a key element of the data science prosess. By doing this, is important to include the running partners. The question help the team to find the right method for the research. For example image that the question is: how to optimize the ads or you have to launch a new product on the market. So, the team is immedialty able to start the reseach on social media, crapping profile and perform the so called Social Media Analsys that is expterely helpfull to find influencers able to spred the ads as much and quick as possible to the vast majority of the social network.

A key point abou the results is to have the right knowledge to create graphical representation of the data to really adds value to the rest of the organization.

As well as in scientific research the question, research and results are the engine that drives uor data science team. The research lead idenity the right question and work togheter witht the data analysis on the research and create reposts. The the project manager work on to comunicate the results to the rest of the organization.

This process was used for example by Netfilx when they launch The House of Cards series.

WORK IN SPRINTS
Sprints are used in agile software and it is a fixed period of time (typically 2 weeks) for the life cycle and contains all the five areas of the DSLC. The main advantage to run in a spring is that its reduce the time between the concept and cash. Even if there is no insights we have a finished question. Consider that long time to dedicate to the result is tricky, because during that period the data might change.
Thanks to this method we can adapt to new ideas, as opposed to being focused on one path. One best tool that a research group can use is the QUESTION BOARD filled with Post-it noted. The idea is to solicit questions from the rest of the organization. The questoin board should be open and inviting. A inport think is that during he presentatin of the result people can recognize their own questions and so they will be more prone to ask new questions in the future and even encorage co-workers to ask questions as well. Crucially, not all the question have to be managed, the reseach lead and the team have to work to prioritize he most interesting ideas. Moreover, if we pay attention on the QUESTION BOARD we can find patterns in the questons: so, the board itself becomes another data source.

The sprint is about two weeks and the team have a lot to do. Therefore, they need some structure to stay efficient.
The first point is to define the meeting that are typically five:
1 - meeting to explore planning
2 - meeting dedicated to question breakdown: large questions are brokendown in much smaller and manageble ones.
3 - visualization design meeting: research lead and data analysts work together to create interesting visualization.
4 - storytelling session: this is about what the team found in the sprint.
5 - meeting for team improvement: after the sprint there is one hour improvement meeting to evaluate the progress.

This means that the team has two weeks between the planning meeting and the sorytelling session.
During a sprint you force the team to do the minimum amount of preparation, so the team can be focused in insights. We do not want the tem spend months just setting up the data, but they have to immediately start exploring the data and find insights.

There is a big difference between presenting data and TELLING A STORY, it requires a lot more work.
When you have a power point presentation, it is like to say: “Here is what I see”.
When we are telling a story, we are saying: “Here is what I believe”. To do that we have to:
1 - synthesizing complexity: explain some that is complex in a simpler way.
2 - bringing knowlegde
3 - make the data more memorable: a good story has to be get everyone engaged.
3 - call to action: a good story has to call to action based on the finding we are explaining.

RELY ON SERENDIPITY
There is a book called: Why Gretness Cannot be Planned: The Myth of the Objective, by Ken Stanley and Joel Lehman.
In the book theey say that objectives actually become obstacles. The more you focus on clearly defined objectives, the less likely you are to make interesting discoveries. Professor Ken Stanley says the team should rely on pure serendipity. Serendipity is something unpredictable or unplanned. Professor Stanley defines this as a spepping stones that lead to insights. If you ignore serendipity, we likely miss the key discoveries. We have to be able to pursue the unexpected.
On the contrary, when people are so focused on routine tasks, they are blind to unexpected events, this phenomena is called: perceptual blindness.

More information about Agile Methodology in Data Science can be found in the book: Data Science: Create Teams That Ask the Right Questions, And Deliver Real Value by Doug Rose.