joaorosa.io | João Rosa thougths

My observations on the effects of CAPEX and OPEX on organisational behaviour

2022-12-19T22:00:00+00:00

During my career, I’ve been involved in different digital transformations, ranging from financial to telco institutions. Those digital transformations aim to evolve the operating model (or parts of it) and bring digital capabilities to the forefront. As part of it, new ways of working are implemented, which leads to new practices to manage capabilities. You might recognise those digital transformations by other names, such as Agile transformation, DevOps transformation, or cloud transformation.

From my learnings across the years, I noticed a pattern within organisations. There is a clash between the parts of the organisation that manages capital expenditure (CAPEX) and the others that manage operating expenditure (OPEX). Let’s use the Wikipedia definitions to expand on those two concepts:

Capital expenditure from Wikipedia:

Capital expenditure or capital expense (capex or CAPEX) is the money an organisation or corporate entity spends to buy, maintain, or improve its fixed assets, such as buildings, vehicles, equipment, or land. It is considered a capital expenditure when the asset is newly purchased or when money is used towards extending the useful life of an existing asset, such as repairing the roof.

Operating expenditure from Wikipedia:

An operating expense, operating expenditure, operational expense, operational expenditure or opex is an ongoing cost for running a product, business, or system. Its counterpart, a capital expenditure (capex), is the cost of developing or providing non-consumable parts for the product or system. For example, the purchase of a photocopier involves capex, and the annual paper, toner, power and maintenance costs represent opex. For larger systems like businesses, opex may also include the cost of workers and facility expenses such as rent and utilities.

I added the definitions of CAPEX and OPEX to the article since, typically, they are not in the IT lexicon. But they are crucial for the observed behaviour.

What I’ve observed

The organisations that I worked with have some unspoken conflict between two very different tribes: on one side, people that want to move fast, test new products or services and abandon what doesn’t work. On the other side, people must maintain the different assets that underpin the company’s operations. Using the accounting jargon: a group of people that manage OPEX and a separate group that manages CAPEX. It might not be a problem since companies had operating like that for decades. Or is it?

Figure 1 - Illustration of the two factions: one wants to move faster, other needs to maintain assets

Let’s use an example…

Let me use an example that you might recognise. Imagine an organisation in the logistics domain. They own warehouses and a fleet of trucks. The business proposition is focused on small businesses, who can act as end-to-end logistic partners, helping the small businesses to have a more significant geographic footprint. They have been operating since the 1980s when business transactions were done via salespeople (in-person), phone or fax. Back then, the company set up an IT department to manage the technical assets, such as phones and faxes. As the company evolved and software became the norm, the IT department now manages physical assets such as the data centre, phones and laptops and digital assets such as software and software licenses. The IT department mainly manages CAPEX, given that most assets are physical. They also have a sales and marketing department, where they transitioned from in-person sales to digital sales. They target the same small businesses, but these small businesses are leveraging the internet to sell their goods. This department manages OPEX since they spend the budget on marketing campaigns, such as social media ads.

So what?

So what?, can be the question in your head. Well, I didn’t transcribe CAPEX in the blog post. One attractive property of CAPEX is that the cost cannot be deducted in the year it is paid. It is capitalised across the lifecycle of the asset. This means that organisations maximise the use of assets under CAPEX, and people directly responsible for managing such assets must understand their lifecycle (which can vary from a couple of years to decades). On the other hand, OPEX can be offset in the same year (taxes like VAT).

It provides an interesting challenge: OPEX and CAPEX have different time horizons, and people will have different behaviours when applied to assets (such as a data centre or a marketing campaign). It can affect decision-making, risk-taking, innovation, and overall organisational culture.

Figure 2 - Different challenges when managing OPEX (above) and CAPEX (below)

At an organisational level, it poses a tall challenge: as time passes, the group that manages OPEX wants to move faster, and the group that manages CAPEX needs to maintain assets with a long lifecycle. In other words, organisations (as described in the example above) have IT to manage CAPEX, and sales type of groups manage OPEX. Of course, it is a simplified view, but it should give you an idea.

When an organisation undergoes a digital transformation, this tension becomes apparent. Because people manage assets using methods that suit the finance and accounting constraints, their behaviour and mental models fit those methods. For example, if an IT department evolves as in the previous example, it will likely apply waterfall methods to software development, producing long lead times. But hey, if those are the methods people know, how can they do it differently? Of course, each organisation is different, and the behaviour that I described is more latent in organisations that are not digital native.

The organisations that are winning in the digital space have short feedback cycles. By having short feedback cycles, they can learn quickly from the people using their services and/or products, having the opportunity to correct course (if they miss the mark) or even create new value propositions. By opposition, if you have long feedback cycles, you are playing catch-22, and competition (is likely) to outpace you. And I realised that companies with long feedback cycles trying to transition to a digital model but have yet to succeed in it have a clear split between groups managing CAPEX and OPEX.

Focus on the environment, not on people

Kurt Lewin wrote in 1936 the following equation:

Figure 3 - Lewin's behaviour equation

It means that behaviour is a function of a person in their environment. Let that sink in for a moment.

Unfortunately, it is common in digital transformation to tell people how they should behave rather than actually change the environment where people conduct their work. And it is one of the strong reasons why those digital transformations fail. Without the proper incentives in the organisation, the behaviour will remain the same.

If you want a digital organisation, you must revisit your finance and accounting model. Those models reinforce behaviours that are different from a truly digital organisation. In the organisations I work with, it is a topic I put early on the table. Without adequate financial constraints, digital transformation is at odds.

Some use cases to foster the environment

Let me describe some use cases. I’m telling some of my experiences, where the goal was to increase the cooperation of different groups. As humans, we relate to examples. But be aware: it is not an exhaustive list of solutions, and I’m looking for feedback. What can other use cases help foster the environment?

Move your IT real estate to an OPEX model

Moving your IT real estate to an OPEX model is the obvious use case. In the example of the logistics company, by moving their workloads from the data centre to the cloud, the group managing the data centre will stop managing CAPEX (or a big chunk of CAPEX). They will be incentivised to think in smaller timelines since they don’t have assets to depreciate.

Platform as a product

Platform as a product is gaining relevance in the IT world. In terms of CAPEX and OPEX, the platform can have assets that are managed via CAPEX but offer services that are OPEX based. It requires financial and accounting experts in the platform to have the proper mechanisms in place. And this model is not new: the cloud providers use it. The platform as a product has been gaining popularity in recent years. The IT industry realises that everything needs proper product management to be successful (I’m referring to product folks that scan the problem space to find opportunities, not people with a “product” role where their job is to be gatekeepers for software teams). Team Topologies have extensive material on this topic.

Focus on the capability, not on the type of role

Given the organisation has the ambition of being digital, there is an opportunity to have an organisational model where people from different roles (marketing, sales and IT) are together to form one capability. The capability has an outside-in view, starting with the user needs (external and internal) and what needs to be done to deliver. It is disruptive since most organisations have a cut between business and IT. This use case assumes that organisations are committed to being the best in their field and are willing to let go of the false dichotomy between business and IT. In this way, the CAPEX and OPEX blend over the different capabilities, and the people are incentivised to nurture organisational capabilities rather than manage assets.

All of the above use cases have profound implications for the sociotechnical system of the organisation. The architecture of the software needs to change to match the new boundaries, the software engineering practices need to evolve to provide fast feedback, and the structure and composition of the teams will evolve to enable a fast flow of change. And the financial and accounting processes need to be redesigned from the ground up to support it (it is the subject of the blog post, but hey, let me be explicit). It is a tall order to evolve an organisation, but with the right incentives in the environment (your organisation), the behaviour will start to change over time.

In conclusion

Any organisation is a complex system. And our industry tends to focus on the people rather than the interactions between people, technical procedures, policies and processes. However, there are better approaches than that. From complexity science, we know that interactions in a complex system are vital. Kurt Lewin was one of the modern pioneers of social, organisational, and applied psychology.

In any sociotechnical system, it is essential to understand how people are constrained and motivated to do their jobs. Changing the constraints and policies at the financial and accounting level allows people to think in different ways to deliver value, increasing the feedback cycles. An organisation can maximise its digital capabilities if they have shorter feedback cycles.

I would love to hear from you. What are the other constraints that you see at play? What are the success and horror stories that you experienced? You can leave a comment or contact me directly.

Skills required for a CTO - be transformational

2022-07-21T22:00:00+00:00

Today, most companies have a CTO. Title, responsibilities, and expectations of the job may vary per company, as there is no singular definition of a “CTO”. But what I notice from various assignments in the software and technology industry is a convergence of the skills required to fill the role of a CTO. Communication and technology vision are frequently mentioned as critical skills. Werner Vogels wrote a blog post in which he describes the CTO role history and the different CTO archetypes.

From my experience as interim CTO and consultant, I have observed an emerging pattern in the industry. Given the critical role of software and data in an organisation’s strategy and the ever-changing environment in which companies operate, the organisation will have a competitive advantage if the CTO also has transformational skills. It becomes more apparent in digital-native companies whose business model centres on selling a software-powered digital product.

Photo by Clark Tibbs on Unsplash

The concept is not new. In the book Accelerate, Dr. Nicole Forsgren, Gene Kim and Jez Humble found in their research that in high-performing organisations, the leadership is transformational instead of servant. They explore the premise that a high-performing organisation needs to be constantly evolving since the user needs shift, and transformational leadership can give direction and guide the organisation’s evolution.

But what does it mean?

Several qualities help to be a transformational CTO, which are interconnected. This blog post will not explore all the skills and qualities of a modern-day CTO but rather the ones connected to the skill of being transformational.

First and foremost the ability to read weak signals. In short, there are signals before any event (examples of events such as a staff member resigning or an economic downturn, to name two). Those signals can be very subtle (weak) and go almost unnoticed. The ability to capture the signals is the first step in building a transformational skill. With the ability to read weak signals, the power of making sense out of those signals. Just capturing signals is not enough; it is necessary to make sense of it, which is a tremendous challenge since reality is complex and chaotic. Let me be more specific: to be transformational, it is crucial to relate different information. And those relations emerge from the various signals (both strong and weak).

Showing empathic communication is vital. Other authors have mentioned communication before, and I want to emphasise on the empathic part. People can communicate with empathy by changing styles, adapting them to the context, and dealing with different situational nuances. My experiences taught me that using empathy in my communication reinforced relationships, even when the content of the message was not positive.

To be effective as a transformational leader, there are other two important qualities:

envision and communicate the principles in which the organisation operates (not only technology related), and
know how to navigate the organisation to implement structural changes. As the environment around the organisation changes, the organisation needs to evolve.

It means that the structure and responsibilities change, and as a transformational leader, it is crucial to identify the operating model principles that hamper or facilitate the organisation’s evolution. Due to organisational inertia, it is common for principles to become dogmas, and people stop to challenge them. Evolving the operating model is not a one-off activity; having the organisation’s buy-in will smooth the process. Knowing how to navigate the stakeholders and taking people along in the journey will increase the likelihood of success in changing the organisation’s structure.

These qualities will compound to the transformational skill of a CTO. But it doesn’t end here. In a digital-native company, a CTO is privileged to contribute to the organisation’s strategy rather than having a supporting role. The business proposition is centred on a digital product, and typically it will fall under the responsibility of the CTO. The company’s operating model will be critical for its success. An operating model bridges the company strategy to execution and delivery. A misfit in the bridge (the operating model) will limit the execution and delivery capability of the company strategy. The operating model is connected to the size of the company and the product lifecycle, to name two dimensions.

Where do we go from here?

As a community of leaders in our industry, we still have a long journey ahead of us. Sharing our field stories, failures and successes will inspire others and accelerate our learnings. I believe that the new generation of CTOs (or CPTOs, as some companies redefine the role) are driving the evolution of a digital-native organisation. More specifically, the evolution of their operating model. Gone are the days when a CTO was required to manage an organisation’s technology. Today, a CTO is in a privileged position to capture weak signals and make sense of them. I also recognise that it is not an easy task; it’s based on several factors, such as the complexity of the challenge (evolving an operating model) and the demand of the role (yes, still responsible for the digital product).

In my personal journey, I started to invest in my qualities of reading and understanding weak signals. I’m inspired by the Cynefin community and how they practice sense-making. I even embedded some of their ways into my practice as an interim CTO/ Consultant.

For example, during department updates, I pose sense-making questions at the beginning and end of the update, allowing people to share their perceptions. In addition to that, I push myself to take notes and have self-reflection moments. The self-reflection is focused on my experiences, and by writing down the experiences, I can identify and amplify the signals. These regular practices helped me to create a foundation on which I can continue building the other qualities.

Being transformational as a CTO is a learnable skill, which I tuned with different professional experiences and knowledge from different communities. Every one of us has a different journey, but in the end, the goal should be to create an environment where people enjoy doing their work!

Thoughts on organizing architecture

2021-08-03T10:00:00+00:00

When being part of an enterprise, you will meet different architects on any given day. The first one introduces itself as a solution architect, the other calls itself the enterprise architect, and they both mention a domain architect. It might feel like different names for the same thing, and perhaps even a bigger question, do we even need all of these different architects? Should the team not be able to make all of these architectural decisions by themselves?

Do we need architects anyway?

Looking at the current technological and organizational paradigm, we can only recognize the world is massively different from 10 or 20 years ago. We live in a world where we can instantly make use of infrastructure via cloud providers. Using common software, functionalities can be purchased and integrated with the click of a button and the availability of a credit card.

Gone are the days of creating large project plans and business cases. No need to negotiate the proposed solution for any given problem with the budget holder. No long debates with other engineers about the envisioned solution. Great for our agility, but it also has consequences.

Value-stream teams have been given more autonomy and possibilities to select, purchase and integrate hardware and software. Albeit via cloud providers where you can autoscale your infrastructure, or via Software-as-a-Service providers who offer you functionality out of the box. Gone are the days of making well-thought documents who are reviewed and tested by colleagues in the organization.

Clearly this benefits the speed of delivery and flexibility in choosing solutions. Consequently, however, it requires more maturity from a team. The solution proposed and decisions made not only have to fit the context of the team, but as well as the organization. The complexity and pressure to guard consistency and best interest for the organization, as a whole, now relies on the value-stream teams. The tension of choosing between different options and stakeholders now solely falls on the burden of the team.

The architect function is key in getting to conscious decisions that are beneficial for the customers and the organization as a whole. Either by limiting the number of options that are available for a team, or by refining and improving the reasoning and acceptance of trade-offs between proposed solutions.

Organizing architecture guided by two perspectives

First-of-all, architectural scopes are not to be seen as static elements. Architects should be dynamic, understanding the purpose of management with the organization and the engineering challenges of the development teams. Gregor Hohpe describes this as riding the elevator. One way to determine scope is to look at complexity. The scope of complexity for a value-stream team is different from the complexity of a whole organization.

Inspired by Ruth Malan and Dana Bredemeyer, we can use the model where the architecture function can be organized in two axes: on the vertical axis is the locality of decision-making, and the horizontal axis the complexity and detail of the challenge:

Let’s start with the axis of complexity. The scope of a team often concerns a limited number of components, microservice or other functionalities. The main objective for the team here is to ensure functionality of the component is above or up to par of customer’s expectations.

As we move up to domain-oriented units, business lines and enterprises, we see an increase in the complexity. On business-line and enterprise level, for example, the concerns are different from a team. One has to balance the best interest of multiple teams, commercial interests and organizational concerns. Simply put, the world is a bit bigger and increasingly complex.

Decision-making is the other axis to consider, especially the impact of a decision. The decisions made on organizational level typically offer boundaries and guidelines towards the organization. These are meant to bound the options available to the smaller units of the organization and aid in better decision-making for those units in the best interest of that unit and the organization. We can frame as the stewardship increases with the complexity, and the decision-making increases with the locality:

What about the architecture role?

As a starter, we see architecture as a function. A function that can be delegated to a day-to-day role, or a function that is attributed to an existing role. This could be a principal engineer or assigned to a group of people, e.g. the product engineering team. A senior engineer or team lead can be appointed as the tie-breaker.

For an organization, it is important to be conscious about the attitude of the architects it hires and puts in charge. The character and way of working of the architect function has a huge impact on the engineering culture. Putting a benevolent dictator in charge has a different impact than an architect that coaches and supports the team in their architecture decisions.

However, context matters here. An architect for an in-shop product engineering department requires different capabilities and attitudes compared to an architect that has to work with vendors and ensure successful integration. The latter architect needs to be stronger in vendor management and the corresponding negotiation.

Do I need an architect?

Again, context matters and generic answers can’t apply here. However, we have some considerations to think about.

Do you have the capabilities and time to guide the technical decisions? Especially as a manager of a small technical department, can you support the time in guiding the decision-making process? From the perspective of time and capabilities.
Can the complexity of an organizational unit and the impact of decisions managed via a shared responsibility by a group of people? For example, in the case of a group of mature principal engineers and a clear decision framework, it could work better than a dedicated domain architect.

Whilst we are talking here about the architecture function, we also propose that these principles and guidance apply to a broader set of disciplines. Design and user experience is a prime example of this. A design system in a way is a decision-framework that enables local decision-making and ensures compliance to the corporate brand and experience.

This blog post was written by Paul de Raaij and João Rosa

EventStorming as a cultural assessment

2021-06-30T10:00:00+00:00

We are on a quest….

As consultants, we are challenged by our customers not only with the issues that they are facing in the technology space but also how it affects the organisational structures and the culture. Based on our experiences, EventStorming is a great technique to expose the underlying cultural aspects of an organisation, while focusing on the value streams and technology. In this post, we are sharing what we have learned, by giving examples from our experiences that hopefully inspire you to use EventStorming as a cultural assessment.

But before that, a bit of theory (yeah, we know, can be boring, but we promise that it’s short)…

Sociotechnical systems

We operate in sociotechnical systems, where the social practices, our cognitive processes, and technology are at play. The definition used in this post is from Jabe Bloom:

Sociotechnical architecture by Jabe Bloom

A sociotechnical system can be summarised as the collision space between the social practices and cognitive processes with the technology, and how we (humans) perceive and use it. Think about smartphones (and their predecessors): what started as a mobile device to make calls on the go, evolved to have more capabilities, to a point where we are able to do most of our shopping on the device. There are even regions around the globe where the usage of smartphones is higher than computers. The use cases for the usage of a smartphone increased, and it shaped how we (society) operate: from making a call on the go, to being able to book a table in your restaurant of choice in a few clicks.

Looking from a distance, we can see our behaviour changing, and the value creation based on technological advancements. As our behaviour as humans change, the culture also changes. Culture lives between us, and, most of the time, we are not aware of it. It is the unwritten set of protocols and expected behaviours, and all of this knowledge is implicit. It becomes visible when a person doesn’t use the expected protocol when communicating with others.

Culture impacts technology

Here, in The Netherlands, we start with greetings, a bit of chit-chat, and then dive into the content. Imagine if I skip the first two? Could feel awkward to the other party. Culture influences how we act, and in a sociotechnical system how we interact with technology and how we produce technology. Culture impacts technology. At the same time, every time that there is a technological enhancement, and it goes mainstream, it starts to affect culture. You can think about the smartphone and its impact on our culture. Technology impacts culture. It’s a dance, and most of the time we are not aware of it.

As such, it is necessary to take a holistic approach, and an ongoing effort, to maintain a sound culture. One-off interventions will not produce the awareness that is to sustain the culture within your organisation. Although it sounds simple, it is a complex challenge.

EventStorming as cultural assessment

EventStorming is a technique created by Alberto Brandolini. Born inside the Domain-Driven Design community, it started to be used to create useful software models based on the Domain-Driven Design tactical patterns. It evolved as more practitioners adopted it, and we use it with a dual purpose: the original intent to create useful software models, and as a cultural assessment. The latter almost happened by accident. “How come?”, you may ask.

Let’s take a few steps, starting by explaining a bit about EventStorming, our facilitation approach and how it can be used as a cultural assessment.

EventStorming in two paragraphs (or less)

EventStorming is all about collaborative modelling; designing boundaries, a shared language and understanding. It’s a technique that depends on the knowledge and behaviour of people. The sessions aim to have people with questions (usually teams that create software) and people with answers (domain or subject matter experts) in the same room. Having people together, and using EventStorming as a visual collaboration technique, triggers discussions about processes and the underlying software (famously known as build the right thing).

Our facilitation approach

EventStorming is not the only tool in our toolbox, but it is our go-to tool to engage with organisations that want to improve their software development process (from any angle). As such, we prefer to have at least two facilitators in a given session. The idea is that one person actively facilitates the group, and the other(s) observe the group. When observing we are paying attention to non-verbal behaviour, facial expressions, who speaks up, or if there is any knowledge being suppressed. It works in-person as in remote, although the way to facilitate and the way to observe are different.

After each session, the facilitators debrief the group about what happened during the session. It is important to do it just after the session, where the experiences are fresh in the memory, where together with the session notes can generate new insights into the group dynamics.

How we started to use EventStorming as a cultural assessment

As time went on, and we facilitated more sessions, we started to discover patterns based on the EventStorming sessions. At that point, we started to use our scientist hat, and we formulated different hypotheses for what we were observing. From there we run different experiences in different sessions, trying to understand if our hypotheses were correct or not.

Our conclusion was that EventStorming exposes group dynamics and cultural norms, which influences the process of creating software, in its essence the sociotechnical system. Interestingly enough, when we cross-check the observations with field interviews, we were able to verify the weak signals and collect more stories that allow us to have a broader picture of the organisation. Note that EventStorming sessions are usually not a longer term engagement. That means we will only observe a moment in time. Nevertheless we feel that these observations are a good foundation for follow-up conversations about culture.

Today we can state that EventStorming is one of the tools that can be used in sense-making, as described by the Cynefin community. Sense-making is critical to understand the different perspectives of people and the thinking/opinion overlaps (if any) in a heterogeneous group. Bring it closer to software, in an organisation that creates software, there are different opinions about the process or even the products to be created, and it is important to understand and manage the difference of opinions. We will share some of our sense-making techniques in a follow-up post.

EventStorming cultural observations

There are lots of signals - weak and strong - that tell something about the culture within a group of people, team or organisation. Over the years, we have identified a couple of categories that affect the social, cognitive and technical aspects of the sociotechnical systems we’re living in. Allow us to discuss a few of them with you.

Ranking and power play

Ranking is a phenomenon that we all have experience with. How we perceive ranking, and whether it’s hindering or helping us, depends on our personal rank in different situations. Ranking in itself is not a bad thing. It helps us to make sense of the world and process information quickly. It can become a problem when we are not aware of its presence and effect on our culture.

The main distinction we have to make here is between explicit ranking (organisational chart position, job title, a formal level of power, etc.) and implicit ranking (gender, skin colour, perceived level of charisma, level of informal power, etc.). Ranking is also very much present in collaborative modelling sessions. It determines who speaks more, who takes the lead, who expresses an opinion and who decides not to share an opinion. That last example means that ranking can lead to missed opportunities, or suppressed knowledge in these sessions.

We are all conditioned to have a “symbolic ideal” of what is dominant/strong, and what is dominated/weak. That’s not a judgement, it’s just how society is wired. We all share a similar picture. The more traits someone has from this infinite list, the more we see this person subconsciously as dominant/strong. And we are conditioned to yield more power to dominant/strong people. (If you want to know more about symbolic violence, make sure to watch this talk from Romeu Moura).

Especially during EventStorming sessions, ranking is present because we invite a lot of people from different departments and ranks. This can hinder the process, if the group isn’t aware of it or if people play their high rank to their own advantage; speak the most, don’t ask others to pitch in, downplay other perspectives. It’s up to us as facilitators to spot these patterns and break them. We do this mostly by asking questions and sharing so-called weather reports: “what makes you say this?”. “Is there anyone that has a different perspective?”. “We observe that not everyone is participating in the conversations, does anyone recognize that?” Whatever you do, NEVER put anyone on the spot by addressing them personally. That can create a very unsafe environment.

If you’re aware of your rank, make sure you act upon it. First of all, own your rank. If you’re the CEO you are not equal. You own budgets and can make decisions. That’s fine. Do make sure that you share your rank. Because of your rank, others might feel hesitant to share their opinion and knowledge. Make sure there’s room and psychological safety to do that. Ask questions, ask for other opinions and make it clear that you are looking for expert opinions from the people in the group.

Cognitive bias towards problems or solutions

There is so much information coming at us that we need to process continuously. Preferably in an optimal manner. So what we do in order to make sense of the world - to cope with all this information without going crazy over it - is creating our own “subjective reality” from our perception of the input we’re getting. Our brain is searching for patterns, familiar situations and recognition in an attempt to simplify information processing. This often results in cognitive bias.

A cognitive bias is a systematic error in thinking that occurs when people are processing and interpreting information in the world around them. It affects the decisions and judgments that people make. You can imagine that especially in collaborative modelling sessions, (new) bits of information are flying around. So we can use all the help we can get to make sense of it. Cognitive bias is a familiar guest at these sessions. Recognizing them and being able to challenge and/or counter them will benefit the outcomes of your session.

As facilitators, it’s easier to spot cognitive bias, because you’re not “part of the group”. Signalling “functional fixedness” for example. This bias keeps people from seeing the full range of solutions to a problem and affects the ideas that are generated and considered. What we know and are used to hinders us from taking on new perspectives and limits ideation and problem-solving. After signalling it, we try to counter this functional fixedness by encouraging people to “model something wrong” on purpose. Express the wild and crazy ideas to see if we can overcome our limitations. A quick example: in our training, we use the cinema domain. As part of searching for solutions “outside-of-the-box”, someone suggested removing all seats and letting everyone enter the cinema in an inflatable bubble. This is what will counter functional fixedness. Plus, it’s super fun!

There are way more cognitive biases to spot and signal. Take a look at this codex. In this talk, our colleague Kenny and Evelyn talk about a few more collaborative modelling sessions.

Group dynamics

EventStorming shouldn’t be done in isolation, which means you’re always dealing with group dynamics, relationships, conflicts and polarities. Group dynamics can amplify ranking, power play and cognitive bias. It can also bring conflicts and polarities to the surface that need to be managed properly in order to get the most out of a collaborative modelling session.

Let us be clear: conflict is not a bad thing. It’s often very helpful if managed properly. To us, conflict doesn’t mean raging anger. It means that there are underlying dilemmas that result in opposite views. The problem we see very often in organizations is that conflicts arise because both/and polarity is treated like an either/or conflict (problem) that can be solved. ‘Where are we going for dinner?’ is an either/or conflict we can solve, ‘When do we go from collaborative modelling to coding?’ is a both/and polarity that we need to manage. If you want to know more about this important distinction and how to map polarities, this blogpost is everything you need.

During EventStorming sessions, it’s not uncommon that people are convincing each other of why they are right and the other person is wrong. But what if they’re both right, but neither is complete? Incompleteness combined with our conviction of being right (accurate)—fed by cognitive bias—is an important source of a potential problem when managing polarities. As facilitators, we always strive for completeness rather than being accurate; making sure all perspectives and knowledge is included on the (digital) brown paper. Facilitate the conversation so everyone can let go of their view in order to see the other. That way, we can move back and forth in that polarity and leverage the benefits of both sides.

How conflict and polarities are managed during an EventStorming session can provide a lot of signals about the culture within the team. Are people open to other views or are they holding on to their version of the truth? Is there room for counter arguments? Does the group find it more important to be right or to be complete? Do people feel safe enough to share their (minority) view and thereby adding a polarity? There are a lot of observations to discuss based on this group behaviour.

Ok, so where do I start now?

From our experience, EventStorming can help you assess the culture of a company. Due to the short term nature of these sessions, you have to be mindful of generalizing. Nevertheless, the signals and observations you collect will help you to continue the valuable follow-up conversations. Be on the lookout for ranking and power play within the group, cognitive bias affecting decisions and collaboration, and the presence or absence of (un)healthy conflict. Based on this, you can determine follow-up actions. Not only regarding the discovered process, but also in terms of organisational and team development.

This blog post was written by Evelyn van Kelle and João Rosa

Mental models: a reflection on AWS outage

2021-01-11T10:00:00+00:00

In November 2020 AWS had a major outage, which started with their Kinesis service, having a cascading failure over some services. Several articles and analyses of the outage, including the official note from AWS. This blog post reflects the outage, but rather focus on the technical aspects, I will deep dive into the social ones, namely mental models.

Pre-reflection, what is a mental model?

I will borrow two definitions for the mental models; the first one from Kenneth Craik:

If the organism carries a “small-scale model” of external reality and of its own possible actions within its head, it is able to try out various alternatives, conclude which is the best of them, react to future situations before they arise, utilise the knowledge of past events in dealing with the present and future, and in every way to react in a much fuller, safer, and more competent manner to the emergencies which face it.

And the second one from Jay Wright Forrester, where he defines general mental models:

The image of the world around us, which we carry in our head, is just a model. Nobody in his head imagines all the world, government or country. He has only selected concepts, and relationships between them, and uses those to represent the real system.

I can say that a mental model is a representation of concepts and their relationships, which help us to reason about a problem/challenge/whatever needs mental capacity. It is influenced by our bias, past experiences, knowledge and information. We use these models to try to anticipate future events, and it shapes our behaviour. Our mental models are not static, they evolve, although we are not aware of it most of the time. Note that I’m not a specialist in behavioural or cognitive science, and there is more behind it, but we can stick with these definitions and my own spin for now.

My reflection

On the AWS official note, towards the end of the message, it states:

Outside of the service issues, we experienced some delays in communicating service status to customers during the early part of this event. We have two ways of communicating during operational events – the Service Health Dashboard, which is our public dashboard to alert all customers of broad operational issues, and the Personal Health Dashboard, which we use to communicate directly with impacted customers. With an event such as this one, we typically post to the Service Health Dashboard. During the early part of this event, we were unable to update the Service Health Dashboard because the tool we use to post these updates itself uses Cognito, which was impacted by this event. We have a back-up means of updating the Service Health Dashboard that has minimal service dependencies. While this worked as expected, we encountered several delays during the earlier part of the event in posting to the Service Health Dashboard with this tool, as it is a more manual and less familiar tool for our support operators. To ensure customers were getting timely updates, the support team used the Personal Health Dashboard to notify impacted customers if they were impacted by the service issues. We also posted a global banner summary on the Service Health Dashboard to ensure customers had broad visibility into the event. During the remainder of the event, we continued using a combination of the Service Health Dashboard, both with global banner summaries and service-specific details, while also continuing to update impacted customers via Personal Health Dashboard. Going forward, we have changed our support training to ensure that our support engineers are regularly trained on the backup tool for posting to the Service Health Dashboard.

I was drawn to this paragraph, and it is the reason for this blog post; mental models and the ripple effects on a sociotechnical system. As stated in the paragraph, AWS service operators have a backup procedure to update the Service Health Dashboard. I notice the delays, where the publicly available information was reporting that everything was fine, but services running on top of the AWS cloud where down. And the well known DownDetector was reporting the outage.

I assume that AWS services are stable (actual top of the line, a fact), and people got used to the reliability of the services; people don’t factor in their mental models, in this case, the secondary processes during an outage (this is my assumption). From my experience in the industry, we don’t test the incident procedures (whatever the type of incident); we still have a road to learn from other industries, where they mature the practices to decrease the risks during an outage. I felt into this trap and saw my peers do the same (this is the start of my reflection). I call it the reliability trap.

How to escape from the reliability trap?

The reliability trap is common in our industry. I describe the reliability trap when people, teams and organisations take IT systems and processes as granted, leading to simplified versions of their mental models. It is common when organisations have outstanding engineering practices, and the IT systems are highly reliability (Amazon example) or when the IT systems and processes didn’t suffer from a diversity of use cases (increase in the load or people using those for other porpuses that are not the initial intent).

During my years as an engineer, architect, consultant and CTO, I “developed” ladders to get out of the trap and increased my awareness of it. My goal is to avoid falling into the reliability trap again. I will summarise my learnings and why I use it.

First and foremost, I use Wardley Maps to model the components that underpin the company’s scope, which I work with; It helps me to develop an awareness of the landscape, and I can generate different options for the various components (think about what-if type of scenarios). Next to it, and because we operate in sociotechnical systems, I practice and use sense-making, which I borrowed from Cynefin. Rather than drawing my own conclusions, based on my mental models (which are highly influenced by my bias and assumptions), I use the sense-making practices and tools to generate insights. Having those two practices and steps together helped me reason how people perceive their roles and their processes.

To verify the potential consequences of the options generated, I practice Chaos Engineering. Although Chaos Engineering bornt at the technical level, I also use it at processes and organisational levels. The principles are the same, but we don’t have the same tools available. We are humans, and we can leverage our creativity to design experiments to verify our assumptions at a broader scope. The goal is to verify how the sociotechnical system behaves, how people and teams engage, and if the processes are followed or are outdated. You can find more details in a previous post. It is helpful because it provides a safe space to verify the assumptions and mental models; if there are reliable systems, and people got used to it, it will uncover potential risks in a real-life scenario (a real outage). Having these insights and experiences firsthand, people can challenge their mental models, and the organisation can provide support to mitigate the risks (think about training programs and adjust processes that are not useful, that only generate hindrance).

In the aftermath of the chaos experiments, and as part of the retrospectives, I tend to use the Maturity Mapping, a variant of Wardley Maps that blends Meaning, Material and Competence (more on the concepts and the why here). Enhancing the initial Wardley Maps with the result of the chaos experiments allows people to discuss the delta; as delta, I’m referring to the landscape’s mental models and the result of the chaos experiments. Having the delta in a map, namely a Maturity Map, captures the different perspectives in a sociotechnical system, and has the benefit to discuss the agency of teams and purpose of teams, on top of the different mental models. As a CTO, I can support the organisation with the proper training, facilitation, coaching and mentoring, since we have a common ground. People can also challenge me to improve since my own behaviour and mental models influence the sociotechnical system; I can look for training, coaching, and mentoring to close the gap.

In summary

Today, in a world that moves faster than we ever experienced, where we are not fully aware of all the connections (it is a complex challenge, beyond our ability to have accurate mental models), it is paramount that organisations are empowered to keep their options open. Also, before executing a strategy, we have a body of knowledge that allows us to test and verify the strategy’s direction, minimise the impact, and avoid sunken costs fallacies, that can have disastrous results (in the extreme, a company can fall).

I used these practices, and I’m still learning :) For the challenges that I faced, those practices were able to deliver the expected results. However, there is a big world out there, and I’m interested in your experiences in this field. How do you recover and avoid the reliability trap?

Organisational structures to create autonomy: what I’ve learned from my daughter

2020-11-24T22:00:00+00:00

I’m grateful to learn from my daughter. Be able to see how the brain develops and picks up new concepts, skills and words. Nowadays, I enjoy to sit down and watch her play. As a parent, I also need to help her to achieve her autonomy: emotionally, mentally and physically. It is the job, and my wife and I are navigating through it. There is not a guide on how to parent, and we discuss what is working and what is not working. Sometimes is just not the time for a new skill, other times our daughter doesn’t develop an interest for a particular activity. Well, we are all different, and that is what makes the world colourful!

Recently I was looking at one of the structures that we created to help our daughter to achieve her autonomy:

A learning tower, allowing my daughter to execute her tasks

Yes, you are seeing it right, we have what so-called a learning tower, allowing kids to achieve higher heights. They can climb and reach the countertops, with a high degree of safety. This one is on our bathroom, and she uses it to help her with her morning routine.

What the heck is the relationship to organisational design?!

I can imagine that this is the question in your head. Well, I will continue to tell the story… We built the climbing towers after she started to walk without our help. My wife saw this concept in a parenting community, and we applied it. In the beginning, we asked our daughter if she wants to be in the climbing tower, and if she said yes, we would put her there. It was a “show and tell” type of style, given that it was a new concept to her. She used it for tasks such as brush her teeth, wash her hands, watch us cook, make a beverage. You can imagine the type of tasks.

What I observed across time, is that the brain embraced the concept, and when we nudge her for a task, let’s say, dinner, she would go to near the tower. It is a sign that she recognises the patterns, and knows the tools that can help her achieve the goal. Lately, she is proactive: we wake up in the morning, and she goes to the bathroom, climbs the tower and asks for her toothbrush. Most of the times, we don’t need to nudge her.

Applicability to organisations

With this story, I hope that I sparked your imagination. The bottom line is that as a manager, in a sociotechnical system, we should create structures that help people and teams to achieve their autonomy and agency. There is a wide range of practices, rituals and tools that can help, and all of them are context-dependent. To make it tangible, let’s say that you are a company with 375 people, and you are an online photo printing solution. The solution is in the cloud, and you take care of producing and deliver the photos worldwide. You have a diverse landscape, with custom software and standard hardware (printers). We rely on partners for the logistics operations, and you have a distributed workforce across the globe, to be close to the customers. You can picture other details… As director of software, you are responsible for the custom software that your customers are delighted for.

You can think about different practices, rituals and tools to help autonomy and agency within the company. Let’ explore some very concreate scenarios, which I hope you can relate with.

Onboarding of new colleagues

A new software engineer is hired. She will join one of the teams, and as part of the onboarding process, she has the starter pack, where she can find the information about the contact person for different operations. Information such as pension and salary declarations are requested to the HR department, vacations are booked in the internal portal, or the software development process can be consulted in the wiki. There is also a public schedule of the different rituals of the teams, such as demos.

The starter pack allows her to have an idea about the owners of different information and processes, allowing her to be autonomous. How many times did we move between jobs and at the same move between time houses? The real state agencies require a lot of paperwork, and most of it is provided by our employer.

Communities of practice

The company is growing, and the software was a monolith, managed by two teams, is now a distributed set of services. The services are loosely coupled and operated by 15 teams. As the complexity of the system increases, the teams adopted principles of observability, and they nail logging and metrics. However, they want to invest in tracing, to increase their capabilities, and streamline the development and operations processes.

As such, a community of practice was created. People from different teams come together, and work on common problems, with the end in mind. It allows space to co-create, diverge and converge, experiment, until patterns and practices start to emerge, and are useful to the different teams. Other community practices can exist and be in different areas. This is not a “or” exercise, is more like a “and” one.

Communities of practice enable agency of teams, where they can continue to align with the organisation purpose, but devise their own way to execute their mission.

A new service is added to the landscape

The company has leveraged cloud services for the last 7 years, but recently they started to move their services from infrastructure-as-a-service to cloud-native, where applicable. It has an impact on the services and the way they are bootstrapped and wired. The platform team created a few templates with different goals: on one side how to migrate the services when the paradigm changes, and on the other side if you do a new service what are the barebones to bootstrap the basics (authentication, authorisation, monitoring, etc.)

Having these templates, a team reduce their lead time to create a new service. The templates and documentation allow them to understand how things are wired and how cloud-native can be leveraged. Once again, the teams can maintain their agency, reducing the burned over the platform team for the creation of new services. Also, there is a change for evolution, as teams can contribute to the templates, co-evolving them as new insights are generated.

Organisations are complex sociotechnical systems

My goal with the examples is to provide some relatable scenarios, and I believe that stories are a great vehicle to convey a message. However, stories are context bounded, and the way that people or teams achieve something might not be feasible in a different context. Thus, my statement that ** organisations are complex sociotechnical systems**, and as a manager, we need to be aware of the complexity theory and how to navigate the inherent complexity.

I’m a student of complexity theory, and Cynefin, the work of Dave Snowden. Cynefin framework is a sense-making framework, allowing us to make sense of the world. I will not go into the details of Cynefin, this is not the goal of the blog post, nor I’m a specialist. I’m just a student.

My point is that you can use Cynefin to make sense of the different needs of your organisation and help people to set-up the additional structures to create agency and autonomy. A great story that exemplifies it is “Cultivating Leadership with Cynefin: from tool to mindset” from Jennifer Garvey-Berger, Carolyn Coughlin, Keith Johnston and Jim Wicks, in the book Cynefin, weaving sense-making unto the fabric of our world. In this story, they show us how an organisation grew, and what were the structures that they added to allow the autonomy of the consultants, allowing their consultancy practice to evolve - a prime example of a Teal organisation.

Sometimes organisational structures to create autonomy and agency are not used on the spot. You need to nudge the system, observe and reflect. Sometimes the concept is new, and people and teams are getting used to it, trying to find the applicability. Other times, the structure is useless, and if that is the case just discard it. As I wrote in the beginning, our daughter got used to the learning tower, and across time increased her autonomy.

I hope this sparked some ideas with you. What are the organisational structures that allow autonomy and agency in your company?

Diverge and converge to create a Context Map

2020-10-23T09:22:51+00:00

Context Map was the first visualisation for the Bounded Context pattern from Domain-Driven Design. In a nutshell, it is a map of the different Bounded Contexts and their relationships. I tend to create a Context Map during or after a Big Picture EventStorming. Changing perspectives can be helpful, to challenge assumptions and get the best of different techniques.

However, sometimes it is hard to reach a consensus on the Context Map. I often operate in brownfield projects, with large organisations. Although people agree with the different bounded contexts, it is a process that takes time, and most significant energy. Which can lead to fatigue towards the method, and at the same time raises exciting patterns in the behaviours. But this blog post is not about emergent behaviour. :)

To tackle it, and inspired by the Divergence and Convergence cycles from Design Thinking, I start to apply the same pattern to my Context Mapping workshops. After a Big Picture EventStorming, I give the theory about Bounded Contexts and Context Mapping. You can find useful resources from the DDD Crew.

I split the group into small groups of 3 to 4 people. Each group needs to create a Context Map for their domain, using the output of the EventStorming. As they generate insights, they challenge the EventStorming, documenting it. We can have several rounds, timeboxed (the wonders of timeboxing). After that, each group do a show and tell about their reasoning and thought process. People can ask questions, challenging their model. As the process goes, we collect these questions and ideas. This is what I call the diverge phase. Each group creates their own version of the Context Map, but because the discussion is in small groups, it can be more effective.

The next phase is to start to converge. I ask one person to volunteer and to get all the bounded contexts from the different groups that are the same. In this case, we can get 50% to 80% of the Context Map done, with the advantage that is the part that people agree with each other. It avoids the fatigue of rehashing discussions. The next step in this phase is to start to decide on the “leftovers”, each needs more time to discuss. From my experience, this step is the most interesting one, because people have their own mental models about the bounded contexts. Focusing the energy to discuss the differences and trade-offs of the boundaries leads to a better outcome, and I can guide people over the discussion on the boundaries of the more complicated and complex problems. This last step can take a few hours, or a few sessions over time, depending on the size of the domain. As a rule of thumbs, I tend to split over multiple sessions, to allow people to think about it. Some of the best ideas happen in the shower!

You may ask if it is useful both physically and remotely. And the answer is yes. Physically you need more modelling space, where groups can discuss but not bias each other. Online you need tools that allow people to be isolated in their own room. In the end, it will produce the same output.

I’m wondering about your experiences. How do you get consensus over a Context Map?

Chaos Engineering as management practice

2020-10-21T09:22:51+00:00

Chaos Engineering is a practice that has its roots at Netflix. It born from the challenges of moving their workloads from the data centre to the cloud; the transient nature of the cloud affected the way that they build and operate a system at scale. The initial project was called Chaos Monkey, and it has almost 10 years.

Since then the community grew, fueled by Netflix practitioners. Today there are commercial and open-source tools, and we can see more initiatives in different communities. The technical practices had matured, and the knowledge started to spread in the IT world.

However, it is deemed perceived as a technical practice. Can we leverage Chaos Engineering as a management practice?

Role of a manager in today’s world

The role of a manager has changed. From a prescriptive approach, where the manager is leading the activities of people (managing and sometimes micromanaging), towards a supportive and transformational role. Today a manager needs to give structure and create a safe space for people and teams to excel in their fields. And the reason is that organisations are striving to be adaptative (anti-fragile), and it requires a new type of leadership.

Organisations require managers that take a leadership role in creating safety for both individuals and teams. Research shows that elite organisations have 5 capabilities with regards to culture and work environment: climate for learning, Westrum organisation culture, culture of psychological safety, job satisfaction and identity.

The reality is that sociotechnical systems are getting more complex, and it is not possible to have the system’s model in our heads. Today there are more components in the system, and to generate value, we have more inter-dependencies than we can imagine. As such, managers need to shift their mental models and practices. As an example, from a command and control style to a mission command style. Moving between these two styles requires a safe environment, where trust is the cornerstone.

But what are other practices a manager should embrace?

A bit of theory… And then we move on.

Options and costs of events, and their stabilisation

Jabe Bloom distilled a model for events. By event, it means anything that might affect the sociotechnical system. You can think about a spike in usage of your service because it hits the social media, holiday season, acquire and merge of companies, or the current pandemic. And everything in between. Some of the events are known upfront, others can be labelled as “unexpected”.

Options and costs of events, and their stabilisation. Copywrite Jabe Bloom

In a nutshell, in the pre-event phase, there are several options and their associated costs. As time goes by, and the event will happen, the options decrease, and the costs increase, until the point where the event starts. Note that you might be aware of the event or not. It depends on the ability to interpret the signals (weak or strong) and map those to create awareness. In the post-event phase, it tends to stabilise, both in terms of options and costs. You might hear the saying, “never waste a good crisis”. Organisations learn (I hope so), and can use different strategies to handle future events of the same type.

I’m providing a simple explanation of the model, and I don’t intend to go into complexity theory. It is a topic (several books maybe) on itself. My point here is that events can be costly if managers are not able to interpret the signals and create sound options. Back to the pandemic example, there are cases that organisations did the transition to working from home more manageable than others. I challenge you to map some hypothesis why. Also, the ones that are struggling with the change are incurring in higher costs, and it can be related to loss of productivity, market share, amongst others.

Now, let’s go to the next piece of theory…

Dynamic Safety Model

The Dynamic Safety Model was described by Jens Rasmussen in his paper titled Risk management in a dynamic society: a modelling problem. It can be illustrated by the following picture:

Illustration of the Dynamic Safety Model. Copywrite https://risk-engineering.org/concept/Rasmussen-practical-drift

Casey Rosenthal and Nora Jones in the book Chaos Engineering, System Resiliency in Practice summarises the Dynamic Safety Model in Economics, Workload and Safety, where software engineers are good into understanding the limits and trade-offs with economics and workload. Still, we often miss the limits and trade-offs in safety. I agree with the analysis, and I learned in my career the same. Sometimes I underestimate my decisions towards safety, and other times I overestimate. Which made me retrospect based on the learnings of the past decade.

My retrospective

During my career, I performed different roles. As a software engineer, I often focus on solving the problem at hand, and as a manager trying to optimise the system environment for maximising the value delivering. And as interim CTO and consultant, I help organisations to change their operating models to accommodate a fast-changing world. As part of my retrospective, I discover that we try to induced change into an organisation, even when we take into consideration human aspects, and when a stressful event happens in the new operating model, people (and teams) usually fall back to the old behaviour.

Let’s have a scoped and narrowed example for the article’s sake. As a trivial example, I observed and worked together with organisations that moved from a siloed environment to end-to-end responsible teams. Where the responsibilities of creating and operating software are in different teams and/or departments, to end-to-end responsibly teams (we can call them teams working according to DevOps principles), where they need to design, build, release and operate the software in production. However, in an outage scenario, the teams have problems to react and fix the issue, due to the fact they are not used to operate in the new context. Hero culture often arises, where individual contributors are the ones solving the problems. In the long run, it is not sufficient and can lead to a dysfunctional and toxic culture.

I hope that you can relate to this example. There is a bigger and broader array of examples, and the theme of people’s ability to change and management capabilities to nurture it is a big topic. My colleague Thomas Kruitbosch and I have frequent conversations about it, and he keeps challenging me on the matter. :)

In the previous example, stress is introduced into the system. And teams tend to overestimate their decisions in the short time (e.g., when the event is still present in their memory), and underestimate in the medium and long time (e.g. when the event is a foggy memory). It matches with Casey and Nora analysis on the safety boundaries. Also relates to what Ben Mosior and Jabe Bloom describe as skilful coping. I believe that managers can create safety, and coach and mentor people to improve their decision-making process regarding the safety limits. At the same time, having a positive effect on the organisation culture towards an open and trustworthy culture.

Chaos Engineering and management

Well, we get back to the title of this blog post. Chaos Engineering! Not only as technical practice for the resilience of the technical system but more critical to the ability of an organisation to be truly adaptative, e.g., anti-fragile. A sociotechnical system has people and technology interacting, and as a manager that is responsible for part of the system, you have a tool to induce controlled experiments into the system; allowing teams and individuals to practice their ability to cope with an event (skilful coping). Not only if the technical system (software, infrastructure and everything in between) has the right measures in place, but if the processes (whatever processes an organisation adopts) is adequate for the context at play. It is very similar to the way of working of first responders, namely firefighters: 80% of their time is practising, and 20% of the time is in action.

And maybe you are wondering: “So what?”. Getting back to the theory chapter of this blog post, the lifespan of a company is full of events. Some with a more significant impact than others:

Cost of events over time.

Based on my experience, and using Jabe’s model, the costs of an event and the period after stabilisation have higher costs. Bureaucracy to “prevent” events of the same type, leading to loss of productivity; health issues, such as anxietyy and burn-out; people are leaving the organisation due to an unstable environment. And so on… However, using Chaos Engineering as a regular practice, the cost of events tend to decrease:

Cost of events over time when using Chaos Engineering.

The costs of an event tend to decrease, and the stabilisation period is shorter. Using Chaos Engineering to inject controlled events into the system will help the organisation to test their options. Validate the different models against the reality (most of the times’ code and people behaviour), and to help people to build skills that allow them to cope with events that are not induced by Chaos Engineering. Using the initial examples, the holiday season, allows the organisation to exercise what means a significant outage during peak traffic during the holiday season, and how people and teams will react to that. Given the context of the exercise, e.g., controlled environment, stress levels are lower, and individual resilience is built. Also, from the technical side insights can be generated (there is enough literature on that). Based on the insights, individuals and teams can increase the sensitivity for the safety boundaries of the system, and managers can nurture the environment for it.

Last but not least, using Chaos Engineering will inject more events into the sociotechnical system. And as we know, in nature, biological systems evolve based on the events that they are exposed to. The same happens with the organisations: as more exposure to events the organisation has, the more adaptive it tends to be because it is building knowledge and practices on the potential options.

We can adopt Chaos Engineering as a management practice to lead the organisation to an anti-fragile state, creating a safe environment for people and teams to excel. As some organisations demonstrate, when the environment is safe, people unleash their superpowers.

Be aware that I’m not advocating for a silver bullet. There is more to it when it comes to creating a healthy organisation, where people feel safe and valued for their contribution. I will continue to explore this subject in the next blog posts.

How do you create a safe environment? What are the practices and tools that you use?

This article was translated to Ukranian by Temy. You can find it here: http://a.temy.co/temy/blog?page=67923

Using Team Topologies to discover and improve reliability qualities

2020-08-18T09:22:51+00:00

Team Topologies is the work of Matthew Skelton and Manuel Pais, and I use it as part of my job. From a sociotechnical perspective, a team-first approach is paramount for any organisation and helps to decrease the accidental complexity. As such, I’m often asked “How can we operate in DevOps?” or “How can I have a reliable service to deliver value to my customer?”.

TL;DR

Combining Team Topologies from the DevOps movement, with Context Mapping from the Domain-Driven Design community, can give insights about the potential friction contact points between software engineering teams. Below you can find how it can be combined and how to generate ideas to drive your organisations to new performance levels, creating a safe and healthy working environment.

Service reliability qualities

There are different reliability qualities, and we have literature about it. Think about the SRE books from Google or Building Evolutionary Architectures from Neal Ford, Rebecca Parsons and Patrick Kua. They describe different qualities for digital services and different approaches to it. But before getting those concepts into action, How does it affect the team(s)?

Sociotechnical thinking into action

To answer to that question, I use Context Mapping and Team Topologies in a workshop mode (lately in a remote mode; I will share my experiences in a later post) to visualise the (1) domains, bounded contexts and their interactions, and (2) the teams and how they interact. Following this path it allows the people involved to see the complexity of their landscape and how the teams are organised to create solutions for the problem space.

Having these insights, both from a social and technical perspective, it enables people to reason about their design choices. By design choices, I mean the organisation of teams and individuals (organisational design) and the technical design (solution/software architecture). Hence, the sociotechnical thinking, where the concept of a team is not an after-thought.

Right, nice introduction. How about my service reliability qualities?

Yes, I know the title of the post. :) I’m about to get there, but first I want to write about another important concept. Complexity, namely on software land. Frederick P. Brooks Jr. wrote the paper No Silver Bullet - Essence and Accident in Software Engineering back in 1986, where he describes two types of complexity: essential complexity and accidental complexity. Complexity is inherent to a distributed system (as a thought exercise you can think about the stack to connect one web server to one database server), and we build complex solutions from complex problems. More often than we expect, the solutions grow organically (using Kent Beck punch line, we don’t know any other type of growth), but unfortunately it almost by accident; the complexity of the solution outgrows the complexity of the problem.

Back to service reliability qualities: I witness people demanding more availability (or any other -ility) for service X. And again, teams keep adding components to try to achieve it, adding more cognitive overhead. Aware of it (I’m guilty because I had the same behaviour in the past), I started to look to the problem from another angle: what if the answer is in the way that teams interact, rather than throw more technology at it?

Using the insights from the Context Map and the Team Topologies, we can reason about the organisational design to achieve the desired service reliability qualities. Usually, I have a few questions, such as:

Does the flow of value cross multiple teams? If so, what are those team types? A mix of platform and stream-align?
Do they belong to the same department or different departments?
Do they have the same managers?
Are the teams co-located, or are they in different timezones?
Are we sharing models across different bounded contexts?
Do we need to revise our SLA/SLO/SLI?
Should we adopt living documentation?
How many handovers do we have?

In my experience, taking a look at how the teams interact and how we layout our solutions is a great start to discover and improve service reliability qualities, before changing the technical aspects. As an organisation grows, the complexity increases, and it is important to make design decisions to cope with the complexity. The essential one!

Let’s visualise one example

Using a fictional case, where a company provides a SaaS solution to analyse sales leads (I don’t have an idea is this exists or not, maybe a business idea). Analysing the Team Gold, responsible to over the service that qualifies a lead, we can have the following Team Topologies:

As you can see, I placed the Team Gold in the middle of the diagram. Also, the rest of the teams have names related to their purpose, and I will let your imagination to fill the technology details. Now, in this fictional case, Team Gold is asked to improve the cycle time of features to production, since they are losing their differentiative features to the competition. Based on Team Topologies, we can see that there are different interaction modes with various teams that might affect the feature cycle time.

Although we can see some room for improvement there, using a Context Map also can indicate what the level of flexibility on the relationships between bounded contexts is, and by extension, the teams who own those:

For this exercise, let’s say that the Team ERP owns the bounded context Lead Information, the Team CRM owns the bounded contexts Customer Information and Customer Analytics, and the Team Gold owns the bounded contexts Qualified Leads and Leads Analytics. Doing a quick analysis, we can spot that although the relationships between the Team Gold and Team ERP is X-as-a-Service, in reality, the Qualified Leads is conformist to Lead Information. If the Team ERP decides to make changes to their bounded context, it will have cascading failures for the Team Gold. On the opposite direction, the relationship between Lead Analytics and Lead Information is customer-supplier, where Team Gold has influence over the decisions made by Team ERP. Last but not least, we have a team to manage documents; however, none of the bounded contexts is related to it. In this example, mismatches like this can be a potential cause of friction between teams, since the functional needs are not aligned with the sociotechnical needs, e.g., team boundaries and technological decisions.

In this naïve example, we can see the potential to improve on the alignment of teams to the functional needs. From my experience, it was the first step to discover and improve reliability qualities. If the responsibilities of the teams are aligned with the bounded contexts, and the teams work towards the most beneficial interaction modes, it improves the system reliability. For organisations that are adopting different collaboration patterns, it is a great starting point to discover the boundaries from a product level; what are the teams and bounded contexts that are involved, and what might/should change to achieve the desired reliability. Designing organisations with that in mind (in other words, the essential complexity) can lead to better collaboration between teams while maintaining their autonomy.

Heuristics

A small list of heuristics that can be used to generate insights:

Team interactions match context relationships
Invest in the team environment, then on the technology
Improve reliability by improving team interactions

A readers note

I mainly operate in complex environments, highly regulated, with hundreds of teams. Your context is different, take my experience with a pinch of salt!

What does “system” mean in the socio-technical land?

2020-06-25T09:22:51+00:00

One of my interests is in socio-technical systems. However, when I discuss it, namely with IT folks, the word “system” implies that it is an IT system. Well, I believe that it is more than that, and I will try to convey my ideas in this blog post.

Define socio-technical system

The definition was coined by Eric Trist, Ken Bamforth and Fred Emery during the II World War. They study the coal mine workers during that period and the relationship of people (and by their extent the society) and the technical aspects within an organisation. There is a pleasing Wikipedia article about it.

What I observe in the IT industry, is that we have a tendency to look at ourselves as unique, and often tend to reinvent the wheel. In regards to the social aspects, there is work from other disciplines, such as anthropology or sociology, where we can be inspired to design our organisations. The world is digital, but this digital era is created, supported, and consumed by people.

What does “system” mean (for IT folks)?

Before going into the nuts and bolts of the term “system”, I want to clarify what (usually) it means within the IT community. Whenever the word “system” pops, people link to an IT system. The CRM system, the order system, the ticketing system. In my opinion, we are not looking to the bigger picture, nor taking advantage of the insights and knowledge that other disciplines offer to us. As such, I genuinely believe that although we have meaningful conversations about how and what we build (in IT terms), we miss the ecosystem around us. Thus, it is paramount to zoom out and look to the “system” as a whole, where different forces are at play.

What does “system” means to me?

Getting to the point of this post (I know, a lengthy introduction), the “system” is the combination of people, processes and technology. These three aspects are paramount in the social-technical system, and we should pay attention to the relationships between them. It is easy to overlook the social aspect, focusing only on the technical one. Also, when problems arise, it is common to enforce processes to fix a symptom, but we don’t take into account the cause.

Why did I write this post

We need a new generation of leaders. And in this generation, we need two critical traits: (1) ability to give structure and (2) ability to create a safe place. Combining these traits with the ability to recognise the different patterns in a socio-technical system, it is sturdy and will help organisations to be successful. We are on the verge of new organisational design, so-called Teal organisations. These organisations are learning ecosystems, where the decision making power is distributed and closer to the information.

At the same time, we emerged in the digital world, and I believe that we need to change the focus on trying to align business and IT, to product engineering thinking. We are leveraging technology to create better products and services, rather than trying to mimic analogue processes.

I intend to write smaller posts on this subject. I plan to navigate from the practices that are emerging, trying to pinpoint the heuristics and patterns that can help us to create the new type of organisations!