IT Risk Mitigation

Every IT project brings some level of risk. Risk mitigation is about understanding those risks that can impact the objectives of the project. Once that’s identified, then you need to take the appropriate actions to minimize the risks to a defined acceptable level to the customer. Taking those deliberate actions to shift the odds in your favor, thereby reducing the odds of bad outcomes.

Riskmanagement

At times risk management is an active process that often requires a large degree of judgement due to insufficient data. The architect has to make certain assumptions about the future. Technology is a source of risk and its often due to the unintended consequences. For this reason, you must validate that your mitigation is resolving the identified risk.

Risk is a function of the likelihood of a given threat-source’s exercising a particular potential vulnerability and the resulting impact of that adverse event on the project.

Risk-assessment-XSmall

So, in order to effectively manage the risk, then one must identify the risk, assess the risk, respond to the risk and then monitor the risk.

I was in a project meeting recently and the project manager was asked what were some of the risk identified. The PM responded with none and the whole room sat silent for a few seconds. Then he went into his risk log list and the whole room chuckled a bit.

Resources:

VCDX No Troubleshooting Scenario

I just read the update from the VCDX program that they’re removing the troubleshooting scenario from all defenses. While I was looking forward to trying to do some on the spot troubleshooting, I’m glad that there is more time now dedicated to the design scenario.

I’m planning on targeting 2016 to give the VCDX-DCV certification a try and not having to work through an already stressful situation by working through a troubleshooting problem is going to alleviate some of that stress. Granted the troubleshooting scenario is only 15 minutes in length, now that time is added to the design scenario problem posed by the panelists.

VCDX-merch-300x82

So the breakdown now will be for all tracks:

  • Orally defend the submitted design (75 minutes)
  • Work through a design problem posed by the panelists, in the format of an oral discussion (45 minutes instead of the previous 30 mins)

Not that this makes the certification easier to achieve by any means. I’m glad that the VCDX program has made this information available well ahead of the 2016 dates so that those that are preparing now can adjust their studying strategies.

I’m looking forward to the challenging and learning experience that will come.

Resources:

AWS Regions and Availability Zones

Every IT architect strives to deliver an optimized and cost effective solution to their customer. Therefore, they must be able to explain and understand the different options that a customer has to then assist them in choosing the best option possible taking the trade-offs into account. There is a mutual partnership relationship between the architect and the customer that is ongoing in order to produce a quality output deliverable.

Amazon AWS infrastructure is broken up by regions and availability zones. They continue to expand their infrastructure constantly as their business grows. So what is a region, a region is a named set of AWS resources in the same separate geographic area. AWS provides you with a choice of different regions around the world in order to help customers meet their requirements. Each region is completely isolated from the other regions.

There are only a set number of available regions to choose from when but as the customers grow, then AWS will continue to provide the infrastructure that meets the global requirements. In North America AWS has 3 regions to choose from and a GovCloud region:

NA_AWS_Regions

Inside of these regions, there are Availability Zones. These are basically AWS data centers within these regions that are connected to each AZ in the region via low latency links.

aws_regions

Availability zones allows you to architect your applications to be as resilient as possible by separating them out as failure domains so that there is not a single point of failure. As with all architects, we must design architectures that assumes that things will fail.

With the implementation of Auto-Scaling, ELBs and multiple Availability Zones then you can build a reliable architecture that takes only minutes to setup instead of days and weeks. I’ll go over Auto-Scaling and ELBs on a separate post.

Resources:

AWS Certified Solutions Architect – Associate Level

I saw this challenge from Virtualization Design Masters (VDM) to write 30 blog postings in 30 days. I’m going into this with the intention of being able to do this since I’ve been meaning to post more blog postings and it might even get me into the habit of posting regularly. I don’t want to post for the sake of posting since I feel like that adds no value so I have my list of topics that I’m currently working on right now so I feel prepared.

So I decided to tackle the Amazon certification for AWS Certified Solutions Architect – Associate Level. I’m planning on taking it at the end of the month, hopefully if everything works out. I’ve been meaning to learn more about AWS for a long time now, but I think now is the time, especially since it’s only going to get more popular.

The exam is a multiple choice exam covering four different domains:

Domain % of Examination
1.0 Designing highly available, cost efficient, fault tolerant, scalable systems 60%
2.0 Implementation/Deployment 10%
3.0 Data Security 20%
4.0 Troubleshooting 10%
TOTAL 100%

So obviously the first domain is the biggest bang for your buck when it comes to scheduling your studying time. This section also fits very nicely with my other work responsibilities and other certifications that I have my eye on, namely the illusive VCDX. I’ve started by signing up for the free 12 month AWS account to use as my lab so that I can try different things that I haven’t done so far like Amazon Aurora which is a relational database engine.

I’ve started by reviewing the exam blueprint and getting familiar with it. Amazon does a good job on documentation so their user guides are very useful and easy to understand. They really make it easy to consume their services and also offer self-paced labs via qwiklabs.com to help with your studying. I’ve only done the free labs there and they were good to get your feet wet.

Inside of my free AWS account I’ve been playing a lot with setting up Virtual Private Clouds (VPC) and the different layers of security that you can apply. A good image from the AWS user guide is the security in your VPC:

AWS-VPC_VPG_Routing_Network-ACL

I really like easy to understand things that are broken down with good explanations, did I already say that AWS user guides are well written. 🙂

Below are my resources that I’m currently using for my studying.

Resources:

Troubleshooting Random Slowness with VMs

Sometimes you’ll come to the question that comes up quite often and that is when user’s complain about slowness on their VMs. Of course the first thing that we do is try to isolate the problem.

So we go through our troubleshooting steps:

  1. Define the problem
  2. Gather detailed information
  3. Consider probable cause for the failure
  4. Devise a plan to solve the problem
  5. Implement the plan
  6. Observe the results of the implementation
  7. Repeat the process if the plan doesn’t resolve the problem
  8. Document

For VM slowness issues, then we want to identify the exact nature of the problem. Identifying which VMs are experiencing this slowness issue and if they’re all on the same host will help to isolate any issues. First, find out if there were any recent changes that would correlate with the slowness issue. Then I like to rule out any issues with networking so you can ping the VM, even though this might not be conclusive due to the VM not responding even when there is no networking problem.

You then want to find out if other VMs are experiencing the same problem and if so, then what is the commonality with all of them. If you can rule out the networking portion then you want to tackle getting more details on the storage. The next thing would be to figure out the type of storage array backing those VMs whether it is an active/active or active/passive, ALUA.

Using esxtop to identify certain statistic thresholds like DAVG, KAVG, QAVG, and GAVG is a great way to further pinpoint where performance degradation is happening.

  • High DAVG indicates a performance issue on the storage array or somewhere along the path to it.
  • High KAVG indicates a performance issue within the VMkernel. Possible causes could include queuing, drivers, etc..
  • High QAVG indicates a performance issue is causing queue latency to go up. This could be an indicator of underperforming storage if higher DAVG numbers are experienced as well.
  • High GAVG is normally the total of the three previous counters. If experiencing high GAVG while other latency metrics seem sufficient, the issue could reside within the VM drivers or virtual hardware.

Along with the above steps then you also want to verify that your storage array is handling the demand from the VMs.

A lot of this information can be found in different resources online. But I like to have a tried and true script that I can work off when doing performance troubleshooting.

My VCAP-DCD Experience

I took the VCAP-DCD exam today and have finally passed. I believe this is one of the unwritten rule that you have to write about your experience so you can share it. I felt like I just tackled a mountain. This was probably one of the hardest exams that I’ve taken in a long time. This was my second attempt at this beast as my first attempt was at VMworld 2014. I’ve been studying off and on for some time now, but after reading some of the other blogs about guys passing this exam on multiple tries I decided what better time than now to take this challenge on. The blog I read was Craig Waiters who passed it on his third attempt; read his journey here.

I did find the 5.5 version has a lot better flow, then the 5.1 exam. I had 42 questions and 8 were the visio-like design questions. After being unsuccessful the first time, I decided I needed to brush up on high level holistic helicopter view of design. I also did some brushing up on PMP skills to help define requirements, assumptions, contraints, and risks.

In preparation for my take two of this exam I used this:

Tips:

  • I skipped all the questions by flagging them which allowed me to concentrate on the design scenarios while I was still fresh. This left me over an hour and a half left to answer the rest of the questions.
  • Know how to apply the infrastructure design qualities of AMPRS: Availability, Manageability, Performance, Recoverability, and Security (Cost can also be thrown in there)
  • Requirements, Assumptions, Constraints, and Risks; how to identify these.
  • Applying dependencies; upstream and downstream; Good blog by vBikerblog on this Application Dependency – Upstream and Downstream Definitions
  • Conceptual, Logical, and Physical designs.
  • Storage design best practices; (ALUA, VAAI, VASA, SIOC)

What’s next:

  • I’ll be taking the Install,Configure, and Manage (ICM) course for NSX and likely taking the VCP-NV
  • Maybe next year I’ll tackle the behemoth of the VCDX; need to finish grad school first

Resource pools, child pools, and who gets what…

I know there have been numerous blog posting on this topic already which are all very good, but this is mainly for my own understanding as I try to understand this topic a bit further.

So what are resource pools?

They are a logical abstraction for management so you can delegate control over the resources of a host (or a cluster). They can be grouped into hierarchies and used to progressively segment out accessible CPU and memory resources.

Resource pools use a construct called shares to deal with possible contention (noisy neighbor syndrome)

Setting CPU share values Memory share values
High 2000 shares per vCPU 20 shares per megabyte of configured VM memory
Normal 1000 shares per vCPU 10 shares per megabyte of configure VM memory
Low 500 shares per vCPU 5 shares per megabyte of configured VM memory

*Taken from the 5.5 vSphere Resource Management document

My biggest issue with even getting into any contention issues is why would you architect a design that has the possibility to have contention if you can avoid it. I understand that a good design takes into account for the worst case scenario in which all VMs would be running at 100% of provisioned resources, therefore possibly generating an over-commitment issue with potential contention but that’s when you go to the business to request more funds for more resources. I know it’s the one of the benefits of virtualization that you can do this, but should you rely on it?.. It’s also a comfort level of over allocation in which the level depends on the individual architect. Keep to the rule of protecting the workload and I don’t see an issue with this. I keep the worst case scenario in mind, but I don’t ever want to get there, but this is more of a rant than anything else.

Frank Denneman(@FrankDenneman), Rawlinson Rivera(@PunchingClouds), Duncan Epping(@DuncanYB) and Chris Wahl(@ChrisWahl) have already set us straight on the impacts of using resource pools incorrectly and the best ways to most efficiently use RPs.

The main takeaways of resource pools are:

  • Don’t use them as folders for VMs
  • Don’t have VMs on the same level as resource pools without understanding the impacts
  • If VMs are added to the resource pools after the shares have been set, then you’ll need to update those share values
  • Don’t go over eight (8) levels deep [RP limitation; and why would you]

Additional Resources:
Chris Wahl: Understanding Resource Pools in VMware vSphere

Frank Denneman and Rawlinson Rivera: VMworld 2012 Session VSP1683: VMware vSphere Cluster Resource Pools Best Practices

Duncan Epping: Shares set on Resource Pools