Tuesday, July 31, 2012

Got RUM (Real User Monitoring)?

This year I attended the O'Reilly Velocity Conference and had a very good experience.  Before I continue let me give a little of my background.  Most of the conferences that I have attended have been for developers or for a specific language.  The developers conferences all degrade into one large and endless commercial.  The programming language conferences turn into juvenile language bashing or a religious revival.  It takes a lot of patience to glean valuable insights from these conferences much like getting water from a dry sponge.  Velocity on the other hand was like drinking from a fire hose.  It is going to take months to get through the valuable information that was available.  One topic stood out almost ominously, the topic of this blog entry real user monitoring or RUM as I'll refer to it throughout the rest of this entry.

What is RUM?

RUM is a passive technology used for performance metrics and monitoring.  A simple definition is that it records all of the user interactions with a website.  That user creating the interactions could be another website, robot or human.  RUM is passive in that the collecting device gathers web traffic without having any effect on the site.  This hasn't always been the case in the past but the technology has improved to a point where there is no excuse to have it.  Passive monitoring differs from other types such as synthetic and automated web browsers in that it relies on actual inbound and outbound web traffic to take measurements.

Why use RUM?

The performance community for years have been preaching that site owners need to use real end-user monitoring tools, like Webpagetest.org, to get a real-world picture of performance.  For example, just because a test was successful doesn't mean users aren't experiencing problems:

  • The user could be on a different browser than the test system.
  • The user may be accessing a portion of the site that is not being tested.
  • The user may be following a navigation path that was not anticipated.
  • An outage could have been so brief that it occurred between two tests.
  • A user's input data could cause the site to behave erratically.
  • In a load-balanced situation a user could hit a failed component while the synthetic tests hits a working one.
There are infinite ways a site can be broken but still be working or hobbling along.  As I have experienced in my career, all the monitors could be green while the user experience is horrible.  RUM is a collection of technologies that capture, analyze and report a site's performance and availability from an actual visitor's perspective.  RUM may involve sniffing a network connection, adding JavaScript to web-pages, installing agents on boxes or any combination thereof.


Simple RUM!


If you are already using Google Analytics you already are instrumented for RUM!  Take a look at the Real-Time pages for the RUM reports.  It will open up a whole new vista for you.  Another simple implementation is using boomerang.

boomerang always comes back, except when it hits something.

Boomerang.js website (https://github.com/lognormal/boomerang) is a piece of JavaScript code that you insert into your pages that captures measures for a whole range of performance characteristics from an actual user's interactions.  I found that using Google Analytics' RealTime to be the easiest way to get RUM up and running and boomerang.js second to that.


Note: Here is how I got my nickname/alias 'oldstinger'.  In high school I played the strong safety/outside linebacker position and was known for my hard hits.  So one day my coach accidentally slurred my last name to 'stingham' instead of 'stringham'.  My friend said that yeah the hits sting.  Thus 'stinger' came into being.  Now that I am getting grey hairs it has become 'oldstinger'. :)

Friday, July 6, 2012

My Agile IT Experience Report At AgileRoots2012

Late in June, I presented "Agile IT/Ops: A One Year Checkup" at AgileRoots. It seemed like IT and DevOps were popular topics because I heard a lot about them at other presentations and in the hallway track.

Specific talks that stood out (and about which I'll post more later) included: Agile to the Rescue (a CIO's view of doing IT Agilely) and Agile 2.0 (which touched a lot on DevOps and IT themes). Also scheduled up against my experience report was Outgrowing the Cloud by my friend Mike Moore I think this one will be available on the web soon, you should look for it.

So, what did I talk about? Good question. Here's the overview:

A little background

Our team of nine is part of a bigger group of about 30 (which includes: data center folks; DBAs; infrastructure, network, and storage engineers, change management; and our sys-admin/Ops team). That group is, in turn, part of a much larger development organization.

Not only do we have a big group with diverse charters, we're also geographicly spread out. We have people in two different office spaces about an hour apart. We also have three data centers spread across two states.

Several years ago, everyone else in the department went to a series of Scrum training events and became 'Agile'. At that point, the powers that be decided that IT couldn't be done within an Agile process, so we kept on doing things 'the old fashioned way'. Eventually, the dissonance became too much and we started exploring a move to Agile. That's where our story picks up.

Jun-Aug

The first quarter of our agile conversion was marked by Painful Planning and Guerilla Agile. Our group's management team met to figure out how to make the move. Since I'm an Agile Methodology junkie (well read in the topic, but only lightly seasoned in practice) I was pulled in as an advisor. We ran several planning exercises to see how that would look (it wasn't pretty), and eventually decided not to make the move. Honestly, I think I spent too much time on the mechanics and not enough time getting into the philosophy — that probably led to some of the problems we ran into downstream.

My team manager and I decided there was still a chance though, so we went underground and started running an Agile Ops Team. We focused a lot on a training-trying-repeat cycle. At this point, I started to slip more philosophy into the mix.

I was also careful to be really explicit about what and how we were doing things. For example, in our retrospecitves, I would start out with a review of the stages of retrospectives, then announce which stage we were moving into and what we were trying to do in it. This helped build some solid institutional knowledge among members of the team.

Sep-Nov

Our second quarter felt like we were Getting Into A Groove (we were hitting the "Norming" stage of Tuckman's stages of Group Development). This was also when we hit our Agile Mandate — you've probably heard of the Agile Manifesto, we had a top down directive that said "Everyone in the department will now be Agile". We didn't all have to make the move at once, but the writing was on the wall.

Unfortunately, this quarter ended on a low water mark for our year, a management re-org that really shook up our group. Our manager moved to another group, a PM took over responsibility for Agile in the group, and there were a variety of external changes that reverberated in our group as well.

Dec-Jan

Soon after the re-org, we went through some additional changes that sent us back to Forming and Storming. We made it through December operating about where we had been and started January with a full-day, off-site 6 month retrospective. From my perspective, this was the highlight of the year. Our team was really humming at this point.

Then we came in for our 6 month planning session which was pre-empted with the news that we were going start "doing agile" as a group. This was rough. We had people in the group with very different skill (and interest) levels, we were now a much larger group trying to meet together, and we had to deal with meetings that crossed time zones and pulled people in via phone.

We had a variety of mis-steps in this quarter: some people ducked meetings to "work on the important stuff"; we often lost traction on improvement ideas that came out of retrospectives; and morale suffered as we learned just how bad our estimates and capacity planning were.

Mar-May

In the final quarter of our year, we got back to Norming. There were still some obstacles, but there were also some lessons learned and some wins for us.

We moved from the product we were using to track requests and problems (a bug tracker, which wasn't optimal, but entrenched) to an agile tool. This created more heat than light at first while we went through some growing pains. It also helped pull the group together — we identified the pain points in a retrospective and came up with some ways to work through the rough patches.

The group also decided to cut back on the time spent in retrospectives and hold them every other iteration. We're coming up on the end of this experiment, so we'll see if we stick to it or not.

We also had one team split back out and make a move to more of a KanBan or "continous flow" model. This provoked a lot of discussion on our team, as we feel it might be a good direction for us as well.

Our team decided to start applying retrospectives to our operational work as well as our iterations. We met each week in a "The Week That Was" (imagine it being read in the booming, deep radio announcer voice) where we would discuss what had happened over the last week, what we could learn from it, and what we were going to do about it.

Today & Tomorrow

Since my timeline ended in May, I also talked a bit about where things are now and where they're headed. Three things really stand out:

  1. We're breaking back out to the team level, and reporting to the group to make our meetings more managable and effective.
  2. We're scheduling an annual retrospective and planning meeting as a team
  3. We're going to experiment with KanBan ourselves.

Wrapping Up

Just before my presentation, my son went on a two week canoeing trip. So this next bit is an homage to him. The tradition in the program he attended was to hold a nightly reflection focusing on Wet Socks (things that didn't go well), Dry Socks (things that went well), and Gold Bond (things that could be done to make things better).

Wet Socks

  • our group was too big and too dispersed to be effective
  • we had too many disparate charters
  • there was no real product owner, so everyone tried to be one (and to paraphrase Syndrome "When everyone is a product owner, no one is a product owner"
  • the reorg
  • cutting retrospectives to every other iteration (in my opinion)

Dry Socks

  • we created a lot of transperancy internally and externally
  • we held ourselves and each other accountable
  • we built a lot of team unity
  • just deciding to do it was a big win
  • the 6 month retrospective
  • starting "The Week That Was"

Gold Bond

  • KanBan
  • integrating Ops and Iteration retrospectives more completely
  • going back to team level meetings

Recommendations

If you're thinking about trying to run your IT shop using Agile principles, do it! It might be hard, but it can work.

Look at continous flow from the get go. We haven't gotten there yet, but we all think this will be a good move for us.

Train all the time. Make every meeting and communication a chance to do a little mini training. Why are we doing this? What does this mean? How can we improve?

Use your retrospectives wisely. Savor the wins, examine the pain points, and keep improving.

Be prepared for hard times. They will come. If you're careful and thoughtful, they'll make you better. If you just grit your teeth and endure them, they'll probably circle back and hit you again.

Keep records and use your metrics. This will give you a better sense of perspective, and ammunition to fight off the occasional attempt to shut things down.

Friday, June 29, 2012

What is best in life?

One of my favorite movies from my childhood was Conan the Barbarian. That movie has a scene where Conan is asked, "What is best in life?" 



Not sure the outcome of Conan's philosophy is congruent with a sustainable business model but it makes great theater!

Yesterday I participated in a roundtable discussion with my companies new CEO. It was a chance for him to get to know employees and get first hand feedback. He also used that forum to pitch his vision. He made the convincing argument that what our customers really want is a wonderful experience. The reason why people go to Disneyland, loyally shop at a particular store, Facebook, etc. is because of the experience they have. People want particular goods or services. But they crave having a unique and rewarding experience. That gave me a new perspective on how I judge my work. Does what I do on a day to day basis contribute to or detract from an excellent customer experience? Does what I do on a day to day basis contribute to what is best in the lives of our customers?

Continuous Delivery at Ancestry.com

I work for an organization whose main product resides in the “monolith, over-the-wall, deploy once a quarter” world. We are in the process of migrating to the “SaaS, Dev/Ops, CD” world. This migration has proven non-trivial to say the least. Ancestry.com recently completed a similar process which is detailed here http://www.youtube.com/watch?v=DoSrsjimXjE. I found the presentation to be very informative and decided to blog about key learning’s I had. Hope you will find them useful as well. (The slides are a bit ahead of the audio).

The first ~16 minutes is a review of Opscode Private Chef. I’ve used Chef and it’s an excellent tool. But I’m going to focus on the challenges and learning regarding the migration from one set of practices to another. There is a lot of good stuff in the Opscode portion (“infrastructure is code”).

Ancestry’s definition…“Continuous Deliver is reliably releasing high quality software fast through automated build, test, configuration and deployment.” Fast is critical because they want to increase the rate they can deliver value to the customer. You will see this theme repeated during the presentation. http://www.youtube.com/watch?v=DoSrsjimXjE&t=17m3s

Key benefits of CD: Increase flow of value to customer, Increase feedback rate, Lower risk due to smaller batch sizes. http://www.youtube.com/watch?v=DoSrsjimXjE&t=18m1s

“Doing CD right means being agile, not just doing Agile.” I call this ‘what’s right, not who’s right.’ http://www.youtube.com/watch?v=DoSrsjimXjE&t=18m39s

CD eliminates “Emergency” deployments and all the bureaucratic rigidness it entails. http://www.youtube.com/watch?v=DoSrsjimXjE&t=26m17s

"Ultimately, what [CD] enabled, it enabled the development teams to own their services and applications from cradle to grave." (Emphasis added) http://www.youtube.com/watch?v=DoSrsjimXjE&t=26m58s

"Because the infrastructure was working on [the developers] behalf, they were able to focus on the service itself, and focus on owning that, and providing what the customer really needs and wants." (Emphasis added) http://www.youtube.com/watch?v=DoSrsjimXjE&t=27m06s

“Extending Agile into the enterprise put pressure on ops to change their methods and practices too.” We are lucky here as our Ops team is already (trying) to work under the Agile model. Getting better at it, too! http://www.youtube.com/watch?v=DoSrsjimXjE&t=27m57s

“Engineering Productivity” The bridge between ops and dev.  EP consists of AppOps, CD Tools and Test Infrastructure teams. EP’s mission is to accelerate development. There is a need for a third group to act as a buffer between Ops and Development. At ancestry this group does a lot of the Opscode Chef work. Our organization is attempting to have ops file both the operations role and the AppOps role with mixed results. http://www.youtube.com/watch?v=DoSrsjimXjE&t=31m15s

“Adopt a service model that enables development to do what they need quickly and easily, getting infrastructure out of the way. Rule #1: Developers don’t want to mess with or own ops infrastructure. Rule #2, If you don’t provide infrastructure and services to enable Rule #1, then developers break Rule #1.” The key to having these two rules obeyed is a CD platform (SaaS). http://www.youtube.com/watch?v=DoSrsjimXjE&t=32m40s

"Store your infrastructure code in source control." Enough said. http://www.youtube.com/watch?v=DoSrsjimXjE&t=35ms20

You can’t do CD without automated configuration and automated deployment. Ancestry uses Opscode Chef for both. Not having those automated tools makes you slow. Slow means you’re not doing CD. http://www.youtube.com/watch?v=DoSrsjimXjE&t=37m50s

All new services are configured with a fully provisioned CD pipeline, from build through production. Not the other way around. http://www.youtube.com/watch?v=DoSrsjimXjE&t=39m50s

"Deployment is monitored and overseen by the team itself." Clearly the development teams have accountability for production systems. They don’t go into exactly how that is done, unfortunately. My guess is that development has responsibility for a portion of production monitoring. http://www.youtube.com/watch?v=DoSrsjimXjE&t=47m54s

There it is. The entire presentation is well worth the time if you are doing CD and Dev/Ops and is highly recommended.

Tuesday, June 26, 2012

About

Welcome to Melting Pot Ops! You're probably wondering what this is all about, let me try to tell you.


We're a team of eight Systems Administrators/Operators/System Managers. We come from a variety of technical backgrounds, and have varying levels of experience. But that's not really where the melting pot bit comes in — we're trying to bridge the gaps between traditional operations and DevOps, between old-school sys-admins and infrastructure as a service, between hand-built and maintained systems and automated configuration, between waterfall planning and agile processes.

We try to think a little differently. We want to explore new ideas and new ways of doing things. Mostly though, we want to provide rock solid infrastructure, monitoring, and service management to our dev teams and to our end users.

As you read Melting Pot Ops, you might see:
  • Reviews of books, courseware, hardware, and software that we've really liked.
  • Experience reports for things we've tried (both successes and failures).
  • Recaps of presentations we've made in house and at conferences.
  • Random other stuff as needed — we'll try to stay away from lolcats though.
One of the reasons we decided to start this blog, and that our management decided to approve it, was to get feedback from the big blue room outside our cubicles and data centers. Please feel free to drop a comment or two as you read. We hope this becomes a conversation.

Thanks for coming by to visit. We hope you like what you read and come back often.