Friday 5 February 2016

LCA 2016 - Day 2

Day 2 of LCA2016 kicked off with the first of four keynotes for the conference delivered by George Fong, President of Internet Australia. He surprised everyone by actually giving a keynote rather than a shameless sponsor promotion. The keynote was entitled "The Cavalry's not coming... We are the Cavalry" which was subtitled "The challenges of the Changing Social Significance of the Nerd."

The main thrust of the keynote was highlighting the insatiable greed for control over technology as exhibited by Governments and legislators - particular when they have little to no understanding of the technologies they are trying to legislate. Particularly damaging is Governments desire to hamstring encryption technology and impose export controls on intangibles and its effect on open source. George emphasised the need to communicate technical concepts to lay people in language they can understand.

Day 2 was also the second day on the miniconfs. For me, that meant the sysadmin miniconf. This one did not have the structure exhibited by he opencloud symposium - each talk was an island of knowledge and there were thicknesses and thinnesses. The talks were also shorter - meaning more of them. The sysadmin miniconf has its own page, so you could ignore this blog entry completely and go there.

 

2/1 Is that a data-center in your pocket? by Steven Ellis

Subtitled "There will be dragons" rather than the predictable "...or are you pleased to see me."
Steven provided a walk-through on how to create a portable, virtualised cloud infrastructure for demo, training and development purposes. This talk was heavy on detail and I found myself wanting to explore this more at a later date. He utilised a USB3 attached SSD drive connected to an ARM Pine64. The setup utilised nested virtualisation, thin LVM and docker.

According to Steve, the "cloud" will very soon be mostly ARM64 - so it's time to prepare for that. He also demonstrated how UEFI can be used to secure boot virtual machines.


Martin highlighted the fact that in the transition from Unix to Linux, somehow we forgot the habits born from Unix administration - in particular, we forgot about system automation, to whit:

  1. Monitoring
  2. Data collection
  3. Policy enforcement
Martin worked through scripts available at https://github.com/madduck/retrans

2/3 A Gentle Introduction to Ceph by Tim Serong (Suse) 

I didn't get a lot out of this talk, other than becoming aware that Ceph was a filesystem popular with OpenCloud. His slides are here.

2/4 Keeping Pinterest Running by Joe Gordon

Joe talked about the challenges and differences in supporting a service as opposed to supporting a piece of software. His basic description is that it's like changing tyres whilst driving at 100MPH. The differences include:
  • stable branches
  • no drivers and configurations
  • no support matrix
  • dependency versions
  • dev support their own service
  • testing against prod traffic
One thing that really interested me is their use of a "Run Book" for the on-call support team. All recent changes are documented in the Run Book against anything it could potentially affect and who to contact about those changes. If on-call support has to respond to a problem, they consult the Run Book first.

In addition to a staging environment, they also have what they call the "canary" environment - akin to the canary in a coal mine metaphor. However, Joe said it was more akin to a rabbit in a sarin gas plant metaphor (insert chuckles).

Their dev->prod cycle looks like:
dev->staging->canary->prod

The staging system uses dark traffic, however the canary system operates of a minimal set of live traffic. If problems occur at any point, they rollback and conduct a blameless post-mortem. Joe emphasised that the blameless component was the most critical.

Before deployment, they conduct a pre-mortem covering:


- Dependencies
- Define an SLA
- Alerting
- Capacity Planning
- Testing
- On call rotation
- Decider to turn feature off if needed
- Incremental launch plan
- Rate limiting 

Tammy's central message was on developing self-healing systems through scripting and auto-remediation. For everything you think of that can go wrong, rather than just logging and crashing, run a script to fix the problem. The motto for their team is KTLO - Keep the lights on.

She also emphasised the need for a "Captain's Log" - which is a log of every on-call alert. Also for cross team disaster recovery testing.

2/6 Network Performance Tuning by Jamie Bainbridge

This talk was more an in-depth tutorial on how to tune network performance of your system as well as diagnose any problems due to the network. It was quite fast-paced, his slides are here.

2/7 'Can you hear me now?' Networking for containers by Jay Coles

I felt a little lost in this talk. This was part 3/3 in a series of talks by Jay on containerisation. As mentioned before, I have neglected containers - something I need to remediate as it seems everyone has embraced them. Much of the material for this talk is available here.

2/8 Pingbeat: y'know, for pings! by Joshua Rich

This was a great talk! Josh gave quick overview of ICMP ping and then introduced Pingbeat, a small open-source program written in Go that can be used to record pings to hundreds or thousands of hosts on a network.
Pingbeats power lies in its ability to write the ping response to Elasticsearch, an open-source NoSQL-like data-store with powerful, built-in search and analytics. Combined with Kibana, a web-based front-end to Elasticsearch, you get an interactive interface to track, search and visualise your network health in near real-time.
Being the ninth talk of the day, I kinda snoozed in this one. I didn't find it particularly useful or interesting hearing about the challenges of supporting an IT system used by research scientists.
Grafana is an open source web charts dashboard. It can be configured to use a variety of backend data stores.

Andrew gave a live install, config and run demonstration of Grafana, starting from a fresh Ubuntu 14 VM with Docker (again!) where he installed and setup Graphite using Carbon to log both host CPU resources and MQTT feeds and created a custom dashboard to suit.
This was another talk on what it's like to support a continuously available service. Things to gain from this included:


- Fixed time iterations
- Plan the scope the known work for the next 2-3 weeks
- Leave sufficient slack for urgent work
- Be realistic

Interruptions:
- Assign team members to dev teams
- Have a rotating “ops goal keeper” with a day pager who is free from other work
- Have developers on pager as well. This helps in closing the feedback loop so that they are aware of issues in production

2/12 From Commit to Cloud by Daniel Hall

This talk was focused on leveraging the benefits of microinstances when managing cloud based services and infrastructure. Deployments should be:


- Fast (10 minutes)
- Small (ideally a single commit, aware of whole change)
- Easy (as little human involvement as possible, minimise context switching, simple to understand)

This leaves less to break, easier rollbacks and allows the dev team to focus on just one thing at a time rather than a multitude of tracked changes. The basic idea is that deployments should be frequent and nobody should be afraid to deploy.

In the setup Daniel works with, they have:

- 30 separate microservices
- 88 docker machines across 15 workers
- 7 deployments to prod each working day!
- Only 4 rollbacks in 1.5 years

Their deployment steps are:
  1. write some code
  2. push to git reporting, build app
  3. automated tests run
  4. app is packaged
  5. deployed to staging
  6. test in staging
  7. approve for prod (single click)
  8. deploy to production

2/13 LNAV

The final talk was a 10 minute adhoc one discussing lnav as a replacement to using tail -f /var/log/syslog when looking at the systemlog. I am fully converted to this tool and will be using it everywhere from now one. It uses static libraries, so you can simply copy it from one system to another as a standalone binary.





No comments:

Post a Comment