MD/Blog

HowTo Migrate your EKS cluster to Graviton2

2024-02-18T10:52:55+00:00

In this article, I’m going to show you how you can migrate an existing EKS cluster to Graviton2 workers in a few simple steps. I’m going to assume you already have an existing EKS cluster working on an Intel based infrastructure (Intel and/or AMD instance types).

Step 1: add Graviton2 worker nodes in your cluster.

This depends on how your EKS cluster is configured; you could have one of the following scenarios:

NodeGroup workers: this is a typical EKS install where you have worker nodes deployed in a nodegroup.
No nodegroups or autoscaling workers; these would be managed by something like Karpenter where we need to add the Graviton2 nodes like that.

NodeGoup setup

In this case, you will have to add a new nodegroup that has support for Graviton2 instances. Depending on how you created your nodegroups you would use a similar method. For example, I will show how to do this when using the terraform open-source EKS module; this is a snippet of our cluster definition:

module "eks_cluster" {
  source  = "terraform-aws-modules/eks/aws"

  cluster_name      = "my-cluster"
  cluster_version   = "1.28"
  subnets           = ["subnet-xxxxxxx", "subnet-yyyyyyy"] 
  vpc_id            = "vpc-xxxxxxx"
  manage_aws_auth   = true
  node_groups = [
    {
      desired_capacity = 2
      max_capacity     = 10
      min_capacity     = 1
      instance_type    = "m6i.xlarge"
      name             = "intel-node-group"
    },
...
  ]
}

in this case, if we want to add a graviton2 nodegroup we would just have to define a new one similar to the intel one but with graviton2 instances. For ex:

    {
      desired_capacity = 2
      max_capacity     = 10
      min_capacity     = 1
      instance_type    = "m6g.xlarge"
      name             = "arm64-node-group"
    },

and reapply the terraform for the cluster definition.

Karpenter

If your EKS cluster is using Karpenter to manage worker nodes then it will not use nodegroups and we will need to tell Karpenter to add these into the cluster. I’ve written a dedicated article about this use case that you can follow along here.

Step 2: Convert your workloads to graviton2.

Now that we have graviton2 worker nodes in the cluster we can convert all our workloads to arm64. We have two special cases that we need to handle:

daemonsets; these are special k8s workloads that run on every single node. These are normally monitoring or logging agents, security agents, and other global infrastructure software that we usually don’t write ourselves and we install and use in our cluster from either open source or commercial vendors.
custom services; these are our own software that we have built and run in our clusters.

Daemonsets

If any of your daemonsets doesn’t have support for arm64 you will see that immediately after you add your first graviton2 worker in the cluster as those will fail to run there. Most of the time we install these with something like helm charts from an open source repository or vendor.

What is needed for this to work is to have all the containers used by the daemonset available for both intel and arm64. This can be seen on the image source in dockerhub or amazon ECR, etc. If there is no support for it we will probably need to work with the vendor or the open source maintainers to have them add support to build a multi-arch containers and have these available.

Assuming that the container registry has support for multi-arch and we have available the arm64 image, EKS nodes are capable of downloading from the registry the appropriate architecture the node is running automatically without any sepecial configuration. Very cool! This is the happy path; where upstream already has support for it or we can get this in the upstream and we don’t have to change anything on our end.

Unfortunately, this doesn’t work all the time. If that is the case, you will have to build the multi-arch containers yourself. Hopefully, the project has definitions for the dockerfile and it is not so difficult to rebuild the image with something like buildx for multi-arch. Once we have that, we can push it to an internal repo, and then use it in the helm chart instead of the official image. Most helm charts will allow us to redefine the image in use with your own custom images; look for that in the values.yaml. This would look something like:

image:
  # -- image registry
  registry: "docker.io"
  # -- Image repository.
  repository: project/agent
  # -- image tag: latest or defined.
  tag: null
  # -- image pull policy.
  pullPolicy: IfNotPresent
  # -- Optional set of image pull secrets.
  pullSecrets: []

Where we would replace with the custom registry (maybe a private ECR) and the repository (our custom project in ecr)

Custom software

The rest of the deployments running in the EKS cluster should be our custom software that we should have full control over and be able to compile and build for multi-arch format. While we are in the transition phase and we don’t have arm64 images for some deployments we need to make sure we are using a selector to have those run on intel nodes as if not they will fail. We can do this with something like:

      nodeSelector:
        intent: apps
        kubernetes.io/arch: amd64

inside the definition of the service. This will force it to run on the original intel worker nodes.

Once we have built the arm64 image we can just remove that selector and allow it to be deployed on any node type, or we can change it to kubernetes.io/arch: arm64 and force it on graviton2 nodes.

Depending on how many services you have deployed in your cluster and how unique they are, this might be a tedious process until you can switch all of those to graviton2 nodes, but this should be worth it, bringing great performance and cost savings along the way.

Once you have migrated all your workloads to graviton2 you can just retire the intel nodegroups and after that, your EKS cluster will be running fully on graviton2. Boom!

Running Graviton2 workloads on EKS clusters with Karpenter

2023-12-10T10:52:55+00:00

Amazon Elastic Kubernetes Service (EKS) provides a managed Kubernetes service, allowing users to deploy, manage, and scale containerized applications using Kubernetes on AWS. With the introduction of Graviton2 processors, AWS offers enhanced performance and cost savings.

Karpenter is an open-source node lifecycle management project built for Kubernetes that was created by AWS as an alternative for the cluster autoscaler project.

In this article, we are going to look into what steps are needed to run graviton2 (arm64) based workloads in a EKS cluster that is managed with Karpenter. I’m going to assume you have a running EKS cluster and karpenter is properly configured in the cluster; if you need help setting up a new cluster with karpenter follow along with the documentation at the official site

NodePool

The NodePool sets constraints on the nodes that can be created by Karpenter and the pods that can run on those nodes. The NodePool configures things like:

Define taints to limit the pods that can run on nodes Karpenter creates
Define any startup taints to inform Karpenter that it should taint the node initially, but that the taint is temporary.
Limit node creation to certain zones, instance types, and computer architectures (like arm64 or amd64)

You can get the active karpenter nodepools in your cluster with:

kubectl describe nodepool

Let’s say that in our case this is driven by a configuration that looks like this:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default 
spec:  
  template:
    metadata:
      labels:
        intent: apps
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: "karpenter.k8s.aws/instance-cpu"
          operator: In
          values: ["4", "8", "16", "32"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
      nodeClassRef:
        name: default
      kubelet:
        containerRuntime: containerd
        systemReserved:
          cpu: 100m
          memory: 100Mi
  disruption:
    consolidationPolicy: WhenUnderutilized

Here we can see that this nodepool allows only amd64 (intel or amd) type of instances. If we want to support graviton2 (arm64) instances we would need to either update this definition to support that or create a new separate nodepool. Let’s just add support in the existing one by adding this key to the requrements:

       - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]

and then re-apply it with kubectl apply. Now our nodepool supports both intel and graviton2 instance types.

NodeClass

Another important concept for Karpenter is the EC2NodeClass. Node Classes enable configuration of AWS specific settings. Each NodePool must reference an EC2NodeClass using spec.template.spec.nodeClassRef. Here we configure things like subnets, security groups, and what AMIs to use for the instances.

The configuration for this might look something like:

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  role: "${local.node_iam_role_name}"
  amiFamily: AL2 
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: ${local.name}
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: ${local.name}
  tags:
    IntentLabel: apps
    KarpenterNodePoolName: default
    NodeType: default
    intent: apps
    karpenter.sh/discovery: ${local.name}

Where ${local.name} is the name of the cluster and ${node_iam_role_name} is the name of the IAM role used for the ec2 instances. A configuration like this where we don’t define any of the AMIs and only use the amiFamily: AL2 (Amazon Linux 2) will automatically detect and use the latest ami for each of the available architectures we have in our nodepool; so we would not have to change anything in this case!!! ;)

You can see the compiled form with the actual AMIs using:

kubectl describe ec2nodeclass

Still, in some cases, folks will prefer to control this and define manually AMIs like this:

status:
  amis:
    - id: ami-01234567890123456
      name: custom-ami-amd64
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values:
            - amd64

and if that is the case we need to make sure we have a similar definition for a valid arm64 ami:

    - id: ami-01234567890123456
      name: custom-ami-arm64
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values:
            - arm64

That’s it; it is as simple as that; we have a nodepool that supports arm64 instances and a nodeclass that defines a proper ami to be used by those.

You can test this with a simple deployment like:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: workload-graviton
spec:
  replicas: 5
  selector:
    matchLabels:
      app: workload-graviton
  template:
    metadata:
      labels:
        app: workload-graviton
    spec:
      nodeSelector:
        intent: apps
        kubernetes.io/arch: arm64
      containers:
      - name: graviton2
        image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
        imagePullPolicy: Always
        resources:
          requests:
            cpu: 512m
            memory: 512Mi

and apply it with kubectl:

kubectl apply -f workload-graviton.yaml

Give it a couple of minutes and you can see the new node in the cluster:

kubectl get nodes -L karpenter.sh/capacity-type,beta.kubernetes.io/instance-type,karpenter.sh/nodepool,topology.kubernetes.io/zone -l karpenter.sh/initialized=true

the output will look something like:

NAME                          STATUS   ROLES    AGE   VERSION               CAPACITY-TYPE   INSTANCE-TYPE   NODEPOOL   ZONE
ip-10-0-62-224.ec2.internal   Ready       60s   v1.28.5-eks-5e0fdde   spot            c6g.xlarge      default    us-east-1a
ip-10-0-79-148.ec2.internal   Ready       87m   v1.28.5-eks-5e0fdde   on-demand       c6g.xlarge      default    us-east-1b

You can also check the karpenter logs with:

kubectl -n karpenter logs -l app.kubernetes.io/name=karpenter --all-containers=true -f

Note: if you played along to test this, please don’t forget to clean up and delete the resources you no longer need.

Finally, I wanted to point out that because karpenter is automatically choosing the most cost-effective instances for your configuration (on-demand vs spot, or graviton2 vs intel) your instances might tilt automatically towards graviton2. You can still control your deployment if you want to run them on amd64 instances (for ex. if you don’t have arm64 versions available) using the kubernetes.io/arch spec config.

HowTo Migrate a managed AWS Elasticsearch cluster to graviton2

2023-01-14T10:52:55+00:00

Amazon Web Services (AWS) offers managed Amazon Elasticsearch clusters (or OpenSearch as they call their fork of the open source Elasticsearch) for deploying and managing search applications. Lately, I’ve been managing several self-managed clusters and showed how I’ve migrated those to graviton2; for newer clusters, I’ve started exploring the AWS managed solution for Elasticsearch. This has several advantages:

Simplified management: With Amazon OpenSearch, AWS takes care of the infrastructure management tasks, such as deploying and configuring the Elasticsearch cluster, patching the underlying operating system, and handling backups and disaster recovery. This allows you to focus on using Elasticsearch to build your search applications, rather than worrying about the underlying infrastructure.
Scalability: Amazon OpenSearch makes it easy to scale your Elasticsearch cluster up or down to meet changing demands. You can add or remove nodes to the cluster with just a few clicks, or use features like auto-scaling to automatically adjust the size of the cluster based on workload patterns.
Availability: Amazon OpenSearch is designed for high availability, with built-in replication and failover capabilities. This helps ensure that your search application remains available even in the face of hardware failures or other issues.
Security: Amazon OpenSearch provides a range of security features to help protect your data, including encryption at rest and in transit, access controls, and integration with AWS Identity and Access Management (IAM) for fine-grained access control.
Integration with other AWS services: Amazon OpenSearch integrates with other AWS services, such as Amazon CloudWatch for monitoring, AWS CloudTrail for auditing, and AWS Identity and Access Management (IAM) for access control. This makes it easy to build end-to-end search applications on AWS.
Cost optimization: Using Amazon OpenSearch can help reduce costs by eliminating the need to manage and maintain your own Elasticsearch cluster infrastructure. Additionally, AWS offers Graviton2-based instances that are optimized for running Amazon OpenSearch and provide better performance and cost efficiency compared to traditional x86-based instances.

In this post, I’ll walk you through the process of upgrading an existing managed Amazon Elasticsearch cluster to Graviton2. We will use the same cluster to perform the upgrade, which means we will upgrade the cluster in place, without creating a new cluster.

Step 1: Change the instance types to Graviton2

This migration is much easier than the self-managed one; it requires only one step ;). We first need to figure out the graviton2 instance type we want to use for our migration. As like with regular ec2 instances AWS provides a wide range of instances for Elasticsearch. The main ones are the “m” instances which are general-purpose instances, and the “r” instances are memory-optimized. The number after the instance type (e.g. “2xlarge” or “16xlarge”) indicates the number of vCPUs and the amount of memory available on the instance:

The available Amazon OpenSearch-optimized Graviton2 instances are:

m6g.medium.elasticsearch
m6g.large.elasticsearch
m6g.xlarge.elasticsearch
m6g.2xlarge.elasticsearch
m6g.4xlarge.elasticsearch
m6g.8xlarge.elasticsearch
m6g.12xlarge.elasticsearch
m6g.16xlarge.elasticsearch
r6g.large.elasticsearch
r6g.xlarge.elasticsearch
r6g.2xlarge.elasticsearch
r6g.4xlarge.elasticsearch
r6g.8xlarge.elasticsearch
r6g.12xlarge.elasticsearch
r6g.16xlarge.elasticsearch

Note: we will also need to make sure we run a supported version of the managed AWS Elasticsearch/OpenSearch that supports graviton2 instances. For the older Elasticsearch anything newer than 7.8 should work, and if you are using the OpenSearch version then any version would work as this has been available since version 1.0.0. If you are running an older version you will need first to upgrade to a supported version before moving forward.

The actual migration only requires us to change the instance type. This can be done in the AWS console, using the AWS cli, or a tool like terraform. Since I use terraform to manage all the cloud assets I will show how this is done with terraform; this would look something like:

resource "aws_opensearch_domain" "elasticsearch_domain" {
  domain_name           = "search-domain"
  elasticsearch_version = "7.10"

  cluster_config {
    instance_type           = "m6g.large.elasticsearch" # this replaces the previous m5 type of instance we had
    instance_count          = 3
    dedicated_master_enabled = true
    dedicated_master_count   = 3
  }

  ebs_options {
    ebs_enabled = true
    volume_type = "gp3"
    volume_size = 1000
  }
... # other elasticsearch cluster configs
}

I want to point out also that we are now able to use gp3 for the ebs volume which allows for much better performance and increased size allowed per data node. This is great optimization that can make the cluster much faster and reduce the need for extra data nodes (we were able to cut our nodes in half from this combination: graviton2 for better performance and gp3 for higher storage capacity per node)

Once you run terraform apply with the new instance type this will kick in the automatic blue-green deployment from AWS managed Elasticsearch that will spin up a new set of nodes and migrate the data to the new nodes; once this is done the original nodes are automatically removed. Depending on the size of your data in the cluster this might take a long time and terraform might time out (60m by default). If this happens, you can use the AWS console or cli to monitor the status of the migration.

aws es describe-upgrade --domain-name

should show the status of the upgrade for the specified domain. You can also check the health of the cluster after the upgrade:

aws es describe-elasticsearch-domain --domain-name  --query 'DomainStatus.ClusterStatus.Health'

This command will return the current health status of the Elasticsearch cluster. If the upgrade has been completed successfully, the cluster should have a green health status. If there are any issues with the upgrade, the cluster may have a yellow or red health status, indicating that there are problems that need to be addressed

Note: theoretically there should be no downtime during the process, but the performance might be slightly impacted during the blue-green migration.

As you can see there is a huge advantage while performing such a migration using a managed service compared with the self-managed solution where we had to handle and take care of everything ourselves.

Conclusion

Upgrading a managed Amazon Elasticsearch cluster to Graviton2 is a straightforward process that can provide significant benefits. By upgrading to Graviton2 instances, you can improve performance, reduce costs, and increase the efficiency of your infrastructure. AWS offers several Graviton2 instance types optimized for Amazon OpenSearch, each with its own set of advantages.

In this post, I have walked you through the process of upgrading an existing managed Amazon OpenSearch cluster to Graviton2 instances. We have used the same cluster to perform the upgrade, which means we have upgraded the cluster in place, without creating a new cluster. I have also provided examples and command-line steps to help you through the process.

Overall, upgrading your managed Amazon OpenSearch cluster to Graviton2 instances is a great way to take advantage of the latest technology and improve the performance and cost efficiency of your search application.

HowTo Migrate a self managed Elasticsearch cluster to graviton2 instances

2022-11-12T10:52:55+00:00

Elasticsearch is an open-source search engine that enables you to store, search, and analyze big data in real time. It is a distributed and scalable search engine that can be used to index and search large volumes of data across multiple nodes. I’m currently managing several Elasticsearch clusters running on AWS EC2 instances. AWS offers EC2 instances powered by Graviton2 processors (their custom arm processors) that offer significant performance and cost benefits compared to traditional x86 instances (up to 40% based on AWS benchmarks, with 20% from pure cost savings and 20% from performance improvements compared to similar intel processors). In this blog post, I’ll walk you through the process of how we migrated our Elasticsearch clusters to run on Graviton EC2 instances.

The first Elasticsearch version that added support for ARM processors was Elasticsearch 7.8. This version introduced official support for ARM64 architecture and was released on May 26, 2020. Before this release, Elasticsearch was only officially supported on x86-based platforms. So in our case, this required us to migrate to a supported version first. We were running an older version in the stable branch 7.x and we upgraded to 7.17 using the standard Elasticsearch rolling upgrade docs.

Here are the steps needed for this migration:

Step 1: Create a new Graviton2-based EC2 instances

The first step in the migration process is to create new Graviton2-based EC2 instances. You can do this using the AWS Management Console or the AWS CLI, or even better use terraform as I do. Various Linux distributions run on ARM, but I have chosen to use an Amazon Linux 2 AMI because this is very well supported by AWS. We can use the AWS console and use the filter for “Architecture” to be set to “arm64” for AMI and find the latest Amazon Linux 2 AMI. Or use a simple aws cli command like:

aws ssm get-parameters --names /aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-arm64-gp2 --region us-east-1

This will return the Graviton2 AMI for the specific region we are using. We would use this in our terraform code to create the new Graviton2 instances; for ex:

# Elasticsearch nodes
resource "aws_instance" "es_nodes" {
  count         = 3
  ami           = "ami-XXX" # Replace with the AMI we found above
  instance_type = "c6g.large"
  security_groups = [aws_security_group.es_node_sg.name]
  
  user_data = <<-EOF
              #!/bin/bash
              echo "cluster.name: es-cluster" >> /etc/elasticsearch/elasticsearch.yml
              echo "node.name: ${format("es-node-%02d", count.index+1)}" >> /etc/elasticsearch/elasticsearch.yml
              echo "network.host: [_ec2_:privateIpv4_, _local_]" >> /etc/elasticsearch/elasticsearch.yml
              systemctl restart elasticsearch
              EOF
  
  tags = {
    Name = "es-node-${count.index+1}"
  }
}

Step 2: Install Elasticsearch on the new instance

Normally we would install Elasticsearch on the nodes using the user_data script, but during this migration, we went with a more manual method; you can install Elasticsearch using the RPM or DEB packages provided by Elasticsearch. Here is an example command to install Elasticsearch on an Amazon Linux 2 instance:

sudo rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
sudo tee /etc/yum.repos.d/elasticsearch.repo <[elasticsearch-7.x]
name=Elasticsearch repository for 7.x packages
baseurl=https://artifacts.elastic.co/packages/7.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1
type=rpm-md
EOF

sudo yum install -y elasticsearch

Step 3: Configure Elasticsearch on the new instances

After installing Elasticsearch on the new Graviton2-based EC2 instance, the next step is to configure Elasticsearch to use the existing data and settings from the old Elasticsearch cluster. You can do this by copying the Elasticsearch configuration files from the old cluster to the new instance.

This might look something like this:

rsync -avz --progress --delete /path/to/old/cluster/config/ ec2-user@new-instance-ip:/etc/elasticsearch/

Step 4: Start Elasticsearch on the new instance

Finally, once the Elasticsearch configuration files are copied to the new Graviton2-based EC2 instance, the next step is to start Elasticsearch on the new instance. You can do this using the Elasticsearch service command. Here is an example command to start Elasticsearch on the new instance:

sudo service elasticsearch start

Step 5: Verify the migration

The final step in the migration process is to verify that the data and settings from the old Elasticsearch cluster have been successfully migrated to the new Graviton2-based EC2 instance. You can do this by checking the Elasticsearch logs and running some search queries on the new instance.

Here is an example command to check the Elasticsearch logs on the new instance:

sudo tail -f /var/log/elasticsearch/elasticsearch.log

This command shows the Elasticsearch logs on the new instance, and you can use it to check if any errors or warnings are reported during the migration process.

Step 6: Remove original nodes

After all the new Graviton2 instances are in sync in the cluster you can go ahead and remove the old intel instances one by one and allow the cluster to rebalance.

Conclusion

Migrating an Elasticsearch cluster to running on Graviton EC2 instances can provide significant performance and cost benefits. In this blog post, I walked you through the process of migrating an existing Elasticsearch cluster to new Graviton2-based EC2 instances. By following the steps outlined in this post, you can easily migrate your Elasticsearch cluster to Graviton2-based EC2 instances and take advantage of the cost/performance improvements they offer.

Speedup MySQL InnoDB shutdown

2014-01-19T09:52:55+00:00

Depending on the size of the databases you have in mysql innodb the time it takes mysql to restart can be horribly slow (for very big innodb databases). There are some tricks that can speed this up, but the most effective one that I’ve found and use all the time in similar situations is to pre-flush the dirty pages right before shutdown; this can be done like this:

mysql> set global innodb_max_dirty_pages_pct = 0;

You can check the number of dirty pages with the command:

mysqladmin ext -i10 | grep dirty

Let the server run like this for a while and after you see it settle in, the restart (or stop) should be much faster.

HowTo Migrate to Chef 11

2013-03-05T00:00:00+00:00

Chef 11 was released earlier in February and it is awesome! Like most people, I love the new features like partial search, chef-apply and knife-essentials inclusions, awesome formatted output, etc. Of course the open source chef 11 server was rewritten completely in erlang with postgresql/mysql support replacing the ruby/couchdb backend stack. solr and rabbitmq are still there ;)… There are many breaking changes meaning you will want to make sure that you fix your cookbooks before upgrading.

When you are ready to upgrade, you will notice that unfortunately there is no official migration path. This howto will document what I’ve used myself for such migrations and hopefully will help you too if you are trying to perform a similar upgrade.

Opscode has done an amazing job with the omnibus installers and starting with Chef 11, the chef server has support for this also. Meaning you can install a new chef server simply by installing the rpm or deb for your platform and everything should be installed for you (ruby/gems, chef, rabbitmq, solr, erlang, postgresql, nginx). Just head over to http://www.opscode.com/chef/install/ and from the chef-server tab download the version for your OS.

In order to migrate to a new chef server we need to migrate from the old server:

clients
nodes
roles
environments
data bags
cookbooks (with all the versions used in each environment)

It is important to have all the clients with their proper public keys because if not we would have to re-register each one of them.

Personally, I’ve migrated using this process several servers from open source chef 0.10.x to chef 11, but theoretically this should work from any chef server implementation (hosted, private, etc.) because we are downloading and uploading the assets using the api calls.

Backup the data from the existing server

You can use my knife-backup plugin for this. Once you install the gem you can just run it and it will backup all the objects from the existing server:

gem install knife-backup
knife backup export

This might take a while depending on your number of nodes/clients, cookbooks, etc. you have. Once completed you will have in .chef/chef_server_backup all the needed files.

Optional: if you have many unused cookbook versions you might want to clean them first before the backup. You can use my knife-cleanup plugin for this:

gem install knife-cleanup
knife cleanup versions -D

Install the new Chef 11 server

I would recommend to setup a new server as this would be the safest approach in case something doesn’t work out well and you don’t have to mess with your current environment. As mentioned earlier you can install the new server very easy with the omnibus installer. For example for Ubuntu 12.04 this would look like:

wget https://opscode-omnitruck-release.s3.amazonaws.com/ubuntu/12.04/x86_64/chef-server_11.0.6-1.ubuntu.12.04_amd64.deb
dpkg -i chef-server*
sudo chef-server-ctl reconfigure

You can also use the chef-server cookbook to install your new server if you prefer that.

Once you have the new chef server up and running, you will need to setup a new admin account and a new knife config. I would recommend to use a special user for this to not interfere with the users that we are trying to import from the old server. I would call it transfer. From the local server this would look like:

mkdir -p ~/.chef
sudo cp /etc/chef-server/chef-webui.pem ~/.chef/
sudo cp /etc/chef-server/chef-validator.pem ~/.chef/

marius@chef:~# knife configure -i
WARNING: No knife configuration file found
Where should I put the config file? [/marius/.chef/knife.rb]
Please enter the chef server URL: [http://localhost:4000] https://localhost
Please enter a clientname for the new client: [transfer]
Please enter the existing admin clientname: [chef-webui]
Please enter the location of the existing admin client's private key: [/etc/chef/webui.pem] ~/.chef/chef-webui.pem
Please enter the validation clientname: [chef-validator]
Please enter the location of the validation key: [/etc/chef/validation.pem] ~/.chef/chef-validator.pem
Please enter the path to a chef repository (or leave blank):
Creating initial API user…
Created client[transfer]
Configuration file written to /marius/.chef/knife.rb

Note: the default server keys will be located in /etc/chef-server/ and not in /etc/chef like they used to be, and this is definitely a welcome change. Also the default server url will still look for http and port 4000, but with chef 11 this works behind a nginx load balancer and listens by default on standard https port.

Restore the data from the old server

Finally, now we can restore all the data from the old server. You can transfer it from the backup and for simplicity drop it in your user .chef folder under .chef/chef_server_backup; be sure to install the knife-backup gem to the server and you should be able to run:

marius@chef:~# knife backup restore
WARNING: This will overwrite existing data!
WARNING: Backup is at least 1 day old
Do you want to restore backup, possibly overwriting exisitng data? (Y/N) y
Restoring clients
...

And this should restore all the data in the new server. Final step would be to regenerate the indexes:

chef-server-ctl reindex

Note: I want to point out that currently knife-backup will skip any clients that already exist on the server as I could not find a way to overwrite them using the api calls. This means that most certainly the validation key will need to be changed as that is a user that for sure will exist on the newly installed server.

Final touches

After the data migration is completed you will probably just have to point your DNS alias to the new server. One issue I’ve noticed is that the chef server when installed will use the local dns record in various places in its config files. When working on a temporary server this has caused problems once changing the dns and activating the server. The chef server will send to the client links from where to download the assets (cookbook parts for ex) and if this was unconfigured at install time then you might have to fix it and correct it to the dns entry the clients can download correctly; check it out:

grep s3_url /var/opt/chef-server/erchef/etc/app.config

and restart the chef server after correcting the s3_url:

chef-server-ctl restart

Hopefully this post will help you migrate to Chef 11. Feel free to let me know in the comments bellow if you had any issues following this process, or if it worked without any problems. Also if you find any problems with the tools used here knife-cleanup or knife-backup please open a ticket on github or submit a patch. Good luck!

knife-backup

2013-03-04T00:00:00+00:00

While working on migrating a chef server from 0.10.x to version 11, I ended up extending the BackupExport and BackupRestore plugins written by Steven Danna and Joshua Timberman and added support for cookbooks and clients. Currently knife-backup has support for the following objects:

clients
nodes
roles
environments
data bags
cookbooks and all their versions.

knife-backup will backup all cookbook versions available on the chef server. Cookbooks are normally available in a repository and should be easy to upload like that, but if you are using various cookbook versions in each environment then it might not be so trivial to find and upload them back to the server; downloading them and having them available to upload like that is simple and clean. If you have too many cookbook versions then you might want to cleanup them first using something like knife-cleanup.

If you want to check it out, just install the gem:

gem install knife-backup

and then just point it to an existing chef server to backup all its objects with:

knife backup export

If you need to restore then it is simple as:

knife backup restore [-d DIR]

Hope you will find this useful and looking forward for your feedback.
Patches are welcome: knife-backup on github

knife-cleanup

2013-02-26T00:00:00+00:00

I’m working on many projects where we have a process that will make sure that every change we introduce in the cookbooks enters as a new version and where we use extensively environments to select what versions of cookbooks we want to use in each environment. This sounds like a great idea, and a workflow that I would highly recommend to anyone for sure. Still, after a while, the result is that you will end up with hundreds maybe even thousands of cookbook versions and most of them are unused (besides the few ones that you are referencing in each environment and maybe the latest ones). Normally I would not care about this and as long as it is not causing performance issues you should not care about it either. Still you must admit that when debugging any problems, it will make it more complex with all those versions everywhere; see bellow an example.

hadoop 0.1.118 0.1.116 0.1.115 0.1.114 0.1.113 0.1.111 0.1.109 0.1.108 0.1.106 0.1.105 0.1.104 0.1.103 0.1.102 0.1.101 0.1.99 0.1.98 0.1.97 0.1.96 0.1.95 0.1.94 0.1.93 0.1.92 0.1.91 0.1.90 0.1.89 0.1.88 0.1.87 0.1.86 0.1.85 0.1.84 0.1.83 0.1.82 0.1.81 0.1.80 0.1.79 0.1.78 0.1.77 0.1.76 0.1.75 0.1.74 0.1.73 0.1.72 0.1.71 0.1.70 0.1.69 0.1.68 0.1.67 0.1.66 0.1.65 0.1.64 0.1.63 0.1.62 0.1.61 0.1.60 0.1.59 0.1.58 0.1.57 0.1.56 0.1.55 0.1.54 0.1.53 0.1.52 0.1.51 0.1.50 0.1.49 0.1.48 0.1.47 0.1.46 0.1.45 0.1.44 0.1.43 0.1.42 0.1.41 0.1.40 0.1.39 0.1.38 0.1.37 0.1.36 0.1.35 0.1.34 0.1.33 0.1.32 0.1.31 0.1.30 0.1.29 0.1.28 0.1.25 0.1.24 0.1.23 0.1.22 0.1.21 0.1.20 0.1.19 0.1.18 0.1.17 0.1.16 0.1.15 0.1.13 0.1.12 0.1.11 0.1.10 0.1.9 0.1.8 0.1.7 0.1.6 0.1.5 0.1.4 0.1.3 0.1.2 0.1.0

(and this was the cookbook with the least versions that I’ve found to paste here).

While working on knife-backup I realized what a huge waste this was, and decided that I needed a way to clean these and keep on the server just the relevant ones.

To solve this problem I wrote knife-cleanup and if you have similar needs you might find it useful. It will cleanup all unused versions of the cookbooks you have on your chef server (this might be hosted opscode platform or open source server). Before doing any deletion it will backup the version it touches (just in case).

If you want to check it out, just install the gem:

gem install knife-cleanup

and assuming you have a working knife config you can run it with:

knife cleanup versions

and this will output the versions it would delete.

If you are ready to delete, you can do that with:

knife cleanup versions -D

and you can find the backups of the versions deleted under .cleanup/cookbook_name

Notes: I’ve seen various cases where it is impossible to download a cookbook version (and knife will error out). From my experience there is not much we can do about that, so the script will just ignore the backup, but will delete the corrupt version. You might want to have a full chef server backup before (see knife-backup) just in case. The way how I’m using this is with exact version pining of cookbooks in environments (for more details see chef-jenkins); if you are using environments and cookbook versions in a different way, then this might not make sense for you.

Hope you will find this useful and looking forward for your feedback.
Patches are welcome: knife-cleanup on github

Bay Area Chef User Group Update - After One Year

2013-02-04T00:00:00+00:00

It’s been a little more than a year since I stepped up and became one of the organizers of the Bay Area Chef user group, trying to help my good friend Rob Berger as he was getting swamped with work and could not dedicate as much time to this, as he used to in the past. This post is meant to be a quick review on what happened during this time, what worked well and of course some ideas on how we can improve this in the future. I’m also hoping to get feedback from our members on what we can do differently in the future to better serve them and make this an even better group.

One of the first things we’ve done last year was to introduce the Chef Cafes. These are small events (we have a max limit of 10 people set for them) done consistently at the same time (1st and 3rd Thursday of the month) at the best coffee in Mountain View (Red Rock Coffee) with the intent to facilitate the interaction between people, give them a place where they can regularly meet and discuss about chef, ask questions and also try to help other members in the spirit of the open source community. The first Chef Cafe was on March 1st 2012 and it was just me and Rob (we had a good time preparing the future events and just catching up). But after that, we had 16 Chef Cafe’s all year long and many of them had 10 or even more people, and each one of them was unique and special in its own way. We had some, where we had new chef users that had various questions on how to use chef and we tried to help them and resolve their blocks in understanding and getting up to speed with chef. On the other hand we had other cafes where we had really advanced uses that brainstormed about various unresolved problems and what was their take on things like cookbook testing, workflow or orchestration. Overall, I think it was a great success and allowed us to be more connected with members, and also more open and helpful to new chef users.

In 2013 we look forward to your suggestions on how we can improve the Chef Cafes and we will try to keep these going. We hope to be able to move one in San Francisco and keep the other one in the South Bay as we had various requests for that. So if you are in the City and you want to get involved with this please ping me.

One other thing we have tried to do was to bring consistency and have at least one meetup every month with an awesome presentation on some hot topic in the chef community. This ended up being a little too optimistic :(. Still, we had 6 cool meetups with speakers like:

Flip Kromer on Ironfan
Jim Hopp on Test-Driven Development
Jesse Robbins - Hacking culture & Being a force for Awesome
Daniel DeLeo on Whyrun mode
Nati Shalom on Cloudify

and we also had Aaron Peterson running an introductory Chef Workshop; considering the big and diverse audience I think we have done quite a great job with that.

With the experiences we had last year, we are more confident that this year we will be able to run one meetup every month, but we need your help: we are always looking for great speakers and interesting topics; if you want to present at one of our meetups please let us know; also if you know someone that we should invite to present to a meetup please let us know.

Most of our meetups last year were hosted by Survey Monkey in Palo Alto and we can’t thank them enough for their support (special thanks to Tim Sabat for making them possible). We also had one meetup in San Francisco hosted at Scalr offices (thanks Sebastian). This year, we hope to diversify and run each meetup in a different place to make things more interesting; and hopefully more meetups in the City. If you are interested in hosting and sponsoring one of our future meetups please contact me privately and let me know.

During last year, our group has grown a lot. We started with 132 members in the first day of January 2012 and ended up the year with more than 400 members. This shows that the interest in Chef is obviously growing and hopefully the events we have been organizing are helping grow our local chef community.

If you have any suggestions on what you would like us to do in the future, please let us know. Use the comments bellow, send us a message, whatever works for you; we would love to hear from you and see how we can serve you better. Overall 2012 was great and with your help we can make 2013 even better!

Finally Migrated to Octopress

2012-11-14T00:00:00+00:00

For a while now, I wanted to migrate my blog from Wordpress to Octopress, but for some reason I kept putting it on the shelf and not doing it. (let’s say because of all those client related projects…). Finally last weekend I’ve completed the migration and I’m really excited to get back to blogging after this. This post is meant to capture some of the issues I’ve encountered during the migration and how to fix them. This is not a full how to migrate post, as there are many such great articles available already.

Migrate old blog posts.

Believe it or not, I had 364 blog posts when I started the migration. Meaning a lot of energy was spent in importing those old articles. I’ve used exitwp to convert the wordpress-xml export of the blog posts; and this produced a reasonably good result. Still I had to run some fixes…

for code blocks:

 perl -pi -e 's/([^\`]|^)(\`)([^\`]|$)/$1\n\`\`\`\n$3/g' *

to enable comments (as ‘comments: true’ was missing from all posts)

find source/_posts/ -type f -print0 | xargs -0 -I file sed -i '' '2 i \
  comments: true' file

Categories/Tags/URLs

Enabled the octopress category list plugin and tags plugin, that you can see in the sidebar. Since I had already tags and categories on all posts it was very important to keep the same urls and not break them. Same thing for regular posts urls. Here are the relevant settings form the octopress config file:

root: /
permalink: /:year/:month/:day/:title/
category_dir: category
tag_dir: "tag"

Just keep in mind that if you have many tags as I do, the generation of the pages will increase a lot after you enable the tags plugin. You’ve been warned!

Disqus comments

Not working at all… I’ve wrote a post specifically about this; check it out here

Feed Url

My wordpress blog has been around for a while (6years more or less) and even if I’ve always used feedburner for my feed, but for some strange reason I’ve always used my own feed url. This of course was no longer working with octopress, hence I had to setup a rewrite rule to not break everyone’s feed reader:

RewriteEngine On
Options +FollowSymLinks -Multiviews

# Feed url
RewriteRule ^feed/?$ atom.xml [QSA,L]

Rewrite non-www to www

This was done automatically by wordpress, but octopress will serve just fine the non-www domain. This can cause issues with search engines and such, so I wanted the same behaviour. Apache again to the rescue:

RewriteCond %{HTTP_HOST} !^www [NC]
RewriteRule $ http://www.%{HTTP_HOST}%{REQUEST_URI} [L,R]

Apache optimizations, caching, compression, etc

After you generated your octopress site, everything is static and fast by default. Still, you want to make sure that apache has some basic caching and compression settings to make it even better. Here are the relevant parts from my config:

#### CACHING ####
 mod_expires.c>
ExpiresActive On

# 1 MONTH
 "\.(ico|gif|jpe?g|png|flv|pdf|swf|mov|mp3|wmv|ppt)$">
  ExpiresDefault A2419200
  Header append Cache-Control "public"


# 3 DAYS
 "\.(xml|txt|html|htm|js|css)$">
  ExpiresDefault A259200
  Header append Cache-Control "private, must-revalidate"


# NEVER CACHE
 "\.(php|cgi|pl)$">
  ExpiresDefault A0
  Header set Cache-Control "no-store, no-cache, must-revalidate, max-age=0"
  Header set Pragma "no-cache"



### Compression ####
 mod_deflate.c>
     mod_setenvif.c>
        BrowserMatch ^Mozilla/4 gzip-only-text/html
        BrowserMatch ^Mozilla/4\.0[678] no-gzip
        BrowserMatch \bMSIE !no-gzip !gzip-only-text/html
        BrowserMatch \bMSI[E] !no-gzip !gzip-only-text/html
    
     mod_headers.c>
        Header append Vary User-Agent env=!dont-vary
    
     mod_filter.c>
        AddOutputFilterByType DEFLATE text/css application/x-javascript text/x-component text/html text/richtext image/svg+xml text/plain text/xsd text/xsl text/xml image/x-icon

Isolated when working on a new post

If you have many posts, the generation of the octopress site will be extremely slow (in my case it takes about 2mins for a full generate) and this makes it basically impossible to work with any new post and see the feedback locally with preview. The solution is well documented and it works by isolating your single post while working on it, and when you are done you integrate back all the other posts before publishing them:

rake new_post['Finally Migrated to Octopress']
rake isolate[finally-migrated-to-octopress]

and now rake generate and rake preview will only work with the new post. Finally when done and ready to publish the awesome new post on the internets:

rake integrate
rake generate
rake deploy

Others

some small customizations to the theme (colors and such)
about me and contact custom asides.
fix the github aside (updated to work with their latest api version and actually return the repos)
and of course the contact form (using a wufoo form)

Disqus comments not visible in Octopress

2012-11-12T00:00:00+00:00

After completing the migration of my blog from Wordpress to Octopress I had the surprise that Disqus comments were not showing up on the site. I’ve already migrated in advance to Disqus and the Wordpress blog was working just fine with the new format. However, once switched to Octopress there were no comments active on the site. Strangely, the total number of comments for each post on the index page was showing just fine, but once you clicked on any post there were no comments. I tested adding new comment and it did show up correctly in Disqus.

Trying to understand and debug this issue, I looked in source/_includes/disqus.html and found the code that is generating the javascript variable disqus_identifier for the posts:

and looking in the html generated by some blog posts the variables disqus_url and disqus_identifier looked ok, like this:

var disqus_identifier = 'http://www.ducea.com//2012/11/12/disqus-comments-not-visible-in-octopress/';
var disqus_url = 'http://www.ducea.com//2012/11/12/disqus-comments-not-visible-in-octopress/';
var disqus_script = 'embed.js';

Still at a closer look I was able to identify the issue; if you look closer at the url above, it has a double / in the url, and even if that should not cause any issues and identify the same url, Disqus was actually seeing it as a separate identifier and hence not showing the comments associated with it. Once I figured it out it was very simple to see where it came form (the site url from _config.xml) was:

url: http://www.ducea.com/

and fixing it, by removing the trailing slash:

url: http://www.ducea.com

Regenerating and deploying the site:

rake generate
rake deploy

fixed the issue and the comments are now back on the site. (you can even try it out here on this post ;)

Hopefully this will help others that are in the same situation… if you just added an extra slash to the Octopress site url config and didn’t realize this brake the Disqus comments.

ChefConf 2012 - San Francisco

2012-05-18T19:09:10+00:00

This week Opscode hosted its inaugural user conference here in San Francisco, and it was an awesome event enjoyed by all chef fans. Even if this was the first one (they are already planing for the future ones), this was by no means a small event, with more than 400 people attending and the workshops that ran on Tuesday sold out.

Even if I have not attended any workshop (they had 2 flavors, one targeted towards a sysadmin workflow and one for developers) the general feeling from people I talked with and attended them was that it was a very good experience, with a lot of hands-on practical examples. Tuesday afternoon, myself I attended the “ChefConf Pre-event Hackday: TEST ALL THE THINGS!!!” organized by Bryan Berry and it was great, and showed how many people are interested in testing their infrastructure as code; it was focused on cookbook testing (unit testing and integration testing), continuous integration with jenkins, and other things like that ;)

The first full day of ChefConf was Wednesday. The conference was structured with main presentations during the mornings and breakout sessions in the afternoon (with 2 main tracks and also a vendor one). From the beginning you could tell that this will be a very well run conference, and even if this was the first one, people like Jesse Robbins have a lot of experience running such events. Not surprisingly ChefConf kicked off with Adam Jacob’s “State of the Union Part 1: Chef, Past and Present” (video) ; Jesse Robbins talked about the community around chef and how this is a key part of Opscode strategy and their efforts to take this to the next level. He showed this very nice visualization of the commits to the chef github repo.

There were many interesting talks during the day, and they recorded most of them and hopefully will make them available online soon so you can see them if you didn’t had the chance to be here (or you want to review them again). I particularly enjoyed:

Ron Vidal - Operations Secret Sauce: Incident Management (video); similar to Jesse Robbins GameDay talk and it was a very nice addition, inspirational and full of interesting points.
Jim Hopp’s - Test-driven Development for Chef Practitioners (video); very well prepared and presented. I hope to have Jim to our Chef Bay Area meetup group to present something similar on the subject and run a testing hackaton.
Patrick McDonnell’s - Lessons from Etsy: Avoiding Kitchen Nightmares; people seem to love everything Etsy is doing and they are sharing a lot of their workflow with chef and open sourcing various tools they write.
and many others…

In the evening we had a great Ignite event ran by Andrew Shafer in his unconfundable way. We had 10 ignite speakers and in the middle there was a fun karaoke ignite that had 10 volunteers rambled on some slides they never sow before. If they recorded this, and will show it online look up the ones by Stephen Nelson-Smith and John Vincent as they were very entertaining.

The second day of the conference started with Christopher Brown’s “State of the Union Part 2: Chef, the Future” where he outlined some of the future features and main focuses of Opscode for Chef: becoming easier to install and use (omnibus installer), enterprise ready, focus on Windows and also a lot of focus on quality. Opscode is working on a project called kitchen chef that will allow to test the functionality of cookbooks on various environments and platforms, and quickly ensure the quality of the cookbook is maintained during various iterations. Also a lot of work has been put into reporting and handlers. The server side also has been completely rewritten in erlang and sql (from ruby and couchdb) and we should see this soon in the open-source and the private chef server. From the work done you can easily tell that a lot of work has been done on private chef and this is quickly becoming an important asset for Opscode going forward.

There were many great talks during the day from speakers like Artur Bergman, Ben Rockwood, Jason Stowe, John Esser, Rob Hirschfeld, Theo Schlossnagle, etc. I finished my day just like I started Tuesday with another event focused on testing: “Test Driven Development Roundtable”, ran by Stephen Nelson-Smith on a panel with Seth Chisamore, Jim Hopp and my friend Rob Berger. They went over the tools people are using these days and what are the things that are still missing and need to be worked on regarding testing.

Overall, I think this was an awesome event and I hope to be able to attend the next one also (hopefully at the same place). My impression is that Opscode is ready to move forward and make the next step and grow the community even bigger: “The revolution will not be televised - it will be coded with chef”.

HowTo completely remove a file from Git history

2012-02-07T11:40:06+00:00

I just started working on a new project and as you would expect one of the first things I did was to download its git repository from github. These were just some scripts and should have been very small ~5M, but the clone from gitbhub took about one hour as the full repo folder was 1.5G… (with the biggest size under .git/objects/pack) Crazy… What was in the git repository history that would cause something like this? I assumed that at some point in time the repository was much bigger (probably from some file/s that don’t exist anymore), but how could I find out what were those files? And more important howto remove them from history? Well if you came here from a google search on “how to remove a file from git history” then you probably know there are plenty of docs and howtos on how to achieve this but from my experience none of them really worked. This is why I decided to document the steps needed to identify the file from the git repo history that is using all that space and to have it removed fully and bring the repository to a manageable size.

First we need to identify the file that is causing this issue; and for this we will verify all the packed objects and look for the biggest ones:

git verify-pack -v .git/objects/pack/*.idx | sort -k 3 -n | tail -5

(and grab the revisions with the biggest files). Then find the name of the files in those revisions:

git rev-list --objects --all | grep

Next, remove the file from all revisions:

git filter-branch --index-filter 'git rm --cached --ignore-unmatch '
rm -rf .git/refs/original/

Edit .git/packed-refs and remove/comment any external pack-refs. Without this the cleanup might not work. I my case I had refs/remotes/origin/master and some others branches.

vim .git/packed-refs

Finally repack and cleanup and remove those objects:

git reflog expire --all --expire-unreachable=0
git repack -A -d
git prune

Hopefully these steps will help you completely remove those un-wanted files from your git history. Let me know if you have any problems after following these simple steps.

Note: if you want to test these steps here is how to quickly create a test repo:

# Make a small repo
mkdir test
cd test
git init
echo hi > there
git add there
git commit -m 'Small repo'
# Add a random 10M binary file
dd if=/dev/urandom of=testme.txt count=10 bs=1M
git add testme.txt
git commit -m 'Add big binary file'
# Remove the 10M binary file
git rm testme.txt
git commit -m 'Remove big binary file'

Getting ready for LISA11 - Boston

2011-12-03T22:53:45+00:00

I’m packing for Boston and will be there next week for LISA11. This will be my second year as part of the LISA blogging team, and after how much I enjoyed LISA last year in San Jose I wouldn’t miss this one even if it is on the other side of the country. I’ve tried to finish as much work as possible to be able to focus on the conference ;) but for various reasons of course this was not quite possible, and actually during the first days I will even be on call… In anycase, I’m sure this is going to be a great week full of awesomeness. I will be blogging for the USENIX blog every day, so be sure to follow that for fresh articles from me and the other memebers of our team (Ben, Rikki and Matt).

If you are going to LISA11 in Boston next week, we should definitely meetup. Contact me on twitter or email.

The Limoncelli Test, was a very interesting presentation by Tom Limoncelli based on a blog post he wrote earlier this year. If you haven’t done it already I would strongly recommend to take the test and see how does your sysadmin team rank on “The Limoncelli Test”.

Recovering From Linux Hard Drive Disasters is Theodore Ts’o signature training material on what to do if you have any sort of hard drive failure and covers in depth details on how to recover from such disasters caused by software or hardware failures.

GameDay: Creating Resiliency Through Destruction (slides): I enjoyed very much Jesse Robbins presentation, where he draws parallels between two of his greatest passions: firefighting and operations. Watch the video.

SRE@Google: Thousands of DevOps Since 2004: Tom Limoncelli, describes the technologies and policies that Google uses to do what is (now) called DevOps. Watch the video.

Interview with LISA11 Program Co-Chairs: Tom Limoncelli and Doug Hughes

2011-11-30T22:31:35+00:00

One of the advantages of being a member of the LISA11 Blog Team is that I was able to talk and interview this year program co-chairs: Tom Limoncelli and Doug Hughes. This was a great honor for me especially since I’ve been a big fan of Tom’s work for many years. The full article is available on the USENIX blog: “Tom Limoncelli and Doug Hughes Interview”

Also my colleagues from the LISA11 blogging team (Ben, Rikki and Matt) have done some very interesting interviews with some key people from LISA11 to get you prepared for the event. Check out the USENIX blog for more from us in the next week.

Here is also a quick intro of our team: “LISA11 Next Week – Meet your blog team!”

Build your own packages easily with FPM

2011-08-31T15:13:02+00:00

Building packages is a task that every system administrator will end up doing. Most of the time this is not a very interesting task but someone has to do it, right? Normally you will end up modifying and tweaking based on your own needs an existing package that was built by the maintainers of the Linux distribution that you are using. In time you might even become familiar with the packaging system you are using (rpm, deb, etc.) and you will be able to write a spec file and start from scratch and build a new package if you need to. Still, this process is complicated and requires a lot of work.

Luckily, Jordan Sissel has built a tool called FPM (Effing Package Management), exactly for this: to ease the pain of building new packages; packages that you will use for your own infrastructure and you want them customized based on your own needs; and you don’t care about upstream rules and standards and other limitations when building such packages. This can be very useful for people deploying their own applications as rpms (or debs) and can simplify a lot of the process of building those packages.

FPM can be easily installed on your build system using rubygems:

gem install fpm

Once installed you can use fpm to build packages (targets):

deb
rpm
solaris

from any of the following sources:

directory (of compiled source of some application)
gem
python eggs
rpm
node npm packages

Use the command line help (fpm --help) or the wiki to see full details on how to use it. I’ll show some simple examples on how to build some packages from various input sources that I’ve found useful myself.

1. Package a directory - output of a ‘make install’ command

This is how you would usually package an application that you would install with:
./configure; make; make install
For example, here is how you can create an rpm of the latest version of memcached:

wget http://memcached.googlecode.com/files/memcached-1.4.7.tar.gz
tar -zxvf memcached-1.4.7.tar.gz
cd memcached-1.4.7
./configure --prefix=/usr
make

so far everything looks like a normal manual installation (that would be followed by make install). Still we will now install it in a separate folder so we can capture the output:

mkdir /tmp/installdir
make install DESTDIR=/tmp/installdir

and finally using fpm to create the rpm package:

fpm -s dir -t rpm -n memcached -v 1.4.7 -C /tmp/installdir

where -s is the input source type (directory), -t is the type of package (rpm), -n in the name of the package and -v is the version; -C is the directory where fpm will look for the files. Note: you might need to install various libraries to build your package; for ex. in this case I had to install libevent-dev.

If you are packaging your own application you can do this just by pointing to your build folder and set the version of the app. Here is an example for an deb package:

fpm -s dir -t deb -n myapp -v 0.0.1 -C /build/myapp/0.0.1/

There are various other parameters that you can use but basically this is how simple it is to build a package from a directory. Here is an example on how to define some dependencies on the package you are building (using -d; repeat it as many times as needed):

fpm -s dir -t deb -n memcached -v 1.4.7 -C /tmp/installdir \
-d "libstdc++6 (>= 4.4.5)" \
-d "libevent-1.4-2 (>= 1.4.13)"

2. Ruby gems or python egg - converted to packages

You can create a deb or rpm from a gem very simple with fpm:

fpm -s gem -t deb

this will download the gem and create a package named rubygem- For example:

fpm -s gem -t deb fpm

will create a debian package for fpm: rubygem-fpm_0.3.7_all.deb

You can inspect it with dpkg –info and you can notice that in this case it will fill nicely all the fields with the maintainer, and dependencies on various other gems. Very cool.

If you use python and want to package various python eggs this will work exactly the same and you will use -s python (it will download the python packages with easy_install first).

Overall FPM is a great tool and can help you simplify the way you are building your own packages. Check it out and let me know what you think and if you found it useful. And if you found this useful don’t forget to thank Jordan for his great work on this awesome tool.

First Chef Cookbook Contest Announced!

2011-08-23T11:18:14+00:00

Yesterday Opscode, the company behind Chef, announced the first ever chef cookbook contest. In order to participate in the contest you will need to write a new cookbook and submit it by the end of September; this is going to be a little tricky as there are many cookbooks already available on the community site. So this is a great idea and it will take care of the few applications that don’t already have chef cookbooks. The cookbooks which shows off the awesome Chef features will have better chances to win. The prizes are also interesting: iPad, gift cards, etc. Here are the full details and rules of the contest: http://www.opscode.com/blog/2011/08/22/cookbook-contest/

So if you have an idea for a chef cookbook, now it’s the time to start working on it. I’m offering my help for free for all my blog readers: I will help you write a cookbook by implementing your ideas; help reviewing it or suggest improvements, or whatever else you might need help with. Use the contact form to email me (or DM me on twitter) and let me know how I can help.

If you don’t have time to write a new cookbook but you have a great idea for a cookbook that is missing from the opscode community site, please post it bellow in the comments section and I’m sure some of my blog readers will help create it.

Again this is a brilliant idea from Opscode and it creates a win-win situation for everyone. I’m just curious, is this the first idea from their new community manager? If this is the case, great job Jesse ;).

Building Vagrant boxes with veewee

2011-08-15T18:49:23+00:00

If you used vagrant (great tool, right?) you have probably downloaded a basebox from some remote location to get you started. This is a great quick start, and there are many good boxes out there that you can use; vagrantbox.es does a great job in listing various public vagrant boxes. But if you are like me, you probably will want to customize the boxes you are using; you might want to install them from scratch based on your own little/or/big customizations. Well if you are like that, then you will be happy to hear that Patrick Debois had exactly the same problem when he decided to write veewee. And veewee is exactly that missing part of vagrant that allows you to easily build your own vagrant boxes from scratch.

So let’s see how we can use veewee. I’m assuming you already have vagrant installed (and virtualbox), but if you don’t please install them first. To install veewee we just have to install the veewee gem:

gem install veewee

once you installed veewee you can see a new task added to vagrant: basebox.

Here is the list of the templates we get out of the box once we install veewee:

**vagrant basebox templates**
The following templates are available:
vagrant basebox define '' 'archlinux-i686'
vagrant basebox define '' 'CentOS-4.8-i386'
vagrant basebox define '' 'CentOS-5.6-i386'
vagrant basebox define '' 'CentOS-5.6-i386-netboot'
vagrant basebox define '' 'Debian-6.0.1a-amd64-netboot'
vagrant basebox define '' 'Debian-6.0.1a-i386-netboot'
vagrant basebox define '' 'Fedora-14-amd64'
vagrant basebox define '' 'Fedora-14-amd64-netboot'
vagrant basebox define '' 'Fedora-14-i386'
vagrant basebox define '' 'Fedora-14-i386-netboot'
vagrant basebox define '' 'freebsd-8.2-experimental'
vagrant basebox define '' 'freebsd-8.2-pcbsd-i386'
vagrant basebox define '' 'freebsd-8.2-pcbsd-i386-netboot'
vagrant basebox define '' 'gentoo-latest-i386-experimental'
vagrant basebox define '' 'opensuse-11.4-i386-experimental'
vagrant basebox define '' 'solaris-11-express-i386'
vagrant basebox define '' 'Sysrescuecd-2.0.0-experimental'
vagrant basebox define '' 'ubuntu-10.04.2-amd64-netboot'
vagrant basebox define '' 'ubuntu-10.04.2-server-amd64'
vagrant basebox define '' 'ubuntu-10.04.2-server-i386'
vagrant basebox define '' 'ubuntu-10.04.2-server-i386-netboot'
vagrant basebox define '' 'ubuntu-10.10-server-amd64'
vagrant basebox define '' 'ubuntu-10.10-server-amd64-netboot'
vagrant basebox define '' 'ubuntu-10.10-server-i386'
vagrant basebox define '' 'ubuntu-10.10-server-i386-netboot'
vagrant basebox define '' 'ubuntu-11.04-server-amd64'
vagrant basebox define '' 'ubuntu-11.04-server-i386'
vagrant basebox define '' 'windows-2008R2-amd64-experimental'

This means that we can build a box based on any of the above templates. That’s awesome! Let’s say we want to build a debian squeeze box using veewee; we would have to run:

vagrant basebox define 'debian-60' 'Debian-6.0.1a-amd64-netboot'

and this will create a folder definitions/debian-60 with the following files (the content of the veewee template):

definition.rb
postinstall.sh
preseed.cfg

we can modify/tune any of those files based on our custom needs. The file definition.rb is the main definition of the template. Here you would define the memory size, disk size, iso file, etc. The content is very easy to understand, but you would normally not have to change many things here. preseed.cfg is just a standard preseed file where you would customize the actual install process (you could change here the partitions or their type, timezone setup, etc). And finally postinstall.sh that is a bash script that will run at the end of the installation process and it will install ruby, gems , chef and puppet and also the virtualbox guest additions (needed for shared folders).

If you have the iso already place it in ‘currentdir’/iso. If not, veewee will download it and place it in the appropriate folder before starting the install process:

vagrant basebox build 'debian-60'

this will start the installation and you can see all the steps it takes (the keystrokes as they are entered, etc.). This can take a while… Once it is done you can validate the build with:

vagrant basebox validate 'debian-60'

(this will run a few basic tests to see if it can connect to the vm as user vagrant, if chef and puppet were installed, if the shared folders are accessible, etc).

And finally you can export it as a vagrant box with:

vagrant basebox export 'debian-60'

and add it to vagrant:

vagrant box add 'debian-60' debian-60.box

and now you can use it in vagrant with:

vagrant init 'debian-60'

That’s it. Very simple and now we have our own box built from scratch. As a side note, I found this very useful to test and troubleshoot preseed configurations ;). As you can see there are plenty of templates available in veewee but if you create a new one please consider to share it with others and send it to Patrick on github. I’m sure he will be happy to include it in newer versions of veewee. And if you found this useful don’t forget to thank Patrick for his great work on this awesome tool.

Monitoring with Icinga @ SF Bay Area LSPE meetup

2011-07-22T11:15:22+00:00

Yesterday evening I presented at the SF Bay Area Large-Scale Production Engineering meetup group at Yahoo HQ a talk about “Monitoring with Icinga”. This was an introductory talk intended to bring awareness about icinga (there were only 3-4 people from the audience of about 75 that heard of it before), and I think it reached its goal very well; afterwards there were many people interested to try it out and had various questions about it at the end. I was also very happy to have Matthew Brooks one of the icinga core developers in the audience and backing me up to some of the more difficult questions people had. Thanks again Matthew for coming! Here are the slides from my presentation:

Monitoring with Icinga @ SF Bay Area LSPE meetup

View more presentations from mdxp

@LSPEMeetup made available the video on justin.tv; unfortunately the quality of the video/sound is not the best; you can find it here.

HowTo Improve IO Performance for KVM Guests

2011-07-06T10:58:18+00:00

Recently I’ve worked on a project where we deployed a bunch KVM instances. Immediately we noticed horrible IO performance on all the guests instances. In this particular case the hosts and the guests were all Ubuntu 10.04 Lucid and were created with vmbuilder without any special settings using the ubuntu defaults. Here is a sample command similar to what we used to build the kvm images:

vmbuilder kvm ubuntu --suite=lucid --flavour=virtual --arch=amd64 --mirror=http://en.archive.ubuntu.com/ubuntu -o --libvirt=qemu:///system --ip=10.0.0.11 --gw=10.0.0.1 --part=vmbuilder.partition --templates=mytemplates --user=username --pass=password --firstboot=/var/vms/vm1/boot.sh --mem=1024 --hostname=myhost --bridge=br0

Now even if we haven’t tuned anything I would have expected it to perform at least the same level or even better compared with a Xen instance. Still, this was not the case, and the performance was really horrible and any kind of IO bound tasks would effectively lock the instance. Looking into this and trying to understand what was the problem I was able to isolate this issue happening only on instances that had ext4 as the filesystem (the default for lucid), but strangely enough this didn’t happen for an older instance that was build with ext3 (actually a debian lenny instance). All the images build with the above command will use qcow2 sparse format as the default format for the disk.

In order to achieve good IO performance we had to use cache=‘writeback’ for the instances and this will significantly increase the IO performance and bring it almost to host level performance, but in anycase much better compared with the old xen instances we had. Here is how you can enable writeback for an instance: stop the vm; edit the guestdomain and add cache=writeback in the driver section, save and start back the vm:

virsh --connect qemu:///system
stop guestdomain
edit guestdomain   <-- add cache='writeback' in the driver section
start guestdomain

Here is the how the disk part of my guest domain looks like after adding the cache writeback:

In the process of debugging and searching for a fix for this issue, I’ve found out that it can also be useful to use elevator=noop as the default kernel io scheduler; this definitely helps, but not to the same extend as the cache writeback setting on the virtio disk. You can add elevator=noop to your kernel command line in your grub config, and I have this by default on all the instances.

Hopefully this will help you greatly improve IO performance for your KVM guests and will save you the time I’ve lost while trying to find a solution to this problem. Please feel free to share your experiences using the comment form bellow; also I’m curious if you have any other tips on how to improve this even more.