Using instance-specific metadata in Eucalyptus

One of the great features of Amazon EC2 is the possibility to dynamically query and use instance specific metadata, or even custom data. This can be useful for various reasons, and the greatest advantage I’ve personally seen into this, is the possibility to allow the instance to have some information on how to configure itself when first booting (using chef or puppet, or some other configuration management tool).

The Amazon documentation explains how to get this information, basically just by using simple http get requests on the ip: 169.254.169.254, like for ex (for the metadata index):
curl http://169.254.169.254/latest/meta-data/
or for the custom data:
curl http://169.254.169.254/latest/user-data

Eucalyptus supports this great feature (starting with v1.4), but we obviously need to target a different ip to retrieve this information (as the amazon ip has nothing to do with our internal cloud ;) ). We need to use the cloud controller IP for the request and the port it is bound (by default 8773 if you have not changed it). This will look like this (you need to run it from inside the actual instance): Read the rest of this entry »

Tags: , , , , ,

Running s3sync in parallel

s3sync is a great tool to synchronize local data with Amazon S3 for backups, or whatever other reasons you might want to put your data on S3. It is very simple to install (gem install s3sync) and use (s3sync -v -s -r –progress <source_dir> s3_bucket:<dir>); it runs very well and it can be easily scripted to do regular backups or even synchronize live data with S3. The only problem I found while using s3sync was that it can be very slow when uploading a lot of data (millions of files) to S3; this because the process is slow but also because it runs a single file at a time, and it doesn’t do several uploads in parallel. I would have loved for s3sync to do this out of the box, but unfortunately it doesn’t, but for my particular need I was able to do this by running more s3sync commands a the same time. It will not apply to your data (unless it is structured the same way as here; very unlikely), but it might give an idea on how you could do this your own data if it is structured in a feasible way.

Read the rest of this entry »

Tags: , , ,

Apache2 umask

Many times you might want to fine tune the default permissions of the files created on a linux system. This is very simple and usually if you are using bash all you have to do is to define somewhere in the bash startup files (/etc/profile is a good place for this) a new value for umask like this:
umask 002
(this will allow by default group write permissions on the newly created files)

Normally on modern linux distributions this is by default set to 022 and you can easily find out what it is on your system by running the umask command:
umask

Contrary to what you might think, this is not enough to have this working for all applications and daemons on the system. This works fine for any files created from a shell session, but the files created by other processes, like the web server for example, will still use the default, unless otherwise configured. In order to have apache use a different umask we can define this inside /etc/apache2/envvars (debian, and ubuntu systems) or /etc/sysconfig/httpd (rhel,centos systems) like this:
umask 002
and restart apache to enable it.

Other daemons will have different locations where you can define this to overwrite the default setting for umask (check their documentation if you are unsure).

Tags: , ,

Linux Tips: get the list of subdirectories with their owner & permissions and full paths

I needed to get a list of all the subdirectories that were owner by some other user than root under /var and their permissions/owner with full paths. My first thought was to use ls and something like this:
ls -dlR */
drwxr-xr-x  2 root root  4096 2009-06-05 06:25 backups/
drwxr-xr-x  8 root root  4096 2009-05-11 06:02 cache/
drwxr-xr-x  2 root root  4096 2009-05-06 04:49 ec2/
drwxr-xr-x 25 root root  4096 2009-05-25 14:55 lib/
...

will show the subdirectories just as I needed but only at one level. Using */*/ would show the next level, etc. This obviously is not a solution and unfortunately I had found no other way to do this with ls. Using:
ls -alR | grep ^d
drwxr-xr-x 15 root root  4096 2009-05-11 06:02 .
drwxr-xr-x 22 root root  4096 2009-06-03 15:02 ..
drwxr-xr-x  2 root root  4096 2009-06-05 06:25 backups
drwxr-xr-x  8 root root  4096 2009-05-11 06:02 cache
drwxr-xr-x  2 root root  4096 2009-05-06 04:49 ec2
drwxr-xr-x 25 root root  4096 2009-05-25 14:55 lib
....

works somehow, but since I don’t have the full paths this is useless.

Read the rest of this entry »

Tags: ,

HowTo update DNS hostnames automatically for your Amazon EC2 instances

A while ago one of the major problems people faced to use Amazon EC2 into production environments was the dynamic state of the instances IPs. Every time one instance was started it was getting a new, dynamic IP. This has been addressed with the introduction of Amazon Elastic IP Addresses, but even when using this, the private IPs are still dynamic and most of the time people will want to communicate between several instances on the private allocated IPs and not on the public ones. This article will show how you can easily automate the process to update DNS hostnames for your EC2 instances, by adding to the AMI’s the logic for this. I will use for this a master DNS server running bind9, but this can be adapted to any other DNS server. Read the rest of this entry »

Tags: , , , , ,

iptables geoip match on debian lenny

The geoip iptables extension allows you to filter, nat or mangle packets based on the country’s source or destination. This does exactly what the geoip apache module does, or the regular geoip binary, but at the iptables level. I would not go into the details why you would want to use that, but there are many ‘positive’ ways it can be useful… For example myself I use it in a project where we want to serve customized content for different countries. Since this is a high traffic site running on many web servers behind a loadbalanced setup, we prefer to split this at the loadbalancer level and not at apache level, to simplify our setup. We serve customized content to the US based visitors, while for the other countries we serve another international site.

Now this has been working fine for a long time now, using the original geoip module and patch-o-matic-ng method of installation (similar to what is very well described here). Still, this is unmaintained, and starting with kernel 2.6.22 it is no longer working. There is a patch that will make it work with a newer kernel, but if you run iptables 1.4.x this will again fail and even if there are some manual walkarounds this is still not the best solution.

The solution is called Xtables-addons. Xtables-addons is the successor to patch-o-matic-ng. Likewise, it contains extensions that were not, or are not yet, accepted in the main kernel/iptables packages. Xtables-addons is different from patch-o-matic in that you do not have to patch or recompile the kernel, sometimes recompiling iptables is also not needed.
The latest version 1.12 supports: iptables >= 1.4.1 and kernel-source >= 2.6.17.

Read the rest of this entry »

Tags: , , ,

Lenny domU Xencons

Even though at some point it looked like debian lenny will not have full xen support (for the 2.6.26 amd64 kernel) in the end this was fixed and lenny supports fully Xen ever on amd64. Upgrading from 2.6.18 to 2.6.26 is very straightforward (though we were using xen-hypervisor 3.2-1 already) and the only problem noticed was that the console on the domU machines was no longer working: it was showing the output correctly, but you could not enter anything on the console.

This is caused by the ‘new Xen console’ (xen now uses hvc0 for its console) and to fix it you have to add to your virtual machine xen configuration file one line: extra = “console=hvc0 xencons=tty”, restart the vm and it should be fine. In /etc/xen/<myvm>.cfg add this line:
extra = "console=hvc0 xencons=tty"

Read the rest of this entry »

Tags: , ,

HowTo get a small sample dataset from a mysql database using mysqldump

Here is a quick tip that will show how you can get a small sample dataset from a mysql database using mysqldump. We frequently need to get a small snapshot from a very big production database to import it into a development or staging database that will not need all the original data; let’s say we need 1,000,000 records from all the tables in the database; we will just use the option –where=”true LIMIT X”, with X the number of records we want mysqldump to stop after.

Simply we will run something like (add whatever other options you need to mysqldump):

mysqldump --opt --where="true LIMIT 1000000" mydb > mydb1M.sql
Read the rest of this entry »

Tags: , ,

Mdadm Cheat Sheet

Mdadm is the modern tool most Linux distributions use these days to manage software RAID arrays; in the past raidtools was the tool we have used for this. This cheat sheet will show the most common usages of mdadm to manage software raid arrays; it assumes you have a good understanding of software RAID and Linux in general, and it will just explain the commands line usage of mdadm. The examples bellow use RAID1, but they can be adapted for any RAID level the Linux kernel driver supports.

1. Create a new RAID array

Create (mdadm –create) is used to create a new array:
mdadm --create --verbose /dev/md0 --level=1 /dev/sda1 /dev/sdb2
Read the rest of this entry »

Tags: , , ,

HowTo force remote devices (routers/switches) to refresh their arp cache entry for a machine

The Address Resolution Protocol (ARP) is the method for finding a host’s link layer (hardware) address when only its Internet Layer (IP) or some other Network Layer address is known. ARP is a Link Layer protocol (Layer 2) because it only operates on the local area network or point-to-point link that a host is connected to. When we migrate one IP from a machine to another one, we might have problems caused by ‘arp caching‘. Various devices will cache the arp information for a specified amount of time and even after we moved the IP this will not be seen by some devices that will still use the cached information. I am talking about directly connected switches or routers, that we might have control or maybe not. If we have control on all the external devices, normally we just connect to the router or switch and remove the arp entry, forcing the device to query again for the information. This post will try to help in the situation where we don’t have direct control on the external devices (we are collocated or use rented servers in a remote datacenter, etc.), to minimize the downtime associated with this type of IP migration.

It is quite frequent to use separate IPs for various services on the same machine, and move those IPs to another server if needed. These are sometimes called portable IPs that can be migrated to any server in a particular colo/lan. This is done normally to minimized downtime and keep maintenance of such operations minimal (and to not rely on dns changes). Still arp caching on various network devices can cause big problems. Let’s assume we moved the IP from one server to another one in the same LAN to move away some service from our main web server. Taking down the IP from the existing server and bringing it up on the new server will complete our direct work if we don’t have access on the switches/routers in front of us. Again if you have control on all devices just connect to them and delete the arp cache for this ip to allow it to be re-cached on the new machine.

Read the rest of this entry »

Tags: , , , ,


Marius on Twitter