Tuesday, 23 December 2014

I am moving to Wordpress.com (http://jotdownux.wordpress.com)


Dear Reader,

I have moved to my new updated wordpress.com blog
Please visit the new page.
I may not continue with this blog anymore.

Wednesday, 26 November 2014

Designing a load-balancer based on OS load

This is my first post after two years. Have been in an accident recently and had some spare time to think while taking rest for a month.

"Do we really consider OS load in the load-balancers?" I asked myself one-day while drowsy. There are myriad number of load-balancers which balance network traffic at various layers of TCP/IP. Some are TCP based, some IP, some can load-balance at application level like HTTP etc. But, most of them revolve around network latency etc. and almost none checks for OS load to schedule traffic to a particular node.

"Why?" probably because it is less cumbersome to keep track of the metrics that determine or influence load-balancer's scheduler. But, what's the harm in trying to make a tool which can really load-balance traffic based on the OS load metrics like memory usage, CPU usage, io-wait time, etc. "No harm at all" I replied to my own question (yes, I am talking to myself now-a-days after the head-eye injury during accident). So let's start building a tool which can efficiently provide a "score card" to every node in the cluster and schedule traffic based on who has got a better score. As simple as that!

As a result, I started a project at GitHub, named "Themis" 

Every node in the cluster will have an agent running at a frequency depending upon the overall load of the cluster. The agent deducts (initial score is 100) scores the node based on available resources like CPU wait time, load-average, available RAM, swapiness etc. These can be tunable by predefining thresholds like 20% of the total CPU time is acceptible as a normal wait time for a particular node. The better the score is, the lower the load on that particular node. There is a plan to keep heuristics as well, to make the scoring process a bit smarter over the time by manipulating the thresholds automatically.
After the agent assigns a score to a particular node, it sends that score (along with the trend data, but probably less frequently) over to the load balancer through QPID (v0.30) messaging in JSON format
The load-balancer keeps a queue of all the nodes reverse-sorted based on their scores, i.e the node with highest score will be put in front of the queue and will be scheduled for the next connection, in case this node does not respond, next node in the queue will be considered.
Load-balancer will also keep track of a "participation-score" i.e how many times a particular node has been selected by the load-balancer. The lowest scorers will be reported by the load-balancer along with the different resource-limit/load data pin-pointing the reason why that node was not selected so frequently, thus giving a hint to the sysadmins on the bottlenecks

I have not yet started working on the load-balancer side, nor have I thought of a way to actually switch traffic to the nodes, well not yet. But then there are so many projects going on in that area and I can probably get inspired by one of them or just simply use my tool as an add-on to "influence" their scheduler policies.

So folks, if you are a Python devops/sysops guy and have some ideas to share with me or participate in the project, please drop a comment here, and we can catch up at GitHub.

Lets code it!!!

Tuesday, 25 December 2012

What's that /boot/initramfs file?

Have you ever wondered how kernel mounts the root "/" filesystem? When I was learning Linux and Unix a few years ago, this question used to haunt me. Why? In order to load a filesystem, kernel needs to have the filesystem module loaded into memory. You can achieve this in two ways, either you compile in the filesystem module code along with the kernel i.e statically link it. Or, you can load the module dynamically when needed. The problem arises there. How is it possible for the Linux kernel to load many kernel modules even before mounting the root filesystem? (If you see dmesg, "mounting root filesystem" message appears towards the end). FYI, kernel modules are usuaally stored in /lib/modules/`uname -r`/ directory.

To solve this problem, Linux uses initram filesystem. Grub boot loader takes "initrd" argument along with the location of the iniramfs file. Grub has enough knowledge of the filesystems enabling it to find kernel, initramfs and config files from the /boot filesystem. But it does not yet support LVM or software RAID. That's why you don't get to see /boot partition on an LVM or software RAID.

Anyway, so grub takes the initramfs file, uncompresses it, and lays it out on the memory. Initraamfs contains enough modules and binaries to load the root filesystem modules. When control passes to the kernel, it sees the initramfs filesystem as it's root file system. It then executes /init (yes, it's not a typo) script on initramfs, which loads the modules required for mounting the root filesystem i.e modules for LVM/software RAID, filesystem. Once the root filesystem is mounted, the script chroots to it and the control is passed on to the /sbin/init program. From then onwards, you all know what happens.

Are you not feeling excited enough to see the contects of the initramfs filesystem? Yes, you can see the contents of it. Just execute the following commands:

mkdir /tmp/init
cp /boot/initramfs-`uname -r`.img /tmp/init/initramfs.gz
gunzip /tmp/init/initramfs.gz
cd /tmp/init
cpio -iv </tmp/init/initramfs

And explore the miniature root filesystem!! :)

Wednesday, 5 December 2012

How to create external Journal device in Linux with EXT4

Among you, who have worked on AIX, must be smiling and thinking "Is this really something to write about on a blog?". Well, it's true for AIX, as external Journal device is an old story there. On the other hand, I have seen very few or close to none usage of external journal device in production Linux systems.

People who do not know what a Journaling filesystem is, this should be comprehensive enough: The whole idea of Journaling filesystem is to keep a write-ahead data of all the filesystem changes. In this way, creating/deleting/modifying a file becomes a transaction. Remember atomicity of transaction in DBMS? All or None law? The same idea works here. So in case if some thing goes wrong, the filesystem can always roll back to the last known good state. However, all awesome things come with a cost. Here the cost you pay is the filesystem write performance: anytime you want to write data to a file, the filesystem will write to the journal first and then it will write the actual data.

Here's why you might want to have an external journal device for your ext4 file system:

1. You are avoiding corruption in the journal data itself by keeping it somewhere else than the original data
2. As in journaling, filesystem writes data twice, keeping a separate journal device i.e a partition or logical volume on a separate physical disk altogether will introduce a significant performance boost

There's a downside though: You cannot have multiple EXT4 filesystems sharing the same journal device as of now (AIX folks have probably started laughing now).

So how do you actually create a external journal device? Here's how:

mke2fs -O journal_dev /dev/block_device_name

While creating a new filesystem, you can easily point to the newly created journal device like this:

mke2fs -t ext4 -J device=/dev/journal_device_name /dev/block_device_for_new_fs

To change journaling from internal to external for an existing filesystem, first unmount the filesystem. Then, execute the following:

tune2fs -O ^has_journal /dev/blk_dev_for_existing_fs
tune2fs -o journal_dev -j -J device=/dev/device_name /dev/blk_dev_for_existing_fs

There are few things to remember though:
The size of the journal device should be at least 1024*block size of the file system. So a filesystem with 4kb block size should have a journal device of 4mb at least.
Block size of the journal device should be set to the same as that of the actual filesystem

You can gain around 40% performance boost with external journal device (provided, the journal device resides on a separate physical disk). Amazing huh?

Sunday, 30 September 2012

Turbo charge your web servers using Squid HTTP Intercept mode

Now, this is not of course something new. But, I thought of sharing basic configurations of Squid proxy caching server to cache web requests. There are of course a lot more you can do with Squid, one of them is load balancing web servers. One thing to remember though, if your web content is dynamic i.e contains stuffs that change over a short time, you may not gain that much performance boost. Also, this does not wok with Secure HTTP i.e HTTPS as it was designed to avoid Man In the Middle attacks.

So here it is:

eth0: => client facing interface
eth1: => web server facing client

1. Install Squid

yum install squid

2. Before we start the service we have to do few changes in the config mode to tun on the intercept mode. Open the /etc/squid/squid.conf file

http_port 3128 intercept

It listens on tcp port 3128 and enables the intercept mode specifically designed to facilitate web traffic.

3. Next, we have to set the cache directory and it's size

cache_dir aufs /var/spool/squid 90 16 256

aufs is better Advanced UNIX file system mode and is better than the ufs mode in terms of file operations. Or else, you may specify diskd mode too which is almost similar but runs as a separate daemon and requires a litlle bit extra fine tuning. 90 MB is the size of the cache. 16 i the number of directories in the cache dir and 256 is the number of the directories under each directory. You may double the numbers depending on the load of the web server.

To have maximum performance, I have mounted /var/spool/squid as tmpfs to keep all of its contents in memory rather than on hard disk

My /etc/fstab has an entry like this

tmpfs /var/spool/squid tmpfs size=100m,rw,rootcontext="system_u:object_r:squid_cache_t:s0" 0 0                    

4. Set the cache memory size to be used

cache_mem 50 MB

Squid is intelligent enough to decide which content should go to cache memory and what should be kept in the cache disk.

5. Next, you have to set your router to route traffic for the web server to the squid server so that the web queries can take advantage of squid. I used my squid server as the router as well (this is called taking most out of it ;)).

So here are the iptables settings I had to do

A. Enable routing:
sysctl -w net.ipv4.ip_forward=1 >>/etc/sysctl.conf

B. To avoid looping, we have to tell iptables to accept any port 80 traffic which came from our squid server's IP  

iptables -t nat -A PREROUTING -s -p tcp -m tcp --dport 80 -j ACCEPT

Then, redirect any traffic for port 80 to localhost tcp port 3128 (which is squid server)

iptables -t nat -A PREROUTING -i eth0 -p tcp -m tcp --dport 80 -j DNAT --to-destination

While sending out web traffic to the web server, pose as if we are the client

iptables -t nat -A POSTROUTING -o eth1 -j MASQUERADE

Enable forwarding

iptables -A FORWARD -i eth0 -j ACCEPT

service iptables save

6. Start the squid service now

service squid start

7. Now, we have to point our clients and the web server to the squid server as default gateway and test the network connectiong using traceroute and ping commands.

Everything should be fine now and all web queries should go through squid.

Saturday, 22 September 2012

OpenLDAP + NFS + Automount = Complete Identity Solution

Well, we have lots of identity solutions these days. They are ready to be in use out-of-the-box with very little configuration changes. But having said that, be it MS Windows Active Directory, Red Hat Directory Server, or IBM Tivoli Identity Manager, all are based on rock-solid LDAP protocol. Still, I have seen people using OpenLDAP in Open Source projects as well as critical commercial environments.

I thought of setting up my own OpenLDAP server for my home lab, just for fun as well as to have more in-depth knowledge about it. As always, I felt sharing the knowledge I gained and the issues that I came across.

I am using RHEL 6.2 on both the server and clients.

Setting up the server:

1. Install the required packages

yum install openldap*

2. cd /etc/openldap/slapd.d
   find ./ -type f | xargs grep "dc=my-domain,dc=com"

This will usually point to ./cn=config/olcDatabase={2}bdb.ldif file
Open that file and change the domain name with yours in vi


3. Change the domain admin's user name from Manager to root to look like this

olcRootDN: cn=root,dc=vmnet,dc=com

4. Press CTRL+Z while in vi to stop the process and run slappasswd to set a new password for the domain admin, root in this case

5. Copy the password string and type 'fg' to get the vi session resumed. Make a new line after olcRootDN directive and put a line like this with the password

olcRootPW: {SSHA}wIEjnTE+CU6U1KsU5pGdcmEyqZ/jTsbt

6. At this point, you may check if the configs are fine by running the following command

slaptest -u 

-u is to ignore warnings for database files, no issues now as we are yet to create them

7. Now, we need to install migrationtools package to migrate exiting users/groups etc. database to LDAP

yum install migrationtools -y

8. cd to /usr/share/migrationtools/ and edit the follwing lines in the migrate_common.ph file to reflect correct domain name

# Default DNS domain
$DEFAULT_MAIL_DOMAIN = "vmnet.com";

# Default base
$DEFAULT_BASE = "dc=vmnet,dc=com";

9. Run the migrate_all_offline.sh script to build LDAP DBs out of local users, groups etc.

10. Now, change the owner of the newly created files in /var/lib/ldap directory

chown -R ldap:ldap /var/lib/ldap/*

11. Start the slapd service

service slapd start
chkconfig slapd on --level 35

12. Open up LDAP port 389 both TCP and UDP on iptables

iptables -I INPUT -m state --state NEW -m tcp -p tcp --dport 389 -j ACCEPT
iptables -I INPUT -m state --state NEW -m udp -p udp --dport 389 -j ACCEPT

13. At this point, you should be able to see the objects in the LDAP domain using slapcat command

Setting up the client

1. Install the following packages

yum install pam_ldap nss-pam-ldapd -y

2. Run authconfig-tui and select LDAP for User Information and Authentication and select NEXT. You have to then provide FQDN for your LDAP server and domain name in the Base DN field.

Adding/removing/modifying LDAP objects

If you are not familiar with the ldif file format, use slapcat or migrate_passwd.pl script in /usr/share/migrationtools directory to get one example.

Then you may execute one of the following:

ldapadd -a -W -D "cn=root,dc=vmnet,dc=com" </tmp/testuser.ldif

Or else, you may install phpLDAPadmin to administer the LDAP server through web

yum httpd php php-ldap

Getting user's home directory automatically mounted on client

It's better to create a separate home diretory for the ldap users. /home/users => this is what I chose

Share this through nfs server


Now, on the client side, configure autofs:

1. In the /etc/auto.master file, you may add the following

/home/users   /etc/auto.home

2. Create /etc/auto.home file and add the follwing

*       -fstype=nfs     red.vmnet.com:/home/users/&

3. Create /home/users directory

In this approach, there will not be any clash between a local user and an LDAP user logged in on the same client machine as they will have separate home directories. Otherwise, a local user would lose access to their home directory once an LDAP user's home directory got automounted on /home.

Now, you are highly likely run into permission issue on the user's home directories if you have not already configured how IDs should match. /etc/idmapd.conf on the client machine is something you need to concentrate on.

This file must be edited for the below directives/options

Domain = vmnet.com
LDAP_server = red.vmnet.com
LDAP_base = dc=vmnet,dc=com

Now, restart the rpcidmapd service

service restart rpcidmapd

You may ask the users to setup ssh-keys and they will be able to log in to any LDAP clients

That's about it!!

Tuesday, 4 September 2012

Centralized logging system: rsyslog, logstash, Elasticsearch & kibana

So I have been a little bit busy with this voluntary work I do in my free time. I am associated with Wikimedia Foundation as a volunteer IT staff. People are very much open and helpful there.

Few months back, I was browsing through their ongoing projects hoping someone would need help in something which I can contribute to. As I was new to their community, things were half clear to me. I needed work on a project which would make me understand their infrastructure and it should be simple enough for me. Of course, I did not want to dive into the most complicated project and then sitting idle, looking at other people's scribbles on IRC.

One day I stumbled upon an interesting project. Its objective was to build a centralized logging system with good search capability. Although they use Nagios for alerting, if someone needed to dive search through logs, they had to log into that particular server and they do a little grep or egrep against the logs. I thought this would be a perfect project for me. I would get to know about the surrounding, plus it's relatively simple to setup something like this.

I was added to the project and I was the only one member. Sweet!!

So I first started with some open source products for experimenting various things. Few did just fine, few did not scale at all. At last I found a perfect combination: Logstash, Elasticsearch, and Kibana.

Logstash is very useful and versatile. It's made of JRuby (Java+Ruby). You can specify inputs and outputs as well as filters. It supports various input types. One of them is "Linux Syslog". Which means, you do not have to install logging agent on every server increasing the overall load of the server. Your default rsyslog client will do just fine. Then comes the filtering part, after taking input, you can filter out logs within Logstash itself. It's awesome but it didn't serve any purpose for me as I wanted to index every log. Next is the output part, Logstash can output logs on standard output (why would anyone want that). But as with input, it supports multiple output types too. One of them is Elasticsearch.

Elasticsearch is a Java based log indexer. You can search through Elasticsearch indices using Lucene search syntax for more complicated query. But, simple wildcard search works too.

Next comes Kibana. It provides the web frontend for Elasticsearch, written on Java Script and PHP, requires only one line to be edited for this to work out off the box.

As of now, I have configured all of them on one relatively larger lab VM. There were several hitches in the beginning, but apart from that it all went pretty smooth.

So here's what is happening:

I had to make a little init script too for these services (logstash and elasticsearch). It's based on Ubuntu 10.04.3 LTS, but should work on CentOS/RedHat as well with a little bit modification.

Here's the script:

#! /bin/sh

export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

. /lib/lsb/init-functions

logstash_bin="/usr/bin/java -- -jar /logstash/logstash.jar"

NICE_LEVEL="-n 19"

# This is needed by ElasticSearch
export ES_HOME="/logstash/elasticsearch"

# sets max. open files limit to 65000
# otherwise, elasticsearch throws java.io.IOException: Too many open files

ulimit -n 65000

start () {

        command_es="/usr/bin/nice ${NICE_LEVEL} /logstash/elasticsearch/bin/elasticsearch"
        #command_ls="/usr/bin/nice ${NICE_LEVEL} ${logstash_bin} agent -f ${logstash_conf} -- web --backend elasticsearch:///?local --log ${logstash_log}"
        command_ls="/usr/bin/nice ${NICE_LEVEL} ${logstash_bin} agent -f ${logstash_conf} --log ${logstash_log}"

        log_daemon_msg "Starting" "elasticsearch"
        if start-stop-daemon --start -d "/logstash/elasticsearch" --quiet  --oknodo  -b --exec ${command_es}; then
                log_end_msg 0
                # I had to do this as -p option with elasticsearch gives wrong PID
                # The same with --pidfile option with start-stop-daemon
                sleep 1 # takes a little bit of time before getting caught by below
                # don't why I chose to grep for "sigar"; maybe it looks like cigar
                ps -elf | grep [e]lasticsearch | grep sigar | awk '{ print $4 }' >${es_pid_file}
                log_end_msg 1

        log_daemon_msg "Starting" "logstash"
        if start-stop-daemon --start -d "/logstash/" --quiet --oknodo --pidfile "$ls_pid_file" -b -m --exec ${command_ls}; then
                log_end_msg 0
                log_end_msg 1

stop () {
        start-stop-daemon --stop --quiet --oknodo --pidfile "$ls_pid_file"
        start-stop-daemon --stop --quiet --oknodo --pidfile "/var/run/elasticsearch.pid"

status () {
        status_of_proc -p $ls_pid_file "" "$name"
        status_of_proc -p ${ws_pid_file} "" "elasticsearch"

case $1 in
        if status; then exit 0; fi




                status && exit 0 || exit $?

        echo "Usage: $0 {start|stop|restart|reload|status}"
        exit 1

exit 0

The system is in the testing phase. We need to check how it scales for 2000+ servers, maybe we have to think about load-balancing too. But as of now, it really does a great job regarding memory consumption, disk space, processing power, etc.

Once the whole system gets ready to go live for production servers, I would definitely publish more detailed technical stuffs. Crossing my fingers!!