Fixit Felix

first in a box

&

now in a slide

Image from http://www.zazzle.com/

The Problems

Boot

(/boot was full)

df shows /boot is full
can't rm the file
but we can truncate it
and we do so
space is free

But

is that the sufficient?

we need to figure who created it

will it come back?

is that legit behaviour?

lsof

tells us some process still holding on to the file

spawned from init

so we have a handle to stop it

Rescue

(Permission problem for Apache)

PART - 1

intricacies of cron and crontab

note about using only necessary privileges in crontab

Rescue

PART - 2

had the cron been working

we would have seen /var/www/rescue.png

getting updated once a minute

that gives us a hint

that some scheduled job is writing to it

scheduled often means cron

Rescue

PART - 3

we could have done a chmod

but then again

not necessarily the right thing to do

understand what umask does

fix the root cause

Ticket Server

(Mysql Slow Query)

PART - 1

Request times out

Starting from port number 9292

work backwards using lsof

to find the ruby process

running under the flo user

Ticket Server

PART - 2

to find the code

ps -f gives a hint

/proc/pid/cwd symlink is another strong hint

once you find the code

you can look through it, find the query

and the timeout

Ticket Server

PART - 3

We could ofcourse

increase the timeout

But we'd be setting up ourselves for future failure

breach our sla with downstream services
put the overall throughput of the system at risk
not know what to do when the db has 10x rows

Basically, we might not have fixed the root cause

but only done a greedy fix of the immediate symptom

Ticket Server

PART - 4

/var/log/mysql/ has some logs

mysql-slow.log

If you peek into it, you see the offending query

but not much more

Next logical step:

generate a query plan

using explain

Ticket Server

PART - 5

Understand the query plan

See that index is not getting used

But there is an Index

Ah, but it is a composite index

Understand what a composite index is

(assuming you know what an index is)

So we now either

modify this index and change the order

or add a new index

But again, not in haste!

Ticket Server

PART - 6

The composite index should have gotten used

    ORDER_ITEMS (
        ORDER_ID,
        PRODUCT_ID,
        STATUS
    )

And our query used both ORDER_ID and PRODUCT_ID

Turns out it is an insiduous bug.

    table: ORDER_ITEMS 
    col:      ORDER_ID varchar(64)
    
    table: ORDER
    col:      ID int(10)

This is the real mistake.

Anger Management

(The Thrift Error)

So you find the source

but the error makes no sense

Digging deeper and following the source trail

Looks like it is expecting a I32 field

but is getting something else

Anger Management

The Thrift Server's source is not available

so what now?

Options

Enable remote debugger and step through
Use tcpdump or ngrep

works well with thrift
because of the protocol simplicity

Either ways, our hypothesis is confirmed

We were expecting I32

We get a String instead

Anger Management

Someone made a mistake

Changed the return type of the API

and did not communicate

and create a migration path

Maybe it would have been easier

to simply ask the dev

on the other end ;-)

The

Takeaway

Hygiene

keep logs

take notes

use checklists

reproduce the problem

on a non-prod machine

Steps

problem isolation

forming hypotheses

listing assumptions

confirmation/validation

reproduceability

Awareness

Monitoring

host level monitoring

(sar, iostat, vmstat)

graphite

--- do not work in the dark ---

But also remember the streetlight effect.
//wikipedia.org/Streetlight_effect

Collect fine-grained metrics if needed

for eg:

sadc

can be used to collect host metrics at a 1-sec granularity

Awareness

Dependencies

Upstream and Downstream

dependencies

matter

Be it a

bug or a performance problem

Credits

Shashwat

Vishwas

Poornima

Kashyap

Kartik

Sameer

Pankaj

Yogi

(I hope you've been hitting the down arrow also)