Fixit Felix


first in a box

&

now in a slide


Image from http://www.zazzle.com/

The Problems

Boot

(/boot was full)


df shows /boot is full
can't rm the file
but we can truncate it
and we do so
space is free

But


is that the sufficient?
we need to figure who created it
will it come back?
is that legit behaviour?

lsof
tells us some process still holding on to the file
spawned from init
so we have a handle to stop it

Rescue

(Permission problem for Apache)

PART - 1

intricacies of cron and crontab
note about using only necessary privileges in crontab

Rescue

PART - 2

had the cron been working
we would have seen /var/www/rescue.png
getting updated once a minute

that gives us a hint
that some scheduled job is writing to it

scheduled often means cron

Rescue

PART - 3

we could have done a chmod
but then again
not necessarily the right thing to do

understand what umask does
fix the root cause

Ticket Server

(Mysql Slow Query)

PART - 1

Request times out
Starting from port number 9292
work backwards using lsof

to find the ruby process
running under the flo user

Ticket Server

PART - 2

to find the code
ps -f gives a hint
/proc/pid/cwd symlink is another strong hint

once you find the code
you can look through it, find the query
and the timeout

Ticket Server

PART - 3

We could ofcourse
increase the timeout

But we'd be setting up ourselves for future failure
  • breach our sla with downstream services
  • put the overall throughput of the system at risk
  • not know what to do when the db has 10x rows

Basically, we might not have fixed the root cause
but only done a greedy fix of the immediate symptom

Ticket Server

PART - 4


/var/log/mysql/ has some logs
mysql-slow.log

If you peek into it, you see the offending query
but not much more

Next logical step:
generate a query plan
using explain

Ticket Server

PART - 5

Understand the query plan
See that index is not getting used

But there is an Index
Ah, but it is a composite index

Understand what a composite index is
(assuming you know what an index is)

So we now either
modify this index and change the order
or add a new index
But again, not in haste!

Ticket Server

PART - 6

The composite index should have gotten used
    ORDER_ITEMS (
        ORDER_ID,
        PRODUCT_ID,
        STATUS
    )
And our query used both ORDER_ID and PRODUCT_ID 
 Turns out it is an insiduous bug.
    table: ORDER_ITEMS 
    col:      ORDER_ID varchar(64)
    
    table: ORDER
    col:      ID int(10)
This is the real mistake.

Anger Management

(The Thrift Error)


So you find the source
but the error makes no sense

Digging deeper and following the source trail
Looks like it is expecting a I32 field
but is getting something else

Anger Management


The Thrift Server's source is not available
so what now?

Options

  • Enable remote debugger and step through
  • Use tcpdump or ngrep
    • works well with thrift
    • because of the protocol simplicity

Either ways, our hypothesis is confirmed
We were expecting I32
We get a String instead

Anger Management


Someone made a mistake
Changed the return type of the API
and did not communicate
and create a migration path

Maybe it would have been easier
to simply ask the dev
on the other end ;-)

The

Takeaway

Hygiene


keep logs

take notes

use checklists

reproduce the problem
on a non-prod machine

Steps


problem isolation

forming hypotheses

listing assumptions

confirmation/validation

reproduceability

Awareness


Monitoring

host level monitoring
(sar, iostat, vmstat)

graphite

--- do not work in the dark ---

But also remember the streetlight effect.

//wikipedia.org/Streetlight_effect

Collect fine-grained metrics if needed
for eg:
sadc
can be used to collect host metrics at a 1-sec granularity

Awareness


Dependencies

Upstream and Downstream
dependencies
matter

Be it a
bug or a performance problem

Credits


Shashwat
Vishwas
Poornima
Kashyap
Kartik
Sameer
Pankaj
JJ

Yogi

(I hope you've been hitting the down arrow also)