Rescue
(Permission problem for Apache)
PART - 1
intricacies of cron and crontab
note about using only necessary privileges in crontab
Rescue
PART - 2
had the cron been working
we would have seen /var/www/rescue.png
getting updated once a minute
that gives us a hint
that some
scheduled job is writing to it
scheduled often means cron
Rescue
PART - 3
we could have done a chmod
but then again
not necessarily the right thing to do
understand what umask does
fix the root cause
Ticket Server
(Mysql Slow Query)
PART - 1
Request times out
Starting from port number 9292
work backwards using lsof
to find the ruby process
running under the flo user
Ticket Server
PART - 2
to find the code
ps -f gives a hint
/proc/pid/cwd symlink is another strong hint
once you find the code
you can look through it, find the query
and the timeout
Ticket Server
PART - 3
We could ofcourse
increase the timeout
But we'd be setting up ourselves for future failure
- breach our sla with downstream services
- put the overall throughput of the system at risk
- not know what to do when the db has 10x rows
Basically, we might not have fixed the root cause
but only done a greedy fix of the immediate symptom
Ticket Server
PART - 4
/var/log/mysql/ has some logs
mysql-slow.log
If you peek into it, you see the offending query
but not much more
Next logical step:
generate a query plan
using explain
Ticket Server
PART - 5
Understand the query plan
See that index is not getting used
But there is an Index
Ah, but it is a composite index
Understand what a composite index is
(assuming you know what an index is)
So we now either
modify this index and change the order
or add a new index
But again, not in haste!
Ticket Server
PART - 6
The composite index should have gotten used
ORDER_ITEMS (
ORDER_ID,
PRODUCT_ID,
STATUS
)
And our query used both ORDER_ID
and PRODUCT_ID
Turns out it is an insiduous bug.
table: ORDER_ITEMS
col: ORDER_ID varchar(64)
table: ORDER
col: ID int(10)
This is the real mistake.
Anger Management
(The Thrift Error)
So you find the source
but the error makes no sense
Digging deeper and following the source trail
Looks like it is expecting a I32
field
but is getting something else
Anger Management
The Thrift Server's source is not available
so what now?
Options
-
Enable remote debugger and step through
-
Use tcpdump or ngrep
-
works well with thrift
-
because of the protocol simplicity
Either ways, our hypothesis is confirmed
We were expecting I32
We get a String
instead
Anger Management
Someone made a mistake
Changed the return type of the API
and did not communicate
and create a migration path
Maybe it would have been easier
to simply ask the dev
on the other end ;-)