The Secret Life of Bears: November 2008

Sunday, November 30, 2008

Web-based auto-generated self-documentation

Fire up pydoc -p 8080, then browse to http://localhost:8080

It will generate a browseable directory of documentation for every module you have installed that's visible on sys.path .

Though the layout is hideous.

I'm not aware of another way to see every module you have installed.

Friday, November 28, 2008

Never use a [] or {} as a default argument.

I've done this a couple times.

When you make [] a default argument, it will be the same list on every subsequent invocation, with the same id(). If you append to it, what you've appended will be still be there for later invocations of the function.

Example:

def functionx(mylist = []):

mylist.append("F")

print mylist

for i in range(4): functionx()

The output of this program should be:

F, F

F, F, F

F, F, F, F

Needless to say, this is really, really stupid. The reason is because the default arguments get defined at definition time, and any mutable type will persist. This is also why somedict.setdefault({},...) will also mess up, any why defaultdict instead uses a callable.

So, you should only use constant values as default parameters, and never a mutable type. I've made this mistake a couple times.

I've seen some obtuse lunkheads actually saying it's better this way. It seems to be a rule that all language designers eventually start rationalizing their mistakes and the corners they cut (another example: see this interview with Håkon Lie , as he is unable to admit there could be any flaw in the design of CSS) and pointlessly defend bad decisions in the work they've invested so much time that no one who approached the language with fresh eyes would ever, ever defend.

I've also seen some people saying that you can use this to simulate static variables. To which I say: so, give us proper static variables then, and don't plant your language full of landmines sure to screw people up when they're not expecting it.

Tuesday, November 25, 2008

Change your readline config to traverse by words

The readline library controls the behavior of most shell programs, including bash, the mysql client and python (though you have to import readline to get this behavior).

This is what gives you ^A to go to the beginning of the line and ^E to go to the end of line. (Note these will also work in many text fields in Mac OS programs).

By default ^F moves the cursor forward one character, and ^B goes back one character. This is fairly useless as any modern keyboard comes with arrow keys.

I overloaded ^F and ^B to instead go forward and backward by words, by adding these lines to ~/.inputrc:

"\C-b" backward-word
"\C-f" forward-word

Another option I always, always set in .inputrc is set completion-ignore-case on.

See man readline for your options. There is actually a vast number of settings compared to the puny set of defaults.

Tell what distro you're using from the command line

I've always wanted to know a command that will tell you what linux distribution you're using.

The answer is:

cat /etc/*release

The results for my current system are:

DISTRIB_ID=Ubuntu

DISTRIB_RELEASE=8.04
DISTRIB_CODENAME=hardy
DISTRIB_DESCRIPTION="Ubuntu 8.04"

Though it will hardly work on all POSIX systems, say my Mac for instance.

Monday, November 24, 2008

drop temporary table

Good idea: when you're dropping a temporary table, use the explicit 'drop temporary table' syntax rather than 'drop table'. This ensures that you cannot drop a real table by accident when you only intend to drop a temporary table.

Let mysql recommend your table definition

If you have an existing table, you can have mysql tell you what it thinks the optimal definition for the table ought to be.

Like so:

select * from mytable procedure analyse();

Note the British spelling of 'analyse'.

This will return a row of results for every column in the select statement. Here's the result for an example column:

*************************** 1. row ***************************
Field_name: someschema.sometable.somecolumn
Min_value: #1 Cochran, Inc.
Max_value: ZZZ Capital Corp.
Min_length: 2
Max_length: 112
Empties_or_zeros: 0
Nulls: 0
Avg_value_or_avg_length: 24.2406
Std: NULL
Optimal_fieldtype: VARCHAR(112) NOT NULL

This will tell you the smallest, largest, and average values of every column in the table, the number of empty or null values, and a recommended definition for the column. Getting the range of a column to determine its optimal size is something I do all the time, so it's nice to have a shortcut to help you do that.

This works well with the 'show table status' command to tell you the absolute table and row sizes of your tables.

Wednesday, November 19, 2008

OpenSQL camp Notes

I've finally got my notes from the saturday sessions of opensql camp.

What follows is a transcription of portions of them that I felt were worth sharing:

(Note also that many of the videos of the sessions have been uploaded.)

How SQL Database Engines Work, by Dr. Richard Hipp, founder of SQLite

- You can't mix ASC and DESC in a multi-column sort and use an index on those same columns to optimize away a filesort operation; you can ASC, or DESC, but not both; UNLESS you declare the columns to be ASC or DESC or vice versa in the index definition.

- To use an index to optimize both where and order by clauses: where clause must be first, order by cols must be second (but the two can overlap). Selected cols that are at least covered by the index can appear in any order.

- OR operations - if you can get the row ids out in order from indexes, you don't need to make a temporary table - could work for AND operations on different cols. and you can nest these intersection & unions using a bitmap index.

Unfortunately, Richard didn't have his slides on the internal mechanics of join operations, which I wold have wanted to see.

Lightening talks, misc presenters:

- Aker: Talks about Gearman, an open source messaging system

- Jay Pipes: talks about Google proto buffers, an open source object multi-language object serialization system

- Patrick Galbraith: talks about a suite of UDF functions for mysql for interacting with memcached. You can then use these in triggers to populate memcached data.

- Jim Starkey: talks about peer-to-peer cloud database system he's working on

Join-fu, by Jay Pipes

Schema is the foundation of performance

Different storage engines have different strengths/weaknesses
- in order to optimize, you need to understand these cost/benefits
- fully optimized systems rarely use just one storage engine

Varchars are silently converted to chars in memory/temporary/derived tables!
- memory tables are fixed-length only. no variable length columns.
- memory, temporary tables, and derived tables are all the same in this respect (I asked him about this; he said that temporary tables use the memory storage by default, but my testing and I'm pretty sure that temporary tables use by default whatever your current default storage engine is, which is usually MyISAM)
- Especially a problem if you have a wide varchar that's usually mostly empty,
- especially a problem if you're using utf8 because space used will become 3 x the char length (because that's what char() does to make sure it remains fixed length, expanding max theoretically possible byte length for that number of characters. Why can't temporary/memory/derived tables just convert to ucs2?)

Vertical partitioning
- put the most commonly used cols in a separate table with fixed length rows
- you can fit a lot more in a data page this way, speeds up scans/reads
- you can put text columns in a separate table w/ fulltext index if you need it; you can even replicate an innodb table to a myisam copy of itself if you need innodb features plus the fulltext index

Query cache
- Any modification to any table involved in the query will invalidate it in the query cache
- so, perhaps put frequently updated cols in another table
- or, if you do a select you know you're about to invalidate, use the sql_no_cache keyword
- there may be a plugin somewhere to see the contents of the query cache
- Jay: "I whip the query cache like a rented mule"

Try variations on a theme when optimizing (ed.:"trial and error")

Join hints are dangerous (use index(), force index(), ignore index())
- could get out of date
- inform your teammates when you use them

Explain output
- "type": the "access strategy" used to grab this data
- "row": estimated rows for this set
- "type: const" - accessing non-nullable unique key; must return 1 or 0 rows only. Can be propagated across joins: if 1 row returned only 1 row to join against another non-null unique index.
- low cardinality equals poor index optimization (cardinality = the number of distinct values in this col over the number of rows)
- "type: index" - index scan usually bad (ed.: but better than a table scan). I asked him about secondary vs. a clustered index scan in InnoDB
- "extra: using index" - usually good, means a covering index is used and no bookmark lookup required

don't functionally transform an indexed column, including coalesce() or isnull() (this will obviously prevent you from being able to a lookup on the index; mysql does not yet support indexes on derived columns); avoid using a non-deterministic function rather than a literal value if you can, because non-deterministic functions cannot be cached.

Stored procedures aren't very good in MySQL
- the compile cache is specific to each connection

The slides from an older version of this presentation are available here. I posted video of another presentation of his here.

I'm also trying to review the video from Peter Zaitsev's talk on sharding in MySQL, having a few issues there.

More complete regex support coming in 2.7?

Looking through the python issue tracker, looks like they're working on almost my exact wishlist of missing features in python's re module.

Including:

possessive quantifiers, atomic grouping
(?flag:...), (?-flag:...) support, (?flag) scoped correctly
variable length lookbehinds

So all of this will be ready for 2.7, right? right?

Tuesday, November 18, 2008

The blackhole storage engine

MySQL supports multiple storage engines, such as MyISAM and InnoDB. One such storage engine is the BLACKHOLE storage engine. Everything you write to a table in this storage engine goes to /dev/null, in other words, it doesn't go anywhere.

So, what is this useful for?

So, while the actual table content isn't stored, the binary log does still exist (the binary log stores a list of modifications to a table such that if the db crashes you don't lose anything). You can therefore use this to replicate the blackhole table to another table somewhere else that uses a more normal storage engine (recall that when you replicate a table, you don't have to use the same storage engine on each side). This allows you to insert into a table asynchronously according to your replication settings.

Monday, November 17, 2008

Row null mask

Mysql contains the nullness of a column in a byte or bytes called the row null mask. Your row size will be one byte larger per every 8 columns that are nullable; they each get a single bit in the same shared byte.

This means that if you have one nullable column, you get seven more for free! It also means that if you one nullable column, you have to pay the same overhead as if you had 7 more nullable columns.

Get subpattern matches balanced even if they're optional

This will give you a mysterious "unbalanced grouping" error:

re.sub('.(.)?)', r'aaa\1')

Because the parenthetical grouping is optional, trying to access it via the \1 backreference will result in an error.

The solution is to replace the ? operator with a null alternative in the paren:

re.sub('.(|.)', r'aaa\1')

Which gives the grouping the option of containing nothing, thereby maintaining the existence of a backreference (though the backreference will of course contain an empty string, which is what you would actually expect).

Back

"Next stop Old Greenwich, then we're in Stamford. Wouldn't want you to be sleeping. The station yard is has scary stuff, big rats. The yardmaster makes terrible carvings."

With these surreal words the conductor shook me awoke. I was totally passed out on the metronorth train, something I'm normally not able to do. That's an exact quote, as least as well as I can remember it.

Almost everything that could go wrong on the trip back did and I'm glad to be home.

I met lots of database luminaries this weekend, the guy who founded MySQL, the guy who founded SQLite, the guy who invented BLOBs, and pretty much everyone who's ever written a book about MySQL. I learned all sorts of esoteric knowledge. I'm eager to see the videos of the sessions in other tracks I missed. Monty seemed pleased that I seemed to like the strange scandinavian fish-flavored drink he tried to make everyone try after the sessions ended on Saturday.

Saturday, November 15, 2008

Jay Pipes

Jay's presentation was by far the most informative. I pinned him down afterward and asked him a bunch of questions.

I'll transcribe my notes later.

memcached msql UDFs

Patrick Galbraith has just given an intriguing talk on his memcached functions for mysql. This allows you to work with memcached seamlessly via triggers on mysql objects.

Meanwhile, Aker is talking about gearman, an open source messaging system

SQLite

Dr. Richard Hipp: "We're not trying to compete with MySQL, we're trying to compete with fopen()"

SQLite is intended for managing disk access for a single application.

Charlottesville

Charlottesville is a cute little down, with an extensive red brick pedestrian arcade. It actually has a European feel to it. I was expecting a fork in the road with two sticks stuck in the ground and the conference would take place around there.

I was expecting to see drunken rednecks, and while I have seen a few of those, there are a lot of hippies selling whatever on the street and hipsters as well. There seems to be hipsters everywhere you go, whether it be New York, Virginia, Thailand, Japan, all with their black-rimmed glasses and v-neck sweaters listening to Nirvana. There must be some pan-hipster consciousness, weaving all of the threads of humanity together.

Everyone is extremely friendly, people smile at you, wave at you when you walk by. The waitress at breakfast called me "sweetie".

Thanks to our MySQL contact, last night I had dinner with some MySQL top guns at a "Himalayan" restaurant -- I sat next to Monty, and Aker was a few seats away.

Aker's just finished his keynote, and a Postgres guy is giving a talk on concurrency.

Thursday, November 13, 2008

limit on update and delete statements

You can use the LIMIT clause on delete and update statements, which will limit you to deleting or updating a set number of rows.

This is useful if you are feeling especially paranoid about making a mistake with a delete or update and nuking your table (everyone has done this once or twice). Just throw an extra limit 1 clause on your statement and you can be absolutely sure that only one row will be affected.

Wednesday, November 12, 2008

Nested list comprehensions

Fun fact: you can nest list comprehensions.

This syntax looks like:

[(i, j) for i in range(4) for j in range(i)]

Which is exactly equivalent to:

x = []

for i in range(4):

for j in range(i):

x.append((i, j))

length() vs. char_length()

If you're using a multi-byte character encoding (e.g. utf8, ucs2) be aware than length() and char_length() may return different results.

length() will return the length in bytes; char_length() will return the number of character in the string, dependent on the character encoding.

The alternative insert syntax

Sometimes, if you're doing an insert, it's easier to type the alternative insert syntax.

insert into sometable set column1='new value', column2='new value'

Yes, this will work

Tuesday, November 11, 2008

Configure mysql to make it better

Create a file in your ~ directory, named .my.cnf . Drop everything and do this right now.

It should look like this:

[mysql]
prompt = 'mysql://\u@\h/\d> '
pager = less -FX

What do these settings do?

The prompt setting changes your mysql prompt to tell you what user, server and database you're currently using. This is vastly more useful if you have lots of connections to different mysql databases, which is constantly the case for me when I'm at work. I used the pseudo-url format used by the PEAR::DB module from PHP, which I think is kind of cool.

This makes my prompt look something like:

mysql://nick@localhost/service_demo>

The pager setting makes your query results appear in a pager program like less. Now, it's actually kind of annoying if every single result appears in a pager. So, the -FX flags tell less to avoid itself if what it is paging is less than 1 page height.

These two settings make the clui mysql client vastly more useful.

In 6.0 there will be an auto-vertical setting to use the \G vertical display automatically if the query result is wider than the current window. I thought of adapting this patch to make an auto-pager setting, but then I remembered about the -F flag to less which essentially does the same thing, although I don't know if there is a similar setting to newer pagers like most.

The last sections of the documentation contain lots of useful tips about these two settings -- it's a shame that they're so little known.

Populating summary tables using init_file

You can populate summary tables at server startup using the init_file variable. This will also allow you to keep summary tables under the memory storage engine, which is non-persistent and will be truncated on server restart.

Jay Pipes Google TechTalk Apr 2008

Monday, November 10, 2008

The hidden debug flag in the re module

The compile function of the re module, as I'm sure you already know, takes a second parameter of flags, such as re.I or re.X.

There is an undocumented flag, however, that will print out the parse tree of the regex. This is the re.DEBUG flag.

Here's some trivial output:

>>> re.compile(u'ell2\u1f4c', re.DEBUG)

literal 101

literal 108

literal 50

literal 8012

The 'literal' statements are followed by the ascii dec values of the characters in the pattern. I threw in a unicode character and you can see the unicode code point as the last item.

This will also tell you whether the pattern is being compiled for the first time or if it is being pulled out of the re module's cache.

Saturday, November 8, 2008

Who you are and who you have been authenticated as

In Mysql, to tell who you are and who you have been authenticated as, use the user() and current_user() functions.

Example:

mysql> select user(), current_user() \G
user(): nick@localhost
current_user(): nick@%

This is useful because you may have overlapping authentication rules, such as 'nick@192.168.%' and 'nick@%' and 'nick@localhost'.

Mysql permissions are confusing and frustrating, hopefully this helps debug some of its issues.

Friday, November 7, 2008

Create a new table from another table's definition

To create a table new_table as a blank copy of another table called another_table, use:

create table new_table like another_table;

Of course you can populate new_table with the contents of another_table like so:

insert into new_table select * from another_table;

You can combine these into one step! But, unlike the two-step method this way has the serious limitation of losing any of the indexes on the original table:

create table new_table select * from another_table;

Thursday, November 6, 2008

Make a more efficient list of integers

If you have a list that's all of the same type, and that type is numeric, use the array module for more efficiency.

Keep a list always sorted

Use the bisect module

You never need to cover the primary key in Innodb

In Innodb (unlike MyISAM), the primary key is stored as a lookup in every secondary index; therefore, there is never a need to add the primary key column to a secondary index in order to cover it.

In MyISAM secondary indexes store the internal row ID as a lookup, so this trick will not work and you need to explicitly add the primary key column if you want to cover it in an index.

Brief aside: in some database systems you can specify columns to add to the index pages without affecting the index structure or constraints. For example, an index like (some_unique_column, primary_key) could cover the primary key in the index without destroying the uniqueness constraint, since of course the primary key would be independently unique.

Tuesday, November 4, 2008

How to tell if the current output is a terminal

sys.stdout.isatty()

How is the encoding of stdout chosen?

It's actually determined by the LC_TYPE environmental variable

Python: how is sys.stdout.encoding chosen?

Monday, November 3, 2008

arithmatic and bitwise operators on sets

The set object supports basic arithmatic operators and bitwise operators.

>>> a = set('abracadabra')

>>> b = set('alacazam')

>>> a # unique letters in a set(['a', 'r', 'b', 'c', 'd'])

>>> a - b # letters in a but not in b set(['r', 'd', 'b'])

>>> a | b # letters in either a or b set(['a', 'c', 'r', 'd', 'b', 'm', 'z', 'l'])

>>> a & b # letters in both a and b set(['a', 'c'])

>>> a ^ b # letters in a or b but not both set(['r', 'd', 'b', 'm', 'z', 'l'])

Cool way to sort a dict by key or value

A simple way to sort a dict is to use iteritems() with a lamdba function, like so:

somedict = {...}
[ x[0] for x in sorted(somedict.iteritems(), lambda x, y: cmp(x[1], y[1])) ]

The cooler (and faster) way is to use itemgetter from the operator module, with the key parameter to sorted().

To sort a dict by keys, use:

import operator
somedict = {...}
[k for k, v in sorted(d.iteritems(), key=itemgetter(0))]

To sort a dict by values, instead use the line:

import operator
somedict = {...}
[k for k, v in sorted(d.iteritems(), key=itemgetter(1))]

Remember how itemgetter works: it is identical to the key lookup operator. In other words, in the code below,

somedict = {'a': 1}
operator.itemgetter('a')(somedict)
somedict['a']

the last two lines do exactly the same thing. This is a neat trick to avoid a lambda call, since subroutine calls are relatively expensive.

How to get a builtin back after you've overloaded it

One of the things in python that I'm not such a fan of is how easy it is to overload a built-in function without realizing it, since functions and variables live in the same namespace. 'str' and 'type' are built-in functions that you need fairly often and also common variable names used in programming; if you have both type() and str() will break and misbehave without it necessarily being obvious why.

For instance:

i = 1
type = 'obj'
type(i)

will result in a "is not callable" TypeError, which may not be that clear at first glance.

To get a builtin back even though it's been overloaded, just specify the builtin namespace, as shown below:

i = 1
type = 'obj'
__builtins__.type(i)

Also, if you del the local 'type' variable, the builtin 'type' will be reinstated. After that, built-ins themselves cannot be deleted, and you will get a NameError if you try.

However, it's probably best to just avoid overloading builtins by convention. For instance, I try to remember to name a generic string variable 'astr' rather than 'str'.

enumerate

Say you want to loop over an iterable, but you want to keep an index of your position.

The most obvious way to do this is to use a separate counter variable, like so:

aseq = 'hello'
i = 0
for c in aseq:
   ....
   i += 1

While that will work, the cooler way is to use enumerate. Enumerate encapsulates any iterable in another iterable that will return the value as yielded by the inner iterable with the value of the current index.

aseq = 'hello'
for (i, c) in enumerate(aseq):
   ....