The Secret Life of Bears: 2008

Monday, December 29, 2008

Mysql sounds like operator

MySQL has an operator SOUNDS LIKE .

a sounds like b

is equivalent to

soundex(a) == soundex(b)

Cute, though not something I would use in production code.

Thursday, December 25, 2008

Put last word in quotes

Really useful tidbit from the readline manual:

Put the following in your .inputrc:

$if Bash
# quote the current or previous word
"\C-g": "\eb\"\ef\""
$endif

Now cntrl-g will surround the previous or current argument with quote characters.

Tuesday, December 23, 2008

Fix up screen good

Awhile ago, I set up screen on another machine.

I was setting up my screen config for a new machine, and I had completely forgotten how I had fixed it up.

Here's what I did:

don't bother with aliasing 'screen' to 'TERM=screen screen'
to fix backspace-delete, instead put 'bindkey -k kD stuff ^H' in .screenrc (note ^H should be entered as cntrl-V-cntrl-H to get a literal)
to get color shell prompts in bash:

there should be a line in .bashrc that looks like case "$TERM" in xterm-color) color_prompt=yes;; esac
add screen) color_prompt=yes;; into the body of the case statement

to get window titles added correctly, for bash:

add this to .screenrc: shelltitle '$ |bash'
add this to .bashrc: PS1=$PS1'\[\033k\033\\\]'

to get window titles added correctly for zsh, add the following to .zshrc:

preexec () { echo -ne "\ek${1%% *}\e\\"}
precmd () { echo -ne "\ekzsh\e\\" }

put 'shell zsh' in .screenrc to start with zsh automatically
startup_message off to disable the startup message
a status line config that I like: '%{kw}[%{kw}%?%-Lw%?%{=b wR}(%{k}%n*%f %t%?(%u)%?%{=b wR})%{= wk}%?%+Lw%?%?%= %{= }]%{g}[%l]'

I found some of these example configs helpful, especially for deciphering the caption/hardstatus lines:

Saturday, December 20, 2008

make python understand ~ for home directories

Python does not automatically expand ~ for user home directories. There is a function in os.path that does that for you:

mypath = os.path.expanduser('~/somepath')

mypath == '/home/nick/somepath'

Python has two builtin functions for evaluating iterables for truth/falseness: any() and all(). any() returns true if any member of the iterable evaluates to True, while all returns True only if every single member of the iterable evaluates to True.

>>> any([True, False, False])
True
>>> any([False, 0])
False
>>> all([True, 0])
False
>>> all([1])
True
>>> all([1, 1, True])
True

There are identically named subquery operators in sql, however they are only really useful for correlated subqueries which are poorly optimized and generally to be avoided in the current versions of mysql.

Wednesday, December 17, 2008

Commands that can see themselves

You can see the contents of the current command line in /proc/self/cmdline . Of course this is only supported by systems that support /proc, i.e. not a BSD-based system like OS X.

cat /proc/self/cmdline --> cat /proc/self/cmdline

I'm not sure what this is useful for.

The different permutations of grep

bzegrep (1) - search possibly bzip2 compressed files for a regular expression
bzfgrep (1) - search possibly bzip2 compressed files for a regular expression
bzgrep (1) - search possibly bzip2 compressed files for a regular expression
egrep (1) - print lines matching a pattern
fgrep (1) - print lines matching a pattern
grep (1) - print lines matching a pattern
pgrep (1) - look up or signal processes based on name and other attributes
rgrep (1) - print lines matching a pattern
zegrep (1) - search possibly compressed files for a regular expression
zfgrep (1) - search possibly compressed files for a regular expression
zgrep (1) - search possibly compressed files for a regular expression

According to passing the string 'grep' to the always useful 'apropos' command

commandline fun tr and seq

Uppercase on the command line:

tr '[:lower:]' '[:upper:]'

Strip all newlines:

tr -d '\n'

Strip all non-printing characters:

tr -dc '[:print:]'

print from 1 to 100:

seq 100

print from 2 to 50, incrementing by 3:

seq 2 3 50

print from 1 to 10, all on the same line:

seq 10 | xargs echo

Friday, December 12, 2008

Parse any date easily

Use dateutil.parser.parse, possibly with the fuzzy flag set to True.

>>> from dateutil.parser import parse

>>> parse("Jan. 1st, 1991", fuzzy=True)

datetime.datetime(1991, 1, 1, 0, 0)

This is a huge boon over the tedious and underfeatured time.strptime and similar. Unhelpfully, the documentation is non-existent.

How to swap two tables without downtime

You can swap two tables in the same schema without messing up any readers by combining rename table operations into a single statement.

rename my_table to my_table_old, my_table_new to my_table;

I haven't checked this but the statement should use a name lock so that attempts to access either table by other clients will be blocked until the statements complete.

Thursday, December 11, 2008

See how python compiles some code

use the dis module to get the bytecode of any python callable.

Example:

import dis
dis.dis(lambda: x is True)

  1 0 LOAD_GLOBAL 0 (var)
   3 LOAD_GLOBAL 1 (True)
   6 COMPARE_OP 8 (is)
   9 RETURN_VALUE

This is another way to get the efficiency of different parts of code

Sunday, December 7, 2008

See how python encodes a string internally

Fun fact: you can see how python encodes a string internally by asking it to encode a unicode object to the 'unicode_internal' codec.

Example:

>>> u'\N{SNOWMAN}'.encode('utf16')
'\xfe\xff&\x03'

>>> u'\N{SNOWMAN}'.encode('unicode_internal')

'&\x03'

As you can see, it's UTF16 without a BOM.

Fun stuff with unicode in python

Fun fact: you can interpolate a unicode character into a unicode string literal by using \N{name of character}

Example:

>>> print u'this sign is \N{VULGAR FRACTION ONE HALF} off'.encode('utf8')
this sign is ½ off

You can tell the name of any given character by using the unicodedata module

>>> import unicodedata
>>> unicodedata.name(u'\xbd')
'VULGAR FRACTION ONE HALF'

The name of the famous snowman character is simply enough 'SNOWMAN.

>>> unicodedata.name(unichr(9731))
'SNOWMAN'

So you can represent this character in python simply by u'\N{SNOWMAN}'

Thursday, December 4, 2008

Use vim's multiple paste buffers

Ever find yourself avoiding cutting something in vim because you don't want to lose the contents of your paste buffer?

Fear not! Vim has been silently saving away the last 16 things you've cut into multiple paste buffers, which you could have been accessing all this time had you only heard the first thing of it.

Type :reg to see the contents of your paste buffers. To paste something that you cut 3 cuts ago, just use "3p

Wednesday, December 3, 2008

Hash images

Now, this is what I call cool: findimagedupes.

It's a image hashing program, that outputs a "perceptual hash" of image files. Unlike a cryptological hash, a small change in an image should result in a small change, so you can use this to tell how similar two slightly different images are.

You can also use it to recursively search directories for duplicate images.

It's in apt for debian and ubuntu.

Tuesday, December 2, 2008

Keep a series of python objects in a shelf

You can store a dict of pickled objects in a "shelf", a dbm-based hash table stored to disk, using the shelve module.

Monday, December 1, 2008

Tell whether two clients are using the same connection

select connection_id() will tell you the unique id of your connection. Try it on two different client objects to see if they are using the same connection or not.

Sunday, November 30, 2008

Web-based auto-generated self-documentation

Fire up pydoc -p 8080, then browse to http://localhost:8080

It will generate a browseable directory of documentation for every module you have installed that's visible on sys.path .

Though the layout is hideous.

I'm not aware of another way to see every module you have installed.

Friday, November 28, 2008

Never use a [] or {} as a default argument.

I've done this a couple times.

When you make [] a default argument, it will be the same list on every subsequent invocation, with the same id(). If you append to it, what you've appended will be still be there for later invocations of the function.

Example:

def functionx(mylist = []):

mylist.append("F")

print mylist

for i in range(4): functionx()

The output of this program should be:

F, F

F, F, F

F, F, F, F

Needless to say, this is really, really stupid. The reason is because the default arguments get defined at definition time, and any mutable type will persist. This is also why somedict.setdefault({},...) will also mess up, any why defaultdict instead uses a callable.

So, you should only use constant values as default parameters, and never a mutable type. I've made this mistake a couple times.

I've seen some obtuse lunkheads actually saying it's better this way. It seems to be a rule that all language designers eventually start rationalizing their mistakes and the corners they cut (another example: see this interview with Håkon Lie , as he is unable to admit there could be any flaw in the design of CSS) and pointlessly defend bad decisions in the work they've invested so much time that no one who approached the language with fresh eyes would ever, ever defend.

I've also seen some people saying that you can use this to simulate static variables. To which I say: so, give us proper static variables then, and don't plant your language full of landmines sure to screw people up when they're not expecting it.

Tuesday, November 25, 2008

Change your readline config to traverse by words

The readline library controls the behavior of most shell programs, including bash, the mysql client and python (though you have to import readline to get this behavior).

This is what gives you ^A to go to the beginning of the line and ^E to go to the end of line. (Note these will also work in many text fields in Mac OS programs).

By default ^F moves the cursor forward one character, and ^B goes back one character. This is fairly useless as any modern keyboard comes with arrow keys.

I overloaded ^F and ^B to instead go forward and backward by words, by adding these lines to ~/.inputrc:

"\C-b" backward-word
"\C-f" forward-word

Another option I always, always set in .inputrc is set completion-ignore-case on.

See man readline for your options. There is actually a vast number of settings compared to the puny set of defaults.

Tell what distro you're using from the command line

I've always wanted to know a command that will tell you what linux distribution you're using.

The answer is:

cat /etc/*release

The results for my current system are:

DISTRIB_ID=Ubuntu

DISTRIB_RELEASE=8.04
DISTRIB_CODENAME=hardy
DISTRIB_DESCRIPTION="Ubuntu 8.04"

Though it will hardly work on all POSIX systems, say my Mac for instance.

Monday, November 24, 2008

drop temporary table

Good idea: when you're dropping a temporary table, use the explicit 'drop temporary table' syntax rather than 'drop table'. This ensures that you cannot drop a real table by accident when you only intend to drop a temporary table.

Let mysql recommend your table definition

If you have an existing table, you can have mysql tell you what it thinks the optimal definition for the table ought to be.

Like so:

select * from mytable procedure analyse();

Note the British spelling of 'analyse'.

This will return a row of results for every column in the select statement. Here's the result for an example column:

*************************** 1. row ***************************
Field_name: someschema.sometable.somecolumn
Min_value: #1 Cochran, Inc.
Max_value: ZZZ Capital Corp.
Min_length: 2
Max_length: 112
Empties_or_zeros: 0
Nulls: 0
Avg_value_or_avg_length: 24.2406
Std: NULL
Optimal_fieldtype: VARCHAR(112) NOT NULL

This will tell you the smallest, largest, and average values of every column in the table, the number of empty or null values, and a recommended definition for the column. Getting the range of a column to determine its optimal size is something I do all the time, so it's nice to have a shortcut to help you do that.

This works well with the 'show table status' command to tell you the absolute table and row sizes of your tables.

Wednesday, November 19, 2008

OpenSQL camp Notes

I've finally got my notes from the saturday sessions of opensql camp.

What follows is a transcription of portions of them that I felt were worth sharing:

(Note also that many of the videos of the sessions have been uploaded.)

How SQL Database Engines Work, by Dr. Richard Hipp, founder of SQLite

- You can't mix ASC and DESC in a multi-column sort and use an index on those same columns to optimize away a filesort operation; you can ASC, or DESC, but not both; UNLESS you declare the columns to be ASC or DESC or vice versa in the index definition.

- To use an index to optimize both where and order by clauses: where clause must be first, order by cols must be second (but the two can overlap). Selected cols that are at least covered by the index can appear in any order.

- OR operations - if you can get the row ids out in order from indexes, you don't need to make a temporary table - could work for AND operations on different cols. and you can nest these intersection & unions using a bitmap index.

Unfortunately, Richard didn't have his slides on the internal mechanics of join operations, which I wold have wanted to see.

Lightening talks, misc presenters:

- Aker: Talks about Gearman, an open source messaging system

- Jay Pipes: talks about Google proto buffers, an open source object multi-language object serialization system

- Patrick Galbraith: talks about a suite of UDF functions for mysql for interacting with memcached. You can then use these in triggers to populate memcached data.

- Jim Starkey: talks about peer-to-peer cloud database system he's working on

Join-fu, by Jay Pipes

Schema is the foundation of performance

Different storage engines have different strengths/weaknesses
- in order to optimize, you need to understand these cost/benefits
- fully optimized systems rarely use just one storage engine

Varchars are silently converted to chars in memory/temporary/derived tables!
- memory tables are fixed-length only. no variable length columns.
- memory, temporary tables, and derived tables are all the same in this respect (I asked him about this; he said that temporary tables use the memory storage by default, but my testing and I'm pretty sure that temporary tables use by default whatever your current default storage engine is, which is usually MyISAM)
- Especially a problem if you have a wide varchar that's usually mostly empty,
- especially a problem if you're using utf8 because space used will become 3 x the char length (because that's what char() does to make sure it remains fixed length, expanding max theoretically possible byte length for that number of characters. Why can't temporary/memory/derived tables just convert to ucs2?)

Vertical partitioning
- put the most commonly used cols in a separate table with fixed length rows
- you can fit a lot more in a data page this way, speeds up scans/reads
- you can put text columns in a separate table w/ fulltext index if you need it; you can even replicate an innodb table to a myisam copy of itself if you need innodb features plus the fulltext index

Query cache
- Any modification to any table involved in the query will invalidate it in the query cache
- so, perhaps put frequently updated cols in another table
- or, if you do a select you know you're about to invalidate, use the sql_no_cache keyword
- there may be a plugin somewhere to see the contents of the query cache
- Jay: "I whip the query cache like a rented mule"

Try variations on a theme when optimizing (ed.:"trial and error")

Join hints are dangerous (use index(), force index(), ignore index())
- could get out of date
- inform your teammates when you use them

Explain output
- "type": the "access strategy" used to grab this data
- "row": estimated rows for this set
- "type: const" - accessing non-nullable unique key; must return 1 or 0 rows only. Can be propagated across joins: if 1 row returned only 1 row to join against another non-null unique index.
- low cardinality equals poor index optimization (cardinality = the number of distinct values in this col over the number of rows)
- "type: index" - index scan usually bad (ed.: but better than a table scan). I asked him about secondary vs. a clustered index scan in InnoDB
- "extra: using index" - usually good, means a covering index is used and no bookmark lookup required

don't functionally transform an indexed column, including coalesce() or isnull() (this will obviously prevent you from being able to a lookup on the index; mysql does not yet support indexes on derived columns); avoid using a non-deterministic function rather than a literal value if you can, because non-deterministic functions cannot be cached.

Stored procedures aren't very good in MySQL
- the compile cache is specific to each connection

The slides from an older version of this presentation are available here. I posted video of another presentation of his here.

I'm also trying to review the video from Peter Zaitsev's talk on sharding in MySQL, having a few issues there.

More complete regex support coming in 2.7?

Looking through the python issue tracker, looks like they're working on almost my exact wishlist of missing features in python's re module.

Including:

possessive quantifiers, atomic grouping
(?flag:...), (?-flag:...) support, (?flag) scoped correctly
variable length lookbehinds

So all of this will be ready for 2.7, right? right?

Tuesday, November 18, 2008

The blackhole storage engine

MySQL supports multiple storage engines, such as MyISAM and InnoDB. One such storage engine is the BLACKHOLE storage engine. Everything you write to a table in this storage engine goes to /dev/null, in other words, it doesn't go anywhere.

So, what is this useful for?

So, while the actual table content isn't stored, the binary log does still exist (the binary log stores a list of modifications to a table such that if the db crashes you don't lose anything). You can therefore use this to replicate the blackhole table to another table somewhere else that uses a more normal storage engine (recall that when you replicate a table, you don't have to use the same storage engine on each side). This allows you to insert into a table asynchronously according to your replication settings.

Monday, November 17, 2008

Row null mask

Mysql contains the nullness of a column in a byte or bytes called the row null mask. Your row size will be one byte larger per every 8 columns that are nullable; they each get a single bit in the same shared byte.

This means that if you have one nullable column, you get seven more for free! It also means that if you one nullable column, you have to pay the same overhead as if you had 7 more nullable columns.

Get subpattern matches balanced even if they're optional

This will give you a mysterious "unbalanced grouping" error:

re.sub('.(.)?)', r'aaa\1')

Because the parenthetical grouping is optional, trying to access it via the \1 backreference will result in an error.

The solution is to replace the ? operator with a null alternative in the paren:

re.sub('.(|.)', r'aaa\1')

Which gives the grouping the option of containing nothing, thereby maintaining the existence of a backreference (though the backreference will of course contain an empty string, which is what you would actually expect).

Back

"Next stop Old Greenwich, then we're in Stamford. Wouldn't want you to be sleeping. The station yard is has scary stuff, big rats. The yardmaster makes terrible carvings."

With these surreal words the conductor shook me awoke. I was totally passed out on the metronorth train, something I'm normally not able to do. That's an exact quote, as least as well as I can remember it.

Almost everything that could go wrong on the trip back did and I'm glad to be home.

I met lots of database luminaries this weekend, the guy who founded MySQL, the guy who founded SQLite, the guy who invented BLOBs, and pretty much everyone who's ever written a book about MySQL. I learned all sorts of esoteric knowledge. I'm eager to see the videos of the sessions in other tracks I missed. Monty seemed pleased that I seemed to like the strange scandinavian fish-flavored drink he tried to make everyone try after the sessions ended on Saturday.

Saturday, November 15, 2008

Jay Pipes

Jay's presentation was by far the most informative. I pinned him down afterward and asked him a bunch of questions.

I'll transcribe my notes later.

memcached msql UDFs

Patrick Galbraith has just given an intriguing talk on his memcached functions for mysql. This allows you to work with memcached seamlessly via triggers on mysql objects.

Meanwhile, Aker is talking about gearman, an open source messaging system

SQLite

Dr. Richard Hipp: "We're not trying to compete with MySQL, we're trying to compete with fopen()"

SQLite is intended for managing disk access for a single application.

Charlottesville

Charlottesville is a cute little down, with an extensive red brick pedestrian arcade. It actually has a European feel to it. I was expecting a fork in the road with two sticks stuck in the ground and the conference would take place around there.

I was expecting to see drunken rednecks, and while I have seen a few of those, there are a lot of hippies selling whatever on the street and hipsters as well. There seems to be hipsters everywhere you go, whether it be New York, Virginia, Thailand, Japan, all with their black-rimmed glasses and v-neck sweaters listening to Nirvana. There must be some pan-hipster consciousness, weaving all of the threads of humanity together.

Everyone is extremely friendly, people smile at you, wave at you when you walk by. The waitress at breakfast called me "sweetie".

Thanks to our MySQL contact, last night I had dinner with some MySQL top guns at a "Himalayan" restaurant -- I sat next to Monty, and Aker was a few seats away.

Aker's just finished his keynote, and a Postgres guy is giving a talk on concurrency.

Thursday, November 13, 2008

limit on update and delete statements

You can use the LIMIT clause on delete and update statements, which will limit you to deleting or updating a set number of rows.

This is useful if you are feeling especially paranoid about making a mistake with a delete or update and nuking your table (everyone has done this once or twice). Just throw an extra limit 1 clause on your statement and you can be absolutely sure that only one row will be affected.

Wednesday, November 12, 2008

Nested list comprehensions

Fun fact: you can nest list comprehensions.

This syntax looks like:

[(i, j) for i in range(4) for j in range(i)]

Which is exactly equivalent to:

x = []

for i in range(4):

for j in range(i):

x.append((i, j))

length() vs. char_length()

If you're using a multi-byte character encoding (e.g. utf8, ucs2) be aware than length() and char_length() may return different results.

length() will return the length in bytes; char_length() will return the number of character in the string, dependent on the character encoding.

The alternative insert syntax

Sometimes, if you're doing an insert, it's easier to type the alternative insert syntax.

insert into sometable set column1='new value', column2='new value'

Yes, this will work

Tuesday, November 11, 2008

Configure mysql to make it better

Create a file in your ~ directory, named .my.cnf . Drop everything and do this right now.

It should look like this:

[mysql]
prompt = 'mysql://\u@\h/\d> '
pager = less -FX

What do these settings do?

The prompt setting changes your mysql prompt to tell you what user, server and database you're currently using. This is vastly more useful if you have lots of connections to different mysql databases, which is constantly the case for me when I'm at work. I used the pseudo-url format used by the PEAR::DB module from PHP, which I think is kind of cool.

This makes my prompt look something like:

mysql://nick@localhost/service_demo>

The pager setting makes your query results appear in a pager program like less. Now, it's actually kind of annoying if every single result appears in a pager. So, the -FX flags tell less to avoid itself if what it is paging is less than 1 page height.

These two settings make the clui mysql client vastly more useful.

In 6.0 there will be an auto-vertical setting to use the \G vertical display automatically if the query result is wider than the current window. I thought of adapting this patch to make an auto-pager setting, but then I remembered about the -F flag to less which essentially does the same thing, although I don't know if there is a similar setting to newer pagers like most.

The last sections of the documentation contain lots of useful tips about these two settings -- it's a shame that they're so little known.

Populating summary tables using init_file

You can populate summary tables at server startup using the init_file variable. This will also allow you to keep summary tables under the memory storage engine, which is non-persistent and will be truncated on server restart.

Jay Pipes Google TechTalk Apr 2008

Monday, November 10, 2008

The hidden debug flag in the re module

The compile function of the re module, as I'm sure you already know, takes a second parameter of flags, such as re.I or re.X.

There is an undocumented flag, however, that will print out the parse tree of the regex. This is the re.DEBUG flag.

Here's some trivial output:

>>> re.compile(u'ell2\u1f4c', re.DEBUG)

literal 101

literal 108

literal 50

literal 8012

The 'literal' statements are followed by the ascii dec values of the characters in the pattern. I threw in a unicode character and you can see the unicode code point as the last item.

This will also tell you whether the pattern is being compiled for the first time or if it is being pulled out of the re module's cache.

Saturday, November 8, 2008

Who you are and who you have been authenticated as

In Mysql, to tell who you are and who you have been authenticated as, use the user() and current_user() functions.

Example:

mysql> select user(), current_user() \G
user(): nick@localhost
current_user(): nick@%

This is useful because you may have overlapping authentication rules, such as 'nick@192.168.%' and 'nick@%' and 'nick@localhost'.

Mysql permissions are confusing and frustrating, hopefully this helps debug some of its issues.

Friday, November 7, 2008

Create a new table from another table's definition

To create a table new_table as a blank copy of another table called another_table, use:

create table new_table like another_table;

Of course you can populate new_table with the contents of another_table like so:

insert into new_table select * from another_table;

You can combine these into one step! But, unlike the two-step method this way has the serious limitation of losing any of the indexes on the original table:

create table new_table select * from another_table;

Thursday, November 6, 2008

Make a more efficient list of integers

If you have a list that's all of the same type, and that type is numeric, use the array module for more efficiency.

Keep a list always sorted

Use the bisect module

You never need to cover the primary key in Innodb

In Innodb (unlike MyISAM), the primary key is stored as a lookup in every secondary index; therefore, there is never a need to add the primary key column to a secondary index in order to cover it.

In MyISAM secondary indexes store the internal row ID as a lookup, so this trick will not work and you need to explicitly add the primary key column if you want to cover it in an index.

Brief aside: in some database systems you can specify columns to add to the index pages without affecting the index structure or constraints. For example, an index like (some_unique_column, primary_key) could cover the primary key in the index without destroying the uniqueness constraint, since of course the primary key would be independently unique.

Tuesday, November 4, 2008

How to tell if the current output is a terminal

sys.stdout.isatty()

How is the encoding of stdout chosen?

It's actually determined by the LC_TYPE environmental variable

Python: how is sys.stdout.encoding chosen?

Monday, November 3, 2008

arithmatic and bitwise operators on sets

The set object supports basic arithmatic operators and bitwise operators.

>>> a = set('abracadabra')

>>> b = set('alacazam')

>>> a # unique letters in a set(['a', 'r', 'b', 'c', 'd'])

>>> a - b # letters in a but not in b set(['r', 'd', 'b'])

>>> a | b # letters in either a or b set(['a', 'c', 'r', 'd', 'b', 'm', 'z', 'l'])

>>> a & b # letters in both a and b set(['a', 'c'])

>>> a ^ b # letters in a or b but not both set(['r', 'd', 'b', 'm', 'z', 'l'])

Cool way to sort a dict by key or value

A simple way to sort a dict is to use iteritems() with a lamdba function, like so:

somedict = {...}
[ x[0] for x in sorted(somedict.iteritems(), lambda x, y: cmp(x[1], y[1])) ]

The cooler (and faster) way is to use itemgetter from the operator module, with the key parameter to sorted().

To sort a dict by keys, use:

import operator
somedict = {...}
[k for k, v in sorted(d.iteritems(), key=itemgetter(0))]

To sort a dict by values, instead use the line:

import operator
somedict = {...}
[k for k, v in sorted(d.iteritems(), key=itemgetter(1))]

Remember how itemgetter works: it is identical to the key lookup operator. In other words, in the code below,

somedict = {'a': 1}
operator.itemgetter('a')(somedict)
somedict['a']

the last two lines do exactly the same thing. This is a neat trick to avoid a lambda call, since subroutine calls are relatively expensive.

How to get a builtin back after you've overloaded it

One of the things in python that I'm not such a fan of is how easy it is to overload a built-in function without realizing it, since functions and variables live in the same namespace. 'str' and 'type' are built-in functions that you need fairly often and also common variable names used in programming; if you have both type() and str() will break and misbehave without it necessarily being obvious why.

For instance:

i = 1
type = 'obj'
type(i)

will result in a "is not callable" TypeError, which may not be that clear at first glance.

To get a builtin back even though it's been overloaded, just specify the builtin namespace, as shown below:

i = 1
type = 'obj'
__builtins__.type(i)

Also, if you del the local 'type' variable, the builtin 'type' will be reinstated. After that, built-ins themselves cannot be deleted, and you will get a NameError if you try.

However, it's probably best to just avoid overloading builtins by convention. For instance, I try to remember to name a generic string variable 'astr' rather than 'str'.

enumerate

Say you want to loop over an iterable, but you want to keep an index of your position.

The most obvious way to do this is to use a separate counter variable, like so:

aseq = 'hello'
i = 0
for c in aseq:
   ....
   i += 1

While that will work, the cooler way is to use enumerate. Enumerate encapsulates any iterable in another iterable that will return the value as yielded by the inner iterable with the value of the current index.

aseq = 'hello'
for (i, c) in enumerate(aseq):
   ....

Wednesday, October 29, 2008

The evolution of coolness string interpolation in SQL in python

Phase 1, tuples:

c.execute("update person set name=%s, address=%s where id=%s", ('Bob Smith', '3 Red Rock Rd', 1))

Phase 2, dicts:

c.execute("update person set name=%(name)s, address=%(addr)s where id=%(id)s", {'name':'Bob Smith', 'addr':'3 Red Rock Rd', 'id':1))

Phase 3, locals():

name = 'Bob Smith'

addr = '3 Red Rock Rd'

id = 1

c.execute("update person set name=%(name)s, address=%(addr)s where id=%(id)s", locals())

Most people don't seem aware of dict string interpolation, which is really much more convenient when you're using a large number of variables. This was cool but seemed annoying because you now need to make a large dict, requiring more typing. Now, locals() returns a dict of all of variables that exist in the local scope. Since we probably have all of the stuff you want to stick into an SQL statement sitting in other variables anyway, combine dict interpolation and locals() and we can stick variables into SQL much more conveniently.

Tuesday, October 28, 2008

apply a custom sort order

The best way to apply a custom order to a field is to use the field() function.

Via this comment thread.

Monday, October 27, 2008

How to catch more than one kind of exception in the same except statement

Example:

import urllib
try:
return urllib.urlopen(source)
except (IOError, OSError):
pass

Example borrowed from the always excellent Dive Into Python.

speed up temporary table inserts

Temporary tables use the MyISAM storage engine by default (in fact, you can specify any storage engine for a temporary table, a fact that few people realize).

You can thusly speed up bulk inserts into a temporary table just like any other table. For the case of default MyISAM temporary tables, I was able to approximately halve the insert execution time of a big bulk insert by surrounding the insert operations with alter table index_builder disable keys and alter table index_builder enable keys.

Another tip for temporary tables is if the temporary table is reasonably small, you can specify engine=Memory to create the table in memory rather than MyISAM, just be aware of the implications of memory/heap tables.

replacement apply syntax

The useful apply() function is deprecated.

The alternative is the "extended call syntax". It looks identical to the *kargs and **kwargs syntax in function definitions, but in the inverse in that you use it to pass arguments to functions rather than defining their input.

def foo(a,b):

pass

foo(*(1,2))

foo(**{'a':1, 'b', 2})

either of these will generate errors:

foo(**{'v':3})

foo(*(1,2,3,))

Friday, October 24, 2008

Tell if a table has been modified

Want to tell if two tables are exactly the same?

Use CHECKSUM TABLE

Or, for more advanced methods, use maatkit's mk-table-checksum.

Thursday, October 23, 2008

Quick way to interpolate local variables into a string

x = "variable 1"

y = "variable 2"

"%(x)s %(y)s" % locals()

This is so easy it hurts.

We can use the dict syntax to interpolate variables into a string, which is often more convenient; and, we can use locals() to return a dict of all locally-bound variables.

Wednesday, October 22, 2008

How to import a bunch of local variables into a dict by name

Here's a cool little function I wrote:

def _vars2dict(vars, *vars_wanted):
vars_wanted = set(vars_wanted)
return dict(filter(lambda i: i[0] in vars_wanted, vars.iteritems()))

Call it with locals(), and any subsequent variables passed to it will be returned in a dict by name.

Here's an example, using a variable we create and a built-in function

>>> x =1
>>> _vars2dict(locals(), 'log', 'x')
{'x': 1, 'log': }

This is useful if you have a lot of local variables that you want to interpolate into a string using the more intelligible dict format rather than the tuple format.

Monday, October 13, 2008

The defaultdict object

In the collections module, there's a object called defaultdict. This is useful (and supposedly faster) if you were to instead call setdefault() for every insert into a regular dict.

This is especially useful for nested dicts; rather than testing for existence or calling setdefault() at every level of the dict; the following code below will run works and illustrates an example:

bidict = collections.defaultdict(dict)

bidict['a']['b'] = 1

Both levels of the dict will be initialized correctly automatically.

A more pedestrian usage might be to initialize every value to zero, which can be accomplished simply, since calling int() will always return 0:

termfreq = collections.defaultdict(int)

termfreq['unknownterm'] == 0

Sunday, October 12, 2008

itertools.groupby

dave-squared has a post with some examples of how to use the interesting itertools.groupby() function, which works kind of like the SQL group by clause.

Friday, October 10, 2008

Example usage of the glob module

The python glob module in the standard library is a simple, useful way of expanding wildcard matches to filenames. It dovetails nicely with the os, os.path, and shutil modules.

Example usage:

import os, glob

TEMP_DIR = '/tmp/whatever'

for file in glob.glob(os.path.join(TEMP_DIR, '*.tmp')):

os.remove(file)

The glob module contains just two functions: glob() and its ~~case-insensitive~~ generator-returning equivalent, iglob().

Thursday, October 9, 2008

How to convert all tabs into spaces in vim

One of the sometimes annoying things about python is the IndentationErrors you get from the conflicting appearance of spaces and tab characters, the difference between which is invisible to the naked eye.

In vim, to convert spaces to tabs throughout a file, just use the command %retab!

How to automatically include modules in your python shell

1. Create a environmental variable in your shell config PYTHONSTARTUP that contains a full path to a python script which we're about to create. For bashrc, this looks like:

export PYTHONSTARTUP="/home/nick/.pythonstart.py"

This file will be executed every time you start a python shell (but not every time you execute a script).

2. In the python file we specified put import statements including whatever modules you want. Mine looks like:

import os, sys, re

from pprint import pprint

from datetime import datetime

from math import *

Every time you start an interactive session these modules will be automatically included and you won't need to bother importing them.

You can see the official docs on the PYTHONSTARTUP variable in the manual.

The python set object

An oft-overlooked builtin python type is the set type, and its immutable equivalent, frozenset. A set is like a list or tuple except every member can only appear once; more accurately, it is like a dict that only has keys and no values.

Have you ever found yourself doing something like this?

found = {}

for item in someiterator():

if not found.has_key(item):

found[item] = 1

...

else:

...

The more optimal way of doing this in python is to use a set

found = set()

for item in someiterator():

if not item in found:

found.add(item)

...

else:

...

The 'in' operator used with sets appears to be as fast as lookup for a key in a dict, yet does not carry the memory overhead and inelegance of using a dict where you don't actually care about the values.

Another useful use of set is when you want to make the members of a list or tuple unique; just cast the list to a set, and cast it back if and when you need to do so.

Frozenset does not appear to have any speed advantages over a set but probably takes up less memory.

Tuesday, October 7, 2008

Expanding wildcard matches on files

Use the glob module.

See also the fnmatch.

Um, all of them

Monday, October 6, 2008

2 snakes enter, 1 snake leaves

Valuable slashdot discussion on python 2.6 and 3.0

Slashdot: Python 2.6 to clear the way for 3.0, coming next month

tuples are always greater than strings

Any tuple is greater than any string.

(1,2) > 'ab' # True

I'm told this is for archaic reasons.

In python 3.0, attempting this will raise an exception.

Tuples are compared according to a lexicographical comparison of their elements in order

If you have a list of tuples, sort() will sort them by lexicographical order.

Example:

[(3, 'zzzzd'), (2, 'hello'), (3, 'whatever')]

calling .sort() on this object will result in

[(1, 'whatever'), (2, 'hello'), (3, 'zzzzd')]

This is because python evaluates tuples in comparisons according to pairs of indexes in order. If there is a difference between the 0-index of the tuples, it will be evaluated according to that first pair; if the first indexes are the same, it will move on the second index, and so on.

(1, 1) < (2, 0) # True -- based on first elements

(1, 1) < (-1, 0) # False -- based on first elements

(1, 1) < (1, 2) # True -- first elements the same, comparison decided based on the second elements

To make the comparison explicit and to force it to use the first and only the first index in the comparison, use the cmp kwarg to sort:

[(1, 'whatever'), (2, 'hello'), (3, 'zzzzd')].sort(cmp=lambda x, y: cmp(x[0], y[0]))

(note: sort will sort the list in-place, so the above won't actually evaluate to anything. sorted() returns a sorted copy of whatever it's run on, and accepts an identical cmp parameter.)

This will make how the comparison is being executed explicit, and will also allow you to sort using any index of the tuple or any aspect of the objects in the list.

list.sort() and sorted() also accept a reverse kwarg that accepts a boolean that will return the list in the opposite order -- though a custom cmp kwarg could do the same implicitly.

String interpolation using dicts

Everyone knows about about the string interpolation operator, %:

"(%s, %s)" % (1, 2)

Rarely do you see the often more useful use of interpolation using dicts. Not only is it much more readable once the number of variables you're trying to interpolate grows large, it will also allow you to repeatedly insert the same element if need be:

"(%(key)s, %(key)s)" % {'key': 1}

The single most valuable thing I've learned about python

... pydoc and help()

pydoc is a executable that's installed with python, pass it the name of a module to see a generated man page for that module (e.g. pydoc string). You can also pass it a file path, but for a local python file, in order for it to work you have to use the full path or precede the filename with './', (e.g. pydoc ./whatever.py).

help() runs within a python shell. Run it without arguments to get general help with python; pass a string containing the name of a module to get documentation for that module, or pass any it any existent module or object reference and it will generate docs for that object.

This will save you a lot of lookups to the manual. For instance, to remind yourself of the methods of dict objects, you can just run help() on any dict object, or just an empty such object (e.g. help({})).

Sunday, October 5, 2008