Wednesday, November 19, 2008

OpenSQL camp Notes

I've finally got my notes from the saturday sessions of opensql camp.

What follows is a transcription of portions of them that I felt were worth sharing:

(Note also that many of the videos of the sessions have been uploaded.)

How SQL Database Engines Work, by Dr. Richard Hipp, founder of SQLite

- You can't mix ASC and DESC in a multi-column sort and use an index on those same columns to optimize away a filesort operation; you can ASC, or DESC, but not both; UNLESS you declare the columns to be ASC or DESC or vice versa in the index definition.

- To use an index to optimize both where and order by clauses: where clause must be first, order by cols must be second (but the two can overlap). Selected cols that are at least covered by the index can appear in any order.

- OR operations - if you can get the row ids out in order from indexes, you don't need to make a temporary table - could work for AND operations on different cols. and you can nest these intersection & unions using a bitmap index.

Unfortunately, Richard didn't have his slides on the internal mechanics of join operations, which I wold have wanted to see.

Lightening talks, misc presenters:

- Aker: Talks about Gearman, an open source messaging system

- Jay Pipes: talks about Google proto buffers, an open source object multi-language object serialization system

- Patrick Galbraith: talks about a suite of UDF functions for mysql for interacting with memcached. You can then use these in triggers to populate memcached data.

- Jim Starkey: talks about peer-to-peer cloud database system he's working on

Join-fu, by Jay Pipes

Schema is the foundation of performance

Different storage engines have different strengths/weaknesses
- in order to optimize, you need to understand these cost/benefits
- fully optimized systems rarely use just one storage engine

Varchars are silently converted to chars in memory/temporary/derived tables!
- memory tables are fixed-length only. no variable length columns.
- memory, temporary tables, and derived tables are all the same in this respect (I asked him about this; he said that temporary tables use the memory storage by default, but my testing and I'm pretty sure that temporary tables use by default whatever your current default storage engine is, which is usually MyISAM)
- Especially a problem if you have a wide varchar that's usually mostly empty,
- especially a problem if you're using utf8 because space used will become 3 x the char length (because that's what char() does to make sure it remains fixed length, expanding max theoretically possible byte length for that number of characters. Why can't temporary/memory/derived tables just convert to ucs2?)

Vertical partitioning
- put the most commonly used cols in a separate table with fixed length rows
- you can fit a lot more in a data page this way, speeds up scans/reads
- you can put text columns in a separate table w/ fulltext index if you need it; you can even replicate an innodb table to a myisam copy of itself if you need innodb features plus the fulltext index

Query cache
- Any modification to any table involved in the query will invalidate it in the query cache
- so, perhaps put frequently updated cols in another table
- or, if you do a select you know you're about to invalidate, use the sql_no_cache keyword
- there may be a plugin somewhere to see the contents of the query cache
- Jay: "I whip the query cache like a rented mule"

Try variations on a theme when optimizing (ed.:"trial and error")

Join hints are dangerous (use index(), force index(), ignore index())
- could get out of date
- inform your teammates when you use them

Explain output
- "type": the "access strategy" used to grab this data
- "row": estimated rows for this set
- "type: const" - accessing non-nullable unique key; must return 1 or 0 rows only. Can be propagated across joins: if 1 row returned only 1 row to join against another non-null unique index.
- low cardinality equals poor index optimization (cardinality = the number of distinct values in this col over the number of rows)
- "type: index" - index scan usually bad (ed.: but better than a table scan). I asked him about secondary vs. a clustered index scan in InnoDB
- "extra: using index" - usually good, means a covering index is used and no bookmark lookup required

don't functionally transform an indexed column, including coalesce() or isnull() (this will obviously prevent you from being able to a lookup on the index; mysql does not yet support indexes on derived columns); avoid using a non-deterministic function rather than a literal value if you can, because non-deterministic functions cannot be cached.

Stored procedures aren't very good in MySQL
- the compile cache is specific to each connection

The slides from an older version of this presentation are available here. I posted video of another presentation of his here.

I'm also trying to review the video from Peter Zaitsev's talk on sharding in MySQL, having a few issues there.

No comments: