Crate: Why require the PRIMARY KEY column to be present in a PARTITIONED BY clause? -
can me understand reading in documentation? https://crate.io/docs/reference/sql/partitioned_tables.html
in these example tables, column id long
not primary_key
; indeed, id
not primary key here, because noted below "if primary key set, must present in partition by
clause"
in app, i've historically had primary key
on id string not null
, want add partitioning on table, on generated date column in example partition_date timestamp generated date_trunc('day', created_at)
. i've read partitioning on date column speed of queries scoped time period (counting today's records, example, hit today's partitions), , helping me archive older frames of data (e.g. > 180 days), don't want lose performance of single pk lookups.
so since can't partitioned (partition_date)
, best if i...
a) remove primary key constraint id
? i'm nervous affect performance single row lookups! in context, makes sense pk must in partition key, because lookup where id = "abc-123"
should ideally have hit single node.
or
b) use both columns partition key, partitioned (id, partition_date)
-- seems weird, because instinctively, want assume id
have high cardinality , bad choice partition column, , 'day' or 'month' better, shown in example on docs. in case, pk lookup hitting every partition, or know go? if run aggregate query scoped today only, hit every partition or 1 holding today's data?
so that's great question! since partitions "sub tables" of sort, reducing size of queried data.
a primary key influences routing in cratedb, adding partitioned table (which requires more extensive routing) deny non-primary key column in partitioned clause. thus, options follows:
- a) while removes ability pk lookups effectively, seems sensible option - regular string lookups can accelerated using fulltext index - going remove read-after-write-consistent primary key lookups add. depending on way generate primary key, might feasible use internal
_id
column instead (for lookups), or issuerefresh table
before id-lookup. - b) result in many partitions there primary keys (and since unique ...) - option create many partitions
since option b) result in chaos, i'd recommend option a). however, if primary key lookups crucial application , expected amount of data not huge (a couple of million fine - depending on cluster size , machine specs of course), might work fine without partitioning!
cheers, claus
Comments
Post a Comment