Tuesday, April 10, 2012

Cassandra data model - Materialized Views or Index CF, which way is better to handle 20 single column indexes to support timeline query(s) in one table?


Howdy,

Can I ask a Cassandra data model question here about time series data and timeline query?
Materialized Views or Index CF, which way is better to handle 20 single column indexes in one table?

We have a book inventory table with 20 columns, 300 million rows, average row size is 1500 bytes.

create table book(
 book_id,
 isbn,
 price,
 author,
 titile,
 ...
 col_n1,
 col_n2,
 ...
 col_nm
);

Data usage:

We need to query data by each column, do pagination as below, typical timeline query,

select * from book where isbn   < "XYZ" order by ISBN   descending limit 30;
select * from book where price  < 992   order by price  descending limit 30;
select * from book where col_n1 < 789   order by col_n1 descending limit 30;
select * from book where col_n2 < "MUJ" order by col_n2 descending limit 30;
...
select * from book where col_nm < 978 order by col_nm descending limit 30;

Write: 100 million updates a day.
Read : 16  million queries a day. 200 queries per second, one query returns 30 rows.

***
Materialized Views approach

{"ISBN_01",book_object1},{"ISBN_02",book_object2},...,{"ISBN_N",book_objectN}
...
We will end up with 20 timelines.


***
Index approach - create 2nd Column Family as Index

'ISBN_01': {'book_id_a01','book_id_a02',...,'book_id_aN'}
'ISBN_02': {'book_id_b01','book_id_b02',...,'book_id_bN'}
...
'ISBN_0m': {'book_id_m01','book_id_m02',...,'book_id_mN'}

This way, we will create 20 index Column Family(s).

---

If we choose Materialized Views approach, we have to update all
20 Materialized View column family(s), for each base row update.
Also, Materialized Views approach will use 20 times more storage space, increase from 500GB base table size to 10TB.
Will the Cassandra write performance acceptable?

Redis recommend building an index for the query on each column, that is your 1st strategy - create 2nd index CF:
http://redis.io/topics/data-types-intro
(see section [ Pushing IDs instead of the actual data in Redis lists ]

Should we just normalize the data, create base book table with book_id as primary key, and then
build 20 index column family(s), use wide row column slicing approach, with index column data value as column name and book_id as value?
This way, we only need to update fewer affected column family that column value changed, but not all 20 Materialized Views CF(s).

Another option would be using Redis to store master book data, using Cassandra Column Family to maintain 2nd index.

What will you recommend?


Thanks,
Charlie | DBA developer

p.s.

Gist from datastax dev blog ( http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra )
"
 If the same event is tracked in multiple timelines,
 it’s okay to denormalize and store all of the event data in each of those timelines.
 One of the main principles that Cassandra was built on is that disk space is very cheap resource;
 minimizing disk seeks at the cost of higher space consumption is a good tradeoff.
 Unless the data for each event is ^very large^, I always prefer this strategy over the index strategy.
"

Will 1500 bytes row size be large or small for Cassandra from your understanding?
A: 500MB is the limit. "performance degradation starts at 500MB rows, its very slow if you hit this limit." Answer from user@cassandra.apache.org