shakti
aboutuniversal database and analytics why: 100 times faster than spark, redshift, bigquery and snowflake. who: arthur whitney and janet lustgarten fintan quill, abby gruen, anton nikishaev e.l. whitney[1920-1966] multiple putnam winner k.e. iverson[1920-2004] turing award winner mission: excellencebenchmarkaboutk-sql is consistently 100 times faster (or more) than redshift, bigquery, snowflake, spark, mongodb, postgres, .. same data. same queries. same hardware. anyone can run the scripts. benchmarks: taq 1.1Trillion trades and quotes taxi 1.1Billion nyc taxi rides stac .. Taq 1.1T https://www.nyse.com/publicdocs/nyse/data/Daily_TAQ_Client_Spec_v2.2a.pdf q1:select max price by sym,ex from trade where sym in S q2:select sum size by sym,time.hour from trade where sym in S q3:do(100)select last bid by sym from quote where sym in S / point select q4:select from trade[s],quote[s] where price<bid / asof join S is top 100 (10%) time(ms) 16core 100 days q1 q2 q3 q4 k 44 72 63 20 spark 80000 70000 DNF DNF - can't do it postgres 20000 80000 DNF DNF - can't do it .. Taxi 1.1B https://tech.marksblogg.com/benchmarks.html q1:select count by type from trips q2:select avg amount by pcount from trips q3:select count by year,pcount from trips q4:select count by year,pcount,_ distance from trips cpu cost core/ram elapsed machines k 4 .0004 4/16 1 1*i3.2xlarge(8v/32/$.62+$.93) redshift 864 .0900 108/1464 8(1 2 2 3) 6*ds2.8xlarge(36v/244/$6.80) bigquery 1600 .3200 200/3200 8(2 2 1 3) db/spark 1260 .0900 42/336 30(2 4 4 20) 21*m5.xlarge(4v/16/$.20+$.30) Stac https://www.stacresearch.com/ ..taq.k/ nyse taq: 1trillion rows (50million trades 1billion quotes per day) e:{x%+/x:exp|6*(!x)%x} N:1|_.5+50e6*x:e m:6000 t:{[[]t:09:30:00+x?06:30:00;e:x?"ABCD";p:10+x?90.;z:100+x?900]} q:{[[]t:09:30:00+x?06:30:00;e:x?"ABCD";b:10+x?90.]} A:100#S:m?`4 T:S!t'N Q:S!q'N d:16 \t:d select sum p by e from T A \t:d select sum z by t.h from T A \t:d*100 {select last b from x}'Q A \ a:*A \t select from T a,Q a where p<b time(ms) 16core 100 days q1 q2 q3 q4 k 44 72 63 20 spark 80000 70000 DNF DNF postgres 20000 80000 DNF DNF .. /SAVE csv \t `t.csv 2:t \t `q.csv 2:q /SAVE LOAD csv \t t:`s=2:`t.csv \t q:`s=2:`q.csv \t `taq 2:(t;q)taxi.k/ apples to apples k is 100 times faster than spark redshift snowflake bigquery .. same data. same hardware. same queries. 1.1billion taxi rides https://tech.marksblogg.com/benchmarks.html select v,count(*)as n from t group by v select p,avg(a)as a from t group by p select extract(year from d)as year,p, count(*)as n from t group by year,p select extract(year from d)as year,p,m,count(*)as n from t group by year,p,m timings(aws i3.4xlarge) k sparky shifty flaky .0 12 19 20+? .3 18 15 30+? .1 20 33 50+? .5 103 36 60+? ---------------------- .9 153 103 160+ https://tech.marksblogg.com/billion-nyc-taxi-rides-spark-2-4-versus-presto-214.html https://tech.marksblogg.com/billion-nyc-taxi-rides-redshift-large-cluster.htm https://blog.tropos.io/analyzing-2-billion-taxi-rides-in-snowflake-a1fbed8b7ba3 bottomline: sparky/shifty/flaky/bigqy are all the same. expensive and slow. \ \l t select count by v from t select avg a by p from t select count by d.year,p from t select count by d.year,p,m from t \ /data curl -s https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-01.csv > 2017.01 .. import`csv \t x:1:`2017.01 \t t:+`v`d`p`m`a!+csv["bd ii 2";x] \t t:`d grp t \t "t/"2:t /synthetic d:2017.01.01+!n:1461 g:{[[]v:x?2;p:x?9;m:x?100;a:.01*100+x?900]} `t 2:t:d!g'n#375000documentsql.dshakti universal database includes: ansi-sql [1992..2011] ok for row/col select. real-sql [1974..2021] atw@ipsa does it better. join: real-easy ansi-ok real: select from T,U ansi: select from T left outer join U group: real-easy ansi-annoy real: select A by B from T ansi: select B, A from T group by B order by B simple: real-easy ansi-easy real: select A from T where C or D, E ansi: select A from T where (C or D)and E complex: real-easy ansi-awful asof/joins select from t,q where price<bid first/last select last bid from quote where sym=`A deltas/sums select from t where 0<deltas price foreignkeys select order.cust.nation.region .. arithmetic x+y e.g. combine markets through time example: TPC-H National Market Share Query 8 http://www.qdpma.com/tpch/TPCH100_Query_plans.html what market share does supplier.nation BRAZIL have by order.year for order.customer.nation.region AMERICA and part.type STEEL? real: select revenue avg supplier.nation=`BRAZIL by order.year from t where order.customer.nation.region=`AMERICA, part.type=`STEEL ansi: select o_year,sum(case when nation = 'BRAZIL' then revenue else 0 end) / sum(revenue) as mkt_share from ( select extract(year from o_orderdate) as o_year, revenue, n2.n_name as nation from t,part,supplier,orders,customer,nation n1,nation n2,region where p_partkey = l_partkey and s_suppkey = l_suppkey and l_orderkey = o_orderkey and o_custkey = c_custkey and c_nationkey = n1.n_nationkey and n1.n_regionkey = r_regionkey and r_name = 'AMERICA' and s_nationkey = n2.n_nationkey and o_orderdate between date '1995-01-01' and date '1996-12-31' and p_type = 'STEEL') as all_nations group by o_year order by o_year; Comparison: real ansi(sqlserver/oracle/db2/sap/teradata/..) install 1 second 100,000 second hardware 1 milliwatt 100,000 milliwatt software 160 kilobyte 8,000,000 kilobyte (+ 10,000,000kilobyte O/S) mediandb 1,000,000 megarow 10 megarow https://docs.microsoft.com/en-us/sql/database-engine/install-windows/install-sql-server?view=sql-server-ver15 shakti is essential for analyzing big (trillion row+) and/or complex data.downloadexpress licenseli2.0 04.28you are accepting terms of eula by downloading li2.0mi2.0 04.28you are accepting terms of eula by downloading mi2.0enterpriseLi2.0 04.28you are accepting terms of eula by downloading Li2.0Mi2.0 04.28you are accepting terms of eula by downloading Mi2.0ffiaboutimport`csv csv["|tcnif";1:":ffi/t.csv"] import`json json"[3.14]"csv.c#include"k.h" #include<string.h> #define W while #define Z(a,b...) if(a){b;}else #define N_(n,a...) {I _n=(n);W(_n--){a;}} #define K(a...) ({K f[]={a};I n=sizeof f;K x=k(' ',n/8);memcpy(xK,f,n);k(x);}) #define min(a,b) ({typeof(b)_a=(a),_b=(b);_b<_a?_b:_a;}) #define Xk xK[i] #define Yk yK[i] #define yt _t(y) #define rI ((I*)r) #define xG ((C*)x) #define xQ ((M*)x) #define ty (y>>56) typedef char C;typedef unsigned M; M _Z3fudPcS_Pd(S,S,F*); ZI Dm[]={0,31,61,92,122,153,184,214,245,275,306,337}; ZI dm(I m){R m+=10,1461*(m/12)/4+Dm[m%12]-306;} ZI uL(I n,S s){R n;} ZK us(I n,S s){R(K)memcpy((S)k('C',n),s,n);} ZI d2(S s){R*s*10+s[1]-528;} ZI ub(I n,S s){R*s&1;} ZI uc(I n,S s){R*(I*)s&~0ull>>64-8*min(8l,n);} ZI d4(S s){R 100*d2(s)+d2(2+s);} ZI ud(I,S),ut(I,S); ZI ui(I n,S s){I r=0;N(n,r=s[i]-48+10*r);R r;} ZK kC(S s){I n=strlen(s);R(K)memcpy((S)k('C',n),s,n);} ZF uF(I m,I n,S s){P('-'==*s,-uF(m,n-1,s+1))I p=-1;W(++p<n&&'.'!=s[p]);n=min(n,1+m+p);p=n-1-p;I x=0;N(n,x+=('.'!=s[i])*(9*x+s[i]-48));F r=x;N(p,r/=10);R r;} ZF uf(I n,S s){F x;_Z3fudPcS_Pd(s,n+s,&x);R x;} ZI udt(I n,S s){int d=ud(10,s),t=ut(n-11,s+11);R 86.4e6*d+t;} ZI ut(I n,S s){P(18<n,udt(n,s))P(!memchr(s,':',n),ui(n,s)/1000000000); int h=d2(s),m=d2(3+s),c=d2(6+s),p=ui(3,9+s);R p+1000*(c+60*(m+60*h));} ZI ud(I n,S s){s[4]=s[7]='-';int y=d4(s),m=d2(5+s),d=d2(8+s);R dm(12*(y-2001)+m-1)+d-1;} #define TY (nty>i&&' '!=ts[i]) #define IX(z) (0<z?ix[z-1]:-1) #define out() ({k(0,(K)ix);R 5;}) #define NC(a...) {I ci=-1;N(cn,Z(TY,++ci;I t=ts[i];a))} #define CE(u,a...) ({z+=d;I n=IX(1+z)-IX(z)-1;S s=p+(1+IX(z));I c='\r'==s[n-1];u(a n-c,s);}) #define c0(t,T,set) {K x=k(T,rn);c=i;Z(KT==xt,I z=c;xt+=8*(18<CE(uL))); N(rn,r=i;I z=r*cn+c;set);yK[ci]=x;break;} #define c(t,T,xk,u) case t:c0(t,T,xk=CE(u)) K ck(K a,K b){P(!b,3);S ts,cl=0;C sp=',';{K x=a,y=b;P(tx||ty,8);Z(xt,P(1!=xt,8)ts=(S)x){P(t_(*xK),8);cl=(S)*xK,ts=(S)xK[1];}} I cn=_n((K)ts),nty=cn; Z(' '<*ts&&'A'>*ts||'z'<*ts,sp=*ts,--cn,--nty,++ts); P(cl&&cn!=_n(cl),7); int bn=_n(b); S p=(S)b;I qt=!!memchr(p,'"',bn);K fu=qt?kC("{&(x in\",\\n\")&=\\~\"\\\"\"=x}"):kC("{&x in\",\\n\"}");((S)fu)[8-!qt]=sp;M*ix=(M*)K(fu,b=k(1,b)); I n=_n((K)ix),d=0;Z(!cl,cn=1;I i=0;W('\n'!=p[i])cn+=sp==p[i++];n-=cn);I c=0,r=0,rn=n/cn,ocn=0; N(cn,ocn+=TY);K x=k(' ',ocn),y=k(' ',ocn); NC(Z(!cl,I z=i;xK[ci]=CE(us))xK[ci]=kC((C[]){cl[i],0}))Z(!cl,d=cn); NC(switch(t){c('n','S',Xk,uc)c('i','I',xI[i],ui)c('*',' ',Xk,us)c('c','C',xG[i],uc)c('d','D',xQ[i],ud)c('t','T',xI[i],ut)c('f','F',xF[i],uf)c('b','I',xI[i],ub) default:Z('9'>t&&'0'<t,c0(t,'F',xF[i]=CE(uF,t-48,)))out();}) k(0,(K)ix);R k('A',x,y);} json.ccextern"C"{ #include"k.h" } #include"simdjson.h" #define Xk xK[i] #define Yk yK[i] #define Na(b...) {I i=0;for(auto a:a){b;++i;}} #define kss(s) kC((S)((std::string_view)s).data()) ZK kC(S s){I n=strlen(s);R(K)memcpy((S)k('C',n),s,n);} typedef char C; using namespace simdjson::dom; K va(element o){S((I)o.type(), C('{',object a=o.get_object();I n=0;Na(++n);K x=k(' ',n),y=k(' ',n);Na(Xk=kss(a.key),Yk=va(a.value));D(x,y)) C('[',array a=o.get_array();I n=a.size(),t=0;Na(I x=(I)a.type();t=!t||t==x?x:-1);S(t,C('d',K x=k('F',n);Na(xF[i]=a.get_double());x),K x=k(' ',n);Na(Xk=va(a));x);0) C('"',kss(o.get_string()))C('d',kf(o.get_double()))C('u',ki(o.get_uint64()))C('l',ki(o.get_int64()))C('t',ki(o.get_bool()))C('n',0),kC((S)"err"));} K jk(K x){P(tx||1!=xt,8);parser p;try{element o=p.parse((S)x,xn);R va(o);}catch(...){R 2;}} t.csvt|e|s|v|p 071609644084992|P|AAPL|17|115