aboutk: fast fun universal database and language
for: hedgefunds banks manufacturers formula1 ..
why: 100 times faster than polars datatable biqquery redshift databricks snowflake ..
who: arthur whitney+ thanks to
e.l. whitney[1920-1966] dad:multiple putnam winner(beat john nash every time)
k.e. iverson[1920-2004] advisor:APL turing award'79
john cocke [1925-2002] advisor:RISC turing award'87
benchmarks: same machine. same data. same queries. (apples-to-apples)
h20: 1Billion rows k is about 100 times faster than polars datatable ..
taxi: 1Billion rows k is about 100 times faster than biggie shifty sparky ..
taq: 1000Billion rows only k (incl. kx.com) can do the asof joins in our lifetime.
stac: 2000Billion rows only k (incl. kx.com) can do these queries.
time - user and machine - is expensive. pandas and polars are free - god bless them - so
1,000 rows: use excel
1,000,000 rows: use polars
1,000,000,000 rows: use k
1,000,000,000,000 rows: only k
compareaboutreal-sql(k) is consistently 100 times faster (or more) than
redshift, bigquery, snowflake, spark, mongodb, postgres, ..
same data. same queries. same hardware. anyone can run the scripts.
benchmarks:
h2o 1Billion rows
taxi 1Billion rows
taq 1000Billion trades and quotes
stac 2000Billion trades and quotes
Taq 1.1T https://www.nyse.com/publicdocs/nyse/data/Daily_TAQ_Client_Spec_v2.2a.pdf
q1:select max price by sym,ex from trade where sym in S
q2:select sum size by sym,time.hour from trade where sym in S
q3:do(100)select last bid by sym from quote where sym in S / point select
q4:select from trade[s],quote[s] where price<bid / asof join
S is top 100 (10%)
time(ms) 16core 100 days
q1 q2 q3 q4
k 44 72 63 20
spark 80000 70000 DNF DNF - can't do it
postgres 20000 80000 DNF DNF - can't do it
..
Taxi 1.1B https://tech.marksblogg.com/benchmarks.html
q1:select count by type from trips
q2:select avg amount by pcount from trips
q3:select count by year,pcount from trips
q4:select count by year,pcount,_ distance from trips
cpu cost core/ram elapsed machines
k 4 .0004 4/16 1 1*i3.2xlarge(8v/32/$.62+$.93)
redshift 864 .0900 108/1464 8(1 2 2 3) 6*ds2.8xlarge(36v/244/$6.80)
bigquery 1600 .3200 200/3200 8(2 2 1 3)
db/spark 1260 .0900 42/336 30(2 4 4 20) 21*m5.xlarge(4v/16/$.20+$.30)
Stac https://www.stacresearch.com/
..h2o.kt:.`9.csv
\t select sum v1 by id1 from t
\t select sum v1 by id1,id2 from t
\t select sum v1,avg v3 by id3 from t
\t select avg v1,avg v2,avg v3 by id1 from t
\t select sum v1,sum v2,sum v3 by id3 from t
\t select med v3,dev v3 by id1,id2 from t
\t select min v1,max v1 by id3 from t;
\t select 2 max v3 by id3 from t
\t select v1 dev v2 by id1,id2 from t
\\
https://h2oai.github.io/db-benchmark
data: 50GB csv: 1e9 rows[id1 id2 id3 id4 id5 id6 v1 v2 v3]
query: 9 multi-column aggregations[sum avg var dev correlation median 2max]
machine: amd epyc 9374f
code: k p/polars r/datatable [and much slower:clickhouse spark pandas arrow duckdb ..]
query csvload (milliseconds)
k 950 1,600
* 97,000 606,000
p 258,000 265,000
r 257,000 1250,000
detail for the 9 queries
k 42 76 25 117 16 293 33 15 330
* 616 1509 6499 693 6260 20655 5817 51161 4231
p 1366 2401 42054 943 47177 5093 90847 29360 38500
r 3364 4494 7307 10008 7466 49770 63584 76673 31024
similar results for the 1e7 and 1e8. *(k4) is an old/1999 version of k.
t.km:_2922%n:16
d:2017.01.01+m*!n
g:{[[]v:x?2;p:x?9;m:x?100;a:x?2.3e]}
t:d!g'n#380000
m*n*.38e6
\t:m select count by v from t
\t:m select avg a by p from t
\t:m select count by d.year,p from t
\t:m select count by d.year,p,m from t
\
/data
curl -s https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-01.csv > 2017.01
..
import`csv
\t x:1:`2017.01
\t t:+`v`d`p`m`a!+csv["bd ii 2";x]
\t t:`d grp t
\t "t/"2:t
1.1billion taxi rides https://tech.marksblogg.com/benchmarks.html
apples to apples (same data. same hardware. same queries.)
k is 100 times faster than spark redshift snowflake bigquery ..
select v,count(*)as n from t group by v
select p,avg(a) as a from t group by p
select extract(year from d)as year,p, count(*)as n from t group by year,p
select extract(year from d)as year,p,m,count(*)as n from t group by year,p,m
timings(aws i3.4xlarge)
k sparky shifty flaky
.0 12 19 20+
.3 18 15 30+
.1 20 33 50+
.5 103 36 60+
----------------------
.9 153 103 160+
https://tech.marksblogg.com/billion-nyc-taxi-rides-spark-2-4-versus-presto-214.html
https://tech.marksblogg.com/billion-nyc-taxi-rides-redshift-large-cluster.htm
https://blog.tropos.io/analyzing-2-billion-taxi-rides-in-snowflake-a1fbed8b7ba3
bottomline: k good (sparky/shifty/flaky/googly bad)docmanfast universal database and language.
connect to everything. depend on nothing.
select min e,max e,avg e by n from`t.csv (billion row challenge)
`t 2'`t.csv [>>>k.k("2'","t",pandas.read_csv("t.csv"))]
select [count first last min max sum avg var dev med ..] by from where ..
while(..)if(..)else .. in .. exp log sqrt sin cos ..
flip flop list sort asc desc unique group key val ..
Verb (monad) Adverb Noun
+ + ' each char " ab"
- - / over right name ``ab
* * sqr \ scan left int 2 3 4
% div sqrt float 2 3.4
& and flip \l load
| or flop \t time
< < asc \v vars z.d date 2001.01.01
> > desc \w work z.t time 12:34:56.789
= = group \\ exit
~ ~ ~
. . value
! mod index
@ @ first I/O Class
? find unique 0' line List (2;3.4)
# take count 1' char Dict {a:2 3}
_ drop floor 2' data Table [a:2 3]
^ cut order 3' set* Expr :2+a
, , , 4' get* Func {[a]2+a}
rosetta Atom List atom list Ddd Mm
k ()[].;: +-*%&|<>=~$! @?#_^, ~-*%_$ @,#!!|^<>?= ..! @& f' F' while(a)if(b)c else d
python ()[].;= +-*/&|<> ^ % isssa ~- sfs lrwrsa u dvk @t [f(x)..] [F(x,y)..] while a:if b:c else:d
(index slices split append;sqrt floor str;len range where reverse sort argsort unique;dict key value;transpose)
Type int float complex boolean char name date time
Class List Dict Array Table Expression Function(* immutable)
ifgbcndt LDATEF* devs
k 1992 833...88 ++++++* 300spartans
sql 1992 32 . .22 - * 20million
python 1991 .... . -- 10million
numpy 2005 433 99 + * 5million
go 2009 422.. 2million
nodejs 1995 . . . -- 15million
excel 1982 . . . + * 80million
c 1972 42 5million
apl 1962 . .. + *sqlshakti universal database includes:
ansi-sql [1992..2011] ok for row/col select.
real-sql [1974..2021] atw@ipsa does it better.
join: real-easy ansi-ok
real: select from T,U
ansi: select from T left outer join U
group: real-easy ansi-annoy
real: select A by B from T
ansi: select B, A from T group by B order by B
simple: real-easy ansi-easy
real: select A from T where C or D, E
ansi: select A from T where (C or D)and E
complex: real-easy ansi-awful
asof/joins select from t,q where price<bid
first/last select last bid from quote where sym=`A
deltas/sums select from t where 0<deltas price
foreignkeys select order.cust.nation.region ..
arithmetic x+y e.g. combine markets through time
example: TPC-H National Market Share Query 8 http://www.qdpma.com/tpch/TPCH100_Query_plans.html
what market share does supplier.nation BRAZIL have by order.year for order.customer.nation.region AMERICA and part.type STEEL?
real: select revenue avg supplier.nation=`BRAZIL by order.year from t where order.customer.nation.region=`AMERICA, part.type=`STEEL
ansi: select o_year,sum(case when nation = 'BRAZIL' then revenue else 0 end) / sum(revenue) as mkt_share from (
select extract(year from o_orderdate) as o_year, revenue, n2.n_name as nation
from t,part,supplier,orders,customer,nation n1,nation n2,region
where p_partkey = l_partkey and s_suppkey = l_suppkey and l_orderkey = o_orderkey and o_custkey = c_custkey and
c_nationkey = n1.n_nationkey and n1.n_regionkey = r_regionkey and r_name = 'AMERICA' and
s_nationkey = n2.n_nationkey and o_orderdate between date '1995-01-01' and date '1996-12-31' and p_type = 'STEEL') as all_nations
group by o_year order by o_year;
Comparison: real ansi(sqlserver/oracle/db2/sap/teradata/..)
install 1 second 100,000 second
hardware 1 milliwatt 100,000 milliwatt
software 160 kilobyte 8,000,000 kilobyte (+ 10,000,000kilobyte O/S)
mediandb 1,000,000 megarow 10 megarow
https://docs.microsoft.com/en-us/sql/database-engine/install-windows/install-sql-server?view=sql-server-ver15
shakti is essential for analyzing big (trillion row+) and/or complex data.eduman$wget shakti.com/edu/k.zip;unzip k;make # linux/avx512
while if else .. select by from where .. in ..
+ - * % &and |or < > = ~ @ ?ind ^cut #take _drop , ! . F'/\ [,.! ~+-*%&|<>=@? ^#_]
abs - sqr sqrt flip rev asc desc freq ~ first uniq sort count floor , key val f'/\ [,.! ~+-*%_ <>=^?&|#@]
LMSAR(min max sum avg rms*) DEZ(log exp zgm*) rand rule grid gen 'map/reduce\scan
k-torch Zigmoid(a*x%1+exp-x) Zoftmax(x%+/x:exp x-max x)
fwd(24M/14)[6 14 288 288] f::x+g@fyZey:x%Rx+:d(li,:cy)S6Z'(ki,:z.by)*(%48)?(z:1e-4*#@k).ay:x%Rx
gen(37M/18)[ 32000 288] g::x?Mx:a@bf/ax
tcn(.6M/16)[13 50 192 64] a+0|b?d^0|c?d^x
mathematics from 4400BC (iraq egypt india ..)
s:1' /sort s rand 16 uruk -4400 divide&impera
q:avg/1>abs /quadrant p grid 96 thebes -3200 pi:abs log -1
a:(+\|)\ /fibonacci f 2 3 pingala ujjain -300
g:(|!\)/ /gcd g 4 6 euclid alexandria -300
p::(+x*)/1 2 3 /polynomial p rand 16 al-khwarizmi baghdad 820
r::(-x*)/1 3 5 /taylorseries q rand 16 madhava kerala 1390 arc cossin
e::x log x%+/x /entropy e freq kj napier louvain 1572 ratio zeno
f:(+,-.)0' /fft f rand 16 gauss brunswick 1805 divide&impera
w:(k log)w' /wordle w 2315 boltzmann graz 1872 divide&impera
m::2>abs(x+*)/x /mandelbrot m grid 96 fatou&julia paris 1915
rosetta infix prefix postfix
k (;){:}: +-*%&|<>=~@?#_^,! +-*%~_&|<>=?^@,#!!. '/\
python (,){:}= +-*/&|<>DD w[:]a% a-ss~ftra suc lrkv[len range keys values]
numpy.[where append abs square sqrt floor transpose reverse sort argsort unique collections.counter]k.zip