WIP: group by optimizations

master
Bill 2 years ago
parent 4942dc1f50
commit eebf507c6a

@ -1,31 +1,20 @@
# AQuery++ Database # AQuery++ Database
### Please try the latest code in dev branch if you encounter any problem. Use `git checkout dev` to switch branches.
## Introduction ## Introduction
AQuery++ Database is a cross-platform, In-Memory Column-Store Database that incorporates compiled query execution. (**Note**: If you encounter any problems, feel free to contact me via ys3540@nyu.edu) AQuery++ Database is a cross-platform, In-Memory Column-Store Database that incorporates compiled query execution. (**Note**: If you encounter any problems, feel free to contact me via ys3540@nyu.edu)
# Architecture
![Architecture](./docs/arch-hybrid.svg)
## Docker (Recommended): ## AQuery Compiler
- See installation instructions from [docker.com](https://www.docker.com). Run **docker desktop** to start docker engine. - The query is first processed by the AQuery Compiler which is composed of a frontend that parses the query into AST and a backend that generates target code that delivers the query.
- In AQuery root directory, type `make docker` to build the docker image from scratch. - Front end of AQuery++ Compiler is built on top of [mo-sql-parsing](https://github.com/klahnakoski/mo-sql-parsing) with modifications to handle AQuery dialect and extension.
- For Arm-based Mac users, you would have to build and run the **x86_64** docker image because MonetDB doesn't offer official binaries for arm64 Linux. (Run `docker buildx build --platform=linux/amd64 -t aquery .` instead of `make docker`) - Backend of AQuery++ Compiler generates target code dependent on the Execution Engine. It can either be the C++ code for AQuery Execution Engine or sql and C++ post-processor for Hybrid Engine or k9 for the k9 Engine.
- Finally run the image in **interactive** mode (`docker run --name aquery -it aquery`) ## Execution Engines
- When you need to access the container again run `docker start -ai aquery` - AQuery++ supports different execution engines thanks to the decoupled compiler structure.
- If there is a need to access the system shell within AQuery, type `dbg` to activate python interpreter and type `os.system('sh')` to launch a shell. - Hybrid Execution Engine: decouples the query into two parts. The sql-compliant part is executed by an Embedded version of Monetdb and everything else is executed by a post-process module which is generated by AQuery++ Compiler in C++ and then compiled and executed.
- Docker image is available on [Docker Hub](https://hub.docker.com/repository/docker/sunyinqi0508/aquery) but building image yourself is highly recommended (see [#2](../../issues/2)) - AQuery Library: A set of header based libraries that provide column arithmetic and operations inspired by array programming languages like kdb. This library is used by C++ post-processor code which can significantly reduce the complexity of generated code, reducing compile time while maintaining the best performance. The set of libraries can also be used by UDFs as well as User modules which makes it easier for users to write simple, efficient yet powerful extensions.
## CIMS Computer Lab (Only for NYU affiliates who have access)
1. Clone this git repo in CIMS.
2. Download the [patch](https://drive.google.com/file/d/1YkykhM6u0acZ-btQb4EUn4jAEXPT81cN/view?usp=sharing)
3. Decompress the patch to any directory and execute script inside by typing (`source ./cims.sh`). Please use the source command or `. ./cims.sh` (dot space) to execute the script because it contains configurations for environment variables. Also note that this script can only work with bash and compatible shells (e.g. dash, zsh. but not csh)
4. Execute `python3 ./prompt.py`
## Singularity Container # Installation:
1. build container `singularity build aquery.sif aquery.def`
2. execute container `singularity exec aquery.sif sh`
3. run AQuery `python3 ./prompt.py`
# Native Installation:
## Requirements ## Requirements
1. Recent version of Linux, Windows or MacOS, with recent C++ compiler that has C++17 (1z) support. (however c++20 is recommended if available for heterogeneous lookup on unordered containers) 1. Recent version of Linux, Windows or MacOS, with recent C++ compiler that has C++17 (1z) support. (however c++20 is recommended if available for heterogeneous lookup on unordered containers)
- GCC: 9.0 or above (g++ 7.x, 8.x fail to handle fold-expressions due to a compiler bug) - GCC: 9.0 or above (g++ 7.x, 8.x fail to handle fold-expressions due to a compiler bug)
@ -38,10 +27,6 @@ AQuery++ Database is a cross-platform, In-Memory Column-Store Database that inco
- On MacOS, Monetdb can be easily installed in homebrew `brew install monetdb`. - On MacOS, Monetdb can be easily installed in homebrew `brew install monetdb`.
3. Python 3.6 or above and install required packages in requirements.txt by `python3 -m pip install -r requirements.txt` 3. Python 3.6 or above and install required packages in requirements.txt by `python3 -m pip install -r requirements.txt`
## Installation
AQuery is tested on mainstream operating systems such as Windows, macOS and Linux
### Windows ### Windows
There're multiple options to run AQuery on Windows. But for better consistency I recommend using a simulated Linux environment such as **Windows Subsystem for Linux** (1 or 2), **Docker** or **Linux Virtual Machines**. You can also use the native toolchain from Microsoft Visual Studio or gcc from Winlabs/Cygwin/MinGW. There're multiple options to run AQuery on Windows. But for better consistency I recommend using a simulated Linux environment such as **Windows Subsystem for Linux** (1 or 2), **Docker** or **Linux Virtual Machines**. You can also use the native toolchain from Microsoft Visual Studio or gcc from Winlabs/Cygwin/MinGW.
@ -97,7 +82,24 @@ There're multiple options to run AQuery on Windows. But for better consistency I
In this case, upgrade anaconda or your compiler or use the python from your OS or package manager instead. Or (**NOT recommended**) copy/link the library from your system (e.g. /usr/lib/x86_64-linux-gnu/libstdc++.so.6) to anaconda's library directory (e.g. ~/Anaconda3/lib/). In this case, upgrade anaconda or your compiler or use the python from your OS or package manager instead. Or (**NOT recommended**) copy/link the library from your system (e.g. /usr/lib/x86_64-linux-gnu/libstdc++.so.6) to anaconda's library directory (e.g. ~/Anaconda3/lib/).
## Docker (Recommended):
- See installation instructions from [docker.com](https://www.docker.com). Run **docker desktop** to start docker engine.
- In AQuery root directory, type `make docker` to build the docker image from scratch.
- For Arm-based Mac users, you would have to build and run the **x86_64** docker image because MonetDB doesn't offer official binaries for arm64 Linux. (Run `docker buildx build --platform=linux/amd64 -t aquery .` instead of `make docker`)
- Finally run the image in **interactive** mode (`docker run --name aquery -it aquery`)
- When you need to access the container again run `docker start -ai aquery`
- If there is a need to access the system shell within AQuery, type `dbg` to activate python interpreter and type `os.system('sh')` to launch a shell.
- Docker image is available on [Docker Hub](https://hub.docker.com/repository/docker/sunyinqi0508/aquery) but building image yourself is highly recommended (see [#2](../../issues/2))
## CIMS Computer Lab (Only for NYU affiliates who have access)
1. Clone this git repo in CIMS.
2. Download the [patch](https://drive.google.com/file/d/1YkykhM6u0acZ-btQb4EUn4jAEXPT81cN/view?usp=sharing)
3. Decompress the patch to any directory and execute script inside by typing (`source ./cims.sh`). Please use the source command or `. ./cims.sh` (dot space) to execute the script because it contains configurations for environment variables. Also note that this script can only work with bash and compatible shells (e.g. dash, zsh. but not csh)
4. Execute `python3 ./prompt.py`
## Singularity Container
1. build container `singularity build aquery.sif aquery.def`
2. execute container `singularity exec aquery.sif sh`
3. run AQuery `python3 ./prompt.py`
# Usage # Usage
`python3 prompt.py` will launch the interactive command prompt. The server binary will be automatically rebuilt and started. `python3 prompt.py` will launch the interactive command prompt. The server binary will be automatically rebuilt and started.
### Commands: ### Commands:
@ -268,17 +270,6 @@ SELECT * FROM my_table WHERE c1 > 10
- `sqrt(x), trunc(x), and other builtin math functions`: value-wise math operations. `sqrt(x)[i] = sqrt(x[i])` - `sqrt(x), trunc(x), and other builtin math functions`: value-wise math operations. `sqrt(x)[i] = sqrt(x[i])`
- `pack(cols, ...)`: pack multiple columns with exact same type into a single column. - `pack(cols, ...)`: pack multiple columns with exact same type into a single column.
# Architecture
![Architecture](./docs/arch-hybrid.svg)
## AQuery Compiler
- The query is first processed by the AQuery Compiler which is composed of a frontend that parses the query into AST and a backend that generates target code that delivers the query.
- Front end of AQuery++ Compiler is built on top of [mo-sql-parsing](https://github.com/klahnakoski/mo-sql-parsing) with modifications to handle AQuery dialect and extension.
- Backend of AQuery++ Compiler generates target code dependent on the Execution Engine. It can either be the C++ code for AQuery Execution Engine or sql and C++ post-processor for Hybrid Engine or k9 for the k9 Engine.
## Execution Engines
- AQuery++ supports different execution engines thanks to the decoupled compiler structure.
- Hybrid Execution Engine: decouples the query into two parts. The sql-compliant part is executed by an Embedded version of Monetdb and everything else is executed by a post-process module which is generated by AQuery++ Compiler in C++ and then compiled and executed.
- AQuery Library: A set of header based libraries that provide column arithmetic and operations inspired by array programming languages like kdb. This library is used by C++ post-processor code which can significantly reduce the complexity of generated code, reducing compile time while maintaining the best performance. The set of libraries can also be used by UDFs as well as User modules which makes it easier for users to write simple but powerful extensions.
# Roadmap # Roadmap
- [x] SQL Parser -> AQuery Parser (Front End) - [x] SQL Parser -> AQuery Parser (Front End)

@ -186,6 +186,21 @@ decayed_t<VT, types::GetLongType<T>> sumw(uint32_t w, const VT<T>& arr) {
return ret; return ret;
} }
template<class T, template<typename ...> class VT>
void avgw(uint32_t w, const VT<T>& arr,
decayed_t<VT, types::GetFPType<types::GetLongType<T>>>& ret) {
typedef types::GetFPType<types::GetLongType<T>> FPType;
const uint32_t& len = arr.size;
uint32_t i = 0;
types::GetLongType<T> s{};
w = w > len ? len : w;
if (len) s = ret[i++] = arr[0];
for (; i < w; ++i)
ret[i] = (s += arr[i]) / (FPType)(i + 1);
for (; i < len; ++i)
ret[i] = ret[i - 1] + (arr[i] - arr[i - w]) / (FPType)w;
}
template<class T, template<typename ...> class VT> template<class T, template<typename ...> class VT>
decayed_t<VT, types::GetFPType<types::GetLongType<T>>> avgw(uint32_t w, const VT<T>& arr) { decayed_t<VT, types::GetFPType<types::GetLongType<T>>> avgw(uint32_t w, const VT<T>& arr) {
typedef types::GetFPType<types::GetLongType<T>> FPType; typedef types::GetFPType<types::GetLongType<T>> FPType;

@ -132,3 +132,7 @@ namespace ankerl::unordered_dense{
struct hash<std::tuple<Types...>> : public hasher<Types...>{ }; struct hash<std::tuple<Types...>> : public hasher<Types...>{ };
} }
struct aq_hashtable_value_t{
uint32_t id;
uint32_t cnt;
};

@ -6,6 +6,8 @@
#include "monetdb_conn.h" #include "monetdb_conn.h"
#include "monetdbe.h" #include "monetdbe.h"
#include "table.h" #include "table.h"
#include <thread>
#undef ERROR #undef ERROR
#undef static_assert #undef static_assert
@ -86,7 +88,10 @@ void Server::connect(Context *cxt){
} }
server = (monetdbe_database*)malloc(sizeof(monetdbe_database)); server = (monetdbe_database*)malloc(sizeof(monetdbe_database));
auto ret = monetdbe_open(server, nullptr, nullptr); monetdbe_options ops;
AQ_ZeroMemory(ops);
ops.nr_threads = std::thread::hardware_concurrency();
auto ret = monetdbe_open(server, nullptr, &ops);
if (ret == 0){ if (ret == 0){
status = true; status = true;
this->server = server; this->server = server;

@ -191,6 +191,21 @@ constexpr prt_fn_t monetdbe_prtfns[] = {
aq_to_chars<std::nullptr_t> aq_to_chars<std::nullptr_t>
}; };
#ifndef __AQ_USE_THREADEDGC__
void aq_init_gc(void *handle, Context* cxt)
{
typedef void (*aq_gc_init_t) (Context*);
if (handle && cxt){
auto sym = dlsym(handle, "__AQ_Init_GC__");
if(sym){
((aq_gc_init_t)sym)(cxt);
}
}
}
#else //__AQ_USE_THREADEDGC__
#define aq_init_gc(h, c)
#endif //__AQ_USE_THREADEDGC__
#include "monetdbe.h" #include "monetdbe.h"
#undef max #undef max
#undef min #undef min
@ -363,12 +378,7 @@ start:
recorded_queries.emplace_back(copy_lpstr("N")); recorded_queries.emplace_back(copy_lpstr("N"));
} }
handle = dlopen(proc_name, RTLD_NOW); handle = dlopen(proc_name, RTLD_NOW);
#ifndef __AQ_USE_THREADEDGC__ aq_init_gc(handle, cxt);
{
typedef void (*aq_gc_init_t) (Context*);
((aq_gc_init_t)dlsym(handle, "__AQ_Init_GC__"))(cxt);
}
#endif
if (procedure_recording) { if (procedure_recording) {
recorded_libraries.emplace_back(handle); recorded_libraries.emplace_back(handle);
} }
@ -474,11 +484,13 @@ start:
p.__rt_loaded_modules = static_cast<void**>( p.__rt_loaded_modules = static_cast<void**>(
malloc(sizeof(void*) * p.postproc_modules)); malloc(sizeof(void*) * p.postproc_modules));
for(uint32_t j = 0; j < p.postproc_modules; ++j){ for(uint32_t j = 0; j < p.postproc_modules; ++j){
auto pj = dlopen(p.name, RTLD_NOW); auto pj = dlopen((procedure_root + p.name + std::to_string(j) + ".so").c_str(), RTLD_NOW);
if (pj == nullptr){ if (pj == nullptr){
printf("Error: failed to load module %s\n", p.name); printf("Error: failed to load module %s\n", p.name);
return true; return true;
} }
aq_init_gc(pj, cxt);
p.__rt_loaded_modules[j] = pj; p.__rt_loaded_modules[j] = pj;
} }
} }
@ -528,6 +540,7 @@ start:
puts(p.queries[j-1]); puts(p.queries[j-1]);
} }
fclose(fp); fclose(fp);
p.__rt_loaded_modules = 0;
return load_modules(p); return load_modules(p);
}; };
switch(n_recvd[i][1]){ switch(n_recvd[i][1]){

@ -289,6 +289,7 @@ public:
uint32_t len = end - start; uint32_t len = end - start;
return ColView<_Ty>(orig, idxs.subvec(start, end)); return ColView<_Ty>(orig, idxs.subvec(start, end));
} }
ColRef<_Ty> subvec_deep(uint32_t start, uint32_t end) const { ColRef<_Ty> subvec_deep(uint32_t start, uint32_t end) const {
uint32_t len = end - start; uint32_t len = end - start;
ColRef<_Ty> subvec(len); ColRef<_Ty> subvec(len);

@ -1059,6 +1059,18 @@ public:
return do_insert_or_assign(std::move(key), std::forward<M>(mapped)).first; return do_insert_or_assign(std::move(key), std::forward<M>(mapped)).first;
} }
template <class K, class M>
auto hashtable_push(K&& key, M& mapped) {
++ mapped.id;
++ mapped.cnt;
auto it_isinserted = try_emplace(std::forward<K>(key), std::forward<M>(mapped));
if (!it_isinserted.second) {
--mapped.cnt;
return it_isinserted.first->second.id;
}
return mapped.id;
}
template <typename K, template <typename K,
typename M, typename M,
typename Q = T, typename Q = T,

Loading…
Cancel
Save