This PR adds GPU acceleration for all remaining ggml tensors that didn't yet have it. Especially for long generations this makes a large difference because the KV cache is still CPU only on master and gets larger as the context fills up.
蠻多人有不同測試的結果,要注意這次不是把 CPU 搬到 GPU 上面做,而是把本來因為比較 light 而還沒搬上 GPU 的部分搬上去,所以不會是數量級的加速,但看起來改善也已經很不賴了:
Early attempt this morning we're getting ~2.5-2.8x perf increase on 4090s and about 1.8-2x on 3090Ti.
Our original product, which we’re retroactively calling Pyston-full, is a fork of the entire CPython codebase. Having users install a fully-custom version of Python lets us make changes across the Python implementation, leading to the most optimizations and largest speedups.
但這種方式的安裝與維護都需要另外搞,而且因為 ABI 不相容的問題,遇到一些套件可能會需要自己編 (甚至自己改?),不能直接用編好的 binary:
The flip side is that it is fairly intensive to set up. While we believe Pyston-full is one of the most highly-compatible alternative Python implementations available, it can be difficult to switch Python implementations regardless of the ease of use of either implementation. Compounded on this, we decided to break the ABI which requires users to recompile extension modules. In theory this is not a big deal, but in practice the lack of available binary packages is a significant disincentive to use an alternative implementation.
The sum of all of this was that while we were very happy to achieve a 30% speedup with Pyston-full, it was very difficult to get people to start using it. We decided to try a different form factor: a pip-installable extension module called Pyston-lite.
但效能的提昇就不像 Pyston-full 這麼高,Pyston-lite 只剩下 10% 了:
So while it’s a bit difficult to accept that we are now providing a 10% speedup instead of 30%, we’ve decided that it’s much more important to provide something that people are willing to use.
In the longer-term future we are planning to submit our JIT upstream as well, but we expect retargeting it to 3.11 to be significantly more work than the other versions due to the extensive amount of changes that were made to the interpreter in that version.
This morning, I began another #pentest for a client. After some google-fu dorking combine with the major search engines, I found the id_rsa key that gave me access to the server and a bunch of others 😬😬
Today, we are excited to announce that, due to high customer demand for additional services in Osaka, the Osaka Local Region will be expanded into a full AWS Region with three Availability Zones by early 2021.
意外的看到有人拿 Star Trek 的材料來玩... 依照作者的說明,DS9 一直沒有 Full HD 版的其中一個原因反而是因為「數位化」了。使用類比膠卷的母帶可以透過更高規格的重新掃描而得到高畫質版本,但 DS9 的母帶似乎已經是數位版了,所以反而造成無法透過重新掃描的方式取得 Full HD 版本:
While you can rescan analog film at a higher resolution, video is digital and can't be rescanned. This makes it much costlier to remaster this TV show, which is one of the reasons why it hasn't happened.
Current releases of PostgreSQL need to read every page in the database at least once every 2 billion write transactions (less, with default settings) to verify that there are no old transaction IDs on that page which require "freezing".
這動作在資料量大的機器上就會吃大量資源導致各種討厭的現象:
All of a sudden, when the number of transaction IDs that have been consumed crosses some threshold, autovacuum begins processing one or more tables, reading every page. This consumes much more I/O bandwidth, and exerts much more cache pressure on the system, than a standard vacuum, which reads only recently-modified page.
而作者送了 patch 改成只會讀還沒搞定的部份:
Instead of whole-table vacuums, we now have aggressive vacuums, which will read every page in the table that isn't already known to be entirely frozen.
要注意的是,agreesive vacuum 相較於 vacuum 會多吃很多資源,但可以打散掉 (有點像一次大 GC 導致 lag 與多次 minor GC 讓程式反應時間變得比較順暢的比較):
An aggressive vacuum still figures to read more data than a regular vacuum, possibly a lot more. But at least it won't read the data that hasn't been touched since the last aggressive vacuum, and that's a big improvement.