Commit Briefs

fffd8298be Sergey Bronnikov

test/fuzz: speedup string serialization (ligurio/gh-xxxx-speedup-serializer, origin/ligurio/gh-xxxx-speedup-serializer)

- clamp before cleaning string because cleaning is not cheap (O(n), where max n is equal to kMaxStrLength) - call cleaning for identifiers only, there is no sense to cleaning string literals - replace symbols disallowed by Lua grammar in indentifier's names with '_' The patch saves 16 sec on 145k samples (401 sec before the patch and 385 sec after the patch). It is actually not so much, but it is about 2.5 min per hour. NO_CHANGELOG=testing NO_DOC=testing


9d3859b246 Vladimir Davydov

vinyl: fix gc vs vylog race leading to duplicate record

Vinyl run files aren't always deleted immediately after compaction, because we need to keep run files corresponding to checkpoints for backups. Such run files are deleted by the garbage collection procedure, which performs the following steps: 1. Loads information about all run files from the last vylog file. 2. For each loaded run record that is marked as dropped: a. Tries to remove the run files. b. On success, writes a "forget" record for the dropped run, which will make vylog purge the run record on the next vylog rotation (checkpoint). (see `vinyl_engine_collect_garbage()`) The garbage collection procedure writes the "forget" records asynchronously using `vy_log_tx_try_commit()`, see `vy_gc_run()`. This procedure can be successfully executed during vylog rotation, because it doesn't take the vylog latch. It simply appends records to a memory buffer which is flushed either on the next synchronous vylog write or vylog recovery. The problem is that the garbage collection isn't necessarily loads the latest vylog file because the vylog file may be rotated between it calls `vy_log_signature()` and `vy_recovery_new()`. This may result in a "forget" record written twice to the same vylog file for the same run file, as follows: 1. GC loads last vylog N 2. GC starts removing dropped run files. 3. CHECKPOINT starts vylog rotation. 4. CHECKPOINT loads vylog N. 5. GC writes a "forget" record for run A to the buffer. 6. GC is completed. 7. GC is restarted. 8. GC finds that the last vylog is N and blocks on the vylog latch trying to load it. 9. CHECKPOINT saves vylog M (M > N). 10. GC loads vylog N. This triggers flushing the forget record for run A to vylog M (not to vylog N), because vylog M is the last vylog at this point of time. 11. GC starts removing dropped run files. 12. GC writes a "forget" record for run A to the buffer again, because in vylog N it's still marked as dropped and not forgotten. (The previous "forget" record was written to vylog M). 13. Now we have two "forget" records for run A in vylog M. Such duplicate run records aren't tolerated by the vylog recovery procedure, resulting in a permanent error on the next checkpoint: ``` ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Run XXXX forgotten but not registered ``` To fix this issue, we move `vy_log_signature()` under the vylog latch to `vy_recovery_new()`. This makes sure that GC will see vylog records that it's written during the previous execution. Catching this race in a function test would require a bunch of ugly error injections so let's assume that it'll be tested by fuzzing. Closes #10128 NO_DOC=bug fix NO_TEST=tested manually with fuzzer


19d1f1cc0a Vladimir Davydov

tuple: don't use offset_slot_cache in vinyl threads

`key_part::offset_slot_cache` and `key_part::format_epoch` are used for speeding up tuple field lookup in `tuple_field_raw_by_part()`. These structure members are accessed and updated without any locks, assuming this code is executed exclusively in the tx thread. However, this isn't necessarily true because we also perform tuple field lookups in vinyl read threads. Apparently, this can result in unexpected races and bugs, for example: ``` #1 0x590be9f7eb6d in crash_collect+256 #2 0x590be9f7f5a9 in crash_signal_cb+100 #3 0x72b111642520 in __sigaction+80 #4 0x590bea385e3c in load_u32+35 #5 0x590bea231eba in field_map_get_offset+46 #6 0x590bea23242a in tuple_field_raw_by_path+417 #7 0x590bea23282b in tuple_field_raw_by_part+203 #8 0x590bea23288c in tuple_field_by_part+91 #9 0x590bea24cd2d in unsigned long tuple_hint<(field_type)5, false, false>(tuple*, key_def*)+103 #10 0x590be9d4fba3 in tuple_hint+40 #11 0x590be9d50acf in vy_stmt_hint+178 #12 0x590be9d53531 in vy_page_stmt+168 #13 0x590be9d535ea in vy_page_find_key+142 #14 0x590be9d545e6 in vy_page_read_cb+210 #15 0x590be9f94ef0 in cbus_call_perform+44 #16 0x590be9f94eae in cmsg_deliver+52 #17 0x590be9f9583e in cbus_process+100 #18 0x590be9f958a5 in cbus_loop+28 #19 0x590be9d512da in vy_run_reader_f+381 #20 0x590be9cb4147 in fiber_cxx_invoke(int (*)(__va_list_tag*), __va_list_tag*)+34 #21 0x590be9f8b697 in fiber_loop+219 #22 0x590bea374bb6 in coro_init+120 ``` Fix this by skipping this optimization for threads other than tx. No test is added because reproducing this race is tricky. Ideally, bugs like this one should be caught by fuzzing tests or thread sanitizers. Closes #10123 NO_DOC=bug fix NO_TEST=tested manually with fuzzer


7b72080ddf Vladimir Davydov

vinyl: fix cache iterator skipping tuples in read view

The tuple cache doesn't store older tuple versions so if a reader is in a read view, it must skip tuples that are newer than the read view, see `vy_cache_iterator_stmt_is_visible()`. A reader must also ignore cached intervals if any of the tuples used as a boundary is invisible from the read view, see `vy_cache_iterator_skip_to_read_view()`. There's a bug in `vy_cache_iterator_restore()` because of which such an interval may be returned to the reader: when we step backwards from the last returned tuple we consider only one of the boundaries. As a result, if the other boundary is invisible from the read view, the reader will assume there's nothing in the index between the boundaries and skip reading older sources (memory, disk). Fix this by always checking if the other boundary is visible. Closes #10109 NO_DOC=bug fix


72763f947b Vladimir Davydov

vinyl: fix run iterator skipping tuples following non-terminal statement

If a run iterator is positioned at a non-terminal statement (UPSERT or UPDATE), `vy_run_iterator_next()` will iterate over older statements with the same key using `vy_run_iterator_next_lsn()` to build the key history. While doing so, it may reach the end of the run file (if the current key is the last in the run). This would stop iteration permanently, which is apparently wrong for reverse iterators (LE or LT): if this happens the run iterator won't return any keys preceding the last one in the run file. Fix this by removing `vy_run_iterator_stop()` from `vy_run_iterator_next_lsn()`. Part of #10109 NO_DOC=bug fix NO_CHANGELOG=next commit


9b63ced353 Vladislav Shpilevoy

box: refactor synchro quorum update on deletion from `_cluster` space

For symmetry with the update of the synchronous replication quorum on insertion into the `_cluster` space, let's reuse the `on_replace_cluster_update_quorum` on_commit trigger. Follows-up #10087 NO_CHANGELOG=<refactoring> NO_DOC=<refactoring> NO_TEST=<refactoring>


29d1c0fa6c Vladislav Shpilevoy

box: update synchro quorum in on_commit trigger instead of on_replace

Currently, we update the synchronous replication quorum from the `on_replace` trigger of the `_cluster` space when registering a new replica. However, during the join process, the replica cannot ack its own insertion into the `_cluster` space. In the scope of #9723, we are going to enable synchronous replication for most of the system spaces, including the `_cluster` space. There are several problems with this: 1. Joining a replica to a 1-member cluster without manual changing of quorum won't work: it is impossible to commit the insertion into the `_cluster` space with only 1 node, since the quorum will equal to 2 right after the insertion. 2. Joining a replica to a 3-member cluster may fail: the quorum will become equal to 3 right after the insertion, the newly joined replica cannot ACK its own insertion into the `_cluster` space — if one out of original 3 nodes fails, then reconfiguration will fail. Generally speaking, it will be impossible to join a new replica to the cluster, if a quorum, which includes the newly added replica (which cannot ACK), cannot be gathered. To solve these problems, let's update the quorum in the `on_commit` trigger. This way we’ll be able to insert a node regardless of the current configuration. This somewhat contradicts with the Raft specification, which requires application of all configuration changes in the `on_replace` trigger (i.e., as soon as they are persisted in the WAL, without quorum confirmation), but still forbids several reconfigurations at the same time. Closes #10087 NO_DOC=<no special documentation page devoted to cluster reconfiguration>


42c4c34bcc Vladislav Shpilevoy

box: add `confirm_lag` field to `box.info.synchro.queue`

Let's add a lag confirmation field to the limbo, and expose information about the time the latest successfully confirmed entry waited for quorum. Part of #9918 @TarantoolBot document Title: New `age` and `confirm_lag` fields in `box.info.synchro.queue` Product: Tarantool Since: 3.2 Root documents: https://www.tarantool.io/ru/doc/latest/reference/reference_lua/box_info/synchro/ The `age` field shows the time that the oldest entry currently present in the queue has spent waiting for quorum, while the `confirm_lag` field shows the time that the latest successfully confirmed entry waited for the quorum to gather.


bded00616d Vladislav Shpilevoy

box: add `age` field to `box.info.synchro.queue`

Let's add an insertion timestamp to the limbo entry, and expose information about the time the current oldest limbo entry has spent in the queue waiting for the quorum. Part of #9918 NO_CHANGELOG=<in the final patch> NO_DOC=<in the final patch>


530aa82da2 Vladimir Davydov

error: finish changes to several C modules to use box.error

We already do some steps to use box.error in these C modules. Let's finish it to make a clear distinction between C modules that are switched to box.error and that are not. Follow-up #9996 NO_CHANGELOG=internal NO_DOC=internal


Branches



























































































Tags

Tree

.editorconfigcommits | blame
.gdbinitcommits | blame
.gitattributescommits | blame
.github/
.gitignorecommits | blame
.gitmodulescommits | blame
.luacheckrccommits | blame
.pack.mkcommits | blame
.test.mkcommits | blame
AUTHORScommits | blame
CMakeLists.txtcommits | blame
CONTRIBUTING.mdcommits | blame
Doxyfilecommits | blame
Doxyfile.API.incommits | blame
FreeBSD/
LICENSEcommits | blame
README.FreeBSDcommits | blame
README.MacOSXcommits | blame
README.OpenBSDcommits | blame
README.mdcommits | blame
TODOcommits | blame
apk/
asan/
changelogs/
cmake/
debian/
doc/
docker/
extra/
patches/
perf/
rpm/
rump/
src/
static-build/
test/
test-run$commits | blame
third_party/
tools/

README.md

# Tarantool

[![Actions Status][actions-badge]][actions-url]
[![Code Coverage][coverage-badge]][coverage-url]
[![OSS Fuzz][oss-fuzz-badge]][oss-fuzz-url]
[![Telegram][telegram-badge]][telegram-url]
[![GitHub Discussions][discussions-badge]][discussions-url]
[![Stack Overflow][stackoverflow-badge]][stackoverflow-url]

[Tarantool][tarantool-url] is an in-memory computing platform consisting of a
database and an application server.

It is distributed under [BSD 2-Clause][license] terms.

Key features of the application server:

* Heavily optimized Lua interpreter with incredibly fast tracing JIT compiler,
  based on LuaJIT 2.1.
* Cooperative multitasking, non-blocking IO.
* [Persistent queues][queue].
* [Sharding][vshard].
* [Cluster and application management framework][cartridge].
* Access to external databases such as [MySQL][mysql] and [PostgreSQL][pg].
* A rich set of built-in and standalone [modules][modules].

Key features of the database:

* MessagePack data format and MessagePack based client-server protocol.
* Two data engines: 100% in-memory with complete WAL-based persistence and an
  own implementation of LSM-tree, to use with large data sets.
* Multiple index types: HASH, TREE, RTREE, BITSET.
* Document oriented JSON path indexes.
* Asynchronous master-master replication.
* Synchronous quorum-based replication.
* RAFT-based automatic leader election for the single-leader configuration.
* Authentication and access control.
* ANSI SQL, including views, joins, referential and check constraints.
* [Connectors][connectors] for many programming languages.
* The database is a C extension of the application server and can be turned
  off.

Supported platforms are Linux (x86_64, aarch64), Mac OS X (x86_64, M1), FreeBSD
(x86_64).

Tarantool is ideal for data-enriched components of scalable Web architecture:
queue servers, caches, stateful Web applications.

To download and install Tarantool as a binary package for your OS or using
Docker, please see the [download instructions][download].

To build Tarantool from source, see detailed [instructions][building] in the
Tarantool documentation.

To find modules, connectors and tools for Tarantool, check out our [Awesome
Tarantool][awesome-list] list.

Please report bugs to our [issue tracker][issue-tracker]. We also warmly
welcome your feedback on the [discussions][discussions-url] page and questions
on [Stack Overflow][stackoverflow-url].

We accept contributions via pull requests. Check out our [contributing
guide][contributing].

Thank you for your interest in Tarantool!

[actions-badge]: https://github.com/tarantool/tarantool/workflows/release/badge.svg
[actions-url]: https://github.com/tarantool/tarantool/actions
[coverage-badge]: https://coveralls.io/repos/github/tarantool/tarantool/badge.svg?branch=master
[coverage-url]: https://coveralls.io/github/tarantool/tarantool?branch=master
[telegram-badge]: https://img.shields.io/badge/Telegram-join%20chat-blue.svg
[telegram-url]: http://telegram.me/tarantool
[discussions-badge]: https://img.shields.io/github/discussions/tarantool/tarantool
[discussions-url]: https://github.com/tarantool/tarantool/discussions
[stackoverflow-badge]: https://img.shields.io/badge/stackoverflow-tarantool-orange.svg
[stackoverflow-url]: https://stackoverflow.com/questions/tagged/tarantool
[oss-fuzz-badge]: https://oss-fuzz-build-logs.storage.googleapis.com/badges/tarantool.svg
[oss-fuzz-url]: https://oss-fuzz.com/coverage-report/job/libfuzzer_asan_tarantool/latest
[tarantool-url]: https://www.tarantool.io/en/
[license]: LICENSE
[modules]: https://www.tarantool.io/en/download/rocks
[queue]: https://github.com/tarantool/queue
[vshard]: https://github.com/tarantool/vshard
[cartridge]: https://github.com/tarantool/cartridge
[mysql]: https://github.com/tarantool/mysql
[pg]: https://github.com/tarantool/pg
[connectors]: https://www.tarantool.io/en/download/connectors
[download]: https://www.tarantool.io/en/download/
[building]: https://www.tarantool.io/en/doc/latest/dev_guide/building_from_source/
[issue-tracker]: https://github.com/tarantool/tarantool/issues
[contributing]: CONTRIBUTING.md
[awesome-list]: https://github.com/tarantool/awesome-tarantool/