| Age | Commit message (Collapse) | Author |
|
Chi Gao and Hengbing Wang reported certain issues around transaction handling
and demonstrated via xlogdump how certain transactions were getting marked
committed/aborted repeatedly on a datanode. When an already committed
transaction is attempted to be aborted again, it results in a PANIC. Upon
investigation, this uncovered a very serious yet long standing bug in
transaction handling.
If the client is running in autocommit mode, we try to avoid starting a
transaction block on the datanode side if only one datanode is going to be
involved in the transaction. This is an optimisation to speed up short queries
touching only a single node. But when the query rewriter transforms a single
statement into multiple statements, we would still (and incorrectly) run each
statement in an autocommit mode on the datanode. This can cause inconsistencies
when one statement commits but the next statement aborts. And it may also lead
to the PANIC situations if we continue to use the same global transaction
identifier for the statements.
This can also happen when the user invokes a user-defined function. If the
function has multiple statements, each statement will run in an autocommit
mode, if it's FQSed, thus again creating inconsistency if a following statement
in the function fails.
We now have a more elaborate mechanism to tackle autocommit and transaction
block needs. The special casing for force_autocommit is now removed, thus
making it more predictable. We also have specific conditions to check to ensure
that we don't mixup autocommit and transaction block for the same global xid.
Finally, if a query rewriter transforms a single statement into multiple
statements, we run those statements in a transaction block. Together these
changes should help us fix the problems.
|
|
A number of functions were defined in pgxcnode.h/pgxnnode.h, but
only ever used in poolmgr.c. Those are:
- PGXCNodeConnect - open libpq connection using conn. string
- PGXCNodePing - ping node using connection string
- PGXCNodeClose - close libpq connection
- PGXCNodeConnected - verify connection status
- PGXCNodeConnStr - build connection string
So move them to poolmgr.c and make them static, so that poolmgr
is the only part dealing with libpq connections directly.
|
|
Similarly to a39b06b0c6, this does minor cleanup in the pool manager
code by removing unused functions and adding a lot of comments, both
at the file level (explaining the concepts and basic API methods)
and for individual functions.
|
|
Our efforts to improve shared queue synchronization continues. We now have a
per queue producer lwlock that must be held for synchronization between
consumers and the producer. Consumers must hold this lock before setting the
producer latch to ensure the producer does not miss out any signals and does
not go into unnecessary waits.
We still can't get rid of all the timeouts, especially we see that sometimes a
producer finishes and tries to unbind from the queue, even before a consumer
gets chance to connect to the queue. We left the 10s wait to allow consumers to
connect. There is still net improvement because when the consumer is not going
to connect, it tells the producer and we avoid the 10s timeout, like we used to
see earlier.
|
|
We never had this support and we never felt the need because the use of FQS was
limited for utility statements and simple queries which can be completed
pushed down to the remote node. But in PG 10, we're seeing errors while using
cursors for queries which are FQSed. So instead of forcing regular remote
subplan on such queries, we are adding support for rescan of RemoteQuery node.
Patch by Senhu <senhu@tencent.com>
|
|
This is the merge-base of PostgreSQL's master branch and REL_10_STABLE branch.
This should be the last merge from PG's master branch into XL 10 branch.
Subsequent merges must happen from REL_10_STABLE branch
|
|
Due to a coding issue in IsDistColumnForRelId() it was possible to drop
columns that were used as modulo distribution keys. A simple example
demonstrates the behavior:
CREATE TABLE t (a INT, b INT, c INT) DISTRIBUTE BY MODULO (a);
ALTER TABLE t DROP COLUMN a;
test=# \d+ t
Table "public.t"
Column | Type | Modifiers | Storage | Stats target | Description
--------+---------+-----------+---------+--------------+-------------
b | integer | | plain | |
c | integer | | plain | |
Distribute By: MODULO(........pg.dropped.1........)
Location Nodes: ALL DATANODES
With this commit, the ALTER TABLE command fails as expected:
ERROR: Distribution column cannot be dropped
The commit simplifies the coding a bit, and removes several functions
that were not needed anymore (and unused outside locator.c).
|
|
Fast-query-shipping consumes all rows produced by one datanode (connection)
before moving to the next connection. This leads to suboptimal performance
when the datanodes can't prroduce tuples at a desired pace. Instead, we switch
between connections after every PGXLRemoteFetchSize (pgx_remote_fetch_size)
rows are fetched. This gives datanode a chance to produce more tuples while
the coordinator consumes tuples already produced and sent across by another
datanode.
This seems to improve performance for FQS-ed queries significantly when they
are returning large number of rows from more than one datanodes.
Report by Pilar de Teodoro <pteodoro@sciops.esa.int>, initial analysis and
performance tests by Tomas Vondra, further analysis and patch by me.
Backpatched to XL9_5_STABLE.
|
|
the remote side during RemoteSubplan execution.
This allows us to experiment with different sizes more easily. Playing with the
fetch size also exposed couple of problems fixed in this same commit.
1. We were incorrectly forgetting a connection response combiner while
suspending a portal, leading to errors later when we try to buffer the results
because the connection must be used for other queries.
2. The remote cursor name was not getting set properly, thus datanodes
complaining about non-existent cursors.
|
|
PostgreSQL's BINARY COPY protocol does not expect a newline character between
rows. But coordinator was adding this while running the protocol between
coordinator and the datanodes. While Postgres-XL had some provision to deal
with this on the datanode side, it seemed either insufficient or buggy as
evident during tests with pglogical. So get rid of that and make the protocol
same as vanilla PG.
|
|
comamnds to the remote nodes
Earlier we'd special cased a few GUCs such as those using memory or time units
or transaction isolation levels. But clearly that wasn't enough as we noticed
with "application_name" recently. So fix this problem in a more comprehensive
manner.
Added a few more test cases to cover these scenarios.
|
|
As pgxc.h is included from src/common/relpath.c, this caused a circular
dependency failure, because it requires lwlocknames.h, generated in
src/backend/storage (which in turn requires src/common already built).
|
|
the correct order and correction position
Per report and test-case by Daniel Bunford-Jones
|
|
There was only a single function in this module, and it was not
referenced from anywhere. So remove the module and all references
to it (e.g. includes of the header).
In createplan.c, this is a bit more difficult, because it got
getpid() through this file, so include the pgxcnode.h instead,
as it uncludes unistd.h.
|
|
While tweaking the code, I've noticed this is the only place using
NodeHealthStatus, apparently attempting to save a few bytes. That
seems rather pointless given how little memory this will need and
how short-lived the memory is (and also how AllocSet doubles the
allocation sizes anyway), so I got rid of this. This makes copying
the old state easier (a simple memcpy instead of for loops).
Note: The PgxcNodeListAndCount() seems a bit unfortunate, as the
name suggests to list and count nodes, but in practice it refreshes
the status of nodes from catalog, while keeping the health status.
|
|
The ShippabilityStat type and shippability walker/test functions are not
really needed outside pgxcship.c.
|
|
enforce_two_phase_commit
These two options were ineffective - defined and listed in the sample
configuration file, but attached to unreferenced variables.
|
|
caller can free up the memory when done with it.
This fixes a memory like while running ALTER TABLE DISTRIBUTE BY
|
|
options, such as hostname/port changes for a node
This allows us to retain connections to all other nodes in the cluster and just
recreate connections to the node whose connection information is changed. This
will be especially handy while dealing with datanode/coordinator failover
|
|
dependent on other settings
shared_queue_size is dependent on the number of datanodes in the cluster since
each datanode may attach itself as a consumer of the shared queue. So the
shared_queue_size now signifies per-datanode value and the actual value used
will be (max_datanodes * shared_queue_size). Existing users should modify their
settings after taking this into consideration.
Similarly, shared_queues highly depends on the number of concurrent queries. We
now conservatively set this to at least 1/4th of max_connections or user
specified value, whichever is higher.
|
|
such a common name for a very important structure member
|
|
query protocol.
While running extended query protocol, a backend that has thrown an error will
keep ignoring all messages until it sees a SYNC message. We now carefully track
the messages that we are sending to the remote node and remember if we must
send a SYNC message even before sending a ROLLBACK command.
While the regression was running fine even without this patch, this issue was
noticed as part of some other work and hence fixed
|
|
This fixes problems reported in plpgsql function calls since
fast-query-shipping was added. But fixing the problem also highlighted other
bugs in the protocol handling. For example, we would set a datanode connection
state to be IDLE upon receiving a CommandComplete. But for simple query
protocol, we should really be waiting for ReadyForQuery before changing the
state. This patch has related fixes.
plpgsql test case's expected output is also fixed now that the underlying bug
has been take care of
|
|
applying to the new repo.
Original commit from the sourceforge repo:
commit e61639b864e83b6b45d11b737ec3c3d67aeb4b56
Author: Mason Sharp <mason_s@users.sourceforge.net>
Date: Sun Jul 26 17:54:08 2015 -0700
Changed license from the Mozilla Public License
to the PostgreSQL License
|
|
'R' - now reports information about remote end (type/name/pid) where type can
be 'C' for coordinator, 'D' for datanode, 'A' for application and 'U' for rest
'G' - now reports information about the originating coordinator (name/pid)
'S' - reports the global session identifier.
|
|
We now log name of the remote node and remote PID if an error is received on a
connection. Some more debugging messages added to the pooler code.
|
|
This information is usally readily available in PGConn, but since we only pass
around socket descriptors between pooler and the coordinator (or datanode
backends), we pass this information back to the backends. This is at least very
useful for debugging purposes.
(This API definitely deserves some refactoring. In particular, I would like to
keep a copy of some other members of the PGConn structure in the
PGXCNodeHandle, but for now backend PID is what we care for).
|
|
one or more consumers to finish.
We have seen bunch of cases where a consumer may never bind to the SharedQ and
rightfully so. For example, in a join between 3 tables which requires
redistribution of tuples, a consumer may not at all bind to the SharedQ because
it the top level outer side did not produce any tuples to join against the
redistributed inner node.
This patch avoids the unnecessary FATAL errors, but what we still do not do
nicely is to avoid the 10s timeout (as currently set for producer). So while
queries, as included in the test case, will finally return success, it will
unnecessarily add a 10s delay in the response time. This is a TODO.
|
|
thus reducing the chances of query getting fully shipped to the remote node.
We now remember all nodes that can satify a READ request for a replicated table
and then finally choose a node randomly if no preferred datanode is specified.
This will avoid non-deterministic selection of FQS query plans as well as allow
us to send some more queries to the remote node.
|
|
send/recv() errors just give us a hint about something going wrong with a node.
But a mere send/recv failure does not mean that the node is down or
unreachable. So before changing the health status, ping the node once and
confirm its health status.
|
|
|
|
cluster.
Each node now maintains a healthmap in shared memory to keep track of
availability of all other nodes in the cluster. Whenever a send() or a
receive() call on a socket fails, we try to ping the node once and if that
fails, we mark the node as UNHEALTHY. On the other hand, if later a new
connection is established successfully to a node, we mark the node as HEALTHY.
We also periodically try to ping UNHEALTHY nodes to see if they have come back
to life and update the healthmap accordingly.
The only consumer of a healthmap right now is SELECT queries on replicated
tables. When a table is replicated to more than one node, we now consult the
healthmap and discards nodes that are known to be UNHEALTHY. If all nodes are
found to be UNHEALTHY, one attempt is made to see if any of them have come back
online.
|
|
transactions
When a two-phase commit protocol is interrupted mid-way, for example because of
a server crash, it can leave behind unresolved prepared transactions which must
be resolved when the node comes back. Postgres-XL provides a pgxc_clean utility
to lookup list of prepared transactions and resolve them based on status of
such a transaction on every node. But there were many issues with the utility
because of which it would either fail to resolve all transactions, or worse
resolve it in a wrong manner. This commit fixes all such issues discovered
during some simple crash recovery testing.
One of the problem with the current approach was that the utility would not
know which all nodes were involved in a transaction. So if it sees a
transaction as prepared on a subset of nodes, but does not exist on other
subset, it would not know if originally it was executed on nodes other than
where its prepared, but other nodes failed before they could prepare the
transaction. If it was indeed executed on other nodes, such transaction must be
aborted. Whereas if it was only executed on the set of nodes where its
currently prepared, then it can safely be committed.
We now store the information about nodes partcipating in a 2PC directly in the
GID. This of course has a downside of increasing the GIDSIZE which implies for
shared memory requirement. But with today's server sizes, it should not be a
very big concern. Sure, we could also look at possibility of storing this
information externally, such as on the GTM. But the fix seems good enough for
now.
|
|
Till now, we used to aggressively assign transaction identifiers on the
coordinator, even if the transaction may not actually need one because
it does not do any write operation. PostgreSQL has improved this case many
years back with the usage of Virtual XIDs. But since in Postgres-XL, if a
read-looking SELECT statement later does some write opreation, it would be too
late to assign a transaction identifier when the query is running on the
datanode. But such aggressive XID assignment causes severe performance
problems, especially for read-only queries.
We now solve this by assigning XID on-demand for SELECT queries. If a datanode
ever needs an XID, it will talk to the GTM with a global session identifier
which remains unique for a given global transaction on all nodes. GTM can then
correlate multiple BEGIN TXN requests for the same global transaction and
return the already assigned identifer, instead of opening a new transaction
every time.
For DML queries, we still acquire XID on the coordinator. This will be a loss
if DML ends up not doing any write operation. But since its not going to be a
very regular scenario, it makes sense to pre-acquire an XID. More tests are
required though to verify this
|
|
simple queries that can be fully shipped to the datanodes
This patch backports some of the FQS code from XC/master branch. We only use
this for simple queries that can be fully executed on the datanodes. But
otherwise queries go through the XL planner/executor. This is showing good
performance improvements for such simple queries. This will also allow us to
support certain missing features such GROUPING SETS on replicated tables or
queries that require no coordinator support for final aggregations. This hasn't
yet tested though
This will cause many regression test case output since the EXPLAIN plan for
FQS-ed queries would look very different.
|
|
Right now the process is responsible for computing the local RecentGlobalXmin
and send periodic updates to the GTM. The GTM then computes a cluster-wide
value of the RecentGlobalXmin and sends it back to all the reporting nodes
(coordinators as well as datanodes). This way GTM does not need to track all
open snapshots in the system, which previously required a transaction to remain
open, even for a read-only operation. While this patch itself may not show
major performance improvements, this will act as a foundation for other major
improvements for transaction handling.
If a node gets disconnected for a long time or stops sending updates to the
GTM, such a node is removed from computation of the RecentGlobalXmin. This is to
ensure that a failed node does not stop advancement of the RecentGlobalXmin
beyond a certain point. Such a node can safely rejoin the cluster as long as
its not using a snapshot with a stale view of the cluster i.e. a snapshot with
xmin less than the RecentGlobalXmin that the GTM is running with.
|
|
|
|
Patch by Krzysztof Nienartowicz, with some bug fixes and rework by me
|
|
Per discussion on the developer list, this patch removes a bulk of XC-specific
code which is not relevant in XL. This code was mostly left-over in #ifdef
blocks, thus complicating code-reading, bug fixes and merges. One can always do
a "git diff" with the XC code base to see the exact differences.
We still continue to use #ifdef PGXC and #ifdef XCP interchangeably because of
the way code was written. Something we should change. Also, there is probably
still some more dead code (because files were copied to different place or
because the code is not referenced in XL). This requires another cleanup patch,
but not something I plan to do immediately
|
|
estblishment time
psql client can use various environment variables such as PGDATESTYLE and
PGOPTIONS to set session parameters at connection establishment time.
Similarly, libpq clients can use connection string to pass down options to the
server. Coordinator now picks some of these options and pass them further down to the
remote nodes. Currently only "DateStyle", "timezone", "geqo", "intervalstyle"
and "lc_monetary" GUCs are passed down
|
|
A transaction may wait for one or more transactions to finish before proceeding
with its operation. For example, an UPDATEing transaction may wait for other
transaction if it has already updated/deleted the tuple and decide the next
action based on other transaction's outcome as well as its own isolation level.
Sometimes it may happen a transaction is marked as committed on a datanode, but
GTM has not yet received a message to this effect. We have seen that this can
lead to breakage in MVCC properties when more than one tuple version may
satisfy MVCC checks. For specifically, when a transaction which is already
committed on the datanode is still seen as in-progress, but a later transaction
which again updated the same tuple is seen as committed as per a snapshot
obtained from GTM. Such snapshots can see both, most old and most recent
versions of a tuple as visible.
This patch adds an ability to track XIDs on which a transaction may have waited
and later sends that list to GTM. GTM must not commit a transaction unless all
such transactions on which it has waited for are also finished. Till such time,
GTM will send back STATUS_DELAYED response to the client. The client must retry
commit until its done. We believe the window is extremely small and its a
corner case. So such retries should not add much overhead to the system.
|
|
client along with the error message
While we were collecting error and detail messages, the hints were lost during
coordinator-datanode communucation. So clients would not see these hints. Fix
that by tracking the hints received in the response combiner and sending them
back to the client along with the error message
|
|
command during shared queue acquire/bind time
Current code had calls to find this out during shared queue release. Given that
shared queue release may be called during transaction abort and the fact that
we can no longer make any sys cache lookup in abort call path, it seems much
safer to note and remember this value when its used for the first time in the
session
(We could have remembered this value when SET global_session command is
processed. But that happens during backend startup time and we haven't yet
setup the datanode and coordinator node handles at that time. So we can't
really determine the node id, which an index into the array of these handles,
at that time)
|
|
Conflicts:
.gitignore
contrib/Makefile
src/backend/access/common/heaptuple.c
src/backend/access/transam/rmgr.c
src/backend/access/transam/xact.c
src/backend/catalog/Makefile
src/backend/catalog/catalog.c
src/backend/catalog/genbki.pl
src/backend/catalog/namespace.c
src/backend/commands/sequence.c
src/backend/executor/execMain.c
src/backend/executor/functions.c
src/backend/executor/nodeAgg.c
src/backend/executor/nodeModifyTable.c
src/backend/nodes/copyfuncs.c
src/backend/nodes/outfuncs.c
src/backend/nodes/readfuncs.c
src/backend/optimizer/plan/createplan.c
src/backend/optimizer/plan/planner.c
src/backend/optimizer/plan/setrefs.c
src/backend/optimizer/util/pathnode.c
src/backend/parser/gram.y
src/backend/parser/parse_agg.c
src/backend/parser/parse_utilcmd.c
src/backend/postmaster/postmaster.c
src/backend/replication/logical/decode.c
src/backend/storage/file/fd.c
src/backend/storage/ipc/procsignal.c
src/backend/tcop/utility.c
src/backend/utils/adt/lockfuncs.c
src/backend/utils/adt/ruleutils.c
src/backend/utils/sort/tuplesort.c
src/backend/utils/time/snapmgr.c
src/include/access/rmgrlist.h
src/include/catalog/pg_aggregate.h
src/include/catalog/pg_proc.h
src/include/nodes/execnodes.h
src/include/nodes/plannodes.h
src/include/nodes/primnodes.h
src/include/nodes/relation.h
src/include/storage/lwlock.h
src/include/storage/procsignal.h
src/include/utils/plancache.h
src/include/utils/snapshot.h
src/test/regress/expected/foreign_key.out
src/test/regress/expected/triggers.out
src/test/regress/expected/with.out
src/test/regress/input/constraints.source
src/test/regress/output/constraints.source
src/test/regress/pg_regress.c
src/test/regress/serial_schedule
src/test/regress/sql/returning.sql
|
|
This makes the code much more readable. In passing, also add checks for local
and remoted coordinator at bunch of places where we are not expected to send
down commands to datanodes from a remote coordinator
|
|
This is the second step towards getting a build that works. Some of the merge
conflicts are rather handled in a trivial fashion and would definitely take
more work to get them working. A case in point example is aggregates. Its amply
clear that aggregate support is broken post merge. PostgreSQL 9.4 changed the
way results of transient functions are stored by aggregates and it nows used
INTERNALOID as a type. Since we need to send these transient results to
coordinators for collection and finalisation, we need to find out a way to do
that. We did some elemenatry changes to let compilation proceed, but its not
going to help ar run-time. There are more such issues post-merger.
Similarly, regression expected outputs are changed without much deliberation. A
complete regression test and analysis is required to fix them.
|
|
The commit 974c23972d57e2072f2c8bc345e45d03dc688bf5 had the
side effect of causing problems when initializing the cluster
with pgxc_ctl init.
|
|
They were taken out in XL. But at least pgxc_clean requires the
xc_maintenance_mode GUC to work properly. The other GUCs are probably not very
important from XL-perspective, but have them anyways unless we have a reason to
remove them
|
|
Many of the fixes are contributed by Pavan Deolasee
|
|
data node to data node communication, more stringent security,
and other performance enhancements. Please see release notes.
Key contributors are:
Andrei Martsinchyk
Nikhil Sontakke
Mason Sharp
|