System.Data.SQLite: View Ticket

2018-01-28
05:31		• Closed ticket [5cee5409f8]: Null reference exception on connection close plus 2 other changes artifact: dae793f0e1 user: mistachkin
05:28		Support asynchronous completion of distributed transactions, fix for [5cee5409f8]. Add experimental WaitForEnlistmentReset method to the SQLiteConnection class, pursuant to [7e1dd697dc]. check-in: a27556f61d user: mistachkin tags: trunk
2018-01-27
22:33		• Ticket [5cee5409f8] Null reference exception on connection close status still Pending with 3 other changes artifact: 36cf4ac552 user: mistachkin
22:24		Refinements to the test for ticket [5cee5409f8], with a special focus on the new WaitForEnlistmentReset method. check-in: 2cb2ae35a3 user: mistachkin tags: tkt-5cee5409f8
21:13		Add initial tests for ticket [5cee5409f8]. check-in: e30a873204 user: mistachkin tags: tkt-5cee5409f8
20:03		• Ticket [5cee5409f8] Null reference exception on connection close status still Pending with 4 other changes artifact: 1b744fd69a user: mistachkin
08:07		• Ticket [5cee5409f8]: 4 changes artifact: 0a4715a66a user: anonymous
04:15		• Ticket [5cee5409f8]: 3 changes artifact: 8edab0217a user: mistachkin
00:05		• Ticket [5cee5409f8]: 4 changes artifact: 4825e99f1a user: anonymous
2018-01-26
03:11		• Ticket [5cee5409f8]: 3 changes artifact: b466c02e83 user: mistachkin
03:10		• Ticket [5cee5409f8]: 3 changes artifact: 6e44e21629 user: mistachkin
02:35		• Ticket [5cee5409f8]: 3 changes artifact: 34691438fd user: mistachkin
2018-01-25
21:06		• Ticket [5cee5409f8]: 3 changes artifact: cc2f1a7275 user: mistachkin
19:21		Work in progress on ticket [5cee5409f8]: use the enlistment lock to protect against several race conditions between SQLiteEnlistment and SQLiteConnection. check-in: 0dc736558b user: mistachkin tags: tkt-5cee5409f8
19:20		• Ticket [5cee5409f8] Null reference exception on connection close status still Pending with 3 other changes artifact: e2866e1944 user: mistachkin
18:47		• Ticket [5cee5409f8]: 3 changes artifact: d108ec31eb user: mistachkin
18:44		• Ticket [5cee5409f8]: 3 changes artifact: 4104a27e03 user: mistachkin
18:04		• Ticket [5cee5409f8]: 4 changes artifact: d0040b9e2e user: anonymous
17:40		• Ticket [5cee5409f8]: 4 changes artifact: 81e1ea20fa user: anonymous
2018-01-24
19:33		• Ticket [5cee5409f8]: 4 changes artifact: 5a5b4588fb user: anonymous
19:03		• Ticket [5cee5409f8]: 3 changes artifact: 8603a2ca27 user: mistachkin
18:41		• Ticket [5cee5409f8]: 4 changes artifact: 93cc7bc05e user: anonymous
18:13		• Ticket [5cee5409f8]: 3 changes artifact: 3229e182d9 user: mistachkin
18:11		Synchronize access to the '_enlistment' field of SQLiteConnection objects. Even thought the SQLiteConnection class is not designed to be thread-safe, this field must be protected from multi-threaded access due to how (asynchronous) transaction scope completion is implemented by the .NET Framework. Candidate fix for ticket [5cee5409f8]. check-in: 8f44046b6f user: mistachkin tags: tkt-5cee5409f8
18:05		• Ticket [5cee5409f8] Null reference exception on connection close status still Pending with 4 other changes artifact: 63d70c41b2 user: anonymous
17:58		• Ticket [5cee5409f8]: 3 changes artifact: 1eb8b2f0a9 user: mistachkin
17:55		• Ticket [5cee5409f8]: 3 changes artifact: 64ce368e75 user: anonymous
17:48		• Ticket [5cee5409f8]: 3 changes artifact: 5321a7aea7 user: mistachkin
17:44		• Ticket [5cee5409f8]: 4 changes artifact: e253661f92 user: anonymous
16:33		• Pending ticket [5cee5409f8]. artifact: e0d9d2c62f user: mistachkin
2017-12-25
17:37		• New ticket [5cee5409f8]. artifact: 92c3b80acb user: anonymous

Ticket Hash:	5cee5409f84da5f6217221a09a5020bb49ac1ec7
Title:	Null reference exception on connection close
Status:	Closed	Type:	Incident
Severity:	Minor	Priority:	Blocker
Subsystem:	Connection	Resolution:	Fixed
Last Modified:	2018-01-28 05:31:11 7.43 years ago	Created:	2017-12-25 17:37:12 7.53 years ago
Version Found In:	1.0.102

User Comments:

anonymous added on 2017-12-25 17:37:12:

When closing a connection, a null reference exception may occur.
Stack trace indicates it occurs directly in the SQLiteConnection.Close method.

This happens when the connection was used with a distributed transaction scope, already disposed at time of closing.

Checking 1.0.106 code, I highly suspect the _enlisment member usage to cause this. It is checked for null then later used if not null, as if assuming it could not be nullify by a concurrent thread finishing the system transaction.

Unfortunately, when a system transaction was distributed, the scope disposal leave as soon as the first phase is finished (all votes for commit or rollback collected), letting code following the scope running, while processing on distinct threads the second phases of enlisted ressources and the transaction completed event.

You can read more on this on a discussion with a Microsoft employee working on MSDTC at following url: https://github.com/npgsql/npgsql/issues/1571#issuecomment-308651461

So this means that connection's enlistment cleanup logic, likely coded on a second phase notification or transaction completed event (I have not checked) can run concurrently to the connection closing even after having left the scope disposal.

In such context, checking a member for null then using it as do SQLiteConnection.Close is prone to failure when it is nullified by another thread, due to the transaction being distributed.

This issue is sometimes seen in my AppVeyor tests with NHibernate, as you may see with stack trace at following url (the null ref is encapsulated as an inner exception, so look down for the SQLite stack): https://ci.appveyor.com/project/fredericDelaporte/nhibernate-core/build/5.1.0.41/job/jwl0hbbsq20ioxu3/tests

I have set this as minor since SQLite does not have actually a two phases commit as far as I know, so using it in a distributed transaction is anyway not very recommandable.

mistachkin added on 2018-01-24 16:33:54:

How is it that the enlistment and connection are being used by different threads?

The SQLiteConnection object is not designed to be thread-safe.

anonymous (claiming to be fredericDelaporte) added on 2018-01-24 17:44:11:

> The SQLiteConnection object is not designed to be thread-safe.

But unfortunately it must support some minimal thread safety for being used with distributed transaction scopes, because distributed scopes do complete on different threads, without any option for the user to prevent that.

This is not a well documented concern, the best reference I have is the link I have provided:
https://github.com/npgsql/npgsql/issues/1571#issuecomment-308651461
As said by Roji on the following comment:
> First, as this affects application developers - who may assume that once TransactionScope.Complete() returns all resource managers have completed their individual commits - it may be good to add an explicit note about this in the documentation.

But that is not done as far as I know, so I guess most developers still assume that after a scope is disposed (the actual thing triggering its completion, Roji made a small mistake by writing Complete), all resources managers have finished their job. But that is not true when the scope is ditributed and they may be running their completion callbacks (phase 2 notification or TransactionCompleted event) on concurrent threads at that point. (Which by the way means those callbacks, if using the transaction object, must deal with the possibility it is already disposed. Causing accesses to TransactionInformation to throw...)

In the case of NHibernate (to which I contribute), I have put in place a synchronization lock (on the NHibernate session, not meant to be thread safe too), activated at the prepare phase and released once receiving the scope completion callback calls. It solves NHibernate own issues triggered by this distributed scope behavior. But it does not avoid the issue for the SQLiteConnection, since there is no way to guarantee its own completion callback will be called before the one of NHibernate.

So I think either SQLiteConnection has to do something similar for supporting distributed scopes, or maybe should it states somewhere it does only support non-distributed scopes (which is somewhat the case anyway since it does not have a two phases commit) and will have up-to threading issues with distributed ones.

A solution could also be put in place just for the null reference case I report, but then it is likely that other issues may stay, notably when using different scopes consecutively on the same connection.

mistachkin added on 2018-01-24 17:48:34:

I can certainly fix the NullReferenceException; however, I would really like to
understand the issue fully.

anonymous added on 2018-01-24 17:55:34:

Another way to ascertain what the previous link and I claim, is to check out these unit tests:
https://github.com/nhibernate/nhibernate-core/blob/5.0.3/src/NHibernate.Test/SystemTransactions/ResourceManagerFixture.cs

These are pure System.Transactions tests, no references to NHibernate classes are done in them. They demonstrate that distributed scopes run transaction completion callbacks on dedicated threads, introducing thread safety concerns, including for code running after scope disposal.

mistachkin added on 2018-01-24 17:58:48:

I think I've got a good fix for this issue.  If I commit it to a branch, are you
able to confirm the fix in your environment?

anonymous (claiming to be fredericDelaporte) added on 2018-01-24 18:05:29:

I do not have a Fossil client currently. The branch would need to be web browsable, and I would apply its changes over a SQLite release source zip.

Note about the tests from my previous comment: their multi-threaded aspect appears in the logs they output. So it needs some log4net configuration.

mistachkin added on 2018-01-24 18:13:05:

You can now download a ZIP containing the proposed fix from:

[/zip/System.Data.SQLite-8f44046b6f.zip?r=8f44046b6f672645]

anonymous (claiming to be fredericDelaporte) added on 2018-01-24 18:41:44:

It does not look to be the intended zip. It contains only a "Doc" and an "Externals" folders, without SQLite binaries or sources.

mistachkin added on 2018-01-24 19:03:07:

The ZIP file is correct when I download it here.  Here is a hard-wired link:

[https://system.data.sqlite.org/temporary/System.Data.SQLite-8f44046b6f.zip]

Please let me know when you've downloaded it so that I can remove it from the
server.  It should be 110389944 bytes in size with a SHA1 of:

f3a997e90324b7390e07bd5e8389ba144a0d4a20

anonymous (claiming to be fredericDelaporte) added on 2018-01-24 19:33:39:

Thanks, now it is complete. (The previous one was 95Mb but I have not attempted to download it again. 7zip "check" feature was detecting a truncation, browser download or unzipping  was not telling anything wrong.)

anonymous (claiming to be fredericDelaporte) added on 2018-01-25 17:40:20:

The null reference exception is fixed, but it experiences another trouble, because the lock is not enough "wide" on enlistment side.

    public void Commit(Enlistment enlistment)
    {
      CheckDisposed();

      SQLiteConnection cnn = _transaction.Connection;

      lock (cnn._enlistmentSyncRoot)
      {
          cnn._enlistment = null;
      }

      try
      {
        _transaction.IsValid(true);
        _transaction.Connection._transactionLevel = 1;
        _transaction.Commit();

        enlistment.Done();
      }
      finally
      {
        Cleanup(cnn);
      }
    }

The _transaction.Commit() can fail with a message "Connection was closed, statement was terminated". And between putting the connection in a local variable and locking it, it can be switched for another connection.

So I think the code should be:

    public void Commit(Enlistment enlistment)
    {
        CheckDisposed();
        SQLiteConnection cnn = null;
        try
        {
            var done = false;
            while (!done)
            {
                cnn = _transaction.Connection;
                lock (cnn._enlistmentSyncRoot)
                {
                    // While acquiring the lock, the connection may have been switched.
                    if (cnn != _transaction.Connection)
                        continue;
                    cnn._enlistment = null;
                    _transaction.IsValid(true);
                    _transaction.Connection._transactionLevel = 1;
                    _transaction.Commit();
                    done = true;
                }
            }

            enlistment.Done();
        }
        finally
        {
            if (cnn != null)
                Cleanup(cnn);
        }
    }

With above code my test for closing the connection after a distributed transaction does no more experiences failures, while otherwise it was failing around the third iteration. My test is:

	[Test]
	public void Loop1()
	{
		for (var i = 0; i < 1000; i++)
		{
			try
			{
				var cnx = Sfi.ConnectionProvider.GetConnection();
				try
				{
					using (var tx = new TransactionScope())
					{
						ForceEscalationToDistributedTx.Escalate();

						Assert.AreNotEqual(
							Guid.Empty,
							System.Transactions.Transaction.Current.TransactionInformation.DistributedIdentifier,
							"Transaction lacks a distributed identifier");

						cnx.EnlistTransaction(System.Transactions.Transaction.Current);
						using (var cmd = cnx.CreateCommand())
						{
							cmd.CommandText = "insert into Person (Id, NotNullData) values (@id, '')";
							var param = cmd.CreateParameter();
							param.DbType = DbType.Int32;
							param.ParameterName = "@id";
							param.Value = i;
							cmd.Parameters.Add(param);
							cmd.ExecuteNonQuery();
						}

						tx.Complete();
					}
				}
				finally
				{
					Sfi.ConnectionProvider.CloseConnection(cnx);
				}
			}
			catch
			{
				Console.WriteLine("Failed at iteration " + i);
				throw;
			}
		}
	}

(Where Sfi is a NHibernate thing that I use to get the SQLiteConnection on a test database with some tables in it, purged at each test run.)

anonymous (claiming to be fredericDelaporte) added on 2018-01-25 18:04:02:

Of course a similar adjustment to the lock should be done for Rollback too.

But the lock will still be not enough "wide" for another case, which may be worth another ticket: using the same connection in consecutive scopes. In this case, the subsequent scope may fail on enlistment with message "Already enlisted in a transaction", although the previous scope disposal has already been exited.

For supporting such a case a lock like a Semaphore is required. On connection side, its locking/unlocking would be don instead of current locks added by your patch. But on enlistment side, it would have to be locked from prepare phase and unlocked from Commit/Rollback.

It will still not be enough for the trickiest case: rollback. A rollback can happen without a prepare phase, and then it is hard to lock "enough soon" in the transaction completion process. On NHibernate side, I have used a "Wait" method on the enlisted resource, called from the session (so would be from the connection for SQLite) which activate the lock if it detects that the system transaction is no more active but the enlistment is not finished yet. (And here there is another trap when checking the system transaction: it may even already be disposed of, and ObjectDisposedException should then be catched, considering in such case that the transaction is no more active.)

Yes, scopes, when distributed, are quite a messy thing.

(About previous comment test case, source code of ForceEscalationToDistributedTx is at: https://github.com/nhibernate/nhibernate-core/blob/5.0.3/src/NHibernate.Test/SystemTransactions/DistributedSystemTransactionFixture.cs#L737 )

mistachkin added on 2018-01-25 18:44:29:

This is starting to become a much bigger change.  I need to figure out a good way
to test these changes.

mistachkin added on 2018-01-25 18:47:12:

Also, for the new code to be correct, I think that the line:

    _transaction.Connection._transactionLevel = 1;

Must be changed to:

    cnn._transactionLevel = 1;

Otherwise, there may still be a race condition here?

mistachkin added on 2018-01-25 19:20:57:

There is already an open ticket for consecutive transaction scopes, [7e1dd697dc].
If you have more information on that, please add it there.

mistachkin added on 2018-01-25 21:06:56:

I think I've got the branch changes in a good state.  Could you please test the
following code in your environment:

[https://system.data.sqlite.org/temporary/System.Data.SQLite-6d391f764a.zip]

It should be 110391457 bytes in size with a SHA1 of:

024de929763f9c13261bd753bac1b5b1001d21d7

mistachkin added on 2018-01-26 02:35:14:

Hold on a bit, I need to upload a newer file as the previous one appears to have
some issues.

mistachkin added on 2018-01-26 03:10:42:

The newly updated file is:

[https://system.data.sqlite.org/temporary/System.Data.SQLite-73c64a8452.zip]

It should be 110392470 bytes in size with a SHA1 of:

6f11972cbbb8ec50d81dac1faa6f129a38172428

mistachkin added on 2018-01-26 03:11:44:

This may also be related to [fcb5a89f5c].

anonymous (claiming to be fredericDelaporte) added on 2018-01-27 00:05:20:

> Must be changed to:
>
>    cnn._transactionLevel = 1;
>
> Otherwise, there may still be a race condition here?

No, it is not needed with my proposed code, since it is in the locked section. But doing that change cannot harm either.

> There is already an open ticket for consecutive transaction scopes

Well, this is also a ticket of mine. But is is strictly speaking another subject, and well, maybe not much legit. The real issue is not the failure of enlisting with null, we should not need to do such a thing. The real issue is the one I have explained here, the possibility to fail an enlistment in the next scope due to the previous being not already "cleaned" out of the connection. (By the way I have not seen your question on this other ticket, sorry. But without a mean to register, and having to input things as "anonymous", it does not help to get notified.)

I have downloaded the last zip, I will look into it later.

mistachkin added on 2018-01-27 04:15:06:

Thanks for the update.

For the consecutive transaction scopes (the other ticket), I do not fully
understand the problem.  The _enlistment field is cleared out when the
transaction is committed -OR- rolled back via the enlistment.  Also, it
is cleared out when the connection is closed.

When else would it need to be cleared out?

anonymous (claiming to be fredericDelaporte) added on 2018-01-27 08:07:00:

About this ticket here, the connection closing failure that may happen with distributed scopes due to their intrinsic asynchronism, I confirm your last zip does not more have any issue.

About the scopes chaining, I will comment further on the other ticket for avoiding polluting to much the ticket here.

https://system.data.sqlite.org/index.html/info/7e1dd697dc

mistachkin added on 2018-01-27 20:03:43:

Thanks for helping to confirm the fix.