System.Data.SQLite
View Ticket
Not logged in
Ticket UUID: 5cee5409f84da5f6217221a09a5020bb49ac1ec7
Title: Null reference exception on connection close
Status: Closed Type: Incident
Severity: Minor Priority: Blocker
Subsystem: Connection Resolution: Fixed
Last Modified: 2018-01-28 05:31:11
Version Found In: 1.0.102
User Comments:
anonymous added on 2017-12-25 17:37:12:
When closing a connection, a null reference exception may occur.
Stack trace indicates it occurs directly in the SQLiteConnection.Close method.

This happens when the connection was used with a distributed transaction scope, already disposed at time of closing.

Checking 1.0.106 code, I highly suspect the _enlisment member usage to cause this. It is checked for null then later used if not null, as if assuming it could not be nullify by a concurrent thread finishing the system transaction.

Unfortunately, when a system transaction was distributed, the scope disposal leave as soon as the first phase is finished (all votes for commit or rollback collected), letting code following the scope running, while processing on distinct threads the second phases of enlisted ressources and the transaction completed event.

You can read more on this on a discussion with a Microsoft employee working on MSDTC at following url: https://github.com/npgsql/npgsql/issues/1571#issuecomment-308651461

So this means that connection's enlistment cleanup logic, likely coded on a second phase notification or transaction completed event (I have not checked) can run concurrently to the connection closing even after having left the scope disposal.

In such context, checking a member for null then using it as do SQLiteConnection.Close is prone to failure when it is nullified by another thread, due to the transaction being distributed.

This issue is sometimes seen in my AppVeyor tests with NHibernate, as you may see with stack trace at following url (the null ref is encapsulated as an inner exception, so look down for the SQLite stack): https://ci.appveyor.com/project/fredericDelaporte/nhibernate-core/build/5.1.0.41/job/jwl0hbbsq20ioxu3/tests

I have set this as minor since SQLite does not have actually a two phases commit as far as I know, so using it in a distributed transaction is anyway not very recommandable.

mistachkin added on 2018-01-24 16:33:54:
How is it that the enlistment and connection are being used by different threads?

The SQLiteConnection object is not designed to be thread-safe.

anonymous (claiming to be fredericDelaporte) added on 2018-01-24 17:44:11:
> The SQLiteConnection object is not designed to be thread-safe.

But unfortunately it must support some minimal thread safety for being used with distributed transaction scopes, because distributed scopes do complete on different threads, without any option for the user to prevent that.

This is not a well documented concern, the best reference I have is the link I have provided:
https://github.com/npgsql/npgsql/issues/1571#issuecomment-308651461
As said by Roji on the following comment:
> First, as this affects application developers - who may assume that once TransactionScope.Complete() returns all resource managers have completed their individual commits - it may be good to add an explicit note about this in the documentation.

But that is not done as far as I know, so I guess most developers still assume that after a scope is disposed (the actual thing triggering its completion, Roji made a small mistake by writing Complete), all resources managers have finished their job. But that is not true when the scope is ditributed and they may be running their completion callbacks (phase 2 notification or TransactionCompleted event) on concurrent threads at that point. (Which by the way means those callbacks, if using the transaction object, must deal with the possibility it is already disposed. Causing accesses to TransactionInformation to throw...)

In the case of NHibernate (to which I contribute), I have put in place a synchronization lock (on the NHibernate session, not meant to be thread safe too), activated at the prepare phase and released once receiving the scope completion callback calls. It solves NHibernate own issues triggered by this distributed scope behavior. But it does not avoid the issue for the SQLiteConnection, since there is no way to guarantee its own completion callback will be called before the one of NHibernate.

So I think either SQLiteConnection has to do something similar for supporting distributed scopes, or maybe should it states somewhere it does only support non-distributed scopes (which is somewhat the case anyway since it does not have a two phases commit) and will have up-to threading issues with distributed ones.

A solution could also be put in place just for the null reference case I report, but then it is likely that other issues may stay, notably when using different scopes consecutively on the same connection.

mistachkin added on 2018-01-24 17:48:34:
I can certainly fix the NullReferenceException; however, I would really like to
understand the issue fully.

anonymous added on 2018-01-24 17:55:34:
Another way to ascertain what the previous link and I claim, is to check out these unit tests:
https://github.com/nhibernate/nhibernate-core/blob/5.0.3/src/NHibernate.Test/SystemTransactions/ResourceManagerFixture.cs

These are pure System.Transactions tests, no references to NHibernate classes are done in them. They demonstrate that distributed scopes run transaction completion callbacks on dedicated threads, introducing thread safety concerns, including for code running after scope disposal.

mistachkin added on 2018-01-24 17:58:48:
I think I've got a good fix for this issue.  If I commit it to a branch, are you
able to confirm the fix in your environment?

anonymous (claiming to be fredericDelaporte) added on 2018-01-24 18:05:29:
I do not have a Fossil client currently. The branch would need to be web browsable, and I would apply its changes over a SQLite release source zip.

Note about the tests from my previous comment: their multi-threaded aspect appears in the logs they output. So it needs some log4net configuration.

mistachkin added on 2018-01-24 18:13:05:
You can now download a ZIP containing the proposed fix from:

[/zip/System.Data.SQLite-8f44046b6f.zip?r=8f44046b6f672645]

anonymous (claiming to be fredericDelaporte) added on 2018-01-24 18:41:44:
It does not look to be the intended zip. It contains only a "Doc" and an "Externals" folders, without SQLite binaries or sources.

mistachkin added on 2018-01-24 19:03:07:
The ZIP file is correct when I download it here.  Here is a hard-wired link:

[https://system.data.sqlite.org/temporary/System.Data.SQLite-8f44046b6f.zip]

Please let me know when you've downloaded it so that I can remove it from the
server.  It should be 110389944 bytes in size with a SHA1 of:

f3a997e90324b7390e07bd5e8389ba144a0d4a20

anonymous (claiming to be fredericDelaporte) added on 2018-01-24 19:33:39:
Thanks, now it is complete. (The previous one was 95Mb but I have not attempted to download it again. 7zip "check" feature was detecting a truncation, browser download or unzipping  was not telling anything wrong.)

anonymous (claiming to be fredericDelaporte) added on 2018-01-25 17:40:20:
The null reference exception is fixed, but it experiences another trouble, because the lock is not enough "wide" on enlistment side.

    public void Commit(Enlistment enlistment)
    {
      CheckDisposed();

      SQLiteConnection cnn = _transaction.Connection;

      lock (cnn._enlistmentSyncRoot)
      {
          cnn._enlistment = null;
      }

      try
      {
        _transaction.IsValid(true);
        _transaction.Connection._transactionLevel = 1;
        _transaction.Commit();

        enlistment.Done();
      }
      finally
      {
        Cleanup(cnn);
      }
    }

The _transaction.Commit() can fail with a message "Connection was closed, statement was terminated". And between putting the connection in a local variable and locking it, it can be switched for another connection.

So I think the code should be:

    public void Commit(Enlistment enlistment)
    {
        CheckDisposed();
        SQLiteConnection cnn = null;
        try
        {
            var done = false;
            while (!done)
            {
                cnn = _transaction.Connection;
                lock (cnn._enlistmentSyncRoot)
                {
                    // While acquiring the lock, the connection may have been switched.
                    if (cnn != _transaction.Connection)
                        continue;
                    cnn._enlistment = null;
                    _transaction.IsValid(true);
                    _transaction.Connection._transactionLevel = 1;
                    _transaction.Commit();
                    done = true;
                }
            }

            enlistment.Done();
        }
        finally
        {
            if (cnn != null)
                Cleanup(cnn);
        }
    }

With above code my test for closing the connection after a distributed transaction does no more experiences failures, while otherwise it was failing around the third iteration. My test is:

	[Test]
	public void Loop1()
	{
		for (var i = 0; i < 1000; i++)
		{
			try
			{
				var cnx = Sfi.ConnectionProvider.GetConnection();
				try
				{
					using (var tx = new TransactionScope())
					{
						ForceEscalationToDistributedTx.Escalate();

						Assert.AreNotEqual(
							Guid.Empty,
							System.Transactions.Transaction.Current.TransactionInformation.DistributedIdentifier,
							"Transaction lacks a distributed identifier");

						cnx.EnlistTransaction(System.Transactions.Transaction.Current);
						using (var cmd = cnx.CreateCommand())
						{
							cmd.CommandText = "insert into Person (Id, NotNullData) values (@id, '')";
							var param = cmd.CreateParameter();
							param.DbType = DbType.Int32;
							param.ParameterName = "@id";
							param.Value = i;
							cmd.Parameters.Add(param);
							cmd.ExecuteNonQuery();
						}

						tx.Complete();
					}
				}
				finally
				{
					Sfi.ConnectionProvider.CloseConnection(cnx);
				}
			}
			catch
			{
				Console.WriteLine("Failed at iteration " + i);
				throw;
			}
		}
	}

(Where Sfi is a NHibernate thing that I use to get the SQLiteConnection on a test database with some tables in it, purged at each test run.)

anonymous (claiming to be fredericDelaporte) added on 2018-01-25 18:04:02:
Of course a similar adjustment to the lock should be done for Rollback too.

But the lock will still be not enough "wide" for another case, which may be worth another ticket: using the same connection in consecutive scopes. In this case, the subsequent scope may fail on enlistment with message "Already enlisted in a transaction", although the previous scope disposal has already been exited.

For supporting such a case a lock like a Semaphore is required. On connection side, its locking/unlocking would be don instead of current locks added by your patch. But on enlistment side, it would have to be locked from prepare phase and unlocked from Commit/Rollback.

It will still not be enough for the trickiest case: rollback. A rollback can happen without a prepare phase, and then it is hard to lock "enough soon" in the transaction completion process. On NHibernate side, I have used a "Wait" method on the enlisted resource, called from the session (so would be from the connection for SQLite) which activate the lock if it detects that the system transaction is no more active but the enlistment is not finished yet. (And here there is another trap when checking the system transaction: it may even already be disposed of, and ObjectDisposedException should then be catched, considering in such case that the transaction is no more active.)

Yes, scopes, when distributed, are quite a messy thing.

(About previous comment test case, source code of ForceEscalationToDistributedTx is at: https://github.com/nhibernate/nhibernate-core/blob/5.0.3/src/NHibernate.Test/SystemTransactions/DistributedSystemTransactionFixture.cs#L737 )

mistachkin added on 2018-01-25 18:44:29:
This is starting to become a much bigger change.  I need to figure out a good way
to test these changes.

mistachkin added on 2018-01-25 18:47:12:
Also, for the new code to be correct, I think that the line:

    _transaction.Connection._transactionLevel = 1;

Must be changed to:

    cnn._transactionLevel = 1;

Otherwise, there may still be a race condition here?

mistachkin added on 2018-01-25 19:20:57:
There is already an open ticket for consecutive transaction scopes, [7e1dd697dc].
If you have more information on that, please add it there.

mistachkin added on 2018-01-25 21:06:56:
I think I've got the branch changes in a good state.  Could you please test the
following code in your environment:

[https://system.data.sqlite.org/temporary/System.Data.SQLite-6d391f764a.zip]

It should be 110391457 bytes in size with a SHA1 of:

024de929763f9c13261bd753bac1b5b1001d21d7

mistachkin added on 2018-01-26 02:35:14:
Hold on a bit, I need to upload a newer file as the previous one appears to have
some issues.

mistachkin added on 2018-01-26 03:10:42:
The newly updated file is:

[https://system.data.sqlite.org/temporary/System.Data.SQLite-73c64a8452.zip]

It should be 110392470 bytes in size with a SHA1 of:

6f11972cbbb8ec50d81dac1faa6f129a38172428

mistachkin added on 2018-01-26 03:11:44:
This may also be related to [fcb5a89f5c].

anonymous (claiming to be fredericDelaporte) added on 2018-01-27 00:05:20:
> Must be changed to:
>
>    cnn._transactionLevel = 1;
>
> Otherwise, there may still be a race condition here?

No, it is not needed with my proposed code, since it is in the locked section. But doing that change cannot harm either.

> There is already an open ticket for consecutive transaction scopes

Well, this is also a ticket of mine. But is is strictly speaking another subject, and well, maybe not much legit. The real issue is not the failure of enlisting with null, we should not need to do such a thing. The real issue is the one I have explained here, the possibility to fail an enlistment in the next scope due to the previous being not already "cleaned" out of the connection. (By the way I have not seen your question on this other ticket, sorry. But without a mean to register, and having to input things as "anonymous", it does not help to get notified.)

I have downloaded the last zip, I will look into it later.

mistachkin added on 2018-01-27 04:15:06:
Thanks for the update.

For the consecutive transaction scopes (the other ticket), I do not fully
understand the problem.  The _enlistment field is cleared out when the
transaction is committed -OR- rolled back via the enlistment.  Also, it
is cleared out when the connection is closed.

When else would it need to be cleared out?

anonymous (claiming to be fredericDelaporte) added on 2018-01-27 08:07:00:
About this ticket here, the connection closing failure that may happen with distributed scopes due to their intrinsic asynchronism, I confirm your last zip does not more have any issue.

About the scopes chaining, I will comment further on the other ticket for avoiding polluting to much the ticket here.

https://system.data.sqlite.org/index.html/info/7e1dd697dc

mistachkin added on 2018-01-27 20:03:43:
Thanks for helping to confirm the fix.