本文纠正长久以来我对 OID(Object Identifiers) 的一个错误认知,我一直以为 OID 是全局唯一的,造成这个错误认知的原因之一是,在我刚接触 PostgreSQL 时看的一本书里相关的描述让我产生了这样的错觉。
我不确定其他人看这里的描述会不会跟我有一样的感觉,但即使我知道 create database
的时候会把 template 里的系统表进行拷贝,我的大脑依然自洽地认为 OID 是在每个 database 内全局唯一的,甚至在我看到 PG12 创建表 WITH OIDS 选项被删除 时,我的大脑也以 WITH OIDS 会导致 oid 被耗尽 的缘由把我说服了。
直到前段时间看 PG 文档 8.19. Object Identifier Types ,才让我开始质疑自己长久以来的错误认知:
Object identifiers (OIDs) are used internally by PostgreSQL as primary keys for various system tables.
The oid type is currently implemented as an unsigned four-byte integer. Therefore, it is not large enough to provide database-wide uniqueness in large databases, or even in large individual tables.
然后我翻看了源码,GetNewObjectId
在集群范围内用一个计数器生成 OID,但这个计数器会 wraparound:
/*
* GetNewObjectId -- allocate a new OID
*
* OIDs are generated by a cluster-wide counter. Since they are only 32 bits
* wide, counter wraparound will occur eventually, and therefore it is unwise
* to assume they are unique unless precautions are taken to make them so.
* Hence, this routine should generally not be used directly. The only direct
* callers should be GetNewOidWithIndex() and GetNewRelFileNumber() in
* catalog/catalog.c.
*/
Oid
GetNewObjectId(void)
{
Oid result;
/* safety check, we should never get this far in a HS standby */
if (RecoveryInProgress())
elog(ERROR, "cannot assign OIDs during recovery");
LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
/*
* Check for wraparound of the OID counter. We *must* not return 0
* (InvalidOid), and in normal operation we mustn't return anything below
* FirstNormalObjectId since that range is reserved for initdb (see
* IsCatalogRelationOid()). Note we are relying on unsigned comparison.
*
* During initdb, we start the OID generator at FirstGenbkiObjectId, so we
* only wrap if before that point when in bootstrap or standalone mode.
* The first time through this routine after normal postmaster start, the
* counter will be forced up to FirstNormalObjectId. This mechanism
* leaves the OIDs between FirstGenbkiObjectId and FirstNormalObjectId
* available for automatic assignment during initdb, while ensuring they
* will never conflict with user-assigned OIDs.
*/
if (TransamVariables->nextOid < ((Oid) FirstNormalObjectId))
{
if (IsPostmasterEnvironment)
{
/* wraparound, or first post-initdb assignment, in normal mode */
TransamVariables->nextOid = FirstNormalObjectId;
TransamVariables->oidCount = 0;
}
else
{
/* we may be bootstrapping, so don't enforce the full range */
if (TransamVariables->nextOid < ((Oid) FirstGenbkiObjectId))
{
/* wraparound in standalone mode (unlikely but possible) */
TransamVariables->nextOid = FirstNormalObjectId;
TransamVariables->oidCount = 0;
}
}
}
/* If we run out of logged for use oids then we must log more */
if (TransamVariables->oidCount == 0)
{
XLogPutNextOid(TransamVariables->nextOid + VAR_OID_PREFETCH);
TransamVariables->oidCount = VAR_OID_PREFETCH;
}
result = TransamVariables->nextOid;
(TransamVariables->nextOid)++;
(TransamVariables->oidCount)--;
LWLockRelease(OidGenLock);
return result;
}
唯二调用这个函数的两个地方是 GetNewOidWithIndex
和 GetNewRelFileNumber
。
GetNewOidWithIndex
会在获取到 OID 之后用系统表的 OID 索引检查该 OID 是否在对应的系统表上已经存在,如果已经存在则重新调用GetNewObjectId
获取下一个 OID。
/*
* GetNewOidWithIndex
* Generate a new OID that is unique within the system relation.
*
* Since the OID is not immediately inserted into the table, there is a
* race condition here; but a problem could occur only if someone else
* managed to cycle through 2^32 OIDs and generate the same OID before we
* finish inserting our row. This seems unlikely to be a problem. Note
* that if we had to *commit* the row to end the race condition, the risk
* would be rather higher; therefore we use SnapshotAny in the test, so that
* we will see uncommitted rows. (We used to use SnapshotDirty, but that has
* the disadvantage that it ignores recently-deleted rows, creating a risk
* of transient conflicts for as long as our own MVCC snapshots think a
* recently-deleted row is live. The risk is far higher when selecting TOAST
* OIDs, because SnapshotToast considers dead rows as active indefinitely.)
*
* Note that we are effectively assuming that the table has a relatively small
* number of entries (much less than 2^32) and there aren't very long runs of
* consecutive existing OIDs. This is a mostly reasonable assumption for
* system catalogs.
*
* Caller must have a suitable lock on the relation.
*/
Oid
GetNewOidWithIndex(Relation relation, Oid indexId, AttrNumber oidcolumn)
{
Oid newOid;
SysScanDesc scan;
ScanKeyData key;
bool collides;
uint64 retries = 0;
uint64 retries_before_log = GETNEWOID_LOG_THRESHOLD;
/* Only system relations are supported */
Assert(IsSystemRelation(relation));
/* In bootstrap mode, we don't have any indexes to use */
if (IsBootstrapProcessingMode())
return GetNewObjectId();
/*
* We should never be asked to generate a new pg_type OID during
* pg_upgrade; doing so would risk collisions with the OIDs it wants to
* assign. Hitting this assert means there's some path where we failed to
* ensure that a type OID is determined by commands in the dump script.
*/
Assert(!IsBinaryUpgrade || RelationGetRelid(relation) != TypeRelationId);
/* Generate new OIDs until we find one not in the table */
do
{
CHECK_FOR_INTERRUPTS();
newOid = GetNewObjectId();
ScanKeyInit(&key,
oidcolumn,
BTEqualStrategyNumber, F_OIDEQ,
ObjectIdGetDatum(newOid));
/* see notes above about using SnapshotAny */
scan = systable_beginscan(relation, indexId, true,
SnapshotAny, 1, &key);
collides = HeapTupleIsValid(systable_getnext(scan));
systable_endscan(scan);
/*
* Log that we iterate more than GETNEWOID_LOG_THRESHOLD but have not
* yet found OID unused in the relation. Then repeat logging with
* exponentially increasing intervals until we iterate more than
* GETNEWOID_LOG_MAX_INTERVAL. Finally repeat logging every
* GETNEWOID_LOG_MAX_INTERVAL unless an unused OID is found. This
* logic is necessary not to fill up the server log with the similar
* messages.
*/
if (retries >= retries_before_log)
{
ereport(LOG,
(errmsg("still searching for an unused OID in relation \"%s\"",
RelationGetRelationName(relation)),
errdetail_plural("OID candidates have been checked %llu time, but no unused OID has been found yet.",
"OID candidates have been checked %llu times, but no unused OID has been found yet.",
retries,
(unsigned long long) retries)));
/*
* Double the number of retries to do before logging next until it
* reaches GETNEWOID_LOG_MAX_INTERVAL.
*/
if (retries_before_log * 2 <= GETNEWOID_LOG_MAX_INTERVAL)
retries_before_log *= 2;
else
retries_before_log += GETNEWOID_LOG_MAX_INTERVAL;
}
retries++;
} while (collides);
/*
* If at least one log message is emitted, also log the completion of OID
* assignment.
*/
if (retries > GETNEWOID_LOG_THRESHOLD)
{
ereport(LOG,
(errmsg_plural("new OID has been assigned in relation \"%s\" after %llu retry",
"new OID has been assigned in relation \"%s\" after %llu retries",
retries,
RelationGetRelationName(relation), (unsigned long long) retries)));
}
return newOid;
}
GetNewRelFileNumber
则需要找到一个在 tablespace 范围内唯一的 OID,当这个 OID 同时作为这个表在 pg_class 中的 OID(伴随表的整个生命周期) 时,则需要保证这个 OID 在 pg_class 中的唯一,这个唯一性通过调用GetNewOidWithIndex
来保证,在 tablespace 范围内唯一则是在获取到 OID 后去对应的 tablespace 中用access
去检查相应的 MAIN_FORK 文件是否存在。
/*
* GetNewRelFileNumber
* Generate a new relfilenumber that is unique within the
* database of the given tablespace.
*
* If the relfilenumber will also be used as the relation's OID, pass the
* opened pg_class catalog, and this routine will guarantee that the result
* is also an unused OID within pg_class. If the result is to be used only
* as a relfilenumber for an existing relation, pass NULL for pg_class.
*
* As with GetNewOidWithIndex(), there is some theoretical risk of a race
* condition, but it doesn't seem worth worrying about.
*
* Note: we don't support using this in bootstrap mode. All relations
* created by bootstrap have preassigned OIDs, so there's no need.
*/
RelFileNumber
GetNewRelFileNumber(Oid reltablespace, Relation pg_class, char relpersistence)
{
RelFileLocatorBackend rlocator;
char *rpath;
bool collides;
ProcNumber procNumber;
/*
* If we ever get here during pg_upgrade, there's something wrong; all
* relfilenumber assignments during a binary-upgrade run should be
* determined by commands in the dump script.
*/
Assert(!IsBinaryUpgrade);
switch (relpersistence)
{
case RELPERSISTENCE_TEMP:
procNumber = ProcNumberForTempRelations();
break;
case RELPERSISTENCE_UNLOGGED:
case RELPERSISTENCE_PERMANENT:
procNumber = INVALID_PROC_NUMBER;
break;
default:
elog(ERROR, "invalid relpersistence: %c", relpersistence);
return InvalidRelFileNumber; /* placate compiler */
}
/* This logic should match RelationInitPhysicalAddr */
rlocator.locator.spcOid = reltablespace ? reltablespace : MyDatabaseTableSpace;
rlocator.locator.dbOid =
(rlocator.locator.spcOid == GLOBALTABLESPACE_OID) ?
InvalidOid : MyDatabaseId;
/*
* The relpath will vary based on the backend number, so we must
* initialize that properly here to make sure that any collisions based on
* filename are properly detected.
*/
rlocator.backend = procNumber;
do
{
CHECK_FOR_INTERRUPTS();
/* Generate the OID */
if (pg_class)
rlocator.locator.relNumber = GetNewOidWithIndex(pg_class, ClassOidIndexId,
Anum_pg_class_oid);
else
rlocator.locator.relNumber = GetNewObjectId();
/* Check for existing file of same name */
rpath = relpath(rlocator, MAIN_FORKNUM);
if (access(rpath, F_OK) == 0)
{
/* definite collision */
collides = true;
}
else
{
/*
* Here we have a little bit of a dilemma: if errno is something
* other than ENOENT, should we declare a collision and loop? In
* practice it seems best to go ahead regardless of the errno. If
* there is a colliding file we will get an smgr failure when we
* attempt to create the new relation file.
*/
collides = false;
}
pfree(rpath);
} while (collides);
return rlocator.locator.relNumber;
}
结论
OID 的生成是一直不停的转着圈地去生成(wraparound),但生成后由各个系统表的索引来决定是否使用这个 OID,唯一一个不需要索引来保证唯一性的是当表 truncate 之后变更新的 relfilenode,这时需要去对应的文件目录检查是否有 OID 对应的文件。
Take aways
- OID 在系统表内唯一
- tablespace 内 relfilenode 的 OID 唯一
最后,也许你想知道 WITH OIDS 被删除的原因(详见邮件列表 Removing “magic” oids):
In my opinion the current WITH OIDs system has numerous weaknesses:
1) The fact that oids are so magic means that if we get pluggable
storage, the design of the potential pluggable systems is constrained
and similar magic has to be present everywhere.
2) The fact that the oids in each table have the same counter to be
based on means that oid wraparounds have much worse consequences
performance wise than necessary. E.g. once the global counter has
wrapped, all toast tables start to be significantly slower.
It would be much better if most database objects had their own
counters.
3) For some oid using objects (toast, large objects at the very least)
it'd be quite worthwhile to switch to 8 byte ids. Currently that's
hard to do, because it'd break on-disk compatibility.
4) There's a lot of special case code around for dealing with oids.
5a) The fact that system table oids don't show up in selects by default
makes it more work than necessary to look at catalogs.
5b) Similarly, it's fairly annoying when debugging not to trivially see
oids for catalog structs.