Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
5.2k views
in Technique[技术] by (71.8m points)

tensorflow - RuntimeError: Unable to create link (name already exists) When saving second model using Google Colab

I have created a model testing pipeline for use in my internship and is run on Google Colab. This pipeline allows for the testing of multiple sets of models and parameters back-to-back. It will spin up a model and a set of parameters in a user-defined manner, perform training for 15 epochs, validating after every epoch. It uses two ModelCheckpoints to save models as h5 files, one to save every epoch, and another to save only the best epoch, under a known name in a different folder, so that it can be easily loaded later.

For reference, every model/parameter set tested is identified using a unique tester id and a model count number, which is incremented every model. The model checkpoints saved every epoch also have the epoch number appended to the end.

After all 15 epochs, the best model is loaded and evaluated on our testing set. Then the next model and set of parameters is spun up and the process repeats until it hits a user-defined stopping point.

At least, that is how it is supposed to work.

What happens instead is that the first model to be run goes according to plan. Then the next model is loaded up and trains and validates for one epoch. However, when it comes time to save the checkpoint for the first epoch, the following is thrown: RuntimeError: Unable to create link (name already exists)

After that occurs, the only way I have found to not encounter the error at the end of the first epoch is to reset the Colab runtime. At which point I get an additional 1 model out of it before the error occurs again. (Note: this is not the same 1 model that I got out before, I adjusted the method parameters to start at the next model that needed to run)

Finally, to firmly lay to rest the most common causes of this error, I have tried running both model.summary() and for i, w in enumerate(model.weights): print(i, w.name). I do not have duplicate names indicated by either of these.

I am unsure why this behavior is occuring, my best guess is that it would fall under some combination of Colab's caching behavior and whatever methodology ModelCheckpoint uses to save the files causing it to interpret a name overlap where there is none.

Any further insight that can be provided as to why this is occurring and how to solve it would be greatly appreciated.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
等待大神解答

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...