SmartOS: svcs: Could not bind to repository server. Finding the cause.

Written by Claudio Kuenzler - 8 comments

Published on January 9th 2013 - Listed in Solaris Shell Databases SmartOS Unix

Today I've seen a strange SMF (Service Management Facility) issue on a SmartOS machine running joyent_20120614T001014Z.
All svc commands, like svcs or svcadm, didn't run anymore:

# svcs
svcs: Could not bind to repository server: repository server unavailable. Exiting.

Short summary: A reboot of the zone solved the problem and the commands like svcs were running again.

But I wanted to find out the reason for this of course. So if you have a little bit of time and curiosity, stick with me...

On my online research I came across Solaris bug reports, mailing list entries, etc. but none had the actual solution to the problem. One mailing list post however pointed me to the right direction by asking whether "svc.configd" is running or not. To my confirmation, configd was not running:

# ps -ef | grep svc
    root 72946 72871   0   Oct 30 ?        5569:20 /lib/svc/bin/svc.startd
    root 73017 72871   0   Oct 30 ?           0:05 /sbin/sh /lib/svc/method/svc-dlmgmtd
    root 54594 51575   0 10:34:58 pts/2       0:00 grep svc

There's svc.startd but as mentioned, there should also be a process /lib/svc/bin/svc.configd.
I tried to manually start the configd (config daemon) to see what would happen:

# /lib/svc/bin/svc.configd

# svcs
svc.configd: Fatal error: invalid integer "-2147483463" in field "lnk_prop_id"
svcs: svcs.c:389: Unexpected libscf error: connection to repository broken. Exiting.

The manual start actually worked without an error and the process appeared again in the processlist. But as soon as I re-used svcs, the configd process crashed and showed the error above.
This means that there is a problem with the repository. And now it get's a little bit complicated if no SMF knowledge is around.
Luckily I found this blog post (Racing in the depths of SMF) which explains the whole SMF architecture in a technical and graphical way.
What's necessary to know for this case is:
At the boot of the machine, startd and configd are launched. Commands like svcs are using startd while startd itself talks with configd. configd is using a SQLite database in the background to store service states, variables, etc.

Now that I knew there is an SQLite database in the background of configd, the error message makes much more sense (field "lnk_prop_id"). It looks like an invalid integer (in this case a negative number) was inserted into the database which configd doesn't seem to handle well and crashes. On the system I found two SQLite databases which could be responsible for this error message:

# find /etc/svc -name "*.db"
/etc/svc/repository.db
/etc/svc/volatile/svc_nonpersist.db

I began my research with /etc/svc/repository.db but I can already tell you this was not the one containing the negative number, so I skip that part and go directly to /etc/svc/volatile/svc_nonpersist.db.

SmartOS comes with an embedded SQLite client. I opened the database using /lib/svc/bin/sqlite and first took a look at the existing tables:

# /lib/svc/bin/sqlite /etc/svc/volatile/svc_nonpersist.db
SQLite version 2.8.15-repcached-joyent_20120614T001014Z
Enter ".help" for instructions
sqlite> .tables
id_tbl pg_tbl prop_lnk_tbl schema_version value_tbl

The table called prop_lnk_tbl sounds promising as the svcs error message pointed to a field called "lnk_prop_id". So I wanted to see what kind of columns this table holds:

sqlite> explain select * from prop_lnk_tbl;
0|ColumnName|0|0|lnk_prop_id
1|ColumnName|1|0|lnk_pg_id
2|ColumnName|2|0|lnk_gen_id
3|ColumnName|3|0|lnk_prop_name
4|ColumnName|4|0|lnk_prop_type
5|ColumnName|5|1|lnk_val_id
[...]

Seems I got the right table - the mentioned field/column lnk_prop_id is here. Now I just wanted to see if there are such entries as seen in the error message:

sqlite> select * from prop_lnk_tbl where lnk_prop_id < 0;
-2147483463|151|247577566|start_method_waitstatus|i|577678377
-2147483230|151|247577566|auxiliary_state|s|577678403
-2147483155|151|247577566|next_state|s|577678404
-2147483071|151|247577566|state|s|577678405
-2147483018|151|247577566|state_timestamp|t|577678406

Jackpot! There are actually several entries with a negative number. The first one matches the number of the error message. Unfortunately from this information I could not see which service is responsible for these entries. I tried to get search three tables with referencing the lnk_pg_id (151) but still couldn't find the name. Here's for documentation purpose only, might help someone for a similar case:

table prop_lnk_tbl
Query: select * from prop_lnk_tbl where lnk_prop_id < 0;

lnk_prop_id	lnk_pg_id	lnk_gen_id	lnk_prop_name	lnk_prop_type	lnk_val_id
-2147483463	151	247577566	start_method_waitstatus	i	577678377
-2147483230	151	247577566	auxiliary_state	s	577678403
-2147483155	151	247577566	next_state	s	577678404
-2147483071	151	247577566	state	s	577678405
-2147483018	151	247577566	state_timestamp	t	577678406

The field lnk_pg_id references to the table pg_tbl. lnk_val_id references to the table value_tbl.

table pg_tbl

Query: sqlite> select * from pg_tbl where pg_id = '151';

pg_id	pg_parent_id	pg_name	pg_type	pg_flags	pg_gen_id
151	823	restarter	framework	1	247577566

At first I thought, I finally found the responsible service name, but there were actually many entries restarter/framework in the database.

table value_tbl
Query: select * from value_tbl where value_id = '577678377' OR value_id = '577678403' OR value_id = '577678404' OR value_id =

value_id	value_type	value_value	value_order
577678377	i	0	0
577678403	s	ct_ev_exit	0
577678404	s	offline	0
577678405	s	offline	0
577678406	t	577678405	0

At this point I knew I had two ways how to solve the problem:

1) Try it with a simple reboot
and if that wouldn't work:
2) Manually delete these entries from the database

As written at the begin of the post, the reboot worked for me and svcs ran again without error. I also checked the database for negative numbers and they were all gone.

I'm sure though that the second option would have worked in case the reboot wouldn't have helped.

Update December 3rd 2013:
Today I ran into this very same issue. Time to test if option 2 (Manually delete these entries from the database) would work - and to answer the critics of Ahmed's comment ;-):

Search for the negative value entries in the SMF database:

/lib/svc/bin/sqlite /etc/svc/volatile/svc_nonpersist.db

sqlite> select * from prop_lnk_tbl where lnk_prop_id < 0;
-2147483513|128|253154846|state_timestamp|t|590693950

sqlite> select * from pg_tbl where pg_id = '128';
128|819|restarter|framework|1|253154846

sqlite> select * from value_tbl where value_id = '590693950';
590693950|t|1386059595.646420000|0

sqlite> delete from prop_lnk_tbl where lnk_prop_id < 0;
sqlite> delete from value_tbl where value_id = '590693950';

Verification if the negative values are gone and then exit:

sqlite> select * from prop_lnk_tbl where lnk_prop_id < 0;
sqlite> .exit

Would svcs work again? Let's try it:

svcs
svcs: Could not bind to repository server: repository server unavailable.Exiting.

Narf. Let's try a restart (as mentioned in the troubleshooting above):

/lib/svc/bin/svc.configd

Another try:

svcs
STATE          STIME    FMRI
online*        -        svc:/site/wsapp:movelogspam
online         Sep_12   svc:/system/svc/restarter:default
online         Sep_12   svc:/system/early-manifest-import:default
online         Sep_12   svc:/system/filesystem/root:default
online         Sep_12   svc:/network/datalink-management:default
online         Sep_12   svc:/network/ip-interface-management:default
[...]

Well here we go. Now I'm kind of in the mood to say "See, I told you so!".

Add a comment

Show form to leave a comment

Comments (newest first)

Italo Penna from wrote on Dec 5th, 2022:

Thank you very much for this detailed content.

Slava B from New York wrote on Jul 13th, 2017:

It seems on our zone the /var run out of space and created the problem. Was enough to restart the svc.configd though to take care of it.

Calixto Mercado from Bolivia wrote on Jun 25th, 2016:

I found this... it is easiest
Restoring svc repository

svcs: Could not bind to repository server: repository server unavailable. Exiting.

What this means is that your SMF repo might be currupt and broken. Luckily backups are kept and you can go on restoring the repository by running the following command:

# /lib/svc/bin/restore_repository

A menu of options will come up asking for what restore point you would like to go back to by default its “boot”, and that will go to the most recent one – thats usually what I do. Just hit ENTER and it will restore it. The on screen menu will make sense and explain your options and show all of the restore points.

It will then auto reboot your system, or you will have to manually do it

# reboot

francois scheurer from zurich wrote on Jul 17th, 2015:

(previous comment continued..)

delete from value_tbl where value_tbl.value_id in (select lnk_val_id from prop_lnk_tbl where lnk_prop_id < 0);

delete from prop_lnk_tbl where lnk_prop_id < 0;

.exit

/lib/svc/bin/svc.configd
svcs -xv

francois scheurer from zurich wrote on Jul 17th, 2015:

Thx Claudio, great research!

It worked! but i also needed to kill /lib/svc/bin/svc.configd that had zombies, else svcs was working but enabling/disabling services was not (it did nothing without showing an error):

svcs -xv
svcs: Could not bind to repository server: repository server unavailable. Exiting.

prstat -s rss
pgrep -lf svc.configd
pkill -f /lib/svc/bin/svc.configd
ptree
pkill -f /lib/svc/bin/svc.startd

/lib/svc/bin/sqlite /etc/svc/volatile/svc_nonpersist.db .dump >/root/svc_nonpersist.txt

/lib/svc/bin/sqlite /etc/svc/volatile/svc_nonpersist.db
.schema
.headers on

select * from prop_lnk_tbl where lnk_prop_id < 0;
lnk_prop_id|lnk_pg_id|lnk_gen_id|lnk_prop_name|lnk_prop_type|lnk_val_id
-2147483647|127|247807630|auxiliary_state|s|578217144
-2147483450|127|247807630|next_state|s|578217145
-2147483379|127|247807630|state|s|578217146
-2147483330|127|247807630|state_timestamp|t|578217147
-2147483113|128|247807631|logfile|s|578217097
-2147483078|128|247807631|contract|c|578217098
-2147482828|128|247807631|start_pid|c|578217099
-2147482699|128|247807631|start_method_timestamp|t|578217100
-2147482505|128|247807631|start_method_waitstatus|i|578217101
-2147482424|128|247807631|auxiliary_state|s|578217148
-2147482239|128|247807631|next_state|s|578217149
-2147482115|128|247807631|state|s|578217150
-2147482075|128|247807631|state_timestamp|t|578217151

select distinct pg_tbl.* from pg_tbl where pg_id in (select lnk_pg_id from prop_lnk_tbl where lnk_prop_id < 0);
127|823|restarter|framework|1|247807630
128|821|restarter|framework|1|247807631

select distinct value_tbl.* from value_tbl where value_tbl.value_id in (select lnk_val_id from prop_lnk_tbl where lnk_prop_id < 0);
578217097|s|/var/svc/log/site-wsapp:movelogsend.log|0
578217098|c|27664308|0
578217099|c|68991|0
578217100|t|1437114522.078017000|0
578217101|i|0|0
5782

Andre Sachs from Johannesburg wrote on May 7th, 2014:

Thanks that worked a treat !

I did however need to invoke /lib/svc/bin/svc.startd for services to become managed again.

Claudio from Switzerland wrote on Jul 1st, 2013:

Hi Ahmed. No, the final conclusion is that for some reason a negative value is added to the sqlite database. By deleting the affected rows and restarting /lib/svc/bin/svc.configd, this issue should be fixed.
However my superior preferred that I do a reboot rather than manually deleting entries in the sqlite database. That's why I wrote at the begin that a reboot solves the issue. But it is still unclear where it came from.

Ahmed Kamal from Egypt wrote on Jun 30th, 2013:

So that was some cool analysis, but the final conclusion was to yet again simply reboot ?! I was wishing for a more insightful resolution .. If you later find out more info about this issue, let me know

Thanks

Blog Tags:

AI AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Observability Office OpenSearch PHP Perl Personal PostgreSQL PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Znuny Zoneminder