SmartOS: svcs: Could not bind to repository server. Finding the cause.

Written by - 8 comments

Published on - Listed in Solaris Shell Database SmartOS Unix


Today I've seen a strange SMF (Service Management Facility) issue on a SmartOS machine running joyent_20120614T001014Z.
All svc commands, like svcs or svcadm, didn't run anymore:

# svcs
svcs: Could not bind to repository server: repository server unavailable.  Exiting.

Short summary: A reboot of the zone solved the problem and the commands like svcs were running again.

But I wanted to find out the reason for this of course. So if you have a little bit of time and curiosity, stick with me...

On my online research I came across Solaris bug reports, mailing list entries, etc. but none had the actual solution to the problem. One mailing list post however pointed me to the right direction by asking whether "svc.configd" is running or not. To my confirmation, configd was not running:

# ps -ef | grep svc
    root 72946 72871   0   Oct 30 ?        5569:20 /lib/svc/bin/svc.startd
    root 73017 72871   0   Oct 30 ?           0:05 /sbin/sh /lib/svc/method/svc-dlmgmtd
    root 54594 51575   0 10:34:58 pts/2       0:00 grep svc

There's svc.startd but as mentioned, there should also be a process /lib/svc/bin/svc.configd.
I tried to manually start the configd (config daemon) to see what would happen:

# /lib/svc/bin/svc.configd

# svcs
svc.configd: Fatal error: invalid integer "-2147483463" in field "lnk_prop_id"
svcs: svcs.c:389: Unexpected libscf error: connection to repository broken.  Exiting.

The manual start actually worked without an error and the process appeared again in the processlist. But as soon as I re-used svcs, the configd process crashed and showed the error above.
This means that there is a problem with the repository. And now it get's a little bit complicated if no SMF knowledge is around.
Luckily I found this blog post (Racing in the depths of SMF) which explains the whole SMF architecture in a technical and graphical way.
What's necessary to know for this case is:
At the boot of the machine, startd and configd are launched. Commands like svcs are using startd while startd itself talks with configd. configd is using a SQLite database in the background to store service states, variables, etc.

Now that I knew there is an SQLite database in the background of configd, the error message makes much more sense (field "lnk_prop_id"). It looks like an invalid integer (in this case a negative number) was inserted into the database which configd doesn't seem to handle well and crashes. On the system I found two SQLite databases which could be responsible for this error message:

# find /etc/svc -name "*.db"
/etc/svc/repository.db
/etc/svc/volatile/svc_nonpersist.db

I began my research with /etc/svc/repository.db but I can already tell you this was not the one containing the negative number, so I skip that part and go directly to /etc/svc/volatile/svc_nonpersist.db.

SmartOS comes with an embedded SQLite client. I opened the database using /lib/svc/bin/sqlite and first took a look at the existing tables:

# /lib/svc/bin/sqlite /etc/svc/volatile/svc_nonpersist.db
SQLite version 2.8.15-repcached-joyent_20120614T001014Z
Enter ".help" for instructions
sqlite> .tables
id_tbl          pg_tbl          prop_lnk_tbl    schema_version  value_tbl

The table called prop_lnk_tbl sounds promising as the svcs error message pointed to a field called "lnk_prop_id". So I wanted to see what kind of columns this table holds:

sqlite> explain select * from prop_lnk_tbl;
0|ColumnName|0|0|lnk_prop_id
1|ColumnName|1|0|lnk_pg_id
2|ColumnName|2|0|lnk_gen_id
3|ColumnName|3|0|lnk_prop_name
4|ColumnName|4|0|lnk_prop_type
5|ColumnName|5|1|lnk_val_id
[...]

Seems I got the right table - the mentioned field/column lnk_prop_id is here. Now I just wanted to see if there are such entries as seen in the error message:

sqlite> select * from prop_lnk_tbl where lnk_prop_id < 0;
-2147483463|151|247577566|start_method_waitstatus|i|577678377
-2147483230|151|247577566|auxiliary_state|s|577678403
-2147483155|151|247577566|next_state|s|577678404
-2147483071|151|247577566|state|s|577678405
-2147483018|151|247577566|state_timestamp|t|577678406

Jackpot! There are actually several entries with a negative number. The first one matches the number of the error message. Unfortunately from this information I could not see which service is responsible for these entries. I tried to get search three tables with referencing the lnk_pg_id (151) but still couldn't find the name. Here's for documentation purpose only, might help someone for a similar case:

table prop_lnk_tbl
Query: select * from prop_lnk_tbl where lnk_prop_id < 0;

lnk_prop_id  lnk_pg_id
 lnk_gen_id  lnk_prop_name  lnk_prop_type  lnk_val_id
  -2147483463  151 247577566 start_method_waitstatus   i
577678377
  -2147483230  151 247577566 auxiliary_state   s
577678403
  -2147483155  151 247577566 next_state   s
577678404
  -2147483071  151 247577566 state
  s
577678405
  -2147483018  151 247577566 state_timestamp   t
577678406

The field lnk_pg_id references to the table pg_tbl. lnk_val_id references to the table value_tbl.

table pg_tbl

Query: sqlite> select * from pg_tbl where pg_id = '151';

 pg_id
 pg_parent_id
 pg_name  pg_type  pg_flags  pg_gen_id
 151  823  restarter
 framework   1
247577566

At first I thought, I finally found the responsible service name, but there were actually many entries restarter/framework in the database.


table value_tbl
Query: select * from value_tbl where value_id = '577678377' OR value_id = '577678403' OR value_id = '577678404' OR value_id =

 value_id  value_type
 value_value  value_order
577678377  i  0  0
577678403  s ct_ev_exit  0
577678404  s  offline  0
577678405  s  offline  0
577678406  t 577678405  0


At this point I knew I had two ways how to solve the problem:

1) Try it with a simple reboot
and if that wouldn't work:
2) Manually delete these entries from the database

As written at the begin of the post, the reboot worked for me and svcs ran again without error. I also checked the database for negative numbers and they were all gone.

I'm sure though that the second option would have worked in case the reboot wouldn't have helped.

Update December 3rd 2013:
Today I ran into this very same issue. Time to test if option 2 (Manually delete these entries from the database) would work - and to answer the critics of Ahmed's comment ;-):

Search for the negative value entries in the SMF database:

/lib/svc/bin/sqlite /etc/svc/volatile/svc_nonpersist.db

sqlite> select * from prop_lnk_tbl where lnk_prop_id < 0;
-2147483513|128|253154846|state_timestamp|t|590693950

sqlite> select * from pg_tbl where pg_id = '128';
128|819|restarter|framework|1|253154846

sqlite> select * from value_tbl where value_id = '590693950';
590693950|t|1386059595.646420000|0

sqlite> delete from prop_lnk_tbl where lnk_prop_id < 0;
sqlite> delete from value_tbl where value_id = '590693950';

Verification if the negative values are gone and then exit:

sqlite> select * from prop_lnk_tbl where lnk_prop_id < 0;
sqlite> .exit

Would svcs work again? Let's try it:

svcs
svcs: Could not bind to repository server: repository server unavailable.Exiting.

Narf. Let's try a restart (as mentioned in the troubleshooting above):

/lib/svc/bin/svc.configd

Another try:

svcs
STATE          STIME    FMRI
online*        -        svc:/site/wsapp:movelogspam
online         Sep_12   svc:/system/svc/restarter:default
online         Sep_12   svc:/system/early-manifest-import:default
online         Sep_12   svc:/system/filesystem/root:default
online         Sep_12   svc:/network/datalink-management:default
online         Sep_12   svc:/network/ip-interface-management:default
[...]

Well here we go. Now I'm kind of in the mood to say "See, I told you so!".


Add a comment

Show form to leave a comment

Comments (newest first)

Italo Penna from wrote on Dec 5th, 2022:

Thank you very much for this detailed content.


Slava B from New York wrote on Jul 13th, 2017:

It seems on our zone the /var run out of space and created the problem. Was enough to restart the svc.configd though to take care of it.


Calixto Mercado from Bolivia wrote on Jun 25th, 2016:

I found this... it is easiest
Restoring svc repository

svcs: Could not bind to repository server: repository server unavailable. Exiting.

What this means is that your SMF repo might be currupt and broken. Luckily backups are kept and you can go on restoring the repository by running the following command:

# /lib/svc/bin/restore_repository

A menu of options will come up asking for what restore point you would like to go back to by default its “boot”, and that will go to the most recent one – thats usually what I do. Just hit ENTER and it will restore it. The on screen menu will make sense and explain your options and show all of the restore points.

It will then auto reboot your system, or you will have to manually do it

# reboot


francois scheurer from zurich wrote on Jul 17th, 2015:

(previous comment continued..)

delete from value_tbl where value_tbl.value_id in (select lnk_val_id from prop_lnk_tbl where lnk_prop_id < 0);

delete from prop_lnk_tbl where lnk_prop_id < 0;

.exit

/lib/svc/bin/svc.configd
svcs -xv


francois scheurer from zurich wrote on Jul 17th, 2015:

Thx Claudio, great research!

It worked! but i also needed to kill /lib/svc/bin/svc.configd that had zombies, else svcs was working but enabling/disabling services was not (it did nothing without showing an error):




svcs -xv
svcs: Could not bind to repository server: repository server unavailable. Exiting.

prstat -s rss
pgrep -lf svc.configd
pkill -f /lib/svc/bin/svc.configd
ptree
pkill -f /lib/svc/bin/svc.startd

/lib/svc/bin/sqlite /etc/svc/volatile/svc_nonpersist.db .dump >/root/svc_nonpersist.txt

/lib/svc/bin/sqlite /etc/svc/volatile/svc_nonpersist.db
.schema
.headers on

select * from prop_lnk_tbl where lnk_prop_id < 0;
lnk_prop_id|lnk_pg_id|lnk_gen_id|lnk_prop_name|lnk_prop_type|lnk_val_id
-2147483647|127|247807630|auxiliary_state|s|578217144
-2147483450|127|247807630|next_state|s|578217145
-2147483379|127|247807630|state|s|578217146
-2147483330|127|247807630|state_timestamp|t|578217147
-2147483113|128|247807631|logfile|s|578217097
-2147483078|128|247807631|contract|c|578217098
-2147482828|128|247807631|start_pid|c|578217099
-2147482699|128|247807631|start_method_timestamp|t|578217100
-2147482505|128|247807631|start_method_waitstatus|i|578217101
-2147482424|128|247807631|auxiliary_state|s|578217148
-2147482239|128|247807631|next_state|s|578217149
-2147482115|128|247807631|state|s|578217150
-2147482075|128|247807631|state_timestamp|t|578217151

select distinct pg_tbl.* from pg_tbl where pg_id in (select lnk_pg_id from prop_lnk_tbl where lnk_prop_id < 0);
127|823|restarter|framework|1|247807630
128|821|restarter|framework|1|247807631

select distinct value_tbl.* from value_tbl where value_tbl.value_id in (select lnk_val_id from prop_lnk_tbl where lnk_prop_id < 0);
578217097|s|/var/svc/log/site-wsapp:movelogsend.log|0
578217098|c|27664308|0
578217099|c|68991|0
578217100|t|1437114522.078017000|0
578217101|i|0|0
5782


Andre Sachs from Johannesburg wrote on May 7th, 2014:

Thanks that worked a treat !

I did however need to invoke /lib/svc/bin/svc.startd for services to become managed again.


Claudio from Switzerland wrote on Jul 1st, 2013:

Hi Ahmed. No, the final conclusion is that for some reason a negative value is added to the sqlite database. By deleting the affected rows and restarting /lib/svc/bin/svc.configd, this issue should be fixed.
However my superior preferred that I do a reboot rather than manually deleting entries in the sqlite database. That's why I wrote at the begin that a reboot solves the issue. But it is still unclear where it came from.


Ahmed Kamal from Egypt wrote on Jun 30th, 2013:

So that was some cool analysis, but the final conclusion was to yet again simply reboot ?! I was wishing for a more insightful resolution .. If you later find out more info about this issue, let me know

Thanks


RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Icingaweb   Icingaweb2   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Office   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder   


Update cookies preferences