Ray on Golem CLI
Ray on Golem exposes a command line interface to help you manage the cluster.
This article explains the commands available, which are useful when working with Ray on Golem.
Commands Overview
Ray on Golem supports the following commands
ray-on-golem start
launchesray-on-golem
's webserver and theyagna
daemon to shorten the cluster spin-up time once you runray up
(details) later onray-on-golem stop
stopsray-on-golem
's webserver and theyagna
daemon. Use it when you are done with using ray for a while (details)ray-on-golem status
gives you the basic information about the runningray-on-golem
's webserver.ray-on-golem version
printsray-on-golem
package information. Please use it when reporting problems - it makes easier for us to find and fix them.ray-on-golem network-stats golem-cluster.yaml
scans the network and offers information about available providers (details)ray-on-golem webserver
starts the Golem requestor service, which acts as an interface between Ray and the Golem network, managing the provider nodes that constitute the Ray cluster. The command is invoked internally - either when you start it manually withray-on-golem start
, or automatically onray up
. In most cases you don't need to use it explicitly on its own.
ray-on-golem [start|stop]
Apart from Ray itself, Ray on Golem utilizes two additional services running on the user's computer:
yagna
daemon acts as a gateway to the Golem Networkray-on-golem
webserver controls the Ray cluster, and reacts to Ray autoscaler's commands (spin up new nodes or stop the nodes, which are no longer required)
When you execute ray up
, Ray on Golem ensures the two services are running, starting them if necessary. Conversely though, ray down
doesn't stop them. The rationale is to allow them to run and keep gathering updates from Golem Network about available nodes, in order to speed up launching a new cluster on a subsequent ray up
.
They are supposed to be light and normally we don't recommend stopping them. However, you might want to do that, for example:
- if you've finished working with Ray on Golem and won't be using it for a while
- in case you wish to clear Ray on Golem's state and start afresh. Ideally, this shouldn't be necessary - if you often find such a restart helps, please let us know - probably something is wrong and we would love to fix it.
Because the process of gathering offers from Golem Network's market can take a couple of minutes, you might wish to launch them even before you first run ray up
. That way, they'll have time to warm up and will already have more knowledge about the network when you request a new cluster to start.
There are two commands that allow you to manage those two services independently from Ray itself:
ray-on-golem start
launches them and leaves them running in the backgroundray-on-golem stop
terminates them and ensures graceful shutdown and proper cleanup.
Network stats
The tool scans the network and gives you an overview of the availability of the providers. It allows you to test the settings of your cluster yaml file.
You can use it to verify and fine-tune different budget control parameters, both on the testnet and the mainnet.
Example usage
ray-on-golem network-stats golem-cluster.yaml --duration 5
Gathering stats data for 5 minutes...
Gathering stats data done
Proposals count:
Initial: 48
Not blacklisted: 48
Passed Reject if per cpu expected cost exceeds `max_cost` = 1.5: 48
Passed Reject if start price exceeds `max_start_price = 0.5: 48
Passed Reject if cpu per hour price exceeds `max_cpu_per_hour` = 0.5: 48
Passed Reject if env per hour price exceeds `max_env_per_hour_price` = 0.5: 48
Negotiation initialized: 48
Negotiated successfully: 37
Negotiation errors:
Outbound rejected because: Everyone rule didn't match whitelist ; Audited-Payload rule requires manifest signature ; Partner rule requires node descriptor ; : 4
No capacity available. Reached Agreements limit: 1: 2
Failed to send proposal response! Request timed out: 1
Failed to send proposal response! 500: {"message":"Failed to send response for Proposal [***]. Error: Countering Proposal [***] GSB error: GSB failure: Net: error forwarding message: Connection timed out.."}: 1
Duration
Golem Network is peer-to-peer, which means there is no central registry of providers offers. Therefore, providers proposals are not immediately available to a newly-connected requestor. They get re-broadcasted at a certain interval and accumulate over time.
Additionally, negotiating with individual providers also takes some time.
Use the --duration
parameter to set how long you want your network scan to last. It allows time to gather and negotiate the incoming proposals
Output
Initial
proposals found during the scan (there might be more than one proposal per provider)Not blacklisted
- proposals coming from non-blacklisted providers (we blacklist providers with a history of misbehaving)Passed Reject if per cpu expected cost exceeds
- proposals not exceeding yourmax_cost
settingPassed Reject if start price exceeds
- proposals not exceeding yourmax_start_price
settingPassed Reject if cpu per hour price exceeds
- proposals not exceeding yourmax_cpu_per_hour_price
settingPassed Reject if env per hour price exceeds
- proposals not exceeding yourmax_env_per_hour_price
settingNegotiation initialized
- proposals passing all the above limitationsNegotiated successfully
- providers ready to deploy the Ray on Golem image
At the end of the output, you can see the reasons for failing negotiations. The most interesting are the reasons around providers' outbound settings, but for now, we don't offer any actionable follow-up.
Nodes availability
The scanning ends with the successful negotiations - the Negotiated successfully
line shows you the potential size of your cluster. Notice that the availability is dynamic - providers might be hired by someone else, more providers might become available, and more providers might get discovered while your cluster is alive.
The last thing is the image deployment. When starting the actual cluster (with ray up
) the providers might also fail to start the Ray on Golem image. This might decrease the maximum size of your cluster too.