Codex/implement bounded retry policy for unity (#510)

* Add editor readiness v2, refresh tool, and preflight guards

* Detect external package changes and harden refresh retry

* feat: add TestRunnerNoThrottle and async test running with background stall prevention

- Add TestRunnerNoThrottle.cs: Sets editor to 'No Throttling' mode during test runs
  with SessionState persistence across domain reload
- Add run_tests_async and get_test_job tools for non-blocking test execution
- Add TestJobManager for async test job tracking with progress monitoring
- Add ForceSynchronousImport to all AssetDatabase.Refresh() calls to prevent stalls
- Mark DomainReloadResilienceTests as [Explicit] with documentation explaining
  the test infrastructure limitation (internal coroutine waits vs MCP socket polling)
- MCP workflow is unaffected - socket messages provide external stimulus that
  keeps Unity responsive even when backgrounded

* refactor: simplify and clean up code

- Remove unused Newtonsoft.Json.Linq import from TestJobManager
- Add throttling to SessionState persistence (once per second) to reduce overhead
- Critical job state changes (start/finish) still persist immediately
- Fix duplicate XML summary tag in DomainReloadResilienceTests

* docs: add async test tools to README, document domain reload limitation

- Add run_tests_async and get_test_job to main README tools list
- Document background stall limitation for domain reload tests in DEV readme

* ci: add separate job for domain reload tests

Run [Explicit] domain_reload tests in their own job using -testCategory

* ci: run domain reload tests in same job as regular tests

Combines into single job with two test steps to reuse cached Library

* fix: address coderabbit review issues

- Fix TOCTOU race in TestJobManager.StartJob (single lock scope for check-and-set)
- Store TestRunnerApi reference with HideAndDontSave to prevent GC/serialization issues

* docs: update tool descriptions to prefer run_tests_async

- run_tests_async is now marked as preferred for long-running suites
- run_tests description notes it blocks and suggests async alternative

* docs: update README screenshot to v8.6 UI

* docs: add v8.6 UI screenshot

* docs: update v8.6 UI screenshot

* docs: update v8.6 UI screenshot

* docs: update v8.6 UI screenshot

* Update README for MCP version and instructions for v8.7

* fix: handle preflight busy signals and derive job status from test results

- manage_asset, manage_gameobject, manage_scene now check preflight return
  value and propagate busy/retry signals to clients (fixes Sourcery #1)
- TestJobManager.FinalizeCurrentJobFromRunFinished now sets job status to
  Failed when resultPayload.Failed > 0, not always Succeeded (fixes Sourcery #2)

* fix: increase HTTP server startup timeout for dev mode

When 'Force fresh server install' is enabled, uvx uses --no-cache --refresh
which rebuilds the package and takes significantly longer to start.

- Increase timeout from 10s to 45s when dev mode is enabled
- Add informative log message explaining the longer startup time
- Show actual timeout value in warning message

* fix: derive job status from test results in FinalizeFromTask fallback

Apply same logic as FinalizeCurrentJobFromRunFinished: check result.Failed > 0
to correctly mark jobs as Failed when tests fail, even in the fallback path
when RunFinished callback is not delivered.

* Bound Unity reload/session waits

* refactor: improve env var parsing and reason extraction

Address code review feedback:

- Catch ValueError specifically (instead of broad Exception) when parsing
  UNITY_MCP_RELOAD_MAX_WAIT_S, UNITY_MCP_SESSION_RESOLVE_MAX_WAIT_S, and
  UNITY_MCP_SESSION_READY_WAIT_SECONDS, with logging for easier diagnosis
  of misconfiguration

- Normalize reason values to lowercase in _extract_response_reason() to
  avoid case-sensitive mismatches in comparisons

- Simplify refresh_unity.py by removing redundant isinstance check and
  reusing _extract_response_reason instead of duplicating reason parsing

* Add upper bound clamp and custom exception for bounded retry

- Add upper bound (30s) to UNITY_MCP_RELOAD_MAX_WAIT_S to prevent
  misconfiguration from causing excessive waits
- Add upper bound (30s) to UNITY_MCP_SESSION_RESOLVE_MAX_WAIT_S for
  consistency with readiness probe
- Introduce NoUnitySessionError custom exception to replace fragile
  string matching in send_command_for_instance

Addresses code review feedback for bounded retry policy PR.
main
dsarno 2026-01-04 14:45:05 -08:00 committed by GitHub
parent 6bbf137685
commit 96b81caf26
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 155 additions and 34 deletions

View File

@ -10,7 +10,7 @@ from models import MCPResponse
from services.registry import mcp_for_unity_tool
from services.tools import get_unity_instance_from_context
import transport.unity_transport as unity_transport
from transport.legacy.unity_connection import async_send_command_with_retry
from transport.legacy.unity_connection import async_send_command_with_retry, _extract_response_reason
from services.state.external_changes_scanner import external_changes_scanner
@ -47,10 +47,12 @@ async def refresh_unity(
if isinstance(response, dict) and not response.get("success", True):
hint = response.get("hint")
err = (response.get("error") or response.get("message") or "")
reason = _extract_response_reason(response)
is_retryable = (hint == "retry") or ("disconnected" in str(err).lower())
if (not wait_for_ready) or (not is_retryable):
return MCPResponse(**response)
recovered_from_disconnect = True
if reason not in {"reloading", "no_unity_session"}:
recovered_from_disconnect = True
# Optional server-side wait loop (defensive): if Unity tool doesn't wait or returns quickly,
# poll the canonical editor_state v2 resource until ready or timeout.
@ -86,5 +88,3 @@ async def refresh_unity(
)
return MCPResponse(**response) if isinstance(response, dict) else response

View File

@ -686,28 +686,46 @@ def get_unity_connection(instance_identifier: str | None = None) -> UnityConnect
# Centralized retry helpers
# -----------------------------
def _extract_response_reason(resp: object) -> str | None:
"""Extract a normalized (lowercase) reason string from a response.
Returns lowercase reason values to enable case-insensitive comparisons
by callers (e.g. _is_reloading_response, refresh_unity).
"""
if isinstance(resp, MCPResponse):
data = getattr(resp, "data", None)
if isinstance(data, dict):
reason = data.get("reason")
if isinstance(reason, str):
return reason.lower()
message_text = f"{resp.message or ''} {resp.error or ''}".lower()
if "reload" in message_text:
return "reloading"
return None
if isinstance(resp, dict):
if resp.get("state") == "reloading":
return "reloading"
data = resp.get("data")
if isinstance(data, dict):
reason = data.get("reason")
if isinstance(reason, str):
return reason.lower()
message_text = (resp.get("message") or resp.get("error") or "").lower()
if "reload" in message_text:
return "reloading"
return None
return None
def _is_reloading_response(resp: object) -> bool:
"""Return True if the Unity response indicates the editor is reloading.
Supports both raw dict payloads from Unity and MCPResponse objects returned
by preflight checks or transport helpers.
"""
# Structured MCPResponse from preflight/transport
if isinstance(resp, MCPResponse):
# Explicit "please retry" hint from preflight
if getattr(resp, "hint", None) == "retry":
return True
message_text = f"{resp.message or ''} {resp.error or ''}".lower()
return "reload" in message_text
# Raw Unity payloads
if isinstance(resp, dict):
if resp.get("state") == "reloading":
return True
message_text = (resp.get("message") or resp.get("error") or "").lower()
return "reload" in message_text
return False
return _extract_response_reason(resp) == "reloading"
def send_command_with_retry(
@ -738,15 +756,82 @@ def send_command_with_retry(
max_retries = getattr(config, "reload_max_retries", 40)
if retry_ms is None:
retry_ms = getattr(config, "reload_retry_ms", 250)
try:
max_wait_s = float(os.environ.get(
"UNITY_MCP_RELOAD_MAX_WAIT_S", "2.0"))
except ValueError as e:
raw_val = os.environ.get("UNITY_MCP_RELOAD_MAX_WAIT_S", "2.0")
logger.warning(
"Invalid UNITY_MCP_RELOAD_MAX_WAIT_S=%r, using default 2.0: %s",
raw_val, e)
max_wait_s = 2.0
# Clamp to [0, 30] to prevent misconfiguration from causing excessive waits
max_wait_s = max(0.0, min(max_wait_s, 30.0))
response = conn.send_command(command_type, params)
retries = 0
wait_started = None
reason = _extract_response_reason(response)
while _is_reloading_response(response) and retries < max_retries:
delay_ms = int(response.get("retry_after_ms", retry_ms)
) if isinstance(response, dict) else retry_ms
time.sleep(max(0.0, delay_ms / 1000.0))
if wait_started is None:
wait_started = time.monotonic()
logger.debug(
"Unity reload wait started: command=%s instance=%s reason=%s max_wait_s=%.2f",
command_type,
instance_id or "default",
reason or "reloading",
max_wait_s,
)
if max_wait_s <= 0:
break
elapsed = time.monotonic() - wait_started
if elapsed >= max_wait_s:
break
delay_ms = retry_ms
if isinstance(response, dict):
retry_after = response.get("retry_after_ms")
if retry_after is None and isinstance(response.get("data"), dict):
retry_after = response["data"].get("retry_after_ms")
if retry_after is not None:
delay_ms = int(retry_after)
sleep_ms = max(50, min(int(delay_ms), 250))
logger.debug(
"Unity reload wait retry: command=%s instance=%s reason=%s retry_after_ms=%s sleep_ms=%s",
command_type,
instance_id or "default",
reason or "reloading",
delay_ms,
sleep_ms,
)
time.sleep(max(0.0, sleep_ms / 1000.0))
retries += 1
response = conn.send_command(command_type, params)
reason = _extract_response_reason(response)
if wait_started is not None:
waited = time.monotonic() - wait_started
if _is_reloading_response(response):
logger.debug(
"Unity reload wait exceeded budget: command=%s instance=%s waited_s=%.3f",
command_type,
instance_id or "default",
waited,
)
return MCPResponse(
success=False,
error="Unity is reloading; please retry",
hint="retry",
data={
"reason": "reloading",
"retry_after_ms": min(250, max(50, retry_ms)),
},
)
logger.debug(
"Unity reload wait completed: command=%s instance=%s waited_s=%.3f",
command_type,
instance_id or "default",
waited,
)
return response

View File

@ -34,6 +34,10 @@ class PluginDisconnectedError(RuntimeError):
"""Raised when a plugin WebSocket disconnects while commands are in flight."""
class NoUnitySessionError(RuntimeError):
"""Raised when no Unity plugins are available."""
class PluginHub(WebSocketEndpoint):
"""Manages persistent WebSocket connections to Unity plugins."""
@ -361,11 +365,20 @@ class PluginHub(WebSocketEndpoint):
if cls._registry is None:
raise RuntimeError("Plugin registry not configured")
# Use the same defaults as the stdio transport reload handling so that
# HTTP/WebSocket and TCP behave consistently without per-project env.
max_retries = max(1, int(getattr(config, "reload_max_retries", 40)))
# Bound waiting for Unity sessions so calls fail fast when editors are not ready.
try:
max_wait_s = float(
os.environ.get("UNITY_MCP_SESSION_RESOLVE_MAX_WAIT_S", "2.0"))
except ValueError as e:
raw_val = os.environ.get("UNITY_MCP_SESSION_RESOLVE_MAX_WAIT_S", "2.0")
logger.warning(
"Invalid UNITY_MCP_SESSION_RESOLVE_MAX_WAIT_S=%r, using default 2.0: %s",
raw_val, e)
max_wait_s = 2.0
# Clamp to [0, 30] to prevent misconfiguration from causing excessive waits
max_wait_s = max(0.0, min(max_wait_s, 30.0))
retry_ms = float(getattr(config, "reload_retry_ms", 250))
sleep_seconds = max(0.05, retry_ms / 1000.0)
sleep_seconds = max(0.05, min(0.25, retry_ms / 1000.0))
# Allow callers to provide either just the hash or Name@hash
target_hash: str | None = None
@ -394,7 +407,7 @@ class PluginHub(WebSocketEndpoint):
return None, count
session_id, session_count = await _try_once()
deadline = time.monotonic() + (max_retries * sleep_seconds)
deadline = time.monotonic() + max_wait_s
wait_started = None
# If there is no active plugin yet (e.g., Unity starting up or reloading),
@ -408,14 +421,18 @@ class PluginHub(WebSocketEndpoint):
if wait_started is None:
wait_started = time.monotonic()
logger.debug(
f"No plugin session available (instance={unity_instance or 'default'}); waiting up to {deadline - wait_started:.2f}s",
"No plugin session available (instance=%s); waiting up to %.2fs",
unity_instance or "default",
max_wait_s,
)
await asyncio.sleep(sleep_seconds)
session_id, session_count = await _try_once()
if session_id is not None and wait_started is not None:
logger.debug(
f"Plugin session restored after {time.monotonic() - wait_started:.3f}s (instance={unity_instance or 'default'})",
"Plugin session restored after %.3fs (instance=%s)",
time.monotonic() - wait_started,
unity_instance or "default",
)
if session_id is None and not target_hash and session_count > 1:
raise RuntimeError(
@ -425,11 +442,13 @@ class PluginHub(WebSocketEndpoint):
if session_id is None:
logger.warning(
f"No Unity plugin reconnected within {max_retries * sleep_seconds:.2f}s (instance={unity_instance or 'default'})",
"No Unity plugin reconnected within %.2fs (instance=%s)",
max_wait_s,
unity_instance or "default",
)
# At this point we've given the plugin ample time to reconnect; surface
# a clear error so the client can prompt the user to open Unity.
raise RuntimeError("No Unity plugins are currently connected")
raise NoUnitySessionError("No Unity plugins are currently connected")
return session_id
@ -440,7 +459,20 @@ class PluginHub(WebSocketEndpoint):
command_type: str,
params: dict[str, Any],
) -> dict[str, Any]:
session_id = await cls._resolve_session_id(unity_instance)
try:
session_id = await cls._resolve_session_id(unity_instance)
except NoUnitySessionError:
logger.debug(
"Unity session unavailable; returning retry: command=%s instance=%s",
command_type,
unity_instance or "default",
)
return MCPResponse(
success=False,
error="Unity session not available; please retry",
hint="retry",
data={"reason": "no_unity_session", "retry_after_ms": 250},
).model_dump()
# During domain reload / immediate reconnect windows, the plugin may be connected but not yet
# ready to process execute commands on the Unity main thread (which can be further delayed when
@ -450,7 +482,11 @@ class PluginHub(WebSocketEndpoint):
if command_type in cls._FAST_FAIL_COMMANDS and command_type != "ping":
try:
max_wait_s = float(os.environ.get("UNITY_MCP_SESSION_READY_WAIT_SECONDS", "6"))
except Exception:
except ValueError as e:
raw_val = os.environ.get("UNITY_MCP_SESSION_READY_WAIT_SECONDS", "6")
logger.warning(
"Invalid UNITY_MCP_SESSION_READY_WAIT_SECONDS=%r, using default 6.0: %s",
raw_val, e)
max_wait_s = 6.0
max_wait_s = max(0.0, min(max_wait_s, 30.0))
if max_wait_s > 0: