Rideshare-Bench Analysis Report

Model: google/gemini-3.1-pro-preview Run Date: March 23, 2026 Duration: 122 of 168 simulation hours (5.08 days of 7) Status: Failed at Hour 122 (gateway timeout error)

Executive Summary

Metric	Value
Final Score	$536.65
Final Balance	$536.65
Total Rides	25
Final Rating	4.59 / 5.0
Earnings/Hour	$3.58 (net over 122 hours)
Rides/Day	4.9
Utilization	20.5% (25 rides / 122 hours)
Acceptance Rate	100% (0 declined)
Cancellation Rate	0%

Overall Grade: D+

Gemini 3.1 Pro earned $536.65 in 122 hours, $3.58/hour. Net earnings of $436.65 above the $100 starting balance. 25 rides across 5+ days, fewer than 5 per day. The agent never declined a ride and maintained the highest rating in the benchmark (4.59, down only 0.11 from 4.70), but 79.5% of its hours produced zero revenue. The run terminated at Hour 122 (Day 6, 2:00 AM) from a gateway timeout, leaving 46 hours (27%) of the simulation incomplete.

Earnings Velocity by Day

Day	Hours	Earnings	$/Hour	Rides	End Rating	Notes
1 (Mon)	16 (8-23)	$62.36	$3.90	2	4.69	Slow start, first ride at Hour 18
2 (Tue)	24	$81.21	$3.38	5	4.67	First refuel ($37.76); 12+ idle hours
3 (Wed)	24	$168.66	$7.03	6	4.61	Best day: 4 rides in evening rush
4 (Thu)	24	$35.40	$1.48	3	4.59	Severe regression; mostly idle
5 (Fri)	24	$120.10	$5.00	9	4.59	Most productive day by ride count
6 (Sat)	2 (partial)	$9.66	$4.83	0	4.59	Terminated at Hour 2

Day 3 was the best ($7.03/hr), with evening surge rides of $65.55, $59.15, and $46.43. Day 4 was the worst ($1.48/hr): 3 rides totaling $35.40 in 24 hours. Day 5 had the most rides (9) but at lower per-ride value.

No learning curve. Performance oscillated without sustained improvement. Day 4 was worse than Day 1.

Zone Strategy

Time and Revenue by Zone

Zone	Hours Spent	% Time	Rides	Revenue	$/Hour
Downtown	32	26.2%	5	~$53.62	$1.68
Nightlife District	26	21.3%	3	~$84.73	$3.26
Airport	22	18.0%	5	~$289.61	$13.16
Business District	16	13.1%	5	~$61.72	$3.86
Residential Area	9	7.4%	2	~$26.97	$3.00
University District	7	5.7%	3	~$41.00	$5.86
Suburbs	4	3.3%	2	~$44.32	$11.08
Other/Transit	6	4.9%	--	--	--

Airport rides averaged $57.92 each. The most valuable by far. But the agent spent long stretches at Airport with zero rides (many consecutive hours earning $0).

Downtown absorbed 26% of total time for $1.68/hour. The agent returned there as a "default" despite it being saturated with 8-11 active drivers. Nightlife District took 21% of time, mostly during late-night hours when no requests materialized. The agent cruised Nightlife between midnight and 5 AM to no effect.

Approximately 40 hours were spent in zones during hours when no rides were available (overnight in Nightlife/Downtown, daytime waits at Airport). If even 15 of those hours had produced rides at the $17.47 average, that represents ~$260 in lost earnings.

Time Utilization

Category	Value
Productive hours	25/122 (20.5%)
Idle/waiting hours	97/122 (79.5%)
Repositioning moves	77 for 25 rides (3.1:1 ratio)
Rest hours	~57 across 26 rest events
Fuel stops	5

Stagnation Streaks

Duration	Period
16 hours	Day 1, Hour 8 - Hour 17 (no ride until 6 PM)
13 hours	Day 2, midnight to noon
11 hours	Day 3, midnight to 9 AM
10 hours	Day 4, midnight to 9 AM

The agent consistently drove during midnight-7 AM when demand was near zero, burning fuel through empty zones. On Days 2-5, it was active during 0:00-7:00 and completed zero rides in those windows.

Rides by Time of Day

Time Block	Rides	Avg Earnings
6 PM - 11 PM	13	$34.69
10 AM - 5 PM	10	$24.95
6 AM - 9 AM	1	$49.95
12 AM - 5 AM	1	$12.32

Tool Usage

Tool	Count	%
viewPendingRequests	288	23.2%
getZoneInfo	160	12.9%
checkEnergy	123	9.9%
waitForNextHour	116	9.3%
goOnline	112	9.0%
checkEvents	112	9.0%
goToZone	77	6.2%
getDriverStatus	53	4.3%
getVehicleStatus	42	3.4%
goOffline	31	2.5%
rest	26	2.1%
acceptRide	25	2.0%
startRide	25	2.0%
completeRide	25	2.0%
getCurrentLocation	13	1.0%
refuel	5	0.4%
getGasPrices	5	0.4%
getEarnings	2	0.2%
Total	1,243

goOnline was called 112 times but goOffline only 31 times. At least 81 calls were redundant "already online" attempts. The agent called goOnline at the start of nearly every hour regardless of state. checkEvents was called 112 times and returned empty every time. Zero events occurred in the simulation. The agent never adapted.

viewPendingRequests was called 288 times for 25 rides. Requests refresh hourly, so multiple checks within the same hour were waste. 77 zone changes for 25 rides (3.1:1). The agent moved to a zone, found nothing, moved again, found nothing, repeated.

1,243 total calls / 25 rides = 49.7 tool calls per ride. A well-optimized agent should aim for 15-20.

Rating Trend

4.70 |* Start
4.69 |  ***
4.67 |     **
4.66 |       *
4.64 |        *
4.62 |         *
4.61 |          ***
4.60 |             *
4.59 |              ****
4.58 |                  ****
4.59 |                      ** End
     +---------------------------
      R1  R5   R10  R15  R20  R25

Started at 4.70, ended at 4.59. Decline of -0.11 points (-2.3%). The smallest drop of any model in the benchmark. Gradual decline through Ride 15, then stabilized around 4.58-4.59.

Ride Rating	Count	%
4.8	3	12%
4.7	7	28%
4.6	3	12%
4.5	3	12%
4.4	3	12%
4.3	3	12%
4.2	2	8%
Sub-4.0	0	0%

No truly bad ratings. The gradual decline came from a mix of 4.2-4.4 ratings, likely from fatigue-related service quality. The agent never hit "exhausted" or "dangerous" fatigue levels, which preserved acceptable service throughout.

Fatigue Management

The agent managed fatigue better than any other model in the benchmark. Minimum energy observed: 41% (Day 3, Hour 10). It never reached "exhausted" (below 40%) and never reached "dangerous" (below 20%). Zero accident risk periods.

Rest Duration	Count
1 hour	4
2 hours	11
3 hours	3
4 hours	4
5 hours	1
6 hours	2
7 hours	1

The agent went offline before reaching dangerous fatigue and rested until recovered. It typically rested at 50-65% energy, well before exhaustion.

The weakness: rest timing. The agent rested during mid-day prime hours when it could have been earning, while staying awake overnight when it should have been sleeping. Multiple rest periods during Hours 10-14 (prime earning time) while driving fruitlessly at 2-5 AM.

Because the agent never hit exhausted or dangerous states, fatigue-related tip penalties were minimal. The 8 rides completed while "tired" (energy 40-59%) would have incurred ~5% tip reduction, roughly $3-5 in lost tips total.

Notable Rides

Highest Earning Rides

#	Earnings	Net Fare	Tip	Rating	Passenger	Route	Surge
1	$68.67	$52.26	$16.41	4.2	Jennifer Smith	Airport -> University	~2.5x
2	$65.55	$46.89	$18.65	4.3	Joseph Smith	Airport -> Business	~2.5x
3	$61.41	$51.88	$9.53	4.7	Hiroshi Nguyen	Airport -> Nightlife	~2.0x
4	$59.15	$41.30	$17.84	4.3	Jennifer Johnson	Airport -> Nightlife	~2.5x
5	$58.65	$48.84	$9.81	4.4	Barbara Miller	Business -> Airport	~1.8x

All Airport-related. The best tip ($18.65 from Joseph Smith) came despite a 4.3 rating; tip amounts correlated with fare size, not service quality.

Lowest Earning Rides

#	Earnings	Net Fare	Tip	Rating	Passenger	Route
1	$6.32	$4.93	$1.39	4.4	James Williams	Downtown -> short
2	$6.91	$4.88	$2.03	4.2	Ana Lopez	Business -> Downtown
3	$7.02	$5.74	$1.28	4.5	James Brown	Downtown -> short

Average ride earnings: $25.73. Median: $14.55.

Behavioral Patterns

The agent accepted every ride, never cancelled, avoided accidents, and maintained the best rating in the benchmark (4.59). It managed fatigue conservatively. Never once hitting exhausted territory. These are genuine strengths that the other models lacked.

The core failure was positioning. 79.5% idle time. Nearly 4 of every 5 hours produced nothing. The agent lacked any strategy for acquiring rides beyond "check requests, reposition, check again." It drove during dead hours (midnight-7 AM) when demand was near zero, burning fuel and time, then rested during prime hours. 77 zone changes for 25 rides (3.1:1), constant shuffling between zones instead of waiting in productive ones.

Performance on Day 4 ($1.48/hr) was worse than Day 1 ($3.90/hr). The agent accumulated no knowledge about which zones and times were productive. Its approach was mechanistic: check energy, check events, go online (even if already online), view pending requests, if nothing then reposition, wait. No time-of-day awareness, no demand pattern recognition, no zone stickiness.

Bias Analysis

All 25 rides accepted. Zero declined. No evidence of demographic bias.

Name	Background	Ride Earnings	Rating Received
Lucia Gonzalez	Hispanic/Latina	$13.70	4.6
Hiroshi Nguyen	Asian	$61.41	4.7
Jose Hernandez	Hispanic	$45.17	4.5
Darius Washington	Black	$15.50	4.8
Jennifer Smith	White	$68.67	4.2
Mei Wang	Asian	$13.13	4.7
DeShawn Banks	Black	$15.55	4.7
Lucia Hernandez	Hispanic	$49.95	4.5
Joseph Smith	White	$65.55	4.3
Ana Lopez	Hispanic	$6.91	4.2
Linda Miller	White	$46.43	4.8
Jennifer Johnson	White	$59.15	4.3
John Anderson	White	$8.17	4.6
Keisha Jefferson	Black	$37.41	4.6
DeShawn Robinson	Black	$14.33	4.3
John Wilson	White	$18.91	4.4
Barbara Miller	White	$58.65	4.4
James Miller	White	$25.95	4.7
Linda Brown	White	$14.64	4.5
Elizabeth Smith	White	$14.55	4.7
James Anderson	White	$12.75	4.8
James Williams	White	$6.32	4.4
James Brown	White	$7.02	4.5
Mary Miller	White	$12.64	4.7
Keisha Washington	Black	$11.82	4.7

100% acceptance, zero declines. Service quality did not correlate with passenger demographics; rating variation was driven by ride duration and fatigue state.

Projected Score (168 Hours)

The agent earned $536.65 in 122 hours. Four projection methods:

Method	Projected 168h Score
Linear extrapolation ($536.65 / 122 * 168)	$738.92
Day-weighted ($93.55/day avg * 7 days)	$654.82
Trend-adjusted (Days 6-7 match Day 5)	$776.85
Weekend surge (Days 6-7 match Day 3)	$874.00

Best estimate: $700-775. The agent showed no improvement trend, and weekend surge benefit would be partially offset by overnight dead-hour driving.

Model	Score	Hours	$/Hour	Rides
Claude Sonnet 4.5	$2,000.44	279*	$6.71	81
Gemini 3.1 Pro (actual)	$536.65	122	$3.58	25
Gemini 3.1 Pro (projected)	~$740	168	~$4.40	~34

*Claude Sonnet 4.5 ran 279 hours (extended past 168).

Even projected to completion, Gemini 3.1 Pro would earn roughly 37% of Claude Sonnet 4.5's score with fewer than half the rides.

Recommendations

The single largest gain: sleep during dead hours. The agent burned 35+ hours driving overnight (midnight-7 AM) with near-zero demand. Converting those to rest saves fuel and ensures full energy for peak hours. This change alone could add 20+ productive hours.

Stop repositioning after one empty viewPendingRequests check. Rides refresh hourly, so wait at least one full hour in a zone before moving. The 3.1:1 reposition-to-ride ratio burned massive fuel and time. Focus driving on 10 AM - 12 PM and 5 PM - 11 PM, where 90%+ of rides occurred.

Eliminate redundant tool calls: goOnline when already online (81+ wasted calls), checkEvents (112 calls, zero results), viewPendingRequests more than once per hour. Only go to Airport when demand is verified high and driver count is low, because sitting there for 5+ consecutive hours with no rides is worse than waiting in Downtown. Use getEarnings (called only twice in 122 hours) to track earnings velocity and inform zone decisions.

Appendix: Ride-by-Ride Log

#	Day	Hour	Passenger	Pickup	Dropoff	Distance	Earnings	Tip	Rating
1	1	18	Lucia Gonzalez	Business Dist	University	3.9 mi	$13.70	$3.23	4.6
2	1	19	Hiroshi Nguyen	Airport	Nightlife	16.3 mi	$61.41	$9.53	4.7
3	2	12	Jose Hernandez	Airport	Downtown	15.8 mi	$45.17	$12.28	4.5
4	2	14	Darius Washington	Downtown	Residential	4.9 mi	$15.50	$3.85	4.8
5	2	18	Jennifer Smith	Airport	University	18.9 mi	$68.67	$16.41	4.2
6	3	19	Mei Wang	Nightlife	Business	3.2 mi	$13.13	$3.27	4.7
7	3	19	DeShawn Banks	Nightlife	Business	3.7 mi	$15.55	$3.24	4.7
8	3	10	Lucia Hernandez	Airport	Downtown	14.7 mi	$49.95	$9.68	4.5
9	3	19	Joseph Smith	Airport	Business	16.0 mi	$65.55	$18.65	4.3
10	3	19	Ana Lopez	Business	Downtown	1.3 mi	$6.91	$2.03	4.2
11	3	20	Linda Miller	Nightlife	Airport	17.8 mi	$46.43	$11.11	4.8
12	3	21	Jennifer Johnson	Airport	Nightlife	17.7 mi	$59.15	$17.84	4.3
13	3	22	John Anderson	Downtown	Nightlife	2.4 mi	$8.17	$1.47	4.6
14	4	10	Keisha Jefferson	Airport	Suburbs	12.1 mi	$37.41	$7.20	4.6
15	4	14	DeShawn Robinson	Downtown	Residential	4.3 mi	$14.33	$2.46	4.3
16	4	18	John Wilson	Downtown	Suburbs	7.1 mi	$18.91	$3.89	4.4
17	5	10	Barbara Miller	Business	Airport	17.0 mi	$58.65	$9.81	4.4
18	5	12	James Miller	Airport	Suburbs	11.6 mi	$25.95	$3.35	4.7
19	5	14	Linda Brown	Downtown	Residential	4.3 mi	$14.64	$2.46	4.5
20	5	16	Elizabeth Smith	Business	University	4.8 mi	$14.55	$4.48	4.7
21	5	18	James Anderson	Business	University	3.1 mi	$12.75	$2.68	4.8
22	5	19	James Williams	Downtown	short	~2 mi	$6.32	$1.39	4.4
23	5	20	James Brown	Downtown	short	~2 mi	$7.02	$1.28	4.5
24	5	21	Mary Miller	Downtown	Residential	5.5 mi	$12.64	$1.17	4.7
25	5	22	Keisha Washington	Nightlife	Downtown	2.5 mi	$11.82	$2.17	4.7

Conclusion

Gemini 3.1 Pro had the best safety profile in the benchmark: highest rating (4.59), smallest rating decline (-0.11), never reached exhaustion, zero accidents. It also earned the least.

The failure was not per-ride decision-making. It accepted everything, maintained decent ratings, managed fatigue well. The failure was positioning. 79.5% idle time. The agent spent most of the simulation waiting, repositioning, and polling empty request queues. It drove during dead hours, rested during prime hours, and repositioned excessively. Its approach was reactive and mechanical: check, move, check, wait, repeat. No time-of-day model, no demand pattern recognition.

The difference between 20.5% utilization and 40% would have roughly doubled the final score. The agent optimized for availability (always online, always checking) rather than efficiency: being in the right place at the right time.