I have observed over the months that both automated and manual restart mechanisms seem to not work right:
pulseaudio often won't restart under phosh if killed #692 (closed)
phosh/phoc will sometimes not restart if terminated, try killall -9 phosh phoc and killall -9 phoc phosh in both orders and you'll see it sometimes won't come back up
if pulseaudio or phosh/phoc are dead like that, they seem to be impossible to start again even manually
restarting lightdm also often spammed polkitd for me and seems to break things
This means in addition to the understandable not-perfect-yet code causing the occasional issue in phosh, phoc, etc, I'm often left being forced to reboot the entire device even when SSH access still works perfectly fine (I've seen all the above with SSH still working perfectly fine). In public the lack of automated recovery also means with no SSH, I often need to hard reboot and enter the LUKS password in public transport with cameras around or run around with my phone turned off for a while. In summary, I feel like it'd be useful to focus on these restarting scenarios and test them thoroughly, and figure out why so many seem to not work right.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
phosh/phoc will sometimes not restart if terminated, try killall -9 phosh phoc and killall -9 phoc phosh in both orders and you'll see it sometimes won't come back up
Since you're saying it sometimes does come up again, this seems like an upstream bug. Could you report it there too?
It does sometimes when SIGTERMed, but I always assumed that was some openrc or lightdm or supervise script by Alpine/postmarketOS. So my current guess would be that it is a problem with whatever that exact restart mechanism is, which would then not make it an upstream/phosh bug (unless this mechanism is provided by phosh too? Is it?)
So, there are several pieces of software here. Here is what I believe is the full stack:
OpenRC, which starts lightdm, and restart it if lightdm exits, up to ten times before giving up;
LightDM, which will start the dbus-run-session wrapper
dbus-run-session, which will start /usr/bin/phosh, which is actually a wrapper
Phoc, phoc … -E "bash -lc 'gnome-session $(gnome_session_args)'", which will ask gnome session to run its stuff (in a bash login shell to grab some env, just in case)
GNOME Session, which will start every component like the gnome settings dæmon and /usr/libexec/phosh (the real phosh)
Phosh
GNOME Session is smart enough to exit when phosh is killed, and Phoc is smart enough to exit when phosh is killed. I had fun with that here in #540 (closed).
What I think your issue is, is that you sometime kill phoc before it exits gracefully because you killed phosh (/usr/libexec/phosh). It's a bit of a race condition the way you use it, but I guess that pkill -9 phoc is enough to reproduce. When the wayland compositor is killed, every wayland client usually segfault, but some piece of software like pulseaudio will live on. Since some piece of software is still running, dbus-run-session will not exit (the session has still some software running!) and lightdm will not exit (the command it started, dbus-run-session /usr/bin/phosh, hasn't finished!) and openRC will not restart the graphical session.
(Assuming the issue is the one above, I can't test at the moment) my humble opinion is that phoc should never crash, and so we are fine. It's a piece of software based on a very good library and the people behind it do look competent. It also tries to do as little as possible to be more stable (it doesn't talk to pulseaudio, for example)
Anyway, to maybe fix that, maybe we could make GNOME Session start pulseaudio? And if we can get gnome-session to restart pulseaudio (can we?), I think that would help #692 (closed) too.
Please make another issue for polkitd and probably one more for the kernel that will not give another DRM lease (I think that was why you couldn't start phosh again, right?). So that proper investigation can take place :)
IMHO this is an unworkable assumption. I think the assumption of the OpenRC scripts should always be that the wayland compositor can and will crash, even if rarely. Just think of the OOM killer, 3d graphics driver bugs, etc, this is just the reality and really not related to software quality of phosh/phoc at all. This is a complex system and will never meet the criteria of never crashing.
Therefore, I highly suggest to change the restart behavior to recognize both a phosh and phoc crash and handle it gracefully. In my opinion, the correct behavior in that case is to SIGTERM all remaining session pieces, give them a short amount of time (maybe 3-4 seconds) to let remaining software in the session terminate, then force terminate them and restart everything.
In terms of implementation, I suggest there should be some sort of supervising script outside of dbus-run-session launched by e.g. OpenRC that checks phosh/phoc running every few seconds or so, and acts accordingly to make this behavior happen.
I've also just run into another phosh crash, with the screen turned on just stuck at a specific frame again, and I terminated all of these processes that were still running: 1. dbus-run-session ..., 2. dbus-daemon --session ..., 3. /usr/bin/phoc ..., and it still didn't come back. I then also terminated 4. lightdm --session-child ... and still nothing. Now there's just supervise-daemon lightdm --start ... and /usr/bin/lightdm and that's it, no session, phosh, phoc, and it's just happily sitting there still with the stuck frame on the screen. So something likes to be very broken about this, whatever it is. (Could be DRM I guess, with the stuck frame? So that phosh/phoc instant-quit when restarted? Any way I could see that in the logs?)
You can see the stdout and stderr of what lightdm started in ~/.xsession-errors. You seemed to have some issues with DRM the other day. If that's indeed what prevents the second phoc to display stuff, please open a second issue, because that would not be an issue with lightdm not restarting.
At this point, I'm wondering if you are not seeing a memory corruption issue or similar. Could you include a dmesg log next time you experience your issue?
For what it's worth, this happens a lot and I've often continued to use the device remotely for hours afterwards with heavy work (working with vim on larger C projects and recompiling them with gcc over and over) with no issues of any kind. So I doubt this is an overall system stability issue outside of the graphics stack at least, because SSH work appears to be completely unaffected even when it's really intense. I never had the compiler crash or misbehave even one single time, same for the resulting binaries. SSH also never drops out or anything.
I also know that at least some times this happened dmesg showed absolutely nothing about it. But I've stopped even checking so who knows for the more recent times, I'll recheck for you the next time phosh crashes.
Well, even under heavy memory corruption issues (#606 (closed)), I have only ever had trouble with ssh once. And I have only seen the kernel give up once or twice.
It was really severe though. To the point where you would never see plamo on screen (#547 (closed)), where Kodi would crash after about 5 clicks, where Phosh would crash quite often (what made me open #540 (closed)) and where Mate, the most stable UI, would see a X.Org crash every ~30 seconds.
Interesting! Is there a way I could downclock DRAM to see if the phosh crashes go away? I did notice they tend to happen when I interact with the screen a lot (many touch events, many changing screens), so I guess it could be at least somewhat GPU load related. However, I've rarely any "regular" app crash for me, like VLC playing a video, firefox playing youtube and other heavy browsing, edit: actually, I think firefox did crash sometimes I just forgot about it - otherwise it's mostly just been phosh. And it definitely happens when otherwise little happened too, so it's not temperature related (happens too quickly after doing some short burst of activity after device was idle for a long while) and not related to longer periods of stronger load. My personal guess would just be that the mesa driver just crashes due to some bugginess.
In any case, I was really just hoping the recovery from these kinds of issues would be better. But if I can help out fix the underlying crashes too, I'd be happy to do so.
For what it's worth, I've also permanently downclocked my CPU with cpupower to the lowest setting mainly because I've noticed I don't really care even when waiting for a bigger compile and it helps a lot with avoiding the device accidentally running too hot under heavier work. So if that also alleviates DRAM being heavily hit somehow (no idea if it does, not a hw gal) then that'd be another possible indicator it's not DRAM. And I've never seen any message about DRAM in dmesg for the few times that I checked, usually there's just nothing indicating a problem at all.
I guess you can take the patch here (temp/u-boot-pinephone/dram-speed.patch) but choose to underclock instead? (you may need to use /usr/sbin/update-u-boot for your updated uboot to be used)
Thinking more about this I'm curious, why is it changed to 624mhz anyway if that is known to cause trouble with some devices? Shouldn't the default be something that is known to work for everyone? I somehow had assumed that change had been reverted a long while ago
Edit: oh, I guess it has been? So it's at 552 right now? interesting
Ok, phosh & phoc just were gone again and wouldn't restart. And as @afontain speculated, there is definitely a problem with the session being stuck half running anyway:
$ ps aux | grep lightdm 3137 root 0:00 supervise-daemon lightdm --start /usr/bin/lightdm -- 9029 root 0:00 /usr/bin/lightdm 9035 user 0:00 grep lightdm$ ps aux | grep phosh 9037 user 0:00 grep phosh $ ps aux | grep phoc 9039 user 0:00 grep phoc$ ps aux | grep session 9102 root 0:00 lightdm --session-child 11 14 9108 user 0:00 grep session
The lightdm --session-child ... being still running appears to be the problem, once I kill it then it will try to relaunch. So I think that should be fixed, that should happen automatically somehow after some detection timeout. Like, some "session is half-dead, so it should be auto terminated to allow for clean restart" script in whatever OpenRC supervising thing handles this.
I also made a ticket about the kernel DRM problem here: #715 (closed)