Overall usability: restart mechanisms being lacking appears to result in notably worse usability

phosh/phoc will sometimes not restart if terminated, try killall -9 phosh phoc and killall -9 phoc phosh in both orders and you'll see it sometimes won't come back up

Since you're saying it sometimes does come up again, this seems like an upstream bug. Could you report it there too?

By Bart Ribbers on 2020-08-11T15:26:30

It does sometimes when SIGTERMed, but I always assumed that was some openrc or lightdm or supervise script by Alpine/postmarketOS. So my current guess would be that it is a problem with whatever that exact restart mechanism is, which would then not make it an upstream/phosh bug (unless this mechanism is provided by phosh too? Is it?)

By Ellie on 2020-08-11T15:26:57

We auto-start lightdm with OpenRC which then launches Phosh. I guess LigthDM is the one doing it here. You're right, maybe not an upstream bug.

By Bart Ribbers on 2020-08-12T10:50:53

added discussion help wanted ui-phosh labels

So, there are several pieces of software here. Here is what I believe is the full stack:

OpenRC, which starts lightdm, and restart it if lightdm exits, up to ten times before giving up;
LightDM, which will start the dbus-run-session wrapper
dbus-run-session, which will start /usr/bin/phosh, which is actually a wrapper
Phoc, phoc … -E "bash -lc 'gnome-session $(gnome_session_args)'", which will ask gnome session to run its stuff (in a bash login shell to grab some env, just in case)
GNOME Session, which will start every component like the gnome settings dæmon and /usr/libexec/phosh (the real phosh)
Phosh

GNOME Session is smart enough to exit when phosh is killed, and Phoc is smart enough to exit when phosh is killed. I had fun with that here in #540 (closed).

What I think your issue is, is that you sometime kill phoc before it exits gracefully because you killed phosh (/usr/libexec/phosh). It's a bit of a race condition the way you use it, but I guess that pkill -9 phoc is enough to reproduce. When the wayland compositor is killed, every wayland client usually segfault, but some piece of software like pulseaudio will live on. Since some piece of software is still running, dbus-run-session will not exit (the session has still some software running!) and lightdm will not exit (the command it started, dbus-run-session /usr/bin/phosh, hasn't finished!) and openRC will not restart the graphical session.

(Assuming the issue is the one above, I can't test at the moment) my humble opinion is that phoc should never crash, and so we are fine. It's a piece of software based on a very good library and the people behind it do look competent. It also tries to do as little as possible to be more stable (it doesn't talk to pulseaudio, for example)

Anyway, to maybe fix that, maybe we could make GNOME Session start pulseaudio? And if we can get gnome-session to restart pulseaudio (can we?), I think that would help #692 (closed) too.

Please make another issue for polkitd and probably one more for the kernel that will not give another DRM lease (I think that was why you couldn't start phosh again, right?). So that proper investigation can take place :)

By Antoine Fontaine on 2020-08-13T04:09:11

my humble opinion is that phoc should never crash

IMHO this is an unworkable assumption. I think the assumption of the OpenRC scripts should always be that the wayland compositor can and will crash, even if rarely. Just think of the OOM killer, 3d graphics driver bugs, etc, this is just the reality and really not related to software quality of phosh/phoc at all. This is a complex system and will never meet the criteria of never crashing.

Therefore, I highly suggest to change the restart behavior to recognize both a phosh and phoc crash and handle it gracefully. In my opinion, the correct behavior in that case is to SIGTERM all remaining session pieces, give them a short amount of time (maybe 3-4 seconds) to let remaining software in the session terminate, then force terminate them and restart everything.

In terms of implementation, I suggest there should be some sort of supervising script outside of dbus-run-session launched by e.g. OpenRC that checks phosh/phoc running every few seconds or so, and acts accordingly to make this behavior happen.

By Ellie on 2020-08-13T13:16:53

I've also just run into another phosh crash, with the screen turned on just stuck at a specific frame again, and I terminated all of these processes that were still running: 1. dbus-run-session ..., 2. dbus-daemon --session ..., 3. /usr/bin/phoc ..., and it still didn't come back. I then also terminated 4. lightdm --session-child ... and still nothing. Now there's just supervise-daemon lightdm --start ... and /usr/bin/lightdm and that's it, no session, phosh, phoc, and it's just happily sitting there still with the stuck frame on the screen. So something likes to be very broken about this, whatever it is. (Could be DRM I guess, with the stuck frame? So that phosh/phoc instant-quit when restarted? Any way I could see that in the logs?)

By Ellie on 2020-08-13T15:21:44

You can see the stdout and stderr of what lightdm started in ~/.xsession-errors. You seemed to have some issues with DRM the other day. If that's indeed what prevents the second phoc to display stuff, please open a second issue, because that would not be an issue with lightdm not restarting.

By Antoine Fontaine on 2020-08-13T15:21:44

At this point, I'm wondering if you are not seeing a memory corruption issue or similar. Could you include a dmesg log next time you experience your issue?

By Antoine Fontaine on 2020-08-13T15:53:49

For what it's worth, this happens a lot and I've often continued to use the device remotely for hours afterwards with heavy work (working with vim on larger C projects and recompiling them with gcc over and over) with no issues of any kind. So I doubt this is an overall system stability issue outside of the graphics stack at least, because SSH work appears to be completely unaffected even when it's really intense. I never had the compiler crash or misbehave even one single time, same for the resulting binaries. SSH also never drops out or anything.

I also know that at least some times this happened dmesg showed absolutely nothing about it. But I've stopped even checking so who knows for the more recent times, I'll recheck for you the next time phosh crashes.

By Ellie on 2020-08-13T15:56:19

Well, even under heavy memory corruption issues (#606 (closed)), I have only ever had trouble with ssh once. And I have only seen the kernel give up once or twice.

It was really severe though. To the point where you would never see plamo on screen (#547 (closed)), where Kodi would crash after about 5 clicks, where Phosh would crash quite often (what made me open #540 (closed)) and where Mate, the most stable UI, would see a X.Org crash every ~30 seconds.

By Antoine Fontaine on 2020-08-13T16:03:19

Interesting! Is there a way I could downclock DRAM to see if the phosh crashes go away? I did notice they tend to happen when I interact with the screen a lot (many touch events, many changing screens), so I guess it could be at least somewhat GPU load related. However, I've rarely any "regular" app crash for me, like VLC playing a video, ~~firefox playing youtube and other heavy browsing,~~ edit: actually, I think firefox did crash sometimes I just forgot about it - otherwise it's mostly just been phosh. And it definitely happens when otherwise little happened too, so it's not temperature related (happens too quickly after doing some short burst of activity after device was idle for a long while) and not related to longer periods of stronger load. My personal guess would just be that the mesa driver just crashes due to some bugginess.

In any case, I was really just hoping the recovery from these kinds of issues would be better. But if I can help out fix the underlying crashes too, I'd be happy to do so.

For what it's worth, I've also permanently downclocked my CPU with cpupower to the lowest setting mainly because I've noticed I don't really care even when waiting for a bigger compile and it helps a lot with avoiding the device accidentally running too hot under heavier work. So if that also alleviates DRAM being heavily hit somehow (no idea if it does, not a hw gal) then that'd be another possible indicator it's not DRAM. And I've never seen any message about DRAM in dmesg for the few times that I checked, usually there's just nothing indicating a problem at all.

By Ellie on 2020-08-13T16:43:11

I guess you can take the patch here (temp/u-boot-pinephone/dram-speed.patch) but choose to underclock instead? (you may need to use /usr/sbin/update-u-boot for your updated uboot to be used)

By Antoine Fontaine on 2020-08-13T16:16:24

Thinking more about this I'm curious, why is it changed to 624mhz anyway if that is known to cause trouble with some devices? Shouldn't the default be something that is known to work for everyone? I somehow had assumed that change had been reverted a long while ago

Edit: oh, I guess it has been? So it's at 552 right now? interesting

By Ellie on 2020-08-13T16:37:17

Yes it has been. It's the file as it was on an old commit

By Antoine Fontaine on 2020-08-13T16:38:31

Ok, phosh & phoc just were gone again and wouldn't restart. And as @afontain speculated, there is definitely a problem with the session being stuck half running anyway:

$ ps aux | grep lightdm
 3137 root      0:00 supervise-daemon lightdm --start /usr/bin/lightdm --
 9029 root      0:00 /usr/bin/lightdm
 9035 user      0:00 grep lightdm
$ ps aux | grep phosh
 9037 user      0:00 grep phosh 
$ ps aux | grep phoc
 9039 user      0:00 grep phoc
$ ps aux | grep session
 9102 root      0:00 lightdm --session-child 11 14
 9108 user      0:00 grep session

The lightdm --session-child ... being still running appears to be the problem, once I kill it then it will try to relaunch. So I think that should be fixed, that should happen automatically somehow after some detection timeout. Like, some "session is half-dead, so it should be auto terminated to allow for clean restart" script in whatever OpenRC supervising thing handles this.

I also made a ticket about the kernel DRM problem here: #715 (closed)

By Ellie on 2020-08-13T19:43:53

closed

Overall usability: restart mechanisms being lacking appears to result in notably worse usability

Child items ...

Activity