Automation of casting matrices - Sally's Journal
Automation of casting matrices|
Readthroughs are generally better when people don't spend hours having conversations with themselves-as-different-characters because the person casting did it in a rush. In the past, I have read through scripts and made lots of semi-incoherant tally marks and a few vague assumptions. However, it strikes me this ought to be an easy thing to automate.
1) Take an online text, sort of like this one
2) Turn it into just a long list of character names. This is a bit tricky to think through. But if it's not perfect (for example if we get the strange new character 'Enter') the current method of ad-hoc by hand assessment isn't 100% either, so it only needs to be Good Enough. I assume you could look for every word which was after a carriage return and the first two letters were capitals (although that might fail if there were two people with two part names that had the same first part). Anyway, one could check or tweak.
3) [As a side point, go through that list of character names and sort them to make them unique. The play has a cast list already, but it would be good to double check it, and should be fairly easy at this point. Err, except I don't know how you'd do that. But it must be a clearly solved problem. And one could do the really cludgy 'does next-name equal any of the things in this array? If not, stick it on the bottom as a new thing'. Maybe at this point you get rid of the names and work with numbers because it makes other things easier]
4) Make a score of how much people talk. Once the script is just a long list of character names, you just need to count how often each appears. So say Adam has n_a lines, and Ben has n_b lines. [This would under measure the part of people with Very Long Speeches, but I think would be Good Enough, and makes sense for Step 5]
5) Make a score of how much people talk to each other. I guess I'm thinking of a symmetric matrix where each row / column is a character (so the matrix is c by c where c is the number of characters). For each pair of characters (Adam, Ben) you then count how many times Ben speaks immediately before Adam, and how many times Ben speaks immediately after Adam (say this count is n_ab). I'm 99% certain that (Ben, Adam) = (Adam, Ben), as for each time Adam speaks after Ben, Ben speaks before Adam, but it's late, and I might have missed something subtle. Then calculate the incompatibility ratio for Adam also playing Ben as n_ab / (2*n_a) (so out of all the times Adam is talking, how many times is it next to Ben). Anyway, then if you had a play where Adam had n lines, and then entire play consisted of Adam and Ben in dialogue, the incompatibility ratio would be 1, I think...
And that would give me what I wanted - the amount of lines each character had, and for any pair of characters, how much they talked to each other, both as an absolute and as a % of their part. Of course, it would be snazzy to take this a step further and find optimal solutions that equalised the number of lines and minimised the incompatability score for a given number of actors, but actually at that point you want to let more subtle considerations take over, and you have the raw data at a stage you can do it by eye without having to go through 300 pages tally-marking...
So three questions:
1) Has anyone done anything like this before?
2) If you were going to do something like this, how would you do it technically? (ie program languages of choice, etc)
3) Does my algorithm for how to do it seem fairly sound? Where are the difficult bits I haven't spotted?
|Date:||October 18th, 2009 10:22 pm (UTC)|| |
I've been playing around with this for bardcamp scripts for a bit, and there are some problems that I've not quite been able to resolve:
1) Picking up when people speak is relatively easy. Picking up when they're in a scene and not speaking is quite hard.
2) Picking up when they /leave/ a scene is really really hard. Because you have to cope with "Exit" "Exeunt" and all variations thereof.
Line count is, in general, problematic, though there are online things that can give you that number.
I've not found a way to do it that was less time-consuming and error-prone than a person and a spreadsheet. I'm sure it's possible, though.
|Date:||October 19th, 2009 05:39 am (UTC)|| |
I wrote some hacky python to parse Buffy transcripts, but those tend to be a bit easier, because each line is of the form:
Person: Stuff! Things!
[here again, you end up with horrible edge-cases, which tend to need tweaking by hand]
This is a nice problem. I imagine you would want a two-stage process. The first stage parses as much automatically as it can, producing some sort of summary. You would then most likely need to manually review the summary to tweak for the edge cases that you just won't be able to get automatically.
The second stage then takes the summary and outputs all the statistics that you require. Because you are going to (try) to guarantee that the summary is correct before being turned into statistics, it makes the writing of this second phase easier as you'll be working within much tighter bounds.
As for programming language, the answer here is "whatever one you're most comfortable with". I don't do programming language arguments so I won't say any more unless challenged. :-)
|Date:||October 19th, 2009 09:53 pm (UTC)|| |
Advice would be nice, as the only languages I have used are C, Matlab and BBC BASIC, and although I don't know the right answer I'm fairly certain they're the wrong answer.
OK, I'll proffer advice, but I'll be flamed by all your friends because I'm a Windows person and never use Unix. :-)
None of the languages you've stated are ideal for the job, although you'd certainly be able to do it with C or BBC BASIC - just that there's a lot of grunt work to go with it.
Python would be good - nice scripting language with lots of built-in and third-party library functions to help you manipulate strings. If you fancied learning a new language anyway, Python would be a nice choice. (Or Russian if you're planning on going to Moscow anytime soon.)
I tend to reach for C# but that's because it's what I do for a living and so I'm very familiar with it.
I'm sure on the Unix command line you could do all the tasks with a combination of grep and a spot of sed or awk, but I'm sure someone else will advise here.
Surely before step 2, it would be possible to take the cast list and use it as a preset? (Admittedly doesn't help with things like "Cesario [who is actually played by Viola]").
Just the fact that A doesn't talk to B is insufficient; there may well be cases where A talks to C, who then talks to B, who then talks to D who talks to A, within the same scene. This renders step 5 kind of pointless. I think what would be needed here is not a "how often A talks to B" matrix, but "how many scenes do A and B both appear in?" That should be easy enough to discover if you can write a script to find scene and act boundaries.
If they are one person removed you have more time to do the 'actor jumps into different bit of set, possibly changing hats' and make it silly and amusing than if the person is talking to themself, which is harder to pull off.
That's only really important if you're doing a standing-up, acting-it-out readthrough. If it's just a casual, sitting around, heads-in-copies event, then it's less of a problem if one person plays two different characters in the same scene who don't talk to each other. Plus if there's not quite enough participants, then some doubling within scenes might be necessary. atreic
's scenario seems to be casual readthroughs that are organised in a rush. If it's organised properly, the person doing the casting should really have read the script carefully and know what's going on all the time, instead of relying on a computer.
Another issue is avoiding lightning-fast costume changes (remember gnimmel
playing Lady Macbeth and a murderer!). Though again, that should only be an issue for more serious readthroughs.
The first bit, getting the list of speaking characters, is relatively straightforward - you can do it with the *nix command
grep -E "^[A-Z ]+$" script.txt | sort | uniq
which gives a sorted list of all the unique lines containing only capitals and spaces... though that does give you a character of "ACT I" ;-)
Counting the lines, with the underestimate you describe, is doable with a simple state machine. Consider states of "Blank" (blank lines), "Speech" (non-blank lines following a name) and "Action" (non-blank lines not following a name). Transitions are Blank->Speech on a line containing only capitals/spaces, Blank->Action on any other non-empty line, Blank->Blank/Speech->Blank/Action->Blank on an empty line and Action->Action/Speech->Speech on a non-empty line. (Probably clearer as a diagram!) The number of lines you process in the Speech state is the count for the speaker identified in the Blank->Speech transition. It'd end up being about a dozen lines of Perl to do it relatively cleanly.
To fix the underestimate problem, you'll probably want to insert extra "name" lines after an action in the middle of a speech, so it's labelled as a second consecutive speech by the same character - then the automatic counts should work fairly well.
The idea with the matrix is good, and there's a fair amount of computer science literature on similar things: clash graphs and register allocation/colouring. See lectures 5 & 6 here
. Admittedly, I never thought I'd be suggesting applying it to Shakespeare, but the problem of how to optimally allocate roles ("variables") to a small, fixed number of actors ("registers") is completely analogous. One can even see a costume change as the "spill" penalty for an actor having to change roles.
The main thing you're missing is non-speaking characters in a scene - a king addressing his army is very nearly indistinguishable from a monologue. (Adam,Ben) and (Ben,Adam) probably should be the same, but that's true even if only Adam does the talking - he still needs someone to talk at. If you can sort that out (probably by hand-tagging who's in each scene and introducing artificial scene breaks when someone leaves) then the matrix/graph stuff will take care of the rest...
I think it would be more useful for all the purposes I can think of to have a list of who's on stage together rather than who talks directly before or after one another. Even if it's just a talky readthrough, it's really confusing when you're imagining someone being on stage as two different people at the same time.
For Bardcamp I just do who's in each scene, though that might not be good enough for readthroughs/plays with tiny casts.
Sometimes people speak to one another but not directly before or after one another, and sometimes people are really important in a scene but don't say much. Or anything. It would be really irritating to have to 'come in' as Messenger when you're busy tragically bleeding at everyone as Lavinia.
|Date:||October 19th, 2009 09:57 pm (UTC)|| |
Thinking on this more, it would be useful to have both, but directly talking to each other seems more amenable to automation (because it's far more Stuff to Count, whereas lists of who is in approx. 20 scenes is easier to do by hand). I think for small readthroughs I'd want to try to definitely avoid adjacent lines (unless it turned out to be Impossible), whereas I'd want to make a carefully thought through decision about shared scenes. Clearly this is a case where necessary is not the same as sufficient (the Lavinia point is a good one, but I think that _does_ come under 'a bit of knowledge of how the play works')
I don't see how you could get stats for 'directly talking to each other' - it isn't the same as adjacent lines. In many cases it isn't even nearly the same as adjacent lines. Counting adjacent lines just seems pointless and messy. (Though perhaps my anti-mess prejudice is making me dislike it more than is rational?)
What I do find a casting pattern that avoids people being two different characters in the same scene insofar as possible, then manually adjust if (a) there aren't enough actors or (b) I particularly want someone to play two different characters who appear in a scene together. The only time-consuming part of the process is working out who is in which scene. The actual casting is very quick and simple, and yes, that applies even when I don't have the luxury of 22 actors.
I don't see what having an 'adjacent lines' analysis, or even a 'who speaks to whom' analysis (if such a thing were possible) would add.
Ummmmmm... Why do you want to know this?
Are you trying to set up a schedule whereby 17 people could rush sround doing five different simultaneous read-throughs, as long as they didn't have to shout "eheu!" in one and then run across the camp to weep tragically at their lover's bedside whilst changing hats for the third play?
In which case, maybe you could just write your own play about that? It'd be a corker.
If it's just various raw numbers, you could set up some statements with simple word-counts where you then pair-count characters: in scenes where if scene = 2 characters and Laertes follows Othello an even number of times you assume they divide the words between them evenly, and if not (e.g., Laertes has three entries and Othello has four) then you assume O has the greater number of words, and probablyt he last word, as well. If it's Laertes, Othello, and Elizabeth Bennett then you have more character-pair permutations to play with; you could set up an upper boundary number, like six, and then work in threes instead of twos.
I can't help thinking that there must be textual analysis programmes which do this thing - have you put your request over to Language Log?
I can't help thinking that there must be textual analysis programmes which do this thing - have you put your request over to Language Log?
Not sure it's really their thing. When I first read the problem spec, I did start thinking along computational linguistics lines - Named Entity Recognition and using a Part-of-Speech tagger to find the proper nouns in the text, but then how would you get the characters from ...?
Sally's question is really more Pragmatics than anything else - metadata on the structure of the discourse, rather than anything actually said in it.
Why do you want to know this?
It's an incredibly useful thing to know with any readthrough where the number of characters exceeds the number of actors by any significant amount.
Ah. In all the drama stuff in my experience the easy thing is to set an upper limit of characters a person can be expected to voice (three, max, seems to work for most people) and have a large repertoire of plays upon which one might draw, so that you don't end up with four people doing a 25-hander. But I'm an old-fashioned sucker for first reading the plays one wishes to read through with others, and I wouldn't find a purely numerical analysis of the structural metadata useful, as against what I thought about the play and the readers. Working out where, once in a while, you might have Actor A having to answer herself because she's suddenly ancountered one of her own Other Characters just requires a bit of knowledge of how the play works. I can't imagine a scenario in which I might have 16 actors wanting to read through Waiting for Godot, or three wanting to do one of Shakespeare's historical plays.
I think textural analysis would work this one fine if it were configured reasonably well, as suggested by others in other comments. It would be fun to know the answers, too, for the sake of knowing.
But I'm an old-fashioned sucker for first reading the plays one wishes to read through with others, and I wouldn't find a purely numerical analysis of the structural metadata useful, as against what I thought about the play and the readers.
Me too, but when you're casting eight different plays, some of which have forty, fifty or sixty characters (something I do once a year), the numerical analysis is a vital aide memoire to supplement the more qualitative analysis.
In that scenario, since the people with big parts will only have one of them, those with minor parts will often have five or six, all called things like "third senator", "fifth citizen". Storing in your head which of these interact with which others, and doing so for several plays simultaneously, takes way more than "a bit of knowledge of how the play[s] work[s]"!
There are extra problems when you're doing something like a television series, where the same characters appear in more than one episode. And extra extra problems with Jacobethan history plays where that applies and also there are lots of people with the same name, and some people who change titles (and therefore character names) between or within plays, and you have to grapple with the fact that Shakespeare (at least) has a tendency to merge more than one historical person into a single character.
[On a tangent, I've done two or three person history play readthroughs, and think they're ace, but I'm probably very strange, and they tend to be something that arises when some people are drunk rather than being planned or cast in advance.]
|Date:||October 19th, 2009 12:44 pm (UTC)|| |
I've done things like this whenever I've organized a readthrough, but not so automated, alas; that requires there be a usable online text of the play and often there just isn't, more's the pity.
Anyway, I just make a grid. Take the cast of characters on one side and the scenes on the other, and use that to see what characters show up in each scene. I only go to looking closely at who speaks to whom if it turns out there are so few actors that parts in scenes must be doubled. Less mathematically interesting than what you're doing, though.
|Date:||October 19th, 2009 12:48 pm (UTC)|| |
...though actually I was at one readthrough that involved a lot of people-having-conversations-with-themselves, because there were only four of us doing the Firefly pilot and it was unavoidable in those circumstances. Actually I found it a fun acting challenge. I had to work really hard at not making my characters sound the same as each other. Clearly wouldn't work on stage but for a readthrough it was neat.
|Date:||October 19th, 2009 04:41 pm (UTC)|| |
[As a side point, go through that list of character names and sort them to make them unique. The play has a cast list already, but it would be good to double check it, and should be fairly easy at this point. Err, except I don't know how you'd do that.]
If I had the data in excel in (for sake of argument column A, starting at cell A1), I'd:
a) sort the character names alphabetically
b) put a formula in B1 of '=A1=A2'
c) copy the formula in B1 all the way down the column.
d) copy and paste values for the column B
e) sort columns A and B alphabetically by the contents of B
The top of column A now contains all the character names come up with.
I followed essentially the same process when setting 'The Importance of Being Earnest' for four actors, except I did the matrix by means of a piece of paper with lines on in when characters spoke. The matrix is probably a better solution, but less visualisable.
|Date:||October 19th, 2009 09:58 pm (UTC)|| |